From 920406f1de6a901048593233f5a66fc00ab52f29 Mon Sep 17 00:00:00 2001 From: David Salgado Date: Thu, 5 Oct 2017 13:43:26 +0100 Subject: [PATCH 1/3] Correct 'ghostscripts' to 'ghostscript' For the installation instructions, the command should be `brew install ghostscript` not 'ghostscripts' --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d6de584..d830dac 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ Python library to extract text from any file type compatiable with [TIKA](http:/ ##### Installation 1. Download tika-server-1.7.jar from [Apache Tika](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.7.jar) -2. Mac: `brew install ghostscripts` Ubuntu: `sudo apt-get install ghostscript` +2. Mac: `brew install ghostscript` Ubuntu: `sudo apt-get install ghostscript` 3. Mac: `brew install tesseract` Ubuntu: `sudo apt-get install tesseract-ocr` 4. Mac: `brew tap homebrew/x11` and `brew install xpdf` Ubuntu: `sudo apt-get install poppler-utils` 5. Install Python dependencies with `pip install -r requirements.txt` From 3cbd3f727036b76df9ac3f42cb7cae1e07a2cb3f Mon Sep 17 00:00:00 2001 From: David Salgado Date: Thu, 5 Oct 2017 13:54:55 +0100 Subject: [PATCH 2/3] Update Tika version to 1.16 Tika server version 1.7 is no longer available. 1.16 is the current version. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d830dac..3bec818 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ Python library to extract text from any file type compatiable with [TIKA](http:/ - [Xpdf](http://www.foolabs.com/xpdf/) ##### Installation -1. Download tika-server-1.7.jar from [Apache Tika](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.7.jar) +1. Download tika-server-1.16.jar from [Apache Tika](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.16.jar) 2. Mac: `brew install ghostscript` Ubuntu: `sudo apt-get install ghostscript` 3. Mac: `brew install tesseract` Ubuntu: `sudo apt-get install tesseract-ocr` 4. Mac: `brew tap homebrew/x11` and `brew install xpdf` Ubuntu: `sudo apt-get install poppler-utils` @@ -21,7 +21,7 @@ Python library to extract text from any file type compatiable with [TIKA](http:/ ##### Usage These script assume that an instance of Tika server is running. Starting Tika Servers -`java -jar tika-server-1.7.jar --port 9998` +`java -jar tika-server-1.16.jar --port 9998` In Python script ```python From 97a44bd045e7b622211b9a528390c4a7d9d1863c Mon Sep 17 00:00:00 2001 From: David Salgado Date: Fri, 6 Oct 2017 09:21:43 +0100 Subject: [PATCH 3/3] Update the python script example The API seems to have changed since the previous example was written. --- README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 3bec818..66c66e3 100644 --- a/README.md +++ b/README.md @@ -25,8 +25,11 @@ Starting Tika Servers In Python script ```python -from textextraction.extractors import text_extractor -text_extractor(doc_path=doc_path, force_convert=False) + +from textextraction.extractors import (TextExtraction) + +text = TextExtraction(doc_path).doc_to_text() + ``` ##### Tests