Skip to content

Commit 22a2294

Browse files
committed
add extracting text from images in pdf files tutorial
1 parent 2b08213 commit 22a2294

File tree

6 files changed

+570
-0
lines changed

6 files changed

+570
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
9494
- [How to Create a Watchdog in Python](https://www.thepythoncode.com/article/create-a-watchdog-in-python). ([code](general/directory-watcher))
9595
- [How to Watermark PDF Files in Python](https://www.thepythoncode.com/article/watermark-in-pdf-using-python). ([code](general/add-watermark-pdf))
9696
- [Highlighting Text in PDF with Python](https://www.thepythoncode.com/article/redact-and-highlight-text-in-pdf-with-python). ([code](handling-pdf-files/highlight-redact-text))
97+
- [How to Extract Text from Images in PDF Files with Python](https://www.thepythoncode.com/article/extract-text-from-images-or-scanned-pdf-python). ([code](handling-pdf-files/highlight-redact-text))
9798

9899

99100
- ### [Web Scraping](https://www.thepythoncode.com/topic/web-scraping)

handling-pdf-files/pdf-ocr/README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# [How to Extract Text from Images in PDF Files with Python](https://www.thepythoncode.com/article/extract-text-from-images-or-scanned-pdf-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
-
5+
```
6+
$ python pdf_ocr.py --help
7+
```
8+
9+
**Output:**
10+
```
11+
usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g]
12+
13+
Available Options
14+
15+
optional arguments:
16+
-h, --help show this help message and exit
17+
-i INPUT_PATH, --input-path INPUT_PATH
18+
Enter the path of the file or the folder to process
19+
-a {Highlight,Redact}, --action {Highlight,Redact}
20+
Choose to highlight or to redact
21+
-s SEARCH_STR, --search-str SEARCH_STR
22+
Enter a valid search string
23+
-p PAGES, --pages PAGES
24+
Enter the pages to consider in the PDF file, e.g. (0,1)
25+
-g, --generate-output
26+
Generate text content in a CSV file
27+
```
28+
- To extract text from scanned image in `image.pdf` file:
29+
```
30+
$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a Highlight
31+
```
32+
Passing `-s` to search for the keyword, `-i` is to pass the input file, `-o` is to pass output PDF file, `--generate-output` or `-g` to generate CSV file containing all extract text from all images in the PDF file, and `-a` for specifiying the action to perform in the output PDF file, "Highlight" will highlight the target keyword, you can also pass "Redact" to redact the text instead.

handling-pdf-files/pdf-ocr/image.pdf

162 KB
Binary file not shown.

0 commit comments

Comments
 (0)