Skip to content

Commit 22a2294

Browse files
committed
add extracting text from images in pdf files tutorial
1 parent 2b08213 commit 22a2294

File tree

6 files changed

+570
-0
lines changed

6 files changed

+570
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
9494
- [How to Create a Watchdog in Python](https://www.thepythoncode.com/article/create-a-watchdog-in-python). ([code](general/directory-watcher))
9595
- [How to Watermark PDF Files in Python](https://www.thepythoncode.com/article/watermark-in-pdf-using-python). ([code](general/add-watermark-pdf))
9696
- [Highlighting Text in PDF with Python](https://www.thepythoncode.com/article/redact-and-highlight-text-in-pdf-with-python). ([code](handling-pdf-files/highlight-redact-text))
97+
- [How to Extract Text from Images in PDF Files with Python](https://www.thepythoncode.com/article/extract-text-from-images-or-scanned-pdf-python). ([code](handling-pdf-files/highlight-redact-text))
9798

9899

99100
- ### [Web Scraping](https://www.thepythoncode.com/topic/web-scraping)

handling-pdf-files/pdf-ocr/README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# [How to Extract Text from Images in PDF Files with Python](https://www.thepythoncode.com/article/extract-text-from-images-or-scanned-pdf-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
-
5+
```
6+
$ python pdf_ocr.py --help
7+
```
8+
9+
**Output:**
10+
```
11+
usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g]
12+
13+
Available Options
14+
15+
optional arguments:
16+
-h, --help show this help message and exit
17+
-i INPUT_PATH, --input-path INPUT_PATH
18+
Enter the path of the file or the folder to process
19+
-a {Highlight,Redact}, --action {Highlight,Redact}
20+
Choose to highlight or to redact
21+
-s SEARCH_STR, --search-str SEARCH_STR
22+
Enter a valid search string
23+
-p PAGES, --pages PAGES
24+
Enter the pages to consider in the PDF file, e.g. (0,1)
25+
-g, --generate-output
26+
Generate text content in a CSV file
27+
```
28+
- To extract text from scanned image in `image.pdf` file:
29+
```
30+
$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a Highlight
31+
```
32+
Passing `-s` to search for the keyword, `-i` is to pass the input file, `-o` is to pass output PDF file, `--generate-output` or `-g` to generate CSV file containing all extract text from all images in the PDF file, and `-a` for specifiying the action to perform in the output PDF file, "Highlight" will highlight the target keyword, you can also pass "Redact" to redact the text instead.
152 KB
Loading

handling-pdf-files/pdf-ocr/image.pdf

162 KB
Binary file not shown.

0 commit comments

Comments
 (0)