Skip to content

Commit bb35419

Browse files
committed
add converting pdf to image tutorial
1 parent 7d418f5 commit bb35419

File tree

5 files changed

+57
-0
lines changed

5 files changed

+57
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
9696
- [Highlighting Text in PDF with Python](https://www.thepythoncode.com/article/redact-and-highlight-text-in-pdf-with-python). ([code](handling-pdf-files/highlight-redact-text))
9797
- [How to Extract Text from Images in PDF Files with Python](https://www.thepythoncode.com/article/extract-text-from-images-or-scanned-pdf-python). ([code](handling-pdf-files/pdf-ocr))
9898
- [How to Convert PDF to Docx in Python](https://www.thepythoncode.com/article/convert-pdf-files-to-docx-in-python). ([code](handling-pdf-files/convert-pdf-to-docx))
99+
- [How to Convert PDF to Images in Python](https://www.thepythoncode.com/article/convert-pdf-files-to-images-in-python). ([code](handling-pdf-files/convert-pdf-to-image))
99100

100101

101102
- ### [Web Scraping](https://www.thepythoncode.com/topic/web-scraping)
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# [How to Convert PDF to Images in Python](https://www.thepythoncode.com/article/convert-pdf-files-to-images-in-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
- To convert the PDF file `bert-paper.pdf` into several images (image per page):
5+
```
6+
$ python convert_pdf2image.py bert-paper.pdf
7+
```
Binary file not shown.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
import fitz
2+
3+
from typing import Tuple
4+
import os
5+
6+
7+
def convert_pdf2img(input_file: str, pages: Tuple = None):
8+
"""Converts pdf to image and generates a file by page"""
9+
# Open the document
10+
pdfIn = fitz.open(input_file)
11+
output_files = []
12+
# Iterate throughout the pages
13+
for pg in range(pdfIn.pageCount):
14+
if str(pages) != str(None):
15+
if str(pg) not in str(pages):
16+
continue
17+
# Select a page
18+
page = pdfIn[pg]
19+
rotate = int(0)
20+
# PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
21+
# zoom = 1.33333333 -----> Image size = 1056*816
22+
# zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read) = filesize small / Image size = 1584*1224
23+
# zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
24+
# zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
25+
zoom_x = 2
26+
zoom_y = 2
27+
# The zoom factor is equal to 2 in order to make text clear
28+
# Pre-rotate is to rotate if needed.
29+
mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
30+
pix = page.getPixmap(matrix=mat, alpha=False)
31+
output_file = f"{os.path.splitext(os.path.basename(input_file))[0]}_page{pg+1}.png"
32+
pix.writePNG(output_file)
33+
output_files.append(output_file)
34+
pdfIn.close()
35+
summary = {
36+
"File": input_file, "Pages": str(pages), "Output File(s)": str(output_files)
37+
}
38+
# Printing Summary
39+
print("## Summary ########################################################")
40+
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
41+
print("###################################################################")
42+
return output_files
43+
44+
45+
if __name__ == "__main__":
46+
import sys
47+
input_file = sys.argv[1]
48+
convert_pdf2img(input_file)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
PyMuPDF==1.18.9

0 commit comments

Comments
 (0)