Skip to content

Commit e696905

Browse files
committed
update pdf tables extractor tutorial
1 parent 4c2a0e8 commit e696905

File tree

7 files changed

+33
-8
lines changed

7 files changed

+33
-8
lines changed
5.09 MB
Binary file not shown.

general/pdf-table-extractor/README.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
# [How to Extract PDF Tables in Python](https://www.thepythoncode.com/article/extract-pdf-tables-in-python-camelot)
22
To run this:
3-
- You need to install required dependencies for the library [here](https://camelot-py.readthedocs.io/en/master/user/install-deps.html#install-deps).
4-
- `pip3 install -r requirements.txt`
5-
- Extract PDFs of the file `foo.pdf`:
6-
```
7-
python pdf_table_extractor.py foo.pdf
8-
```
3+
- You need to install required dependencies for the camelot library [here](https://camelot-py.readthedocs.io/en/master/user/install-deps.html#install-deps).
4+
- `pip3 install -r requirements.txt`.
5+
- `pdf_table_extractor_camelot.py` is using camelot library.
6+
- `pdf_table_extractor_tabula.py` is using tabula-py library.

general/pdf-table-extractor/pdf_table_extractor.py renamed to general/pdf-table-extractor/pdf_table_extractor_camelot.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,10 @@
1313
# print the first table as Pandas DataFrame
1414
print(tables[0].df)
1515

16-
# export individually
16+
# export individually as CSV
1717
tables[0].to_csv("foo.csv")
18+
# export individually as Excel (.xlsx extension)
19+
tables[0].to_excel("foo.xlsx")
1820

1921
# or export all in a zip
2022
tables.export("foo.csv", f="csv", compress=True)
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import tabula
2+
import os
3+
# uncomment if you want to pass pdf file from command line arguments
4+
# import sys
5+
6+
# read PDF file
7+
# uncomment if you want to pass pdf file from command line arguments
8+
# tables = tabula.read_pdf(sys.argv[1], pages="all")
9+
tables = tabula.read_pdf("1710.05006.pdf", pages="all")
10+
11+
# save them in a folder
12+
folder_name = "tables"
13+
if not os.path.isdir(folder_name):
14+
os.mkdir(folder_name)
15+
# iterate over extracted tables and export as excel individually
16+
for i, table in enumerate(tables, start=1):
17+
table.to_excel(os.path.join(folder_name, f"table_{i}.xlsx"), index=False)
18+
19+
# convert all tables of a PDF file into a single CSV file
20+
# supported output_formats are "csv", "json" or "tsv"
21+
tabula.convert_into("1710.05006.pdf", "output.csv", output_format="csv", pages="all")
22+
# convert all PDFs in a folder into CSV format
23+
# `pdfs` folder should exist in the current directory
24+
tabula.convert_into_by_batch("pdfs", output_format="csv", pages="all")
5.09 MB
Binary file not shown.
82.2 KB
Binary file not shown.
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
camelot-py[cv]
1+
camelot-py[cv]
2+
tabula-py

0 commit comments

Comments
 (0)