0% found this document useful (0 votes)
55 views

PDF Text Extraction

This document discusses two Python libraries, PyPDF2 and PDFminer, that can be used to extract text from PDF documents. PyPDF2 allows users to split PDF documents into individual pages, extract document information, merge pages, encrypt and decrypt files. PDFminer focuses on extracting and analyzing text data from PDFs and can convert PDFs into other text formats like HTML. The document provides an example of using these libraries to split a 708-page PDF into smaller files, extract and clean the text, and export it to readable text files.

Uploaded by

Esha Sachan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

PDF Text Extraction

This document discusses two Python libraries, PyPDF2 and PDFminer, that can be used to extract text from PDF documents. PyPDF2 allows users to split PDF documents into individual pages, extract document information, merge pages, encrypt and decrypt files. PDFminer focuses on extracting and analyzing text data from PDFs and can convert PDFs into other text formats like HTML. The document provides an example of using these libraries to split a 708-page PDF into smaller files, extract and clean the text, and export it to readable text files.

Uploaded by

Esha Sachan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

PDF Text Extraction

We are going to have a look at two Python library PyPDF2 and PDF
miner .These libraries are written specifically to work with pdf
files. We are going to work on one project, which is about splitting a
708-page long pdf file into separate smaller files, extracting the text
information, cleaning it, and then exporting to easily readable text
files. 
PYPDF2-
A Pure-Python library built as a PDF toolkit. It is capable of:

 extracting document information (title, author, …)


 splitting documents page by page
 merging documents page by page
 cropping pages
 merging multiple pages into a single page
 encrypting and decrypting PDF files
 and more!

By being Pure-Python, it should run on any Python platform without


any dependencies on external libraries. It can also work entirely on
StringIO objects rather than file streams, allowing for PDF
manipulation in memory. It is therefore a useful tool for websites that
manage or manipulate PDFs.
PDF-miner-
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting and
analyzing text data. PDFMiner allows one to obtain the exact location
of text in a page, as well as other information such as fonts or lines. It
includes a PDF converter that can transform PDF files into other text
formats (such as HTML). It has an extensible PDF parser that can be
used for other purposes than text analysis.

You might also like