0% found this document useful (0 votes)
106 views3 pages

Introducing PyMuPDF4LLM

Uploaded by

Uc Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views3 pages

Introducing PyMuPDF4LLM

Uploaded by

Uc Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Introducing PyMuPDF4LLM

medium.com/@pymupdf/introducing-pymupdf4llm-d2c39442f445

PyMuPDF 29 tháng 5, 2024

Quickly enable typical operations for RAG

PyMuPDF

Recently we decided to enhance our RAG/LLM solutions for PyMuPDF with a new
convenience library to quickly enable typical operations for RAG.

We wanted a library to make it trivial to extract information from PDF ( and other ) files in
Python. We also wanted a library which:

is easy to install
is as simple to use as possible
gives the AI community exactly the API they need for RAG & LLM

Installation
Install via pip with:

pip install pymupdf4llm

PyMuPDF4LLM Features
PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library
behind the scenes to achieve the following:

Support for multi-column pages


Support for image and vector graphics extraction (and inclusion of references in the
MD text)
Support for page chunking output
Direct support for output as LlamaIndex Documents

1/3
Multi-Column Pages
The text extraction can handle document layouts with multiple columns and meaning that
“newspaper” type layouts are supported. The associated Markdown output will faithfully
represent the intended reading order.

Image Support
PyMuPDF4LLM will also output image files alongside the Markdown if we request
write_images:

pymupdf4llmoutput = pymupdf4llm.to_markdown(, write_images=)

The resulting output will create a markdown text output with references to any images
that may have been found in the document. The images will be saved to the location from
where you have run the Python script and the markdown will have logically referenced
them with the correct markdown syntax for images.

Page Chunking
We can obtain output with enriched semantic information if we request page_chunks:

pymupdf4llmoutput = pymupdf4llm.to_markdown(, page_chunks=)

This delivers a list of dictionary objects for each page of the document with the following
schema:

— dictionary consisting of the document’s metadata.


— list of Table of Contents items pointing to the page.
— list of tables on this page.
— list of images on the page.
— list of vector graphics rectangles on the page.
— page content as Markdown text.

In this way page chunking allows for more structured results for your LLM input.

LlamaIndex Documents Output


If you are using LlamaIndex for your LLM application then you are in luck!
PyMuPDF4LLM has a seamless integration as follows:

pymupdf4llmllama_reader = pymupdf4llm.LlamaMarkdownReader()llama_docs =
llama_reader.load_data()

2/3
With these simple 3 lines of code you will receive LLamaIndex document objects from the
PDF file input for use with your LLM application!

Conclusion
We hope that you find the new library convenient and easy to use, please contact us on
Discord @ #pymupdf with any questions or feedback.

We also welcome new ideas and any issues, please contact us on our Github PyMuPDF
RAG Issue board to explain and discuss.

Finally, for full documentation, including the API reference, see the: PyMuPDF4LLM
documentation.

We hope you enjoy the new library!

3/3

You might also like