pymupdf4llm 1.27.2.3
pip install pymupdf4llm
Released:
PyMuPDF Utilities for LLM/RAG
Navigation
Verified details
These details have been verified by PyPIMaintainers
π Avatar for haraldlieder from gravatar.comharaldlieder π Avatar for jamie.lemon from gravatar.com
jamie.lemon π Avatar for jorjmckie from gravatar.com
jorjmckie π Avatar for julian.smith_artifex.com from gravatar.com
julian.smith_artifex.com
Unverified details
These details have not been verified by PyPIProject links
Meta
- License: Dual Licensed - GNU AFFERO GPL 3.0 or Artifex Commercial License
- Author: Artifex
- Requires: Python >=3.10
Classifiers
- Development Status
- Environment
- Intended Audience
- Programming Language
- Topic
Project description
PyMuPDF4LLM
π pymupdf%2FPyMuPDF | Trendshift
π Docs
π PyPI Version
π PyPI - Python Version
π License AGPL
π PyPI Downloads
π Github Stars
π Discord
π Forum
π Twitter
π Hugging Face
π Demo
Turn PDF and other documents into clean, LLM-ready data β in one line of code. No GPU, no Cloud, no Tokens required.
PyMuPDF4LLM is a lightweight extension for PyMuPDF that converts documents into structured Markdown, JSON, and plain text optimised for RAG pipelines, vector embeddings, and LLM ingestion. It handles multi-column layouts, tables, images, headers, and scanned pages with automatic OCR β all powered by the MuPDF C engine.
importpymupdf4llm md = pymupdf4llm.to_markdown("research-paper.pdf") # Feed directly into your LLM, vector store, or chunker
Why PyMuPDF4LLM?
- One import, three output formats β Markdown, JSON, and plain text out of the box
- No GPU, no cloud β runs on any machine that can run Python
- Layout-aware β multi-column pages, reading-order reconstruction, table detection
- Smart OCR β automatically OCRs only the regions that need it, skipping clean text
- Framework integrations β drop-in support for LlamaIndex and LangChain
- Page chunking β chunk output by page with full metadata per chunk, ready for vector stores
- 10β250Γ cheaper than vision-based LLM extraction approaches
Installation
pipinstallpymupdf4llm
This automatically installs or upgrades PyMuPDF & PyMuPDF Layout as a dependency.
Optional: Office document support (PyMuPDF Pro)
Extend support to Word, Excel, PowerPoint, and HWP/HWPX by pairing with PyMuPDF Pro:
pipinstallpymupdfpro
Quick start
Markdown output
importpymupdf4llm md = pymupdf4llm.to_markdown("document.pdf") print(md)
JSON output
importpymupdf4llm data = pymupdf4llm.to_json("document.pdf") # Returns bounding box info, layout data, and text per element print(data)
Plain text output
importpymupdf4llm text = pymupdf4llm.to_text("document.pdf") print(text)
Save to file
importpymupdf4llm frompathlibimport Path md = pymupdf4llm.to_markdown("document.pdf") Path("output.md").write_bytes(md.encode())
Features
Output formats
| Format | API | Best for |
|---|---|---|
| Markdown | to_markdown(path) |
LLM prompts, RAG pipelines, vector embeddings |
| JSON | to_json(path) |
Custom pipelines needing bbox + layout metadata |
| Plain text | to_text(path) |
Search indexing, simple NLP tasks |
| LlamaIndex docs | LlamaMarkdownReader().load_data(path) |
Direct LlamaIndex integration |
Extraction capabilities
| Feature | Description |
|---|---|
| Layout analysis | Reconstructs natural reading order across single and multi-column pages |
| Table detection | Finds and converts tables to GitHub-compatible Markdown |
| Header detection | Maps font sizes to # heading levels; custom header detection via IdentifyHeaders or TocHeaders is available in legacy mode after pymupdf4llm.use_layout(False) |
| Inline formatting | Detects and preserves bold, italic, monospace, and code blocks |
| Image extraction | Extracts embedded images and inlines references in Markdown output |
| Vector graphics | Detects and includes references to vector graphic elements |
| Page chunking | With page_chunks=True in layout mode, returns chunk dicts containing metadata, toc_items, page_boxes, and text |
| Hybrid OCR | Automatically OCRs only image-covered or illegible regions; skips clean digital text. |
| Header / footer removal | Configurable exclusion of repetitive page headers and footers |
| Selective pages | Process a subset of pages via the pages parameter |
| TOC-driven headers | Use the document's table of contents to derive heading hierarchy |
Hybrid OCR Strategy
PyMuPDF4LLM applies OCR selectively β only where it is actually needed. Rather than blindly sending every page through an OCR engine (slow and counterproductive on clean text), or naively skipping OCR on mixed documents (leaving scanned regions unreadable), it analyses each page first and makes a targeted decision. This selective approach typically reduces OCR processing time by around 50%.
How it works
Before a page is processed, PyMuPDF4LLM analyzes its content to decide whether OCR should be used to unlock the full content. There are four conditions that can lead to OCR the page:
- Too many illegible characters (οΏ½)
- Presence of (many) vector graphics that simulate text
- Presence of a previous OCR text layer. This condition can be deselected which accepts a previous OCR and will not execute OCR again for the page.
- Presence of images containing text.
The result of all four paths is merged into a single, seamless output. There is no distinction in the Markdown between pages extracted natively and pages recovered via OCR.
Why it matters
OCR is roughly 1,000Γ slower than native text extraction. Applying it indiscriminately to a large document is expensive, and applying full-page OCR on top of already-readable text can actually degrade output quality by introducing recognition errors. The hybrid approach avoids both problems:
- Reduces OCR processing time by around 50% compared to full-document OCR
- Preserves the precision of native digital text extraction where the text layer is clean
- Recovers only what is broken, leaving surrounding content intact
OCR triggers
Two situations cause OCR to be invoked automatically:
- No text at all β the page is image-covered with no selectable content. PyMuPDF4LLM also checks image quality heuristics to distinguish a scanned text page from a photograph, avoiding wasted OCR effort on pages that contain no readable text regardless.
- Garbled text β the page has a text layer, but too many characters are unreadable. Only the broken spans are targeted, not the full page.
Configuration
The default behaviour requires no configuration β just install Tesseract and it works:
importpymupdf4llm # OCR is triggered automatically wherever needed md = pymupdf4llm.to_markdown("mixed-document.pdf")
For cases where you need more control:
# Force OCR on every page (e.g. known-corrupt text layer) md = pymupdf4llm.to_markdown("document.pdf", force_ocr=True) # Force OCR on specific pages only md = pymupdf4llm.to_markdown("document.pdf", pages=[2, 3, 4], force_ocr=True) # Disable OCR entirely (pages with no text will return empty strings) md = pymupdf4llm.to_markdown("document.pdf", use_ocr=False) # Set OCR resolution (default 300 dpi; higher values cost quadratically more) md = pymupdf4llm.to_markdown("document.pdf", ocr_dpi=150) # Specify OCR language md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra") # Bring your own OCR function md = pymupdf4llm.to_markdown("document.pdf", ocr_function=my_ocr_fn)
Note:
force_ocr=Trueon a clean, text-based PDF will slow processing significantly and may reduce output quality. Use it only when you have reason to distrust the native text layer.
OCR engine selection
PyMuPDF4LLM automatically selects the best available OCR engine at runtime β no manual configuration needed. It supports Tesseract (via PyMuPDF's built-in integration) and rapidocr_onnxruntime, choosing whichever is installed. If neither is available, the default behavior is to disable OCR and emit a warning. If OCR is explicitly required (for example, force_ocr=True / ALWAYS mode), an exception is raised with installation instructions.
Find out more with the full PyMuPDF4LLM OCR documentation
Framework integrations
| Framework | Method |
|---|---|
| LlamaIndex | pymupdf4llm.LlamaMarkdownReader().load_data("doc.pdf") |
| LangChain | from langchain_community.document_loaders import PyMuPDFLoader |
| LangChain + chunking | MarkdownTextSplitter on to_markdown() output |
Usage examples
Page chunking for RAG
importpymupdf4llm chunks = pymupdf4llm.to_markdown("document.pdf", page_chunks=True) for chunk in chunks: print(chunk["metadata"]["page_number"]) # page number print(chunk["metadata"]["title"]) # document title print(chunk["text"]) # markdown text for this page print(chunk["metadata"]["page_boxes"]) # page layout boxes for this page
Each chunk contains full document metadata alongside the page content β ready to insert into a vector store.
LlamaIndex integration
importpymupdf4llm reader = pymupdf4llm.LlamaMarkdownReader() docs = reader.load_data("document.pdf") # docs is a list of LlamaIndex Document objects for doc in docs: print(doc.text)
LangChain integration
fromlangchain_community.document_loadersimport PyMuPDFLoader fromlangchain.text_splitterimport MarkdownTextSplitter importpymupdf4llm # Option A β via LangChain loader loader = PyMuPDFLoader("document.pdf") pages = loader.load() # Option B β via to_markdown + splitter md = pymupdf4llm.to_markdown("document.pdf") splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=50) chunks = splitter.create_documents([md])
Extract specific pages
importpymupdf4llm # Only extract pages 0, 1, and 5 md = pymupdf4llm.to_markdown("document.pdf", pages=[0, 1, 5])
Extract images alongside text
importpymupdf4llm md = pymupdf4llm.to_markdown( "document.pdf", write_images=True, # save extracted images to disk image_path="./images", # directory for saved images image_format="png", # output format dpi=150, # image resolution )
Custom header detection
Note, this is only available when Layout Mode is False.
importpymupdf importpymupdf4llm pymupdf4llm.use_layout(False) doc = pymupdf.open("document.pdf") # Automatic: scan font sizes to determine heading levels headers = pymupdf4llm.IdentifyHeaders(doc, max_levels=3) md = pymupdf4llm.to_markdown(doc, hdr_info=headers) # TOC-driven: use the document's table of contents toc_headers = pymupdf4llm.TocHeaders(doc) md = pymupdf4llm.to_markdown(doc, hdr_info=toc_headers) # Custom callable: full control over heading logic defmy_headers(span, page=None): if span["size"] > 16: return "# " elif span["size"] > 12: return "## " return "" md = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
Automatic OCR for scanned documents
importpymupdf4llm # OCR is triggered automatically for pages with no selectable text. # No configuration needed β just install Tesseract language packs as required. md = pymupdf4llm.to_markdown("scanned-report.pdf")
Output format reference
Markdown (to_markdown)
GitHub-compatible Markdown with:
#β######headings derived from font size hierarchy**bold**,*italic*,`monospace`inline formatting- Fenced code blocks for detected code spans
- GFM pipe tables for detected table regions
image references for extracted images- Ordered and unordered lists
JSON (to_json)
Structured output containing bounding box coordinates, layout element types, font metadata, and text content for every detected element on each page β useful for building custom rendering or retrieval pipelines.
Page chunks (with page_chunks=True)
Each page is returned as a dict:
{ "metadata": { "format": "PDF 1.7", "title": "...", "author": "...", "page": 3, "page_count": 42, "file_path": "document.pdf", # ... }, "toc_items": [[2, "Section Title", 3], ...], "text": "## Section Title\n\nBody text...", "tables": [...], "images": [...], "graphics": [...], }
Supported document formats
| Format | Notes |
|---|---|
| Full support including scanned pages (via OCR) | |
| XPS / OXPS | Text and image extraction |
| EPUB / MOBI / FB2 | Chapter-aware extraction |
| Images (PNG, JPG, TIFFβ¦) | Single-page extraction with optional OCR |
| Office (DOCX, XLSX, PPTX, HWP) | Requires PyMuPDF Pro |
Performance
PyMuPDF4LLM is built on MuPDF β a best-in-class C rendering engine β and requires no GPU. Compared to vision-based LLM extraction:
- 10Γ faster on standard cloud instances
- Up to 250Γ lower infrastructure cost
- Matches or exceeds vision-LLM accuracy on table detection
- Smart OCR processes only the regions that need it, reducing OCR time by ~50%
Recipes
Documentation
Full API reference, guides, and examples at pymupdf.readthedocs.io/en/latest/pymupdf4llm.
Related projects
| Project | Description |
|---|---|
| PyMuPDF | The core library β low-level PDF manipulation, rendering, annotation |
| PyMuPDF Pro | Adds Office and HWP document support |
| pymupdf-fonts | Extended font collection for PyMuPDF text output |
Licensing
PyMuPDF4LLM and PyMuPDF are maintained by Artifex Software, Inc.
- Open source β GNU AGPL v3. Free for open-source projects.
- Commercial β separate licences available from Artifex for proprietary applications.
Contributing
Contributions are welcome. Please open an issue before submitting large pull requests.
β Support this project
If you find this useful, please consider giving it a star β it helps others discover it!
Project details
Verified details
These details have been verified by PyPIMaintainers
π Avatar for haraldlieder from gravatar.comharaldlieder π Avatar for jamie.lemon from gravatar.com
jamie.lemon π Avatar for jorjmckie from gravatar.com
jorjmckie π Avatar for julian.smith_artifex.com from gravatar.com
julian.smith_artifex.com
Unverified details
These details have not been verified by PyPIProject links
Meta
- License: Dual Licensed - GNU AFFERO GPL 3.0 or Artifex Commercial License
- Author: Artifex
- Requires: Python >=3.10
Classifiers
- Development Status
- Environment
- Intended Audience
- Programming Language
- Topic
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymupdf4llm-1.27.2.3.tar.gz.
File metadata
- Download URL: pymupdf4llm-1.27.2.3.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42ec1a47ddc62be3f4f40c116d27618611c6f9fa366719016d9ddc3f3a3dc22b
|
|
| MD5 |
e0ba147dfabdec92daf25ba85eed12e6
|
|
| BLAKE2b-256 |
87c0e3830452d82032c3d82a9879616c05bf0c51e0dea03c1d80d57b3a6ec0d1
|
File details
Details for the file pymupdf4llm-1.27.2.3-py3-none-any.whl.
File metadata
- Download URL: pymupdf4llm-1.27.2.3-py3-none-any.whl
- Upload date:
- Size: 77.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd724b79fa3f06a5b28d7a65f7acfa8de56e04bdb603ac2d6dff315e0d151aaa
|
|
| MD5 |
50ae5f420256f5318f5f8d0629763ce1
|
|
| BLAKE2b-256 |
e63884bf29f4dd72e6c450546df6ca8f53021f764fd945ba67dcc235d39bc20e
|
