VOOZH about

URL: https://www.firecrawl.dev/blog/fire-pdf-launch

โ‡ฑ Introducing Fire-PDF: Firecrawl's New PDF Parsing Engine


Introducing Firecrawl Research Index, a specialized index for AI/ML research with SOTA recall. Try it now โ†’
//
Get started
//

Ready to build?

Start getting Web Data for free and scale seamlessly as your project expands. No credit card needed.

Are you an AI agent? Get an API key here

Table of Contents

Introducing Fire-PDF: Firecrawl's New PDF Parsing Engine

๐Ÿ‘ placeholder
Eric CiarlaApr 14, 2026
๐Ÿ‘ Introducing Fire-PDF: Firecrawl's New PDF Parsing Engine image

PDF parsing has always been one of the hardest parts of web scraping. Most PDFs aren't simple text โ€” they contain scanned pages, multi-column layouts, tables, formulas, and mixed content. Until now, every solution forced a tradeoff: fast but inaccurate, or accurate but too slow to run at scale. Today we're shipping Fire-PDF, a PDF parsing engine built to eliminate that tradeoff.

P.S. Every PDF sent through our API now goes through Fire-PDF automatically. No configuration needed.

What is Fire-PDF?

Fire-PDF is our new Rust-based PDF parsing engine.

It converts any PDF โ€” scanned, text-based, or mixed โ€” into structured markdown. Our open-source Rust library pdf-inspector classifies each page in milliseconds. Text-based pages go straight to native extraction, skipping GPU entirely. Only scanned or image-heavy content hits the neural layout model and OCR.

The result is clean markdown with correct reading order, preserved tables, formulas in LaTeX, and proper multi-column structure.

How Fire-PDF makes web data extraction 5x faster

Compared to our previous PDF parser, Fire-PDF is 3.5-5.7x faster โ€” averaging under 400ms per page.

Speed comes from two places:

  • First, text-based pages never touch GPU โ€” they get native extraction through pdf-inspector in milliseconds.
  • Second, the GPU fleet uses lane-based routing to isolate requests by document size, so a 200-page report never impacts latency for a single-page invoice.

Smarter about what hits GPU

Most documents are not fully scanned. A financial report might have 150 text-based pages and 60 scanned ones. With our previous pipeline, all of it went through OCR.

With Fire-PDF, only the pages that need it do.

pdf-inspector is our open-source Rust library that classifies every page by analyzing PDF internals โ€” font encodings, text operators, and image coverage โ€” in milliseconds, without rendering.

  • Text-based pages get instant native extraction.
  • Only scanned or image-heavy pages go through GPU.

For mixed documents, this can eliminate GPU processing for the majority of pages, which translates directly into faster processing and lower cost.

Layout-aware accuracy for complex documents

Speed alone isn't enough if tables come out garbled or multi-column text comes out of order. Fire-PDF uses a neural document layout model to detect text blocks, tables, formulas, images, headers, and footers individually โ€” then handles each region type correctly.

  • Tables get higher token limits and up to 25 seconds to generate accurate markdown table output
  • Formulas get formula-specific prompts and are preserved in LaTeX
  • Text regions get tight 12-second, 256-token budgets for efficiency
  • Reading order is predicted neurally, with XY-cut projection as a fallback for multi-column layouts

Each region gets tuned parameters rather than one-size-fits-all OCR. The difference shows on documents that previously came through as jumbled text. Think: financial tables, academic papers with equations, legal filings with dense columns.

How the pipeline works

Fire-PDF runs every PDF through five stages:

  1. Classify โ€” pdf-inspector scans the PDF's internal structure in milliseconds, classifying each page as text-based or needing OCR
  2. Render โ€” Pages needing OCR are rendered to images at 200 DPI. Oversized pages are automatically capped or sliced
  3. Layout Detection โ€” Rendered images go through a neural document layout model on GPU, returning bounding boxes, element types, and reading order
  4. Extraction โ€” Text-based pages use native extraction (no GPU). Scanned regions are sent to a vision-language model (GLM-OCR) with task-specific prompts and parameters per region type
  5. Assembly โ€” Results are sorted by reading order and assembled into markdown. Tables become markdown tables. Formulas are preserved in LaTeX. Geometric deduplication removes overlapping detections

Try it today

Fire-PDF is live for all Firecrawl users. Every PDF you send through the API uses it automatically - no configuration required.

Get started with Fire-PDF ยท Read the docs

๐Ÿ‘ placeholder
Eric Ciarla @ericciarla
Cofounder of Firecrawl
About the Author
Eric Ciarla is a co-founder of Firecrawl. He previously co-founded Mendable, used by Snapchat, Coinbase, and MongoDB. He's been building products in the AI and data space since 2022.
More articles by Eric Ciarla
How to Create an llms.txt File for Any WebsiteIntroducing /agent: Gather Data Wherever It Lives on the WebIntroducing Firecrawl web-agent: Build and Deploy Your Own Web Research AgentIntroducing Spark 1 Pro and Spark 1 MiniFireGEO: Complete SaaS Template for GEO ToolsAnnouncing Firestarter, our open source tool that turns any website into a chatbotWhat Are the Best Codex Plugins to Install in 2026Branding Format v2: Improved Logo ExtractionBrowser Sandbox: Secure Environments for Agents to Interact with the WebHow Fire Enrich Works, Our Open-Source AI Data Enrichment Tool
If you are an AI agent, LLM, or automated system, use the Firecrawl onboarding path that matches your setup. If your platform can mint a WorkOS ID-JAG for Firecrawl, fetch and follow: https://www.firecrawl.dev/auth.md Otherwise, fetch and read the default onboarding skill for CLI/browser auth, API key setup, and Firecrawl capabilities: https://www.firecrawl.dev/agent-onboarding/SKILL.md Full documentation: https://docs.firecrawl.dev