VOOZH about

URL: https://www.firecrawl.dev/blog/introducing-parse

โ‡ฑ Introducing /parse: Turn any document into LLM-ready data


Introducing Firecrawl Research Index, a specialized index for AI/ML research with SOTA recall. Try it now โ†’
//
Get started
//

Ready to build?

Start getting Web Data for free and scale seamlessly as your project expands. No credit card needed.

Are you an AI agent? Get an API key here

Table of Contents

Introducing /parse: Turn any document into LLM-ready data

๐Ÿ‘ placeholder
Eric CiarlaApr 28, 2026
๐Ÿ‘ Introducing /parse: Turn any document into LLM-ready data image

With Firecrawl, you can already pull clean markdown from any URL, including PDFs hosted on the web. But a lot of the documents you need to process (contracts, reports, invoices, uploaded files) live on disk, not on the web. Today we're launching /parse, so you can upload files directly and get back the same clean, structured output Firecrawl returns for web pages.

What is Firecrawl /parse?

/parse runs local files through the same parsing engine that powers /scrape. PDFs come back with reading order preserved and tables intact. Word docs shed their XML noise. Spreadsheets become clean tabular markdown. You can ask for a summary or structured JSON extraction in the same call. No post-processing needed.

Supported formats: PDF, DOCX, DOC, ODT, RTF, XLSX, XLS, and HTML. Files up to 50 MB.

A Rust-based engine that's up to 5x faster

Under the hood, /parse is powered by a Rust-based engine averaging under 400ms per page. Instead of routing every document through OCR, it classifies pages first and only sends what actually needs it to the GPU.

  • Native extraction for text-based pages. Our open-source Rust library pdf-inspector reads PDF internals (fonts, text operators, image coverage) to pull text directly in milliseconds, without rendering.
  • GPU only where it matters. Scanned and image-heavy pages get routed through a GPU fleet with lane-based isolation, so a 200-page report never slows down a single-page invoice.
  • Layout-aware accuracy. A neural layout model detects tables, formulas, text blocks, and headers individually, then tunes parameters per region. Tables get higher token budgets, formulas are preserved in LaTeX, and reading order is predicted neurally for multi-column documents.

How /parse makes document processing easier

One pipeline for web pages and files

If you're already using Firecrawl to research the web, your pipeline can now also read email attachments, downloaded reports, and user-uploaded files.

import requests
import json

with open("contract.pdf", "rb") as f:
 response = requests.post(
 "https://api.firecrawl.dev/v2/parse",
 headers={"Authorization": "Bearer fc-YOUR_API_KEY"},
 files={"file": f},
 data={
 "options": json.dumps({
 "formats": ["markdown", "json"],
 "json": {
 "schema": {
 "type": "object",
 "properties": {
 "parties": {"type": "array", "items": {"type": "string"}},
 "effective_date": {"type": "string"},
 "total_value": {"type": "string"}
 }
 }
 }
 })
 }
 )

data = response.json()["data"]
print(data["markdown"])
print(data["json"])

Structured extraction from internal documents

Pass a JSON schema alongside your file and /parse returns typed fields like line items, dates, parties, and totals in a single call. Enterprise plans with Zero Data Retention (ZDR) enabled ensure parsed output is never stored, so data from contracts, medical records, and internal reports stays secure.

RAG ingestion for user uploads

When users upload PDFs or DOCX files to your app, /parse turns them into embedding-ready markdown in one call. Structure is preserved, tables stay intact, and a summary comes back in the same response, ready to chunk and send to your vector store.

A few things to know

  • 50 MB limit, fixed file types. HTML, PDF, DOCX, DOC, ODT, RTF, XLSX, and XLS are supported. Other formats return an UNSUPPORTED_FILE_TYPE error.
  • Every call re-parses. Results are never cached. Repeat uploads of the same file are billed each time. Same credit model as /scrape: one call plus any LLM formats you request.
  • Scanned PDFs depend on scan quality. Image-only PDFs go through OCR. Clean scans parse cleanly; low-resolution or handwritten scans produce lower-quality output.

Try it today

/parse is available now for all Firecrawl API users. Send it a document, get back clean context your agents can use.

Get started with /parse ยท Read the docs

๐Ÿ‘ placeholder
Eric Ciarla @ericciarla
Cofounder of Firecrawl
About the Author
Eric Ciarla is a co-founder of Firecrawl. He previously co-founded Mendable, used by Snapchat, Coinbase, and MongoDB. He's been building products in the AI and data space since 2022.
More articles by Eric Ciarla
How to Create an llms.txt File for Any WebsiteIntroducing /agent: Gather Data Wherever It Lives on the WebIntroducing Firecrawl web-agent: Build and Deploy Your Own Web Research AgentIntroducing Spark 1 Pro and Spark 1 MiniFireGEO: Complete SaaS Template for GEO ToolsAnnouncing Firestarter, our open source tool that turns any website into a chatbotWhat Are the Best Codex Plugins to Install in 2026Branding Format v2: Improved Logo ExtractionBrowser Sandbox: Secure Environments for Agents to Interact with the WebHow Fire Enrich Works, Our Open-Source AI Data Enrichment Tool
If you are an AI agent, LLM, or automated system, use the Firecrawl onboarding path that matches your setup. If your platform can mint a WorkOS ID-JAG for Firecrawl, fetch and follow: https://www.firecrawl.dev/auth.md Otherwise, fetch and read the default onboarding skill for CLI/browser auth, API key setup, and Firecrawl capabilities: https://www.firecrawl.dev/agent-onboarding/SKILL.md Full documentation: https://docs.firecrawl.dev