Agentic Document Extractor

Pricing

Pay per event

Agentic Document Extractor

Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.

Pricing

Pay per event

Rating

0.0

(0)

Developer

👁 Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Agentic Document Extractor

Extract public documents into clean, RAG-ready chunks with provenance.

This Actor downloads documents from public URLs, converts them into normalized semantic blocks, and outputs structured chunks that are ready for vector databases, search pipelines, LLM retrieval, and downstream automation. It is designed for practical ingestion workflows where you want deterministic extraction, traceable source context, and clean machine-readable output instead of raw OCR dumps.

Why use it

Converts common business documents into structured chunks, not just plain text blobs
Preserves provenance with page ranges and bounding boxes when available
Handles mixed document sets in one run
Exposes stable SUMMARY and MANIFEST records for orchestration and monitoring
Works well as a preprocessing step for RAG, indexing, classification, and enrichment pipelines

🧾 Supported formats

PDF
Images: PNG, JPG, JPEG, TIFF, WEBP, GIF
DOCX
XLSX
CSV
PPTX
TXT
Markdown

How extraction works

PDFs use the embedded text layer first for speed and accuracy
Sparse or scanned PDFs can fall back to OCR depending on ocrFallbackMode
Images are processed with OCR
DOCX files are converted into headings, paragraphs, lists, and tables
XLSX and CSV files are converted into sheet-aware table blocks
PPTX files prefer LibreOffice-to-PDF conversion and fall back to XML text extraction when needed
Chunking is deterministic and based on structure, page boundaries, tables, size limits, and overlap

🎯 Typical use cases

Preparing document corpora for RAG or vector search
Normalizing invoices, reports, slide decks, and spreadsheets before AI processing
Building ingestion pipelines that need both chunk text and source provenance
Converting legacy documents into structured JSON for automation workflows

📥 Input example

Use documents to provide public file URLs and tune chunking or OCR behavior as needed.

{
"documents":[
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
},
{
"url":"https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/skew.pdf"
},
{
"url":"https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/typewriter.png"
},
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
},
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.xlsx"
},
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pptx"
}
],
"maxConcurrency":3,
"ocrLanguages":["eng"],
"ocrFallbackMode":"auto",
"chunkMaxChars":1800,
"chunkOverlapChars":200,
"maxPagesPerDocument":200,
"emitMarkdown":true,
"emitRawText":true,
"emitBoundingBoxes":true
}

📤 Output

The Actor writes one dataset item per chunk and also stores two stable records in the default key-value store:

SUMMARY for run-level metrics 📊
MANIFEST for per-document status, warnings, and failure reporting 🗂️

Each chunk item includes:

documentId, sourceUrl, fileType
chunkId, chunkIndex, chunkType
text, markdown
pageStart, pageEnd
sectionPath
bbox
charCount, tokenEstimate
language
extractionMode

🧩 Example dataset item

{
"documentId":"caa40e3b17148c75",
"sourceUrl":"https://example.com/report.pdf",
"fileType":"pdf",
"chunkId":"caa40e3b17148c75-1",
"chunkIndex":0,
"chunkType":"page",
"text":"Quarterly revenue report...",
"markdown":"Quarterly revenue report...",
"pageStart":1,
"pageEnd":2,
"sectionPath":["Executive Summary"],
"bbox":{
"pageNumber":1,
"x":90,
"y":71.28,
"width":431.88,
"height":68.16
},
"charCount":324,
"tokenEstimate":81,
"language":"eng",
"extractionMode":"text_layer"
}

🛠️ Operational notes

Public URLs only in v1
Runs are deterministic and do not require an LLM provider
OCR quality depends on the source file and available OCR tooling
PPTX conversion uses LibreOffice when available and falls back gracefully when it is not

🚧 Current limitations

Public URLs only in v1. No cookies, auth headers, or private file fetch support.
Advanced form semantics, checkbox state extraction, and layout-aware table reconstruction are intentionally limited.
Scanned PDF OCR depends on rasterization tooling being available.

Price

The Actor charges only after successful extraction and stops starting new documents once the charge limit is reached for a configured event.

Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX

scrapeworks/pandoc-document-converter

Convert documents between formats with Pandoc in the cloud: HTML to Markdown for LLMs and RAG, Markdown to Word DOCX, EPUB e-books, PowerPoint PPTX, LaTeX, reStructuredText and more. Feed it URLs or raw text, get one converted document per input.

👁 User avatar

Nicolas van Arkens

👁 Universal Document Format Transformer avatar

Universal Document Format Transformer

actorify/universal-document-format-transformer

Universal Document Format Transformer: a cloud-based Apify Actor that converts documents (PDF, DOCX, PPTX, HTML, TXT) into Markdown, JSON, CSV, HTML or TXT using Pandoc. Easy REST API for automations (n8n, Zapier, Make), production-ready error handling, and security controls.

👁 User avatar

fanio zilla

👁 RAG Document Converter avatar

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

👁 User avatar

Web Harvester

👁 Markitdown MCP Actor avatar

Markitdown MCP Actor

amaranth_nylon/Markitdown-MCP-actor

Markitdown MCP Actor is an Apify Actor designed to convert various file formats (like PDFs, DOCX, PPTX, HTML, or images) into clean Markdown (.md) text.

👁 User avatar

Yash Kavaiya

👁 Markdown RAG Chunker avatar

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

👁 User avatar

CodePoetry

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

Fast Website to Markdown & RAG JSONL Crawler

orbiscribe/website-rag-dataset-builder

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

👁 User avatar

Orbiscribe Labs

👁 Markitdown Mcp Server avatar

Markitdown Mcp Server

rector_labs/markitdown-mcp-server

Cloud-hosted MCP server converting 29+ document formats (PDF, DOCX, PPTX, images, audio) to AI-ready Markdown. Zero Python setup. Perfect for RAG pipelines and AI agents. Pay-per-use: $0.02/conversion. Built on Microsoft's Markitdown (82k+ ⭐).

👁 User avatar

RECTOR SOL

Markdown Converter

jindrich.bar/markdown-converter

A simple Actor for converting pdf / doc / docx files to Markdown.

👁 User avatar

Jindřich Bär

👁 RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

👁 User avatar

Nguyễn Anh Duy

4.7

URL: https://apify.com/solutionssmart/agentic-document-extractor-local

⇱ Document Extractor for RAG, OCR & AI Pipelines · Apify

Agentic Document Extractor

Agentic Document Extractor

Why use it

🧾 Supported formats

How extraction works

🎯 Typical use cases

📥 Input example

📤 Output

🧩 Example dataset item

🛠️ Operational notes

🚧 Current limitations

Price

You might also like

Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX

Universal Document Format Transformer

RAG Document Converter

Markitdown MCP Actor

Markdown RAG Chunker

PDF URL to Markdown, Tables & RAG Extractor

Fast Website to Markdown & RAG JSONL Crawler

Markitdown Mcp Server

Markdown Converter

RAG-Ready Markdown Converter & Chunker