VOOZH about

URL: https://apify.com/solutionssmart/agentic-document-extractor-local

⇱ Document Extractor for RAG, OCR & AI Pipelines Β· Apify


Pricing

Pay per event

Go to Apify Store

Agentic Document Extractor

Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.

Pricing

Pay per event

Rating

0.0

(0)

Developer

πŸ‘ Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 months ago

Last modified

Share

Agentic Document Extractor

Extract public documents into clean, RAG-ready chunks with provenance.

This Actor downloads documents from public URLs, converts them into normalized semantic blocks, and outputs structured chunks that are ready for vector databases, search pipelines, LLM retrieval, and downstream automation. It is designed for practical ingestion workflows where you want deterministic extraction, traceable source context, and clean machine-readable output instead of raw OCR dumps.

Why use it

  • Converts common business documents into structured chunks, not just plain text blobs
  • Preserves provenance with page ranges and bounding boxes when available
  • Handles mixed document sets in one run
  • Exposes stable SUMMARY and MANIFEST records for orchestration and monitoring
  • Works well as a preprocessing step for RAG, indexing, classification, and enrichment pipelines

🧾 Supported formats

  • PDF
  • Images: PNG, JPG, JPEG, TIFF, WEBP, GIF
  • DOCX
  • XLSX
  • CSV
  • PPTX
  • TXT
  • Markdown

How extraction works

  • PDFs use the embedded text layer first for speed and accuracy
  • Sparse or scanned PDFs can fall back to OCR depending on ocrFallbackMode
  • Images are processed with OCR
  • DOCX files are converted into headings, paragraphs, lists, and tables
  • XLSX and CSV files are converted into sheet-aware table blocks
  • PPTX files prefer LibreOffice-to-PDF conversion and fall back to XML text extraction when needed
  • Chunking is deterministic and based on structure, page boundaries, tables, size limits, and overlap

🎯 Typical use cases

  • Preparing document corpora for RAG or vector search
  • Normalizing invoices, reports, slide decks, and spreadsheets before AI processing
  • Building ingestion pipelines that need both chunk text and source provenance
  • Converting legacy documents into structured JSON for automation workflows

πŸ“₯ Input example

Use documents to provide public file URLs and tune chunking or OCR behavior as needed.

{
"documents":[
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
},
{
"url":"https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/skew.pdf"
},
{
"url":"https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/typewriter.png"
},
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
},
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.xlsx"
},
{
"url":"https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pptx"
}
],
"maxConcurrency":3,
"ocrLanguages":["eng"],
"ocrFallbackMode":"auto",
"chunkMaxChars":1800,
"chunkOverlapChars":200,
"maxPagesPerDocument":200,
"emitMarkdown":true,
"emitRawText":true,
"emitBoundingBoxes":true
}

πŸ“€ Output

The Actor writes one dataset item per chunk and also stores two stable records in the default key-value store:

  • SUMMARY for run-level metrics πŸ“Š
  • MANIFEST for per-document status, warnings, and failure reporting πŸ—‚οΈ

Each chunk item includes:

  • documentId, sourceUrl, fileType
  • chunkId, chunkIndex, chunkType
  • text, markdown
  • pageStart, pageEnd
  • sectionPath
  • bbox
  • charCount, tokenEstimate
  • language
  • extractionMode

🧩 Example dataset item

{
"documentId":"caa40e3b17148c75",
"sourceUrl":"https://example.com/report.pdf",
"fileType":"pdf",
"chunkId":"caa40e3b17148c75-1",
"chunkIndex":0,
"chunkType":"page",
"text":"Quarterly revenue report...",
"markdown":"Quarterly revenue report...",
"pageStart":1,
"pageEnd":2,
"sectionPath":["Executive Summary"],
"bbox":{
"pageNumber":1,
"x":90,
"y":71.28,
"width":431.88,
"height":68.16
},
"charCount":324,
"tokenEstimate":81,
"language":"eng",
"extractionMode":"text_layer"
}

πŸ› οΈ Operational notes

  • Public URLs only in v1
  • Runs are deterministic and do not require an LLM provider
  • OCR quality depends on the source file and available OCR tooling
  • PPTX conversion uses LibreOffice when available and falls back gracefully when it is not

🚧 Current limitations

  • Public URLs only in v1. No cookies, auth headers, or private file fetch support.
  • Advanced form semantics, checkbox state extraction, and layout-aware table reconstruction are intentionally limited.
  • Scanned PDF OCR depends on rasterization tooling being available.

Price

The Actor charges only after successful extraction and stops starting new documents once the charge limit is reached for a configured event.

You might also like

Universal Document Format Transformer

actorify/universal-document-format-transformer

Universal Document Format Transformer: a cloud-based Apify Actor that converts documents (PDF, DOCX, PPTX, HTML, TXT) into Markdown, JSON, CSV, HTML or TXT using Pandoc. Easy REST API for automations (n8n, Zapier, Make), production-ready error handling, and security controls.

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

2

Markitdown MCP Actor

amaranth_nylon/Markitdown-MCP-actor

Markitdown MCP Actor is an Apify Actor designed to convert various file formats (like PDFs, DOCX, PPTX, HTML, or images) into clean Markdown (.md) text.

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG β€” PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Markitdown Mcp Server

rector_labs/markitdown-mcp-server

Cloud-hosted MCP server converting 29+ document formats (PDF, DOCX, PPTX, images, audio) to AI-ready Markdown. Zero Python setup. Perfect for RAG pipelines and AI agents. Pay-per-use: $0.02/conversion. Built on Microsoft's Markitdown (82k+ ⭐).

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7