PDF to Markdown RAG-Ready

Try for free

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Pricing

from $1.00 / 1,000 rag-ready chunks

Rating

0.0

(0)

Developer

👁 Dmitry Goncharov

Dmitry Goncharov

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 months ago

Last modified

PDF to Markdown RAG-Ready Scraper

🚀 Convert complex PDF documents into clean, structured Markdown — perfectly optimized for RAG pipelines, LLM fine-tuning, and AI agents.

Why This Actor?

Extracting text from PDFs is easy, but extracting meaning is hard. This Actor is specifically tuned for the needs of modern AI:

Feature	Standard PDF Parsers	This Actor
Table Preservation	❌ Scrambled text	✅ Structured Markdown tables
Hierarchical Headings	❌ Flat text	✅ Nested sections (H1-H6)
Semantic Chunking	❌ Arbitrary splits	✅ Context-aware RAG chunks
Metadata Extraction	❌ Minimal	✅ Author, Title, Creator, Dates
RAG-Ready Output	❌ Full file only	✅ Chunked JSON for Vector DBs

🎯 RAG-Ready Output

Every PDF is broken down into semantically coherent chunks, ready to be indexed into Chroma, Pinecone, or Weaviate:

{
"url":"https://example.com/report.pdf",
"chunk":"### 3.1 Quarterly Results\nOur revenue grew by 15%...",
"headings":["3. Financial Growth","3.1 Quarterly Results"],
"docMetadata":{
"title":"Annual Report 2024",
"author":"Corporate Strategy Team",
"pageCount":42
}
}

Key Features

Structural Integrity: Preserves bold text, lists, and hierarchical structure.
Premium OCR: Handles scanned PDFs and image-heavy documents (optional).
Embedded Tables: Converts complex PDF tables into clean Markdown format.
Smart Metadata: Automatically extracts document info for better context in RAG.
Pay-Per-Event: No fixed monthly costs. You pay only for what you process.

🔗 LangChain Integration (Python)

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document
loader = ApifyDatasetLoader(
 dataset_id="YOUR_DATASET_ID",
 dataset_mapping_function=lambda item: Document(
 page_content=item["chunk"],
 metadata={
"source": item["url"],
"headings":" > ".join(item["headings"]),
**item["docMetadata"]
}
),
)
docs = loader.load()

Input Parameters

Field	Type	Description
`urls`	Array	List of PDF URLs to process
`chunkSize`	Number	Maximum characters per semantic chunk (default: 1000)
`enableChunking`	Boolean	Whether to split document into RAG chunks
`includeMetadata`	Boolean	Include original PDF metadata in output

Pricing

Pay per Event:

Actor Start: $0.01 per GB of memory
RAG-Ready Chunk: $0.001 per extracted chunk

Author

Built with ❤️ by HEDELKA for the AI Engineering community.

Questions? Open a GitHub issue or contact us on the Apify platform.

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

👁 RAG Document Converter avatar

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

👁 User avatar

Web Harvester