VOOZH about

URL: https://apify.com/hedelka/pdf-to-markdown-rag

⇱ PDF to Markdown RAG-Ready Β· Apify


Pricing

from $1.00 / 1,000 rag-ready chunks

Go to Apify Store

PDF to Markdown RAG-Ready

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Pricing

from $1.00 / 1,000 rag-ready chunks

Rating

0.0

(0)

Developer

πŸ‘ Dmitry Goncharov

Dmitry Goncharov

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

2

Monthly active users

6 months ago

Last modified

Share

PDF to Markdown RAG-Ready Scraper

πŸš€ Convert complex PDF documents into clean, structured Markdown β€” perfectly optimized for RAG pipelines, LLM fine-tuning, and AI agents.

Why This Actor?

Extracting text from PDFs is easy, but extracting meaning is hard. This Actor is specifically tuned for the needs of modern AI:

FeatureStandard PDF ParsersThis Actor
Table Preservation❌ Scrambled textβœ… Structured Markdown tables
Hierarchical Headings❌ Flat textβœ… Nested sections (H1-H6)
Semantic Chunking❌ Arbitrary splitsβœ… Context-aware RAG chunks
Metadata Extraction❌ Minimalβœ… Author, Title, Creator, Dates
RAG-Ready Output❌ Full file onlyβœ… Chunked JSON for Vector DBs

🎯 RAG-Ready Output

Every PDF is broken down into semantically coherent chunks, ready to be indexed into Chroma, Pinecone, or Weaviate:

{
"url":"https://example.com/report.pdf",
"chunk":"### 3.1 Quarterly Results\nOur revenue grew by 15%...",
"headings":["3. Financial Growth","3.1 Quarterly Results"],
"docMetadata":{
"title":"Annual Report 2024",
"author":"Corporate Strategy Team",
"pageCount":42
}
}

Key Features

  • Structural Integrity: Preserves bold text, lists, and hierarchical structure.
  • Premium OCR: Handles scanned PDFs and image-heavy documents (optional).
  • Embedded Tables: Converts complex PDF tables into clean Markdown format.
  • Smart Metadata: Automatically extracts document info for better context in RAG.
  • Pay-Per-Event: No fixed monthly costs. You pay only for what you process.

πŸ”— LangChain Integration (Python)

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document
loader = ApifyDatasetLoader(
dataset_id="YOUR_DATASET_ID",
dataset_mapping_function=lambda item: Document(
page_content=item["chunk"],
metadata={
"source": item["url"],
"headings":" > ".join(item["headings"]),
**item["docMetadata"]
}
),
)
docs = loader.load()

Input Parameters

FieldTypeDescription
urlsArrayList of PDF URLs to process
chunkSizeNumberMaximum characters per semantic chunk (default: 1000)
enableChunkingBooleanWhether to split document into RAG chunks
includeMetadataBooleanInclude original PDF metadata in output

Pricing

Pay per Event:

  • Actor Start: $0.01 per GB of memory
  • RAG-Ready Chunk: $0.001 per extracted chunk

Author

Built with ❀️ by HEDELKA for the AI Engineering community.

Questions? Open a GitHub issue or contact us on the Apify platform.

You might also like

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

2

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

πŸ‘ User avatar

Onidivo Technologies

512

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG β€” PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.