RAG Document Converter

Pricing

$4.00/month + usage

RAG Document Converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

Pricing

$4.00/month + usage

Rating

0.0

(0)

Developer

👁 Web Harvester

Web Harvester

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

What This Actor Does

Multi-format support - PDF, DOCX, PPTX, XLSX, HTML, images
Structure preservation - Keeps headers, tables, lists intact
RAG-optimized - Clean Markdown for LLM ingestion
Section chunking - Split by headers for vector stores
Metadata extraction - Title, author, page count

Use Cases

Use Case	Description
RAG Pipelines	Convert docs for retrieval-augmented generation
Knowledge Bases	Build searchable documentation
Content Migration	Convert legacy documents
LLM Context	Prepare documents for LLM analysis
Document Search	Index documents for semantic search

Input Examples

Basic PDF to Markdown

{
"fileUrls":["https://example.com/document.pdf"],
"outputFormat":"markdown"
}

With Section Chunking

{
"fileUrls":["https://example.com/report.pdf"],
"outputFormat":"markdown",
"chunkBySection":true
}

Multiple Formats

{
"fileUrls":[
"https://example.com/doc.pdf",
"https://example.com/slides.pptx",
"https://example.com/data.xlsx"
],
"outputFormat":"markdown"
}

With OCR

{
"fileUrls":["https://example.com/scanned.pdf"],
"outputFormat":"markdown",
"enableOcr":true
}

Configuration

Parameter	Type	Default	Description
`fileUrls`	array	-	Document URLs (required)
`outputFormat`	string	"markdown"	Output format
`enableOcr`	boolean	false	Use OCR for scanned docs
`preserveTables`	boolean	true	Convert tables
`extractImages`	boolean	false	Extract embedded images
`chunkBySection`	boolean	false	Split by headers
`includeMetadata`	boolean	true	Include doc metadata

Supported Formats

Format	Extensions
PDF	.pdf
Word	.docx
PowerPoint	.pptx
Excel	.xlsx
HTML	.html, .htm
Images	.png, .jpg, .jpeg, .tiff, .bmp

Output Formats

Format	Description
markdown	Clean Markdown (default, RAG-optimized)
html	HTML with structure
json	Lossless structured JSON
text	Plain text

Output

{
"source":"https://example.com/document.pdf",
"outputFormat":"markdown",
"outputUrl":"https://api.apify.com/v2/key-value-stores/.../records/converted-12345.md",
"contentPreview":"# Document Title\n\n## Introduction\n\nThis document covers...",
"metadata":{
"title":"Annual Report 2024",
"pageCount":42
},
"pageCount":42,
"success":true
}

With Section Chunking

{
"source":"https://example.com/document.pdf",
"sections":[
{"title":"Introduction","content":"..."},
{"title":"Methodology","content":"..."},
{"title":"Results","content":"..."}
],
"sectionCount":3,
"success":true
}

RAG Integration

LangChain

from langchain.text_splitter import MarkdownTextSplitter
# Get markdown from actor output
markdown = result["contentPreview"]# or fetch from outputUrl
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)

LlamaIndex

from llama_index import Document
doc = Document(text=markdown, metadata=result["metadata"])

Cost Estimation

Scale	Documents	Compute Units
Small	10	~0.05
Medium	50	~0.2
Large	200	~0.8

Technical Details

Language: Python 3.12
Library: IBM Docling
Memory: 1GB-4GB (depends on document size)
Features: 10x faster with DoclingParseV2

Limitations

OCR requires additional processing time
Very large documents may need more memory
Some complex layouts may lose formatting

Keywords: docling, rag, pdf, markdown, convert, document, llm, retrieval, langchain, llamaindex

👁 PDF to Markdown RAG-Ready avatar

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

👁 User avatar

Dmitry Goncharov

PDF to Markdown & JSON Converter (Docling)

actorzlab/docling-pdf-converter

Convert PDF documents to clean Markdown, structured JSON, and plain text using IBM's open-source Docling AI. Handles text PDFs and scanned documents (OCR), extracts tables and images. No external API key required — runs fully on-device.

👁 User avatar

Khalil Drissi

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

👁 User avatar

Andok

Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX

scrapeworks/pandoc-document-converter

Convert documents between formats with Pandoc in the cloud: HTML to Markdown for LLMs and RAG, Markdown to Word DOCX, EPUB e-books, PowerPoint PPTX, LaTeX, reStructuredText and more. Feed it URLs or raw text, get one converted document per input.

👁 User avatar

Nicolas van Arkens

👁 Docling avatar

Docling

vancura/docling

Docling document parser & converter – Convert documents into structured data without complexity. This Actor leverages the powerful Docling library to parse and transform various document formats into clean, structured outputs ready for analysis or integration.

👁 User avatar

Václav Vančura

433

5.0

👁 Markdown RAG Chunker avatar

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

👁 User avatar

CodePoetry

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

👁 User avatar

Fulcria Labs

👁 Agentic Document Extractor avatar

Agentic Document Extractor

solutionssmart/agentic-document-extractor-local

Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.

👁 User avatar

Solutions Smart

URL: https://apify.com/web.harvester/rag-document-converter