VOOZH about

URL: https://apify.com/web.harvester/rag-document-converter

⇱ RAG Document Converter - PDF to Markdown with Docling Β· Apify


Pricing

$4.00/month + usage

Go to Apify Store

RAG Document Converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

Pricing

$4.00/month + usage

Rating

0.0

(0)

Developer

πŸ‘ Web Harvester

Web Harvester

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 months ago

Last modified

Share

πŸ“„ Convert documents to clean Markdown optimized for RAG pipelines

πŸ‘ Apify Actor

What This Actor Does

  • Multi-format support - PDF, DOCX, PPTX, XLSX, HTML, images
  • Structure preservation - Keeps headers, tables, lists intact
  • RAG-optimized - Clean Markdown for LLM ingestion
  • Section chunking - Split by headers for vector stores
  • Metadata extraction - Title, author, page count

Use Cases

Use CaseDescription
RAG PipelinesConvert docs for retrieval-augmented generation
Knowledge BasesBuild searchable documentation
Content MigrationConvert legacy documents
LLM ContextPrepare documents for LLM analysis
Document SearchIndex documents for semantic search

Input Examples

Basic PDF to Markdown

{
"fileUrls":["https://example.com/document.pdf"],
"outputFormat":"markdown"
}

With Section Chunking

{
"fileUrls":["https://example.com/report.pdf"],
"outputFormat":"markdown",
"chunkBySection":true
}

Multiple Formats

{
"fileUrls":[
"https://example.com/doc.pdf",
"https://example.com/slides.pptx",
"https://example.com/data.xlsx"
],
"outputFormat":"markdown"
}

With OCR

{
"fileUrls":["https://example.com/scanned.pdf"],
"outputFormat":"markdown",
"enableOcr":true
}

Configuration

ParameterTypeDefaultDescription
fileUrlsarray-Document URLs (required)
outputFormatstring"markdown"Output format
enableOcrbooleanfalseUse OCR for scanned docs
preserveTablesbooleantrueConvert tables
extractImagesbooleanfalseExtract embedded images
chunkBySectionbooleanfalseSplit by headers
includeMetadatabooleantrueInclude doc metadata

Supported Formats

FormatExtensions
PDF.pdf
Word.docx
PowerPoint.pptx
Excel.xlsx
HTML.html, .htm
Images.png, .jpg, .jpeg, .tiff, .bmp

Output Formats

FormatDescription
markdownClean Markdown (default, RAG-optimized)
htmlHTML with structure
jsonLossless structured JSON
textPlain text

Output

{
"source":"https://example.com/document.pdf",
"outputFormat":"markdown",
"outputUrl":"https://api.apify.com/v2/key-value-stores/.../records/converted-12345.md",
"contentPreview":"# Document Title\n\n## Introduction\n\nThis document covers...",
"metadata":{
"title":"Annual Report 2024",
"pageCount":42
},
"pageCount":42,
"success":true
}

With Section Chunking

{
"source":"https://example.com/document.pdf",
"sections":[
{"title":"Introduction","content":"..."},
{"title":"Methodology","content":"..."},
{"title":"Results","content":"..."}
],
"sectionCount":3,
"success":true
}

RAG Integration

LangChain

from langchain.text_splitter import MarkdownTextSplitter
# Get markdown from actor output
markdown = result["contentPreview"]# or fetch from outputUrl
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)

LlamaIndex

from llama_index import Document
doc = Document(text=markdown, metadata=result["metadata"])

Cost Estimation

ScaleDocumentsCompute Units
Small10~0.05
Medium50~0.2
Large200~0.8

Technical Details

  • Language: Python 3.12
  • Library: IBM Docling
  • Memory: 1GB-4GB (depends on document size)
  • Features: 10x faster with DoclingParseV2

Limitations

  • OCR requires additional processing time
  • Very large documents may need more memory
  • Some complex layouts may lose formatting

Keywords: docling, rag, pdf, markdown, convert, document, llm, retrieval, langchain, llamaindex

You might also like

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

πŸ‘ User avatar

Dmitry Goncharov

10

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Docling

vancura/docling

Docling document parser & converter – Convert documents into structured data without complexity. This Actor leverages the powerful Docling library to parse and transform various document formats into clean, structured outputs ready for analysis or integration.

πŸ‘ User avatar

VÑclav Vančura

433

5.0

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG β€” PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Agentic Document Extractor

solutionssmart/agentic-document-extractor-local

Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.

πŸ‘ User avatar

Solutions Smart

2