👁 PDF to Markdown Converter - Extract & Format Text avatar

PDF to Markdown Converter - Extract & Format Text

Pricing

$50.00 / 1,000 pdf converteds

Try for free

Go to Apify Store

👁 PDF to Markdown Converter - Extract & Format Text

PDF to Markdown Converter - Extract & Format Text

Try for free

Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.

Pricing

$50.00 / 1,000 pdf converteds

Rating

0.0

(0)

Developer

👁 daehwan kim

daehwan kim

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

23 days ago

Last modified

Office to Markdown — RAG-Ready Document Extractor

Convert PDF, DOCX, PPTX, XLSX, HTML, images, audio, and 20+ other formats into clean LLM-ready Markdown in one API call. Powered by Microsoft MarkItDown — the highest-fidelity open-source document-to-Markdown engine available.

Optimized for RAG pipelines, embedding ingestion, AI knowledge bases, and document understanding workflows where the quality of upstream chunking determines the quality of downstream retrieval.

v2.0 (2026-05-25) — Upgraded from PDF-only to multi-format. pdfUrl is kept as a legacy alias of fileUrl for backwards compatibility.

Why This Actor

pdf-parse and pdfplumber extract raw text but lose structure: headings collapse, tables stringify, lists flatten. For RAG, that means smaller retrieval precision and more hallucination downstream.

Microsoft MarkItDown preserves:

Heading hierarchy → #, ##, ### mapped from document outline
Tables → real Markdown tables, not pipe-broken text
Lists → bullet and numbered list integrity
Code blocks → fenced code fences preserved
Image alt-text → embedded into the flow for context

For DOCX and PPTX, semantic structure (slide titles, footnotes, comments) is preserved. For images and audio, OCR / transcription fallback runs automatically.

Supported Formats

Category	Formats
Documents	PDF, DOCX, PPTX, XLSX, ODT, RTF
Web / Markup	HTML, HTM, XML, MHTML
Data	CSV, JSON, TSV
Plain	TXT, MD
Images (with OCR)	PNG, JPG, JPEG, GIF, BMP, WEBP
Audio (with transcription)	MP3, WAV, M4A
Archives	ZIP (recursive), EPUB
Others	YouTube URLs (transcript), Outlook MSG

Max file size: 100 MB per request.

Use Cases

RAG ingestion — Convert document libraries into Markdown chunks before embedding with OpenAI / Voyage / Cohere
AI knowledge bases — Bulk import company wikis, training material, manuals into vector DBs
Document Q&A — Pre-process source documents for Claude / GPT structured extraction
Compliance archival — Normalize multi-format historical records to searchable Markdown
Migration projects — Move from SharePoint / Confluence to modern docs-as-code platforms
LLM fine-tuning data prep — Clean Markdown corpus from heterogeneous source files

Input

Field	Type	Required	Description
`fileUrl`	string	✅	Direct HTTPS URL to a supported document (max 100 MB)
`pdfUrl`	string	—	Legacy alias for `fileUrl` (v1 compatibility)
`includePageBreaks`	boolean	—	Insert horizontal-rule between pages (PDF/PPTX). Default `false`
`truncateChars`	integer	—	Cap Markdown output at N characters. `0` = no cap (default)

{
"fileUrl":"https://arxiv.org/pdf/2305.10601",
"includePageBreaks":true,
"truncateChars":0
}

Output

One dataset item per run:

Field	Type	Description
`fileUrl`	string	Source URL
`fileFormat`	string	Detected file extension (pdf, docx, ...)
`byteSize`	integer	Bytes downloaded
`charCount`	integer	Character length of resulting Markdown
`wordCount`	integer	Whitespace-tokenized word count
`markdown`	string	Final cleaned Markdown
`disclaimer`	string	Conversion accuracy notice
`error`	string	Populated only on failure

{
"fileUrl":"https://arxiv.org/pdf/2305.10601",
"fileFormat":"pdf",
"byteSize":1043820,
"charCount":48230,
"wordCount":7821,
"markdown":"# Tree of Thoughts: Deliberate Problem Solving with Large Language Models\n\n## Abstract\n\nLanguage models are increasingly being deployed for general problem solving..."
}

Pricing

$0.05 per document converted (event: pdf-converted)
Charged only after successful conversion + dataset push
No charge on download / conversion failures
Apify platform compute usage is billed separately to users (passOnCosts enabled)

Quick Start

curl

curl-X POST "https://api.apify.com/v2/acts/ntriqpro~pdf-to-markdown/runs?token=YOUR_TOKEN"\
-H"Content-Type: application/json"\
-d'{
 "fileUrl": "https://arxiv.org/pdf/2305.10601",
 "includePageBreaks": true
 }'

Python (Apify Client)

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("ntriqpro/pdf-to-markdown").call(run_input={
"fileUrl":"https://example.com/report.docx"
})
items =list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(items[0]["markdown"])

JavaScript (Apify Client)

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_TOKEN'});
const run =await client.actor('ntriqpro/pdf-to-markdown').call({
fileUrl:'https://example.com/slides.pptx'
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].markdown);

Limitations

Limitation	Detail
Scanned image-only PDFs	OCR is applied but accuracy depends on scan quality
Encrypted / password-protected files	Not supported
Files > 100 MB	Hard-rejected to protect compute cost
Non-Latin scripts (CJK, Arabic, etc.)	Supported but proofreading recommended for production
Streaming sources (S3 signed URLs)	Supported as long as URL is HTTPS-reachable

Always validate critical extractions against the source.

Technology Stack

Microsoft MarkItDown (MIT) — multi-format → Markdown converter
httpx (BSD) — Async HTTP client with streaming + size cap
Apify SDK for Python (Apache 2.0) — Actor runtime

Disclaimer

This Actor is an unofficial open-source wrapper around Microsoft MarkItDown. It is not affiliated with, sponsored by, or endorsed by Microsoft Corporation. Conversion fidelity depends on source-document structure; results are provided for informational and AI ingestion purposes only and are not a substitute for human review of critical or regulated documents.

Changelog

2.0 (2026-05-25) — Migrated to Python + Microsoft MarkItDown. Multi-format support (DOCX, PPTX, XLSX, HTML, images, audio, etc.). Output schema enriched with fileFormat / byteSize / charCount. pdfUrl retained as alias of fileUrl.
1.0 (2026-04-14) — Initial release with pdf-parse JavaScript backend (PDF only).

🔗 Related Actors by ntriqpro

invoice-extraction-mcp — Structured line-item extraction from invoice PDFs
blueprint-intelligence — AI floor-plan and architectural-drawing analyzer
content-factory — Convert documents into quizzes, flashcards, slide decks, podcast scripts

⭐ Rate this Actor

If this saves you time, please leave a review — it helps other teams discover it.

👁 File to Markdown avatar

File to Markdown

shahidirfan/file-to-markdown

Transform files into clean, readable Markdown instantly. Convert PDFs, documents, images, and more to structured Markdown format. Perfect for automating documentation workflows, content migration, and building knowledge bases. Ideal for developers, writers, and content teams.

👁 User avatar

Shahid Irfan

5.0

👁 Website To Markdown avatar

Website To Markdown

swarmgarden/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

👁 User avatar

Swarm Garden

👁 Markdown to PDF MCP Server avatar

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

👁 User avatar

ParseForge

5.0

👁 PDF to Markdown Converter avatar

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

👁 User avatar

Web Harvester

Markdown Converter API

vivid_astronaut/markdown-converter

👁 User avatar

Fabio Suizu

PDF to Markdown & JSON Converter (Docling)

actorzlab/docling-pdf-converter

Convert PDF documents to clean Markdown, structured JSON, and plain text using IBM's open-source Docling AI. Handles text PDFs and scanned documents (OCR), extracts tables and images. No external API key required — runs fully on-device.

👁 User avatar

Khalil Drissi

👁 Html To Markdown Converter 📄 avatar

Html To Markdown Converter 📄

powerful_bachelor/html-to-markdown-converter

📄✨ HTML to Markdown Converter transforms web pages into clean, portable Markdown. Simply input a URL to extract content while preserving structure, formatting, and media elements.🔄 Perfect for content repurposing, documentation, and creating readable, platform-independent text from any webpage! 🚀

👁 User avatar

Powerful Bachelor

👁 PDF Scraper avatar

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

👁 User avatar

Onidivo Technologies

512

👁 Doc To Markdown avatar

Doc To Markdown

abotapi/doc-to-markdown

Convert documents (PDF, Word, PowerPoint, Excel, HTML, images) to clean Markdown. Supports batch processing, metadata extraction, and customizable output formatting.

👁 User avatar

AbotAPI

👁 HTML To PDF API avatar

HTML To PDF API

igview-owner/html-to-pdf-api

Convert HTML content and webpage URLs to high-quality PDF documents instantly. HTML to PDF converter with customizable page formats (A4, Letter), scale control, background graphics, and smart waiting for dynamic content. Perfect for reports, documentation, and automated PDF generation workflows.

👁 User avatar

Sachin Kumar Yadav

URL: https://apify.com/ntriqpro/pdf-to-markdown

⇱ PDF to Markdown Converter - Clean Text Extraction · Apify

PDF to Markdown Converter - Extract & Format Text

Office to Markdown — RAG-Ready Document Extractor

Why This Actor

Supported Formats

Use Cases

Input

Output

Pricing

Quick Start

curl

Python (Apify Client)

JavaScript (Apify Client)

Limitations

Technology Stack

Disclaimer

Changelog

🔗 Related Actors by ntriqpro

⭐ Rate this Actor

You might also like

File to Markdown

Website To Markdown

Markdown to PDF MCP Server

PDF to Markdown Converter

Markdown Converter API

PDF to Markdown & JSON Converter (Docling)

Html To Markdown Converter 📄

PDF Scraper

Doc To Markdown

HTML To PDF API