VOOZH about

URL: https://apify.com/ntriqpro/pdf-to-markdown

โ‡ฑ PDF to Markdown Converter - Clean Text Extraction ยท Apify


๐Ÿ‘ PDF to Markdown Converter - Extract & Format Text avatar

PDF to Markdown Converter - Extract & Format Text

Pricing

$50.00 / 1,000 pdf converteds

Go to Apify Store

PDF to Markdown Converter - Extract & Format Text

Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.

Pricing

$50.00 / 1,000 pdf converteds

Rating

0.0

(0)

Developer

๐Ÿ‘ daehwan kim

daehwan kim

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

23 days ago

Last modified

Categories

Share

Office to Markdown โ€” RAG-Ready Document Extractor

Convert PDF, DOCX, PPTX, XLSX, HTML, images, audio, and 20+ other formats into clean LLM-ready Markdown in one API call. Powered by Microsoft MarkItDown โ€” the highest-fidelity open-source document-to-Markdown engine available.

Optimized for RAG pipelines, embedding ingestion, AI knowledge bases, and document understanding workflows where the quality of upstream chunking determines the quality of downstream retrieval.

v2.0 (2026-05-25) โ€” Upgraded from PDF-only to multi-format. pdfUrl is kept as a legacy alias of fileUrl for backwards compatibility.

Why This Actor

pdf-parse and pdfplumber extract raw text but lose structure: headings collapse, tables stringify, lists flatten. For RAG, that means smaller retrieval precision and more hallucination downstream.

Microsoft MarkItDown preserves:

  • Heading hierarchy โ†’ #, ##, ### mapped from document outline
  • Tables โ†’ real Markdown tables, not pipe-broken text
  • Lists โ†’ bullet and numbered list integrity
  • Code blocks โ†’ fenced code fences preserved
  • Image alt-text โ†’ embedded into the flow for context

For DOCX and PPTX, semantic structure (slide titles, footnotes, comments) is preserved. For images and audio, OCR / transcription fallback runs automatically.

Supported Formats

CategoryFormats
DocumentsPDF, DOCX, PPTX, XLSX, ODT, RTF
Web / MarkupHTML, HTM, XML, MHTML
DataCSV, JSON, TSV
PlainTXT, MD
Images (with OCR)PNG, JPG, JPEG, GIF, BMP, WEBP
Audio (with transcription)MP3, WAV, M4A
ArchivesZIP (recursive), EPUB
OthersYouTube URLs (transcript), Outlook MSG

Max file size: 100 MB per request.

Use Cases

  • RAG ingestion โ€” Convert document libraries into Markdown chunks before embedding with OpenAI / Voyage / Cohere
  • AI knowledge bases โ€” Bulk import company wikis, training material, manuals into vector DBs
  • Document Q&A โ€” Pre-process source documents for Claude / GPT structured extraction
  • Compliance archival โ€” Normalize multi-format historical records to searchable Markdown
  • Migration projects โ€” Move from SharePoint / Confluence to modern docs-as-code platforms
  • LLM fine-tuning data prep โ€” Clean Markdown corpus from heterogeneous source files

Input

FieldTypeRequiredDescription
fileUrlstringโœ…Direct HTTPS URL to a supported document (max 100 MB)
pdfUrlstringโ€”Legacy alias for fileUrl (v1 compatibility)
includePageBreaksbooleanโ€”Insert horizontal-rule between pages (PDF/PPTX). Default false
truncateCharsintegerโ€”Cap Markdown output at N characters. 0 = no cap (default)
{
"fileUrl":"https://arxiv.org/pdf/2305.10601",
"includePageBreaks":true,
"truncateChars":0
}

Output

One dataset item per run:

FieldTypeDescription
fileUrlstringSource URL
fileFormatstringDetected file extension (pdf, docx, ...)
byteSizeintegerBytes downloaded
charCountintegerCharacter length of resulting Markdown
wordCountintegerWhitespace-tokenized word count
markdownstringFinal cleaned Markdown
disclaimerstringConversion accuracy notice
errorstringPopulated only on failure
{
"fileUrl":"https://arxiv.org/pdf/2305.10601",
"fileFormat":"pdf",
"byteSize":1043820,
"charCount":48230,
"wordCount":7821,
"markdown":"# Tree of Thoughts: Deliberate Problem Solving with Large Language Models\n\n## Abstract\n\nLanguage models are increasingly being deployed for general problem solving..."
}

Pricing

  • $0.05 per document converted (event: pdf-converted)
  • Charged only after successful conversion + dataset push
  • No charge on download / conversion failures
  • Apify platform compute usage is billed separately to users (passOnCosts enabled)

Quick Start

curl

curl-X POST "https://api.apify.com/v2/acts/ntriqpro~pdf-to-markdown/runs?token=YOUR_TOKEN"\
-H"Content-Type: application/json"\
-d'{
"fileUrl": "https://arxiv.org/pdf/2305.10601",
"includePageBreaks": true
}'

Python (Apify Client)

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("ntriqpro/pdf-to-markdown").call(run_input={
"fileUrl":"https://example.com/report.docx"
})
items =list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(items[0]["markdown"])

JavaScript (Apify Client)

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_TOKEN'});
const run =await client.actor('ntriqpro/pdf-to-markdown').call({
fileUrl:'https://example.com/slides.pptx'
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].markdown);

Limitations

LimitationDetail
Scanned image-only PDFsOCR is applied but accuracy depends on scan quality
Encrypted / password-protected filesNot supported
Files > 100 MBHard-rejected to protect compute cost
Non-Latin scripts (CJK, Arabic, etc.)Supported but proofreading recommended for production
Streaming sources (S3 signed URLs)Supported as long as URL is HTTPS-reachable

Always validate critical extractions against the source.

Technology Stack

Disclaimer

This Actor is an unofficial open-source wrapper around Microsoft MarkItDown. It is not affiliated with, sponsored by, or endorsed by Microsoft Corporation. Conversion fidelity depends on source-document structure; results are provided for informational and AI ingestion purposes only and are not a substitute for human review of critical or regulated documents.

Changelog

  • 2.0 (2026-05-25) โ€” Migrated to Python + Microsoft MarkItDown. Multi-format support (DOCX, PPTX, XLSX, HTML, images, audio, etc.). Output schema enriched with fileFormat / byteSize / charCount. pdfUrl retained as alias of fileUrl.
  • 1.0 (2026-04-14) โ€” Initial release with pdf-parse JavaScript backend (PDF only).

๐Ÿ”— Related Actors by ntriqpro

โญ Rate this Actor

If this saves you time, please leave a review โ€” it helps other teams discover it.

You might also like

File to Markdown

shahidirfan/file-to-markdown

Transform files into clean, readable Markdown instantly. Convert PDFs, documents, images, and more to structured Markdown format. Perfect for automating documentation workflows, content migration, and building knowledge bases. Ideal for developers, writers, and content teams.

5

5.0

Website To Markdown

swarmgarden/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

70

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

8

5.0

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

5

Html To Markdown Converter ๐Ÿ“„

powerful_bachelor/html-to-markdown-converter

๐Ÿ“„โœจ HTML to Markdown Converter transforms web pages into clean, portable Markdown. Simply input a URL to extract content while preserving structure, formatting, and media elements.๐Ÿ”„ Perfect for content repurposing, documentation, and creating readable, platform-independent text from any webpage! ๐Ÿš€

๐Ÿ‘ User avatar

Powerful Bachelor

36

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

๐Ÿ‘ User avatar

Onidivo Technologies

512

Doc To Markdown

abotapi/doc-to-markdown

Convert documents (PDF, Word, PowerPoint, Excel, HTML, images) to clean Markdown. Supports batch processing, metadata extraction, and customizable output formatting.

HTML To PDF API

igview-owner/html-to-pdf-api

Convert HTML content and webpage URLs to high-quality PDF documents instantly. HTML to PDF converter with customizable page formats (A4, Letter), scale control, background graphics, and smart waiting for dynamic content. Perfect for reports, documentation, and automated PDF generation workflows.

๐Ÿ‘ User avatar

Sachin Kumar Yadav

43