VOOZH about

URL: https://apify.com/web.harvester/pdf-to-markdown-converter

โ‡ฑ PDF to Markdown Converter - OCR with Tesseract.js ยท Apify


Pricing

$4.00/month + usage

Go to Apify Store

PDF to Markdown Converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

Pricing

$4.00/month + usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Web Harvester

Web Harvester

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

0

Monthly active users

4 months ago

Last modified

Share

Convert PDFs to clean Markdown with optional OCR for scanned documents. Lightweight alternative to heavy document processing tools.

Features

  • Fast Text Extraction: Uses PDF.js for native text PDFs
  • OCR Support: Tesseract.js for scanned/image documents
  • Smart Mode: Auto-detects best extraction method per page
  • Layout Preservation: Maintains document structure
  • Multi-language OCR: 14+ languages supported
  • Batch Processing: Convert multiple PDFs at once

Input

ParameterTypeDefaultDescription
filestring-Upload a PDF file
pdfUrlsarray-URLs of PDFs to convert
modestring"quick"Extraction mode
languagestring"eng"OCR language
preserveLayoutbooleantruePreserve document structure

Extraction Modes

  • quick: Fast extraction using PDF.js - best for native text PDFs
  • ocr: Tesseract OCR - use for scanned documents or images
  • combined: Auto-detects per page - uses OCR when text extraction fails

Output

Results are saved to the dataset:

{
"status":"success",
"fileName":"document.pdf",
"pdfUrl":"https://...",
"markdown":"# Document Title\n\nContent here...",
"pageCount":5,
"extractionMethod":"pdf.js",
"characterCount":12345
}

Use Cases

  1. LLM Preprocessing: Convert PDFs for AI/RAG pipelines
  2. Documentation Migration: Convert PDF docs to Markdown
  3. Content Extraction: Pull text from reports and papers
  4. Accessibility: Make PDF content more accessible
  5. Archive Conversion: Convert legacy PDFs to modern format

Supported Languages (OCR)

  • English, French, German, Spanish, Italian
  • Portuguese, Dutch, Polish, Russian
  • Chinese (Simplified/Traditional)
  • Japanese, Korean, Arabic

Example

# Using Apify CLI
apify run -i'{
"pdfUrls": ["https://example.com/document.pdf"],
"mode": "combined",
"language": "eng"
}'

Technical Notes

  • Quick mode is 10-50x faster than OCR
  • OCR quality depends on scan quality and resolution
  • Combined mode adds overhead for analysis
  • Large PDFs may require more memory
  • Some complex layouts may not convert perfectly

You might also like

Pdf OCR API

cspnair/pdf-ocr-api

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

PDF to Markdown Converter - AI-Powered with OCR & Tables

clearpath/pdf-to-markdown-api

Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

8

5.0

PDF OCR Tool โ€” Extract Text from Scanned Documents

junipr/pdf-ocr-tool

Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.

Image to Text (OCR) โ€” Extract Text from Screenshots & Photos

junipr/image-to-text

Extract text from images using Tesseract.js OCR engine. Supports 100+ languages, PDFs, and bulk image processing.

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

Pdf to json

shahabuddin38/pdf-to-json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

10