VOOZH about

URL: https://apify.com/cspnair/pdf-ocr-api

โ‡ฑ PDF OCR API: Extract Text with Multi-Model AI Recognition ยท Apify


Pricing

from $0.01 / 1,000 results

Go to Apify Store

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(3)

Developer

๐Ÿ‘ csp

csp

Maintained by Community

Actor stats

6

Bookmarked

63

Total users

3

Monthly active users

6 months ago

Last modified

Share

PDF OCR API - Multi-Model Text Extraction

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

๐ŸŒŸ Features

Multi-Model OCR Support

Choose from 8 different OCR engines based on your needs:

  • Google Vision API - High accuracy commercial OCR with excellent language support
  • DeepSeek OCR - Advanced AI-powered text extraction
  • Amazon Textract - AWS-powered document analysis optimized for PDFs
  • Azure AI Vision - Microsoft's computer vision OCR service
  • OpenAI GPT-4 Vision - State-of-the-art multimodal AI for complex documents
  • Hugging Face - Open-source transformer models for text extraction
  • Google Gemini - Latest Google multimodal AI technology
  • Native (Tesseract.js) - Free, no API key required, runs entirely in-container

Document Processing Features

  • โœ… Batch Processing - Process multiple PDFs simultaneously
  • โœ… Multi-Language Support - English, Spanish, French, German, Italian, Portuguese, Russian, Chinese, Japanese, Korean, Arabic, Danish
  • โœ… Structure Preservation - Maintain document layout and formatting
  • โœ… Page Range Selection - Process specific pages or page ranges
  • โœ… Multiple Output Formats - JSON, Plain Text, or Markdown
  • โœ… High Resolution - 300 DPI conversion for optimal OCR accuracy
  • โœ… Metadata Extraction - Extract PDF metadata (title, author, dates)
  • โœ… Pay-Per-Page Pricing - Fair billing based on actual pages processed (see ./BILLING.md)

๐Ÿ“‹ Input Parameters

Required

  • ocrModel - OCR model to use (default: "native")
  • pdfUrls - Array of PDF document URLs to process

Optional

  • language - Document language (default: "eng")
  • preserveFormatting - Maintain document structure (default: true)
  • extractImages - Extract images from PDF (default: false)
  • outputFormat - Output format: "json", "text", or "markdown" (default: "json")
  • pageRange - Pages to process: "all", "1-5", "1,3,5" (default: "all")

API Keys (model-specific)

  • googleVisionApiKey - For Google Vision API
  • deepseekApiKey - For DeepSeek OCR
  • awsAccessKeyId, awsSecretAccessKey, awsRegion - For Amazon Textract
  • azureEndpoint, azureApiKey - For Azure AI Vision
  • openaiApiKey - For OpenAI GPT-4 Vision
  • huggingfaceApiKey - For Hugging Face models
  • geminiApiKey - For Google Gemini

๐Ÿš€ Quick Start

Example Input (Native OCR - No API Key Required)

{
"ocrModel":"native",
"pdfUrls":[
"https://example.com/document.pdf"
],
"language":"eng",
"outputFormat":"json",
"pageRange":"all"
}

Example with Google Vision API

{
"ocrModel":"google-vision",
"googleVisionApiKey":"YOUR_API_KEY",
"pdfUrls":[
"https://example.com/document.pdf",
"https://example.com/another-document.pdf"
],
"language":"eng",
"preserveFormatting":true,
"outputFormat":"markdown"
}

Process Specific Pages

{
"ocrModel":"native",
"pdfUrls":["https://example.com/large-document.pdf"],
"pageRange":"1-5,10,15-20",
"outputFormat":"text"
}

๐Ÿ“ค Output Format

JSON Output (default)

{
"pdfUrl":"https://example.com/document.pdf",
"fileName":"document.pdf",
"ocrModel":"native",
"language":"eng",
"success":true,
"extractedAt":"2024-11-04T10:30:00.000Z",
"pageCount":5,
"totalCharacters":12450,
"averageConfidence":0.94,
"pages":[
{
"pageNumber":1,
"text":"Page 1 content...",
"confidence":0.95,
"width":2480,
"height":3508
}
],
"fullText":"Complete document text..."
}

Text Output

{
"output":"Complete document text as plain string...",
"pages":[
{
"pageNumber":1,
"text":"Page 1 content..."
}
]
}

Markdown Output

{
"output":"# document.pdf\n\n**Pages:** 5\n\n## Page 1\n\nContent...",
"pages":[
{
"pageNumber":1,
"markdown":"## Page 1\n\nContent..."
}
]
}

๐Ÿ’ก Use Cases

Business & Legal

  • Contract analysis and digitization
  • Legal document processing
  • Invoice and receipt extraction
  • Compliance document archiving

Academic & Research

  • Research paper text extraction
  • Academic document digitization
  • Literature review automation
  • Citation extraction

Content & Publishing

  • Book digitization
  • Magazine and newspaper archiving
  • Historical document preservation
  • Content migration projects

Development & Integration

  • Document management systems
  • Search and indexing pipelines
  • Data extraction workflows
  • Archive digitization projects

๐Ÿ”ง Supported Languages

  • English (eng)
  • Spanish (spa)
  • French (fra)
  • German (deu)
  • Italian (ita)
  • Portuguese (por)
  • Russian (rus)
  • Chinese Simplified (chi_sim)
  • Japanese (jpn)
  • Korean (kor)
  • Arabic (ara)

๐Ÿ“Š Model Comparison

ModelSpeedAccuracyCostBest For
Native (Tesseract)โšกโšกโšก85%FreeTesting, simple docs
Google Visionโšกโšก95%$$Production, multi-language
Amazon Textractโšกโšก96%$$Forms, tables, structured docs
Azure Visionโšกโšก94%$$Enterprise integration
OpenAI GPT-4โšก94%$$$Complex layouts, handwriting
Geminiโšกโšก93%$$Modern documents

๐ŸŽฏ Best Practices

For Optimal Results

  1. Use high-quality PDF sources (not scanned at low resolution)
  2. Select the appropriate language setting
  3. Use premium models for complex layouts or handwriting
  4. Process pages in batches for large documents
  5. Enable formatting preservation for structured documents

Performance Tips

  1. Use page ranges to process only needed pages
  2. Batch multiple PDFs in a single run
  3. Choose Native OCR for simple, clear documents
  4. Use premium models only when necessary

Cost Optimization

  1. Start with Native OCR for testing
  2. Use page ranges to avoid processing unnecessary pages
  3. Batch process to reduce overhead
  4. Monitor API costs for premium models

๐Ÿ“ˆ Performance

  • Processing Speed: 5-30 seconds per page (varies by model)
  • Concurrent Processing: Up to 10 PDFs simultaneously
  • Maximum File Size: 100MB per PDF
  • Supported Formats: PDF (any version)
  • Resolution: 300 DPI conversion

๐Ÿ’ฐ Pricing

This actor uses pay-per-event pricing:

  • $0.01 per PDF processed successfully (configurable)
  • Failed PDFs are not charged
  • Events tracked: pdf_processed

๐Ÿ†˜ Support

For issues, questions, or feature requests:

  • Check the Apify documentation
  • Review the input schema for parameter details
  • Ensure API keys are valid and have sufficient quota
  • Verify PDF files are accessible and not corrupted

๐Ÿ”„ Version History

v1.0

  • Initial release
  • Support for 8 OCR models
  • Multi-language support (12 languages)
  • Batch processing capabilities
  • Multiple output formats (JSON, Text, Markdown)
  • Page range selection
  • Structure preservation
  • Pay-per-event pricing

๐Ÿ“š Related Actors

  • Receipt OCR API - Specialized for receipt processing
  • Invoice OCR API - Optimized for invoice extraction
  • Form OCR API - Structured form data extraction

๐Ÿ”— Links


Transform your PDF documents into searchable, structured data! ๐Ÿ“„โœจ

You might also like

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

5

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

๐Ÿ‘ User avatar

Brennan Crawford

2

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

๐Ÿ‘ User avatar

Kumar Gagandeo

6

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

๐Ÿ‘ User avatar

Onidivo Technologies

512

Image Text Extractor

m3web/image-text-extractor

Extract text from images using OCR (Optical Character Recognition) via direct URLs or uploaded JSON/CSV files. Works with multiple languages and automatically enriches your structured file with the text found inside images.

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

๐Ÿ‘ User avatar

codemaster devops

56

5.0

OCR Structured Extractor (AI) โ€” Image/PDF โ†’ OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

56

5.0