VOOZH about

URL: https://apify.com/shahabuddin38/pdf-to-json

โ‡ฑ PDF to JSON API | OCR PDF Parser & Data Extraction ยท Apify


Pricing

from $3.50 / 1,000 results

Go to Apify Store

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Shahab Uddin

Shahab Uddin

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

1

Monthly active users

2 months ago

Last modified

Share

PDF to JSON API

This Apify Actor converts PDF files into normalized JSON. It accepts direct http / https PDF URLs for real workloads and ships with a bundled builtin://sample.pdf smoke-test input so Apify Store QA does not depend on a third-party sample URL staying online.

What it supports

  • Text extraction from standard text-based PDFs
  • Optional table extraction
  • Optional metadata output
  • Multiple PDF URLs per run
  • Apify dataset views, key-value store summaries, and live status page

Current limitations

  • OCR for image-only or scanned PDFs is not included in this version
  • Production URLs must be directly downloadable over http or https
  • maxPages limits parsed pages for text extraction, but pageCount still reports the full document page count reported by the PDF parser

Input example

{
"pdfUrls":[
"builtin://sample.pdf"
],
"extractTables":false,
"includeMetadata":true,
"outputFormat":"json",
"maxDownloadRetries":4,
"requestTimeoutSecs":30,
"saveDebugSnapshots":false,
"proxyConfiguration":{
"useApifyProxy":false
}
}

Apify QA compatibility

Apify Store's automated health check runs the Actor with its default input and expects a succeeded run with a non-empty default dataset within 5 minutes. To keep this Actor healthy:

  • The input schema now uses both prefill and default for pdfUrls, which avoids older tasks or integrations failing when the field is omitted.
  • The schema also pre-fills the lightweight default options, matching Apify's daily default-input health check path.
  • The default sample is bundled into the Actor as builtin://sample.pdf, so daily checks do not rely on a third-party PDF host.
  • The custom Dockerfile builds main.ts during the Apify image build and runs the generated dist/main.js, so production cannot drift from the TypeScript source.
  • The runtime falls back to the bundled sample when pdfUrls is omitted, which protects legacy runs that were created before the field existed.
  • The deprecated legacy inputs extractKeyValuePairs and useOcr are still accepted as hidden no-op fields so older saved Apify inputs do not fail validation.
  • JSON records in the default key-value store are written with an explicit application/json content type so they satisfy Apify's key-value-store schema validation rules for collections that use jsonSchema.
  • Real URL downloads now use retryable browser-like requests, optional Apify proxy support, and optional debug snapshots when a target serves HTML or an anti-bot page instead of a PDF.

If every PDF fails, the Actor now ends in a failed status instead of silently reporting a successful run with only error items.

Output example

{
"sourceUrl":"builtin://sample.pdf",
"fileName":"dummy.pdf",
"pageCount":1,
"metadata":{
"PDFFormatVersion":"1.4"
},
"text":"Dummy PDF file",
"tables":[],
"success":true,
"processedAt":"2026-04-18T10:00:00.000Z"
}

Apify outputs

  • Dataset items: one normalized record per processed PDF
  • RUN_SUMMARY: compact run summary in the default key-value store
  • RESULTS.json or RESULTS.pretty.json: aggregated export of all dataset items
  • DEBUG_*: optional diagnostic metadata and HTML/text previews for blocked downloads when saveDebugSnapshots is enabled
  • Live view:
    • / HTML dashboard
    • /health compact JSON counters
    • /status full in-memory run state

Local development

npminstall
npm run build
npm start

Run the same smoke path Apify cares about locally with:

$npm run smoke

For local TypeScript changes, rebuild before running or use:

$npm run dev

The runtime start command intentionally launches dist/main.js; the Docker build runs npm run build first and then prunes development dependencies before startup.

The actor source of truth is main.ts, and the Apify Console input UI is defined in .actor/INPUT_SCHEMA.json.

You might also like

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

๐Ÿ‘ User avatar

Kumar Gagandeo

6

Pdf Json Extractor

p6t_p10n/pdf-json-extractor

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.

๐Ÿ‘ User avatar

Peerapat Pongnipakorn

2

OCR Structured Extractor (AI) โ€” Image/PDF โ†’ OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

5

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

56

5.0