Pdf to json

Pricing

from $3.50 / 1,000 results

Pdf to json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

👁 Shahab Uddin

Shahab Uddin

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

PDF to JSON API

This Apify Actor converts PDF files into normalized JSON. It accepts direct http / https PDF URLs for real workloads and ships with a bundled builtin://sample.pdf smoke-test input so Apify Store QA does not depend on a third-party sample URL staying online.

What it supports

Text extraction from standard text-based PDFs
Optional table extraction
Optional metadata output
Multiple PDF URLs per run
Apify dataset views, key-value store summaries, and live status page

Current limitations

OCR for image-only or scanned PDFs is not included in this version
Production URLs must be directly downloadable over http or https
maxPages limits parsed pages for text extraction, but pageCount still reports the full document page count reported by the PDF parser

Input example

{
"pdfUrls":[
"builtin://sample.pdf"
],
"extractTables":false,
"includeMetadata":true,
"outputFormat":"json",
"maxDownloadRetries":4,
"requestTimeoutSecs":30,
"saveDebugSnapshots":false,
"proxyConfiguration":{
"useApifyProxy":false
}
}

Apify QA compatibility

Apify Store's automated health check runs the Actor with its default input and expects a succeeded run with a non-empty default dataset within 5 minutes. To keep this Actor healthy:

The input schema now uses both prefill and default for pdfUrls, which avoids older tasks or integrations failing when the field is omitted.
The schema also pre-fills the lightweight default options, matching Apify's daily default-input health check path.
The default sample is bundled into the Actor as builtin://sample.pdf, so daily checks do not rely on a third-party PDF host.
The custom Dockerfile builds main.ts during the Apify image build and runs the generated dist/main.js, so production cannot drift from the TypeScript source.
The runtime falls back to the bundled sample when pdfUrls is omitted, which protects legacy runs that were created before the field existed.
The deprecated legacy inputs extractKeyValuePairs and useOcr are still accepted as hidden no-op fields so older saved Apify inputs do not fail validation.
JSON records in the default key-value store are written with an explicit application/json content type so they satisfy Apify's key-value-store schema validation rules for collections that use jsonSchema.
Real URL downloads now use retryable browser-like requests, optional Apify proxy support, and optional debug snapshots when a target serves HTML or an anti-bot page instead of a PDF.

If every PDF fails, the Actor now ends in a failed status instead of silently reporting a successful run with only error items.

Output example

{
"sourceUrl":"builtin://sample.pdf",
"fileName":"dummy.pdf",
"pageCount":1,
"metadata":{
"PDFFormatVersion":"1.4"
},
"text":"Dummy PDF file",
"tables":[],
"success":true,
"processedAt":"2026-04-18T10:00:00.000Z"
}

Apify outputs

Dataset items: one normalized record per processed PDF
RUN_SUMMARY: compact run summary in the default key-value store
RESULTS.json or RESULTS.pretty.json: aggregated export of all dataset items
DEBUG_*: optional diagnostic metadata and HTML/text previews for blocked downloads when saveDebugSnapshots is enabled
Live view:
- / HTML dashboard
- /health compact JSON counters
- /status full in-memory run state

Local development

npminstall
npm run build
npm start

Run the same smoke path Apify cares about locally with:

$npm run smoke

For local TypeScript changes, rebuild before running or use:

$npm run dev

The runtime start command intentionally launches dist/main.js; the Docker build runs npm run build first and then prunes development dependencies before startup.

The actor source of truth is main.ts, and the Apify Console input UI is defined in .actor/INPUT_SCHEMA.json.

👁 PDF OCR API - Document Extraction avatar

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

👁 User avatar

The Howlers

👁 Bulk Pdf To Json OCR avatar

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

👁 User avatar

Kumar Gagandeo

👁 Pdf Json Extractor avatar

Pdf Json Extractor

p6t_p10n/pdf-json-extractor

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.

👁 User avatar

Peerapat Pongnipakorn

PDF to Structured Data (JSON/CSV)

zenomastro/pdf-to-structured-data

Convert PDF files into clean structured JSON or CSV: text per page, reconstructed lines, optional table detection, and document metadata.

👁 User avatar

Rosario Vitale

👁 OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON avatar

OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

👁 User avatar

Anass

👁 PDF Text Extractor - Bulk PDF to Text & Metadata avatar

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

👁 User avatar

Ale

Elite Document Ocr Lite

thepattyroller/elite-document-ocr-lite

Basic document text extraction and processing. Extract text from documents, analyze document structure, and extract structured data from invoices and receipts. Perfect for document automation workflows.

👁 User avatar

Logan Kiser

👁 PDF to Markdown Converter avatar

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

👁 User avatar

Web Harvester

👁 PDF To JSON Parser avatar

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

👁 User avatar

ParseForge

5.0

Pdf Power Tools

agenscrape/pdf-power-tools

Split, merge, compress, convert & OCR PDFs via API. Extract text from scanned documents in 14 languages. Compress files for email, convert pages to PNG/JPEG/WebP, split by pages or ranges, merge multiple PDFs. Perfect for document automation & data extraction workflows.

👁 User avatar

Agenscrape

URL: https://apify.com/shahabuddin38/pdf-to-json

⇱ PDF to JSON API | OCR PDF Parser & Data Extraction · Apify

Pdf to json

PDF to JSON API

What it supports

Current limitations

Input example

Apify QA compatibility

Output example

Apify outputs

Local development

You might also like

PDF OCR API - Document Extraction

Bulk Pdf To Json OCR

Pdf Json Extractor

PDF to Structured Data (JSON/CSV)

OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON

PDF Text Extractor - Bulk PDF to Text & Metadata

Elite Document Ocr Lite

PDF to Markdown Converter

PDF To JSON Parser

Pdf Power Tools