Pdf Json Extractor

Under maintenance

Pricing

from $50.00 / 1,000 results

Try for free

Go to Apify Store

👁 Pdf Json Extractor

Pdf Json Extractor

Under maintenance

Try for free

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.

Pricing

from $50.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 Peerapat Pongnipakorn

Peerapat Pongnipakorn

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 months ago

Last modified

PDF → Structured JSON Extractor (Apify Actor)

This Apify Actor extracts structured JSON from PDF files using PDF parsing + optional OCR + LLM-based schema extraction.

Features

Accepts a pdfUrl (HTTP) or pdfBase64 (string) as input
Extracts raw text using pdf-parse and optionally OCR (stub)
Sends the text and a user-provided schema to an LLM to return strict JSON
Pushes extraction result to Dataset

Quick start

Update main.js's callLLM function to call your chosen LLM provider (OpenAI, Anthropic, Google)
(Optional) Implement runOCR using Tesseract or a cloud OCR API
apify push to your Apify account and run the actor with input.json

Example input.json

{
"pdfUrl":"https://example.com/invoice123.pdf",
"schema":{
"invoice_number":"string",
"invoice_date":"date",
"total_amount":"number",
"items":[{"name":"string","qty":"number","price":"number"}]
},
"aiModel":"gpt-4o-mini",
"ocr":false,
"returnFormat":"json"
}

Notes

The starter callLLM function is a stub for testing and must be replaced with an actual LLM API call before production use.
Consider rate limits and cost of LLM calls. Offer batching or model selection in your product.

Suggested pricing

Free: 20 PDFs / month
Starter: $19 / month (200 PDFs)
Pro: $49 / month (1000 PDFs)
Business: $149 / month (10k PDFs)

Validation & LLM retry behavior

This Actor now validates the extracted JSON using ajv when you provide a JSON Schema as the schema input. If the JSON does not validate, the Actor will automatically attempt to repair it by sending a targeted prompt to the LLM (up to 2 repair attempts).

LLM calls use p-retry with exponential backoff for transient failures (retries on 5xx and rate-limit responses). You can control retry counts and model via the input parameters.

OCR Options (Tesseract or Google Vision)

This Actor supports optional OCR when ocr is enabled in the input. You can select the OCR engine via the input ocrOptions.engine field.

`ocrOptions` example

"ocr":true,
"ocrOptions":{"engine":"tesseract"}

or for Google Vision:

"ocr":true,
"ocrOptions":{"engine":"google"}

Tesseract (offline)

Uses tesseract.js (Node). This allows OCR without external APIs but adds a larger dependency.
No env vars needed. Install dependencies and run the Actor as usual.

Google Vision (cloud OCR)

Uses Google Vision DOCUMENT_TEXT_DETECTION endpoint. Requires GOOGLE_API_KEY env var with an API key that has Vision API enabled.
Set the key in environment before running:

$exportGOOGLE_API_KEY="YOUR_GOOGLE_VISION_API_KEY"

Behavior notes

The Actor will attempt pdf-parse extraction first. If ocr is true and extracted text is short or empty, the configured OCR engine will be invoked.
OCR can be slower and more expensive (Google Vision costs), so use it only for scanned PDFs.

👁 Pdf to json avatar

Pdf to json

shahabuddin38/pdf-to-json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

👁 User avatar

Shahab Uddin

👁 OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON avatar

OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

👁 User avatar

Anass

👁 Bulk Pdf To Json OCR avatar

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

👁 User avatar

Kumar Gagandeo

👁 Vision OCR MCP avatar

Vision OCR MCP

accelerationengg/vision-ocr-mcp

Extract text from images instantly. Turn receipts, invoices, documents, and handwritten notes into structured data.

👁 User avatar

Acceleration

5.0

👁 PDF To JSON Parser avatar

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

👁 User avatar

ParseForge

5.0

👁 PDF to JSON Parser avatar

PDF to JSON Parser

jungle_synthesizer/pdf-to-json-parser

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

👁 User avatar

BowTiedRaccoon

👁 PDF AI Extractor MCP avatar

PDF AI Extractor MCP

devaditya/pdf-ai-extractor-mcp

Extracts text, tables, summaries, and structured data from any PDF using OpenAI, Google Gemini, or Claude. Supports bulk AI processing, clean JSON exports, and an AI-ready MCP mode for agent workflows.

👁 User avatar

lalithhh

👁 PDF Scraper avatar

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

👁 User avatar

Onidivo Technologies

512

👁 URL to PDF avatar

URL to PDF

reinventingai/url-to-pdf

Convert webpages, dashboards, invoices, and authenticated app screens into clean PDFs with advanced controls. Supports JavaScript waits, cookies, headers, print settings, PDF/A, PDF/UA, watermarks, attachments, and post-processing. Ideal for reports, archives, and automation workflows.

👁 User avatar

Mark Fulton

Pdf API

vivid_astronaut/pdf

👁 User avatar

Fabio Suizu

URL: https://apify.com/p6t_p10n/pdf-json-extractor