VOOZH about

URL: https://apify.com/jungle_synthesizer/pdf-to-json-parser

โ‡ฑ PDF to JSON Parser (AI-Powered) ยท Apify


Pricing

Pay per event

Go to Apify Store

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

Pricing

Pay per event

Rating

0.0

(0)

Developer

๐Ÿ‘ BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 days ago

Last modified

Share

Convert PDF documents into structured JSON. Supply a list of public PDF URLs โ€” the actor downloads each file, extracts text from every page, and returns clean, organized output. Add your OpenAI API key to get an AI-powered structuring pass that turns raw text into categorized JSON fields.

What it does

  • Accepts a list of public PDF URLs (up to 50 MB per file)
  • Downloads each PDF to temporary storage and extracts text per page using native PDF parsing
  • Processes every page for complete coverage โ€” no pages skipped
  • Optionally runs an AI structuring pass (OpenAI GPT-4o-mini or GPT-4o) that organizes the raw text into titled sections, tables, key fields, and metadata
  • Returns one dataset record per PDF with the full extracted text, per-page breakdown, and AI output
  • Saves error records for PDFs that fail to download or parse โ€” the run continues

Use cases

  • Invoice and receipt extraction for accounting automation
  • Contract and legal document analysis
  • Academic paper indexing and summarization
  • Form data extraction from government or regulatory PDFs
  • Report parsing for data pipelines
  • Bulk document conversion for RAG / LLM pipelines

Input

FieldTypeRequiredDescription
pdfUrlsArrayYesPublic PDF URLs to process. Must be directly downloadable.
openaiApiKeyStringNoYour OpenAI API key (sk-...). Enables AI structuring. Not stored.
extractionPromptStringNoCustom prompt for the AI structuring pass. Leave blank to use the default (extracts title, author, summary, sections, tables, key fields).
modelSelectNoOpenAI model: gpt-4o-mini (default, fast) or gpt-4o (most capable).
maxItemsIntegerNoMaximum PDFs to process per run. Default: 15.

Output

One dataset record per PDF:

FieldTypeDescription
sourceUrlStringOriginal PDF URL
pageCountNumberNumber of pages in the PDF
rawTextStringFull extracted text (all pages concatenated)
pagesStringJSON array of per-page text: [{"page": 1, "text": "..."}]
structuredJsonStringAI-structured output as JSON string (null if no API key supplied)
modelStringOpenAI model used (null if AI pass skipped)
processedAtStringISO timestamp when processing completed
statusStringsuccess or error
errorMsgStringError message on failure, null on success

Example record (native extraction only)

{
"sourceUrl":"https://example.com/invoice-2024-01.pdf",
"pageCount":2,
"rawText":"Invoice #INV-2024-001\nDate: January 15, 2024\n...",
"pages":"[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"},{\"page\":2,\"text\":\"Payment terms...\"}]",
"structuredJson":null,
"model":null,
"processedAt":"2026-06-07T12:00:00.000Z",
"status":"success",
"errorMsg":null
}

Example record (with AI structuring)

{
"sourceUrl":"https://example.com/invoice-2024-01.pdf",
"pageCount":2,
"rawText":"Invoice #INV-2024-001\nDate: January 15, 2024\n...",
"pages":"[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"}]",
"structuredJson":"{\"title\":\"Invoice #INV-2024-001\",\"date\":\"January 15, 2024\",\"key_fields\":{\"invoice_number\":\"INV-2024-001\",\"amount\":\"$1,250.00\"}}",
"model":"gpt-4o-mini",
"processedAt":"2026-06-07T12:00:00.000Z",
"status":"success",
"errorMsg":null
}

Notes

  • Native extraction works on any text-based PDF (invoices, reports, forms, contracts). Scanned image-only PDFs return empty text โ€” OCR for image PDFs is not currently supported.
  • AI structuring is additive. Even when the OpenAI call fails (rate limit, invalid key, network error), the actor returns the native extraction record with structuredJson: null rather than failing the run.
  • Custom prompts let you tailor the structuring output for a specific document type. For example: "Extract all line items as an array of {description, quantity, unit_price, total}".
  • File size limit: 50 MB per PDF. Larger files are rejected with an error record.
  • OpenAI costs are billed to your API key separately from actor usage.

You might also like

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

56

5.0

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

๐Ÿ‘ User avatar

Onidivo Technologies

512

PDF AI Extractor MCP

devaditya/pdf-ai-extractor-mcp

Extracts text, tables, summaries, and structured data from any PDF using OpenAI, Google Gemini, or Claude. Supports bulk AI processing, clean JSON exports, and an AI-ready MCP mode for agent workflows.

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL โ€” extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

Extract text from PDF

akash9078/pdf-text-extractor

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

๐Ÿ‘ User avatar

Akash Kumar Naik

107

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

๐Ÿ‘ User avatar

codemaster devops

56

5.0

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

๐Ÿ‘ User avatar

Kumar Gagandeo

6

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

๐Ÿ‘ User avatar

Brennan Crawford

2