VOOZH about

URL: https://apify.com/zenomastro/pdf-to-structured-data

โ‡ฑ PDF to JSON/CSV Data Extractor ยท Apify


Pricing

from $10.00 / 1,000 pdf processeds

Go to Apify Store

PDF to Structured Data (JSON/CSV)

Convert PDF files into clean structured JSON or CSV: text per page, reconstructed lines, optional table detection, and document metadata.

Pricing

from $10.00 / 1,000 pdf processeds

Rating

0.0

(0)

Developer

๐Ÿ‘ Rosario Vitale

Rosario Vitale

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

7 days ago

Last modified

Share

Turn any PDF into clean, structured data your code can actually use. Give the Actor one or more PDF URLs and get back text per page, reconstructed lines in reading order, optional table detection, and document metadata โ€” as JSON or CSV.

No more copy-pasting from PDFs by hand or fighting with brittle local libraries. Send a URL, get structured output.

What it does

  • ๐Ÿ“„ Text extraction โ€” full text of every page, in natural reading order.
  • ๐Ÿ“ Line reconstruction โ€” text items are grouped by position into real lines, not a jumbled blob.
  • ๐Ÿ“Š Table detection (optional) โ€” heuristically splits rows into cells so you can rebuild tables.
  • ๐Ÿท๏ธ Metadata (optional) โ€” title, author, producer and creation date when present.
  • ๐Ÿ” Batch โ€” pass many PDF URLs in a single run.

Input

FieldTypeDescription
pdfUrlsarray of stringsDirect links to the PDF files (required).
extractTablesbooleanDetect tables and return rows of cells. Default false.
extractMetadatabooleanInclude document metadata. Default true.
maxPagesintegerMax pages to read per PDF. 0 = all. Default 0.

Example input

{
"pdfUrls":[
"https://raw.githubusercontent.com/mozilla/pdf.js/master/web/compressed.tracemonkey-pldi-09.pdf"
],
"extractTables":false,
"extractMetadata":true,
"maxPages":0
}

Output

One dataset item per PDF:

{
"url":"https://.../document.pdf",
"success":true,
"numPages":14,
"pagesExtracted":14,
"metadata":{"Producer":"pdfeTeX-1.21a","Creator":"TeX","CreationDate":"..."},
"pages":[
{
"pageNumber":1,
"text":"Trace-based Just-in-Time Type Specialization ...",
"lines":["Trace-based Just-in-Time Type Specialization ...","Languages"],
"tables":[["Cell A","Cell B"],["1","2"]]
}
],
"fullText":"Trace-based Just-in-Time Type Specialization ..."
}

Export the dataset as JSON, CSV, Excel, or HTML straight from the run, or pull it through the Apify API.

Common use cases

  • Extract data from invoices, receipts, price lists, and bank statements.
  • Feed PDF text into search, RAG pipelines, or LLMs.
  • Turn reports and catalogs into spreadsheets.
  • Archive and index document text at scale.

Notes & limits

  • Works on text-based PDFs. Scanned/image-only PDFs contain no selectable text, so they need OCR (not included in this version).
  • Table detection is a position-based heuristic โ€” great for clean, grid-like tables, approximate for complex layouts.
  • pdfUrls must be direct links to the PDF file (not a viewer page).

Pricing

Pay-per-result: you are billed per PDF successfully processed. Failed downloads/parses are returned with success: false and are not charged.

You might also like

Pdf to json

shahabuddin38/pdf-to-json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

10

PDF to JSON Parser

jungle_synthesizer/pdf-to-json-parser

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

๐Ÿ‘ User avatar

BowTiedRaccoon

2

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL โ€” extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

๐Ÿ‘ User avatar

Onidivo Technologies

512

Convert Image to PDF and PDF to Image

akash9078/image-pdf-converter

Convert images (JPG, PNG, BMP, and more) into high-quality PDFs, or extract images from PDF files in seconds. Imageโ€“PDF Converter Pro delivers fast, reliable, and professional results for all your document and image conversion needs.

๐Ÿ‘ User avatar

Akash Kumar Naik

37

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

Extract text from PDF

akash9078/pdf-text-extractor

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

๐Ÿ‘ User avatar

Akash Kumar Naik

107