VOOZH about

URL: https://apify.com/zentrafoundry/pdf-table-extractor

⇱ PDF Table Extractor | Apify Actor Β· Apify


Pricing

$54.00 / 1,000 parsed-tables

Go to Apify Store

PDF Table Extractor

Transform pdf table extractor inputs into structured rows, clear errors, confidence signals, and automation-ready output.

Pricing

$54.00 / 1,000 parsed-tables

Rating

0.0

(0)

Developer

πŸ‘ Zentra

Zentra

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

an hour ago

Last modified

Share

Transform pdf table extractor inputs into structured rows, clear errors, confidence signals, and automation-ready output.

Who this is for

Developers, analysts, data operations teams, AI-agent builders, and automation owners use this actor when they need focused pdf table extractor output instead of a broad generic scraper or manual checking.

Buyer outcomes

  • Turn pdf table extractor inputs into repeatable structured output for downstream systems.
  • Prioritize cleanup with schema, quality, extraction, change, warning, and error fields.
  • Route normalized rows into Apify datasets, APIs, spreadsheets, automations, or AI-agent workflows.

Sources monitored

Inputs

  • sourceMode: use sample for a smoke run, startUrls for URL-backed PDFs/datasets/pages, or configured dataset modes.
  • startUrls: PDF URLs, dataset URLs, public files, or pages to parse, audit, normalize, extract, or compare.
  • sourceIds: approved source or dataset identifiers used to scope the run.
  • maxItems: bounded number of files, tables, rows, fields, or changes to process.
  • watchlistTerms: optional column names, schema keys, quality rules, or extraction terms.
  • webhookUrl: optional completion destination for the transformation report.
  • outputMode: use sample records for Store validation or production output for normal runs.

How it transforms the input

  • Input: PDF, CSV, JSON, Apify dataset URL, table-like document, website, or messy operational data.
  • Transformation: parse, extract, normalize, audit, compare, dedupe, or report schema/quality issues.
  • Output: normalized fields, extracted tables/rows, schema report, diff report, warnings, confidence, and errors.

Outputs

The actor returns structured transformation records: extracted tables, normalized schemas, dataset quality metrics, diff reports, parsed fields, warnings, errors, and confidence signals.

Family-specific fields to expect:

  • extractedRows: Rows parsed or produced by the transformation.

  • schema: Detected, normalized, or target schema.

  • columns: Detected table or dataset columns.

  • validationErrors: Validation, parse, schema, or quality errors.

  • duplicateCount: Duplicate rows or keys found during audit/dedupe.

  • nullRate: Null or empty-value rate for important fields.

  • changedRecords: Added, removed, or changed records for diff workflows.

  • recordId: Stable record ID for exports, dedupe, and downstream joins.

  • title: Human-readable record title for review and export.

  • sourceName: Source identifier used to trace where the record came from.

  • sourceUrl: Direct source URL for review and audit.

  • dedupeKey: Stable key used for delta mode and duplicate suppression.

  • retrievedAt: Timestamp showing when the actor retrieved or generated this record.

  • score: Normalized field for filtering, routing, or downstream review.

  • scoreReasons: Buyer-readable explanation for the score or match.

  • confidence: Normalized field for filtering, routing, or downstream review.

  • errors: Normalized field for filtering, routing, or downstream review.

  • runSummary: Run-level summary for counts, filters, charges, and next actions.

Pricing

This actor uses Apify pay-per-event pricing. Current public listing guidance: $29-$49 / 1,000 launch validation records until public data proof is complete. Charges are tied to buyer-visible value events such as document-parsed, dataset-processed, record-saved, enriched-record. Small validation runs are supported so you can inspect output before scaling a schedule.

  • document-parsed: Charge when PDF Table Extractor produces Enriched Record. Typical price: $0.043. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • dataset-processed: Base charge when PDF Table Extractor writes a non-empty default dataset. Typical price: $0.011. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • record-saved: Charge for each buyer-visible result saved by PDF Table Extractor. Typical price: $0.003. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • enriched-record: Charge when PDF Table Extractor adds match scoring, source evidence, or enrichment to a saved result. Typical price: $0.022. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • first-run-cap: Recommended first run budget cap. Typical price: $3.820. Start with the default small run, inspect the dataset, then raise maxItems or schedule recurring runs.

API example

curl-X POST "https://api.apify.com/v2/actors/zentrafoundry~pdf-table-extractor/runs"\
+ -H"Authorization: Bearer $APIFY_TOKEN"\
+ -H"Content-Type: application/json"\
+ -d'{"maxItems":10,"sourceIds":["APIFY-DATASETS"],"includeSourceUrls":true,"includeMatchReasons":true,"outputMode":"buyer-ready-records"}'

Recommended first run

{
"maxItems":10,
"sourceIds":[
"APIFY-DATASETS"
],
"includeSourceUrls":true,
"includeMatchReasons":true,
"outputMode":"buyer-ready-records"
}

Sample output

Sample status: sample_unavailable at https://zentra.nimblique.studio/external/actor-review/samples/pdf-table-extractor.json. No fake sample is published; run a bounded real sample refresh before using examples in promotion.

Recommended public tasks

[
{
"name":"Validate one small data transformation",
"description":"Low-cost validation run for checking parsed, normalized, audited, or diffed output.",
"input":{
"maxItems":10,
"sourceIds":[
"APIFY-DATASETS"
],
"includeSourceUrls":true,
"includeMatchReasons":true,
"outputMode":"buyer-ready-records",
"actorSlug":"pdf-table-extractor"
}
},
{
"name":"Recurring dataset utility check",
"description":"Recurring batch for schema, quality, extraction, or change reports.",
"schedule":"Daily during local business hours",
"input":{
"maxItems":25,
"sourceIds":[
"APIFY-DATASETS"
],
"includeSourceUrls":true,
"includeMatchReasons":true,
"outputMode":"buyer-ready-records",
"actorSlug":"pdf-table-extractor"
}
}
]

Use cases

  • Clean, extract, compare, or audit pdf table extractor data before it enters a downstream workflow.
  • Convert messy inputs into predictable JSON/CSV-ready rows for APIs, spreadsheets, or agents.
  • Surface schema drift, duplicates, nulls, errors, warnings, or changed records.
  • Use small validation runs before connecting larger datasets or destinations.

Trust and compliance

  • Uses Apify datasets/storage.
  • Keeps source URLs and source identifiers in output records for auditability.
  • Does not require private credentials unless a source is explicitly configured for approved authenticated access.

Limitations

  • Results depend on public-source availability, source uptime, and source update cadence.
  • Public sources can revise records after publication; rerun scheduled tasks for fresh evidence.
  • Scores and match reasons are decision-support signals, not legal, financial, procurement, medical, safety, or regulatory advice.
  • Large production runs can cost more than the default smoke run; start small, inspect output, then scale schedules.

FAQ

Can I run this without URLs? Yes. The default sample mode is designed to succeed without user-supplied URLs, and URL-backed runs can use startUrls when needed.

Can I schedule it? Yes. Use sinceLastRun, watchlistTerms, and optional webhookUrl to turn the actor into a recurring alert or report workflow.

How do I verify value before scaling? Run the recommended first-run input, review the sample output fields, then increase maxItems or schedule recurring runs after the dataset matches your use case.

You might also like

Public Source Discovery Agent

zentrafoundry/public-source-discovery-agent

Transform public source discovery agent inputs into structured rows, clear errors, confidence signals, and automation-ready output.

Apify Dataset to Google Sheets Sync

zentrafoundry/apify-dataset-to-google-sheets-sync

Transform apify dataset to google sheets sync inputs into structured rows, clear errors, confidence signals, and automation-ready output.

πŸ“„ PDF Text Extractor

scrapio/pdf-text-extractor

πŸ“„ PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚑ Saves time & boosts productivity for research, automation, and document workflows.

Ugly Website AI-Agent Connector

zentrafoundry/ugly-website-ai-agent-connector

Transform ugly website ai-agent connector inputs into structured rows, clear errors, confidence signals, and automation-ready output.

πŸ“„ PDF Text Extractor

api-empire/pdf-text-extractor

πŸ“„ PDF Text Extractor effortlessly converts PDF files into searchable text and clean output. ⚑ Fast, accurate, and user-friendlyβ€”ideal for document analysis, data extraction, and content indexing. πŸš€ Perfect for research, compliance, and automation.

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

πŸ‘ User avatar

Onidivo Technologies

512