VOOZH about

URL: https://apify.com/zentrafoundry/apify-dataset-quality-auditor

โ‡ฑ Validate and Audit Apify Datasets ยท Apify


Pricing

$54.00 / 1,000 reports

Go to Apify Store

Apify Dataset QA Gate

Score Apify datasets and emit actionable quality issues before downstream use.

Pricing

$54.00 / 1,000 reports

Rating

0.0

(0)

Developer

๐Ÿ‘ Zentra

Zentra

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

an hour ago

Last modified

Share

Apify Dataset Quality Auditor

Transform apify dataset quality auditor inputs into structured rows, clear errors, confidence signals, and automation-ready output.

Who this is for

Developers, analysts, data operations teams, AI-agent builders, and automation owners use this actor when they need focused apify dataset quality auditor output instead of a broad generic scraper or manual checking.

Buyer outcomes

  • Turn apify dataset quality auditor inputs into repeatable structured output for downstream systems.
  • Prioritize cleanup with schema, quality, extraction, change, warning, and error fields.
  • Route normalized rows into Apify datasets, APIs, spreadsheets, automations, or AI-agent workflows.

Sources monitored

Inputs

  • sourceMode: use sample for a smoke run, startUrls for URL-backed PDFs/datasets/pages, or configured dataset modes.
  • startUrls: PDF URLs, dataset URLs, public files, or pages to parse, audit, normalize, extract, or compare.
  • sourceIds: approved source or dataset identifiers used to scope the run.
  • maxItems: bounded number of files, tables, rows, fields, or changes to process.
  • watchlistTerms: optional column names, schema keys, quality rules, or extraction terms.
  • webhookUrl: optional completion destination for the transformation report.
  • outputMode: use sample records for Store validation or production output for normal runs.

How it transforms the input

  • Input: PDF, CSV, JSON, Apify dataset URL, table-like document, website, or messy operational data.
  • Transformation: parse, extract, normalize, audit, compare, dedupe, or report schema/quality issues.
  • Output: normalized fields, extracted tables/rows, schema report, diff report, warnings, confidence, and errors.

Outputs

The actor returns structured transformation records: extracted tables, normalized schemas, dataset quality metrics, diff reports, parsed fields, warnings, errors, and confidence signals.

Family-specific fields to expect:

  • extractedRows: Rows parsed or produced by the transformation.

  • schema: Detected, normalized, or target schema.

  • columns: Detected table or dataset columns.

  • validationErrors: Validation, parse, schema, or quality errors.

  • duplicateCount: Duplicate rows or keys found during audit/dedupe.

  • nullRate: Null or empty-value rate for important fields.

  • changedRecords: Added, removed, or changed records for diff workflows.

  • recordId: Stable record ID for exports, dedupe, and downstream joins.

  • title: Human-readable record title for review and export.

  • sourceName: Source identifier used to trace where the record came from.

  • sourceUrl: Direct source URL for review and audit.

  • dedupeKey: Stable key used for delta mode and duplicate suppression.

  • retrievedAt: Timestamp showing when the actor retrieved or generated this record.

  • score: Normalized field for filtering, routing, or downstream review.

  • scoreReasons: Buyer-readable explanation for the score or match.

  • confidence: Normalized field for filtering, routing, or downstream review.

  • errors: Normalized field for filtering, routing, or downstream review.

  • runSummary: Run-level summary for counts, filters, charges, and next actions.

Pricing

This actor uses Apify pay-per-event pricing. Current public listing guidance: $29-$49 / 1,000 launch validation records until public data proof is complete. Charges are tied to buyer-visible value events such as qa-report-created, row-audited, issue-found, dataset-processed, record-saved, enriched-record. Small validation runs are supported so you can inspect output before scaling a schedule.

  • qa-report-created: Charge after producing one dataset QA report. Typical price: $0.180. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • row-audited: Charge after producing one row audited. Typical price: $0.001. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • issue-found: Charge after producing one actionable quality issue. Typical price: $0.004. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • dataset-processed: Base charge when Apify Dataset Quality Auditor writes a non-empty default dataset. Typical price: $0.011. A run that produces 10 matching records charges only for the matched buyer-value events and remains capped by the run limit.
  • first-run-cap: Recommended first run budget cap. Typical price: $2.000. Start with the default small run, inspect the dataset, then raise maxItems or schedule recurring runs.

API example

curl-X POST "https://api.apify.com/v2/actors/zentrafoundry~apify-dataset-quality-auditor/runs"\
+ -H"Authorization: Bearer $APIFY_TOKEN"\
+ -H"Content-Type: application/json"\
+ -d'{"maxItems":10,"sourceIds":["APIFY-DATASETS"],"includeSourceUrls":true,"includeMatchReasons":true,"outputMode":"buyer-ready-records"}'

Recommended first run

{
"maxItems":10,
"sourceIds":[
"APIFY-DATASETS"
],
"includeSourceUrls":true,
"includeMatchReasons":true,
"outputMode":"buyer-ready-records"
}

Sample output

Sample status: sample_unavailable at https://zentra.nimblique.studio/external/actor-review/samples/apify-dataset-quality-auditor.json. No fake sample is published; run a bounded real sample refresh before using examples in promotion.

Recommended public tasks

[
{
"name":"Validate one small data transformation",
"description":"Low-cost validation run for checking parsed, normalized, audited, or diffed output.",
"input":{
"maxItems":10,
"sourceIds":[
"APIFY-DATASETS"
],
"includeSourceUrls":true,
"includeMatchReasons":true,
"outputMode":"buyer-ready-records",
"actorSlug":"apify-dataset-quality-auditor"
}
},
{
"name":"Recurring dataset utility check",
"description":"Recurring batch for schema, quality, extraction, or change reports.",
"schedule":"Daily during local business hours",
"input":{
"maxItems":25,
"sourceIds":[
"APIFY-DATASETS"
],
"includeSourceUrls":true,
"includeMatchReasons":true,
"outputMode":"buyer-ready-records",
"actorSlug":"apify-dataset-quality-auditor"
}
}
]

Use cases

  • Clean, extract, compare, or audit apify dataset quality auditor data before it enters a downstream workflow.
  • Convert messy inputs into predictable JSON/CSV-ready rows for APIs, spreadsheets, or agents.
  • Surface schema drift, duplicates, nulls, errors, warnings, or changed records.
  • Use small validation runs before connecting larger datasets or destinations.

Trust and compliance

  • Uses Apify datasets/storage.
  • Keeps source URLs and source identifiers in output records for auditability.
  • Does not require private credentials unless a source is explicitly configured for approved authenticated access.

Limitations

  • Results depend on public-source availability, source uptime, and source update cadence.
  • Public sources can revise records after publication; rerun scheduled tasks for fresh evidence.
  • Scores and match reasons are decision-support signals, not legal, financial, procurement, medical, safety, or regulatory advice.
  • Large production runs can cost more than the default smoke run; start small, inspect output, then scale schedules.

FAQ

Can I run this without URLs? Yes. The default sample mode is designed to succeed without user-supplied URLs, and URL-backed runs can use startUrls when needed.

Can I schedule it? Yes. Use sinceLastRun, watchlistTerms, and optional webhookUrl to turn the actor into a recurring alert or report workflow.

How do I verify value before scaling? Run the recommended first-run input, review the sample output fields, then increase maxItems or schedule recurring runs after the dataset matches your use case.

You might also like

Dataset Quality Gate - Schema & Data QA

jy-labs/dataset-quality-gate

Validate Apify Datasets by pasted items, Dataset ID, or Run ID before delivery, automation, or AI/RAG ingestion. Catch schema drift, missing fields, duplicates, and bad URLs/emails/dates.

Dataset Result Gate

vittuhy/dataset-result-gate

Conditional pipeline gate. Fails if the previous actor's dataset is empty, succeeds if it has results โ€” stopping unnecessary downstream runs before they start.

๐Ÿ‘ User avatar

Vรญt Tuhรฝ

1

Apify Store Scraper

igolaizola/apify-store-scraper

Scrape the Apify Store at scale. Collect actor listings, descriptions, stats, pricing, categories, and tags. Filter by query, use Apify Proxy, and export JSON/CSV for market research, competitor tracking, and trend analysis.

๐Ÿ‘ User avatar

Iรฑigo Garcia Olaizola

21

Related articles

Announcing Apify CLI v1
Read more