VOOZH about

URL: https://apify.com/zuzka/dataset-to-schema

⇱ Dataset(s) To Schema Β· Apify


Pricing

Pay per usage

Go to Apify Store

Dataset(s) To Schema

Takes a Dataset ID(s) and outputs a JSON schema of the contents of the dataset into key value store.

Pricing

Pay per usage

Rating

5.0

(1)

Developer

πŸ‘ Zuzka PelechovΓ‘

Zuzka PelechovΓ‘

Maintained by Community

Actor stats

0

Bookmarked

8

Total users

0

Monthly active users

6 months ago

Last modified

Share

Dataset to Schema

Generates a JSON Schema from one or more datasets on Apify. The actor scans dataset items, detects data types for each field (including merging multiple types), and outputs the resulting schema:

  • Saves it to the Key‑Value Store under the key SCHEMA (as application/json),
  • Also pushes the same schema as an item to the run’s output dataset for convenient viewing or sharing.

Use case: validating scraper outputs, generating OpenAPI/validators, or quickly checking data consistency across multiple datasets.


Input (input schema)

{
"title":"Generate schema from datasets",
"type":"object",
"schemaVersion":1,
"properties":{
"datasetIds":{
"title":"Dataset IDs",
"type":"array",
"description":"IDs of the datasets for which to generate a schema",
"editor":"stringList"
}
},
"required":["datasetIds"]
}

Fields

  • datasetIds (array β€” list of Apify dataset IDs to include in schema generation. You can provide one or multiple IDs; the actor iterates through them and merges schemas together.

Output

The actor produces the same schema in two places:

  1. Key‑Value Store: key SCHEMA – complete JSON Schema file (e.g., schema.json).
  2. Output dataset: a single item containing the full schema (for quick preview in the console).

Example output schema (truncated)

{
"$schema":"http://json-schema.org/draft-07/schema#",
"type":"object",
"properties":{
"title":{"type":["string","null"]},
"price":{"type":["number","string"]},
"inStock":{"type":"boolean"},
"images":{
"type":"array",
"items":{"type":"string"}
}
},
"additionalProperties":true
}

Note: The actor merges multiple observed types into union types (e.g., "type": ["number", "string"]) when data varies.


How It Works

  • Reads datasetIds from the input.
  • Iterates through each dataset and detects field types: number, string, boolean, object, array (unifying differing values into union types if needed).
  • Merges all detected fields into a single schema covering all datasets.
  • Saves the final schema to the KV Store (SCHEMA) and pushes it to the output dataset.
  • If a dataset exceeds internal iteration limits (β‰ˆ1β€―M items), logs a warning that the schema may be incomplete but still completes the run.

Quick Start on Apify

  1. Create a run of the actor in the Apify Console.

  2. Provide input:

    {"datasetIds":["abc123","def456"]}
  3. Run it. After completion, open Storage β†’ Key‑Value Store and download SCHEMA. Alternatively, open the output dataset to view the schema item.

Limitations & Edge Cases

  • Large datasets (>β€―~1β€―M items): the actor logs a warning (β€œSchema might not be perfect.”) and continues. For higher accuracy, generate a schema from a smaller sample or pre‑aggregate data.
  • Heterogeneous data: if fields vary widely, expect broader union types β€” this is intentional so the schema reflects observed variability.

You might also like

Validate Dataset(s) with JSON Schema

jaroslavhejlek/validate-dataset-with-json-schema

This Actor validates items in one or more datasets against a provided JSON Schema. Use it if you planning to add a dataset validation schema to your actor and you want test it.

πŸ‘ User avatar

Jaroslav Hejlek

5

Structured Data Extractor β€” URL to JSON

shelvick/structured-extractor

Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.

2

Output & Dataset Schema Creator

zuzka/output-dataset-schema-creator

Generate JSON schemas for output and dataset on your Actor using AI. Perfect for testing new actors.

πŸ‘ User avatar

Zuzka PelechovΓ‘

1

Dataset Quality Gate - Schema & Data QA

jy-labs/dataset-quality-gate

Validate Apify Datasets by pasted items, Dataset ID, or Run ID before delivery, automation, or AI/RAG ingestion. Catch schema drift, missing fields, duplicates, and bad URLs/emails/dates.

Forward Dataset to Actor or Task

valek.josef/forward-dataset-to-actor-or-task

Forwards contents of specified dataset to a specified field on the input of another Actor or task.

πŸ‘ User avatar

Josef VΓ‘lek

22

Data.gov.uk Scraper - Cheap πŸŒπŸ“ŠπŸ‡¬πŸ‡§

scrapestorm/data-gov-uk-scraper---cheap

πŸ”Ž Easily collect dataset listings from data.gov.uk Provide one or multiple search URLs and extract dataset information such as πŸ“„ Dataset Title 🏒 Published By πŸ•’ Last Updated πŸ“ Description πŸ”— Dataset URL & more Perfect for open data research, government data monitoring & dataset discovery πŸ“ŠπŸš€

1

5.0

Google Dataset Items Translator

web.harvester/google-dataset-items-translator

Translate any dataset field(s) to any of the supported languages using the Google Translate website, it goes through all the items in the dataset and translates all of the selected fields

20

Dataset Download

idiatech/apify-Dataset-Download

Download any dataset from the Apify platform automatically and in any format you want. Use this actor along with a Dataset toolbox automation tool.

Zip Key-value Store

jaroslavhejlek/zip-key-value-store

Takes the ID of the key-value store, archives all their keys into a zip file, and saves them into the key-value store of the actor. For more than 1000 keys, multiple zip files are created. If their total size is bigger than the actor's available memory, it creates multiple smaller zip files.

πŸ‘ User avatar

Jaroslav Hejlek

209

Related articles

Dataset processing on Apify
Read more
Your Apify Actor's input schema is its UI. Here's how I design mine after 20+ Actors.
Read more
Why you should be using Actor schemas
Read more