VOOZH about

URL: https://apify.com/automation-lab/openml-scraper

⇱ OpenML Scraper - ML Datasets, Tasks and Flows Β· Apify


Pricing

Pay per event

Go to Apify Store

OpenML Dataset Scraper

Scrape ML datasets, tasks, flows, and runs from OpenML - the open science platform for machine learning

Pricing

Pay per event

Rating

0.0

(0)

Developer

πŸ‘ Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

0

Monthly active users

4 days ago

Last modified

Categories

Share

OpenML Scraper

Extract ML datasets, benchmark tasks, and algorithm flows from OpenML β€” the open science platform for machine learning. Get structured metadata for thousands of public ML benchmark datasets including feature counts, instance counts, class distributions, quality metrics, tags, download URLs, and more.

No API key required. No proxy needed. Pure HTTP access to OpenML's public REST API.

What does it do?

OpenML Scraper connects to the OpenML public REST API (openml.org/api/v1/json) and extracts structured data for three resource types:

  • πŸ“Š Datasets β€” ML benchmark datasets with quality metrics (features, instances, classes, missing values), tags, descriptions, download URLs, and metadata
  • 🎯 Tasks β€” Supervised classification and regression tasks defining evaluation procedures and target attributes
  • βš™οΈ Flows β€” Algorithm and pipeline implementations (scikit-learn, Weka, R packages, etc.) uploaded by the community

Results are pushed to an Apify dataset in clean, flat JSON format β€” ready for analysis, filtering, or export to CSV/Excel.

Who is it for?

ML researchers who want to browse and discover datasets for benchmarking without clicking through the OpenML web UI. Filter by name, status, or type and get all metadata in a single structured output.

AutoML engineers building dataset recommendation systems or experiment tracking pipelines. Use the scraper to programmatically catalog available benchmark datasets and their properties.

Data scientists who need to audit which OpenML datasets match their constraints (minimum features, instances, classes) for reproducible research.

Platform builders creating dataset directories or ML curriculum tools who need a machine-readable catalog of public benchmark datasets.

Students and educators exploring the landscape of ML datasets for teaching purposes β€” quickly find datasets by name, size, or domain tag.

Why use it?

OpenML's REST API is public and powerful, but integrating it into workflows requires building custom fetch/pagination/normalization code. This actor handles all of that:

  • βœ… Pagination built-in β€” fetches all matching results up to your maxResults limit, automatically handling page offsets
  • βœ… Rich metadata β€” goes beyond the list API to fetch full dataset descriptions, upload dates, licence info, and download URLs
  • βœ… Quality metrics extracted β€” flattens the nested quality array into named fields (numberOfFeatures, numberOfInstances, etc.)
  • βœ… No auth needed β€” OpenML's public API requires no API key
  • βœ… Retry logic β€” configurable retry count for transient failures
  • βœ… Clean flat output β€” no nested objects, ready for Apify datasets table view and CSV export

Data extracted

Datasets

FieldDescription
idOpenML dataset ID
nameDataset name
versionDataset version number
statusActive / deactivated / in_preparation
formatFile format (ARFF, CSV, etc.)
urlOpenML dataset page URL
downloadUrlDirect ARFF file download URL
numberOfFeaturesTotal number of attributes/columns
numberOfInstancesTotal number of rows/samples
numberOfClassesNumber of target classes (classification datasets)
numberOfMissingValuesCount of missing values across all cells
uploadDateWhen the dataset was uploaded
descriptionDataset description (up to 500 chars)
licenceLicence (Public, CC BY, etc.)
defaultTargetAttributeDefault prediction target column name
tagsArray of tags (domain, study labels, source)

Tasks

FieldDescription
idTask ID
nameTask name (usually dataset name)
taskTypeTask type (Supervised Classification, Supervised Regression, etc.)
taskTypeIdNumeric task type ID
datasetIdSource dataset ID
statusTask status
targetFeatureTarget column to predict
estimationProcedureCross-validation procedure ID
evaluationMeasuresPrimary evaluation metric
numberOfFeaturesFeatures in the underlying dataset
numberOfInstancesInstances in the underlying dataset
urlOpenML task page URL

Flows

FieldDescription
idFlow ID
nameFlow name (e.g., sklearn.ensemble.forest.RandomForestClassifier)
fullNameFull name with version (e.g., sklearn...RandomForestClassifier(8))
versionFlow version number
externalVersionExternal library version tag
uploaderIdUser ID of the uploader
urlOpenML flow page URL

How much does it cost to scrape OpenML datasets?

πŸ’‘ Free plan estimate: ~100 free results per month on the Apify Free plan.

The actor uses Pay-Per-Event (PPE) pricing:

EventBRONZESILVERGOLDPLATINUMDIAMOND
Run startedflat feeflat feeflat feeflat feeflat fee
Per result~$0.000029~$0.0000225~$0.0000173~$0.0000115~$0.00001

Example costs:

  • 100 datasets: ~$0.008
  • 500 datasets: ~$0.019
  • 1,000 datasets: ~$0.034

OpenML has ~6,000 active datasets, ~100,000 tasks, and ~20,000 flows. A full catalog export at BRONZE pricing costs ~$0.18–$2.89 depending on resource type.

How to use it

Step 1 β€” Choose your resource type

Select whether you want Datasets, Tasks, or Flows from the "What to scrape" section.

Step 2 β€” Filter (optional)

For datasets, enter a name filter in Search by name (e.g., iris, mnist, breast cancer) and set the Dataset status filter to active.

Step 3 β€” Set a result limit

Set Max results to control how many items to return. Start small (20–50) to preview the output before running a large batch.

Step 4 β€” Run and export

Click Save & Run. Results appear in the Dataset tab. Export to JSON, CSV, or Excel from the Export button.

Input parameters

ParameterTypeDefaultDescription
resourceTypestringdatasetsWhat to scrape: datasets, tasks, or flows
searchQuerystring``Filter by name (datasets: API-side; flows: client-side)
statusstringactiveDataset status: active, deactivated, in_preparation, any
maxResultsinteger100Maximum results to return (1–10,000)
maxRequestRetriesinteger3Retry attempts per failed request

Output example

{
"resourceType":"dataset",
"id":61,
"name":"iris",
"version":1,
"status":"active",
"format":"ARFF",
"url":"https://www.openml.org/d/61",
"downloadUrl":"https://openml.org/data/v1/download/61/iris.arff",
"numberOfFeatures":5,
"numberOfInstances":150,
"numberOfClasses":3,
"numberOfMissingValues":0,
"uploadDate":"2014-04-06T23:23:39",
"description":"Fisher's Iris Plants Database...",
"licence":"Public",
"defaultTargetAttribute":"class",
"tags":["Botany","Machine Learning","uci"]
}

Tips for best results

  • πŸ” Name search is exact-prefix for datasets β€” search for iris returns iris, iris-2, etc. Use short, common dataset names.
  • βš™οΈ Flow search is substring match β€” searching for sklearn matches any flow whose name contains sklearn. It scans all flows (up to 20,000), which takes ~30–60 seconds.
  • πŸ“Š Use status: any to include deactivated and in-preparation datasets in your catalog.
  • ⚑ Set maxResults to 100 for quick previews. For full catalogs, set it to 10,000 and expect 2–5 minutes of runtime.
  • πŸ”„ Tasks don't support name filtering β€” all tasks are returned in order of task ID. Filter by task type in your downstream pipeline.

Integrations

πŸ”— Export to Google Sheets

Use the Google Sheets integration to automatically push extracted datasets to a spreadsheet for collaborative review or ML experiment planning.

πŸ“Š Connect to Power BI or Tableau

Export the dataset as CSV from the Apify console and import it into your BI tool to build dashboards comparing dataset sizes, feature counts, and class distributions.

πŸ€– AutoML pipeline seeding

Run this actor on a schedule to keep a local database of OpenML datasets fresh. Use the dataset list to auto-select benchmark datasets for your AutoML framework's evaluation suite.

πŸ”” Monitor new datasets via webhook

Configure an Apify webhook to trigger your downstream pipeline whenever new datasets matching your filter are found. Useful for ML research groups that want to stay current with new public benchmarks.

API usage

Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_API_TOKEN'});
const run =await client.actor('automation-lab/openml-scraper').call({
resourceType:'datasets',
searchQuery:'mnist',
status:'active',
maxResults:50,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient(token="YOUR_API_TOKEN")
run = client.actor("automation-lab/openml-scraper").call(run_input={
"resourceType":"datasets",
"searchQuery":"mnist",
"status":"active",
"maxResults":50,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

cURL

curl-X POST \
"https://api.apify.com/v2/acts/automation-lab~openml-scraper/runs?token=YOUR_API_TOKEN"\
-H"Content-Type: application/json"\
-d'{
"resourceType": "datasets",
"searchQuery": "iris",
"status": "active",
"maxResults": 10
}'

Use with Claude and MCP (AI agent access)

This actor is available as an MCP (Model Context Protocol) tool, letting AI agents like Claude query OpenML datasets directly in conversation.

Claude Code (terminal)

$claude mcp add--transport http apify "https://mcp.apify.com?tools=automation-lab/openml-scraper"

Claude Desktop / Cursor / VS Code

Add to your MCP config file:

{
"mcpServers":{
"apify":{
"type":"http",
"url":"https://mcp.apify.com?tools=automation-lab/openml-scraper",
"headers":{
"Authorization":"Bearer YOUR_API_TOKEN"
}
}
}
}

Example prompts for Claude:

  • "Find all active OpenML datasets with 'breast cancer' in the name"
  • "Get 100 OpenML benchmark datasets with at least 1000 instances"
  • "List the first 20 supervised classification tasks on OpenML"
  • "Find all scikit-learn algorithm flows on OpenML"

Legality and terms of service

OpenML data is publicly available under the OpenML terms of service. The datasets themselves are shared under various open licences (Public Domain, CC BY, etc.) which are included in the licence field. This actor only accesses the public REST API using documented endpoints β€” no scraping of HTML content. Commercial use of the data depends on individual dataset licences.

FAQ

Q: Why does the flow search take a long time? A: OpenML's API doesn't support server-side name filtering for flows. The actor paginates through all flows and filters client-side. With 20,000+ flows, this can take 30–120 seconds. For fast results on flows, set maxResults to 50–100 and omit the searchQuery to get the latest flows by ID.

Q: The actor returned fewer results than my maxResults β€” why? A: OpenML may not have that many resources matching your filter. For example, searching for iris as a dataset name returns ~5 datasets (multiple versions). This is expected behavior.

Q: How do I get the actual dataset file (ARFF/CSV)? A: Each result includes a downloadUrl field with the direct ARFF download link. You can use this in your ML framework (e.g., arff.load() in Python, or pass directly to OpenML Python client).

Q: Can I filter datasets by minimum number of instances or features? A: Not directly via the actor input. Run the actor with no filter to get all datasets, then filter in your downstream pipeline using the numberOfInstances and numberOfFeatures fields.

Q: The description field is truncated β€” can I get the full description? A: The description is truncated at 500 characters to keep dataset sizes manageable. OpenML descriptions can be several kilobytes of text. If you need full descriptions, use the id field to call https://www.openml.org/api/v1/json/data/{id} directly.

Related scrapers

You might also like

ML Contests Scraper

automation-lab/mlcontests-scraper

Scrape machine learning, data science, and robotics competitions from mlcontests.com

πŸ‘ User avatar

Stas Persiianenko

3

Papers with Code Scraper

crawlerbros/papers-with-code-scraper

Scrape Papers with Code like search ML papers, fetch paper details with repos and results, browse ML tasks and leaderboards, search datasets, and find ML methods.

OSF Open Science Framework Scraper

parseforge/osf-scraper

Export public research projects, preprints, and registrations from the Open Science Framework (OSF). Search across 1M+ open science records. Filter by keyword, subject, or provider. Pull titles, descriptions, tags, DOIs, authors, institutions, dates, and full metadata.

Dataset to HuggingFace

flamboyant_leaf/DatasetToHuggingFace

Transfers data from Apify datasets to Hugging Face datasets. Bridges web scraping with ML platforms, enabling access to pre-trained models and collaborative tools. Customize transfer limits, streamline ML workflows, and leverage data versioning. Ideal for data scientists and ML researchers.

Hosco Courses Scraper - Low-costπŸ’²πŸ”₯πŸŽ“πŸ“š

delectable_incubator/hosco-courses-scraper-low-cost

Scrape Hosco courses and learning opportunities πŸŽ“πŸ“š with a powerful education scraper. Extract course titles, providers, locations, durations, learning formats, descriptions, and course URLs. Ideal for e-learning platforms, education research, skills development tracking and learning datasets πŸ“ŠπŸš€

Aijobs.net AI & ML Job Listings Scraper

jungle_synthesizer/aijobs-net-ai-engineer-jobs-scraper

Scrape AI, ML, and data science job listings from aijobs.net β€” the go-to AI/ML job board. Extracts full job details including salary range, seniority, remote policy, tech stack tags, company info, and apply URL. Sitemap-driven for complete coverage.

πŸ‘ User avatar

BowTiedRaccoon

2

OSF Open Science Framework Projects Scraper

parseforge/osf-projects-scraper

Search the Open Science Framework for public research projects by keyword or category. Returns project IDs, titles, descriptions, contributors, public flags, date created, date modified, and tag lists. Useful for meta science, scholarly discovery, and tracking research outputs across labs.

HuggingFace Models Datasets Spaces Scraper - Low-costπŸ’²πŸ”₯πŸ€–πŸ€—

delectable_incubator/huggingface-models-datasets-spaces-scraper-low-cost

Scrape Hugging Face Models, Datasets & Spaces πŸ€–πŸ“Š with a powerful AI ecosystem scraper. Extract repository names, owners, tags, downloads, likes, update dates, source URLs and more from keyword searches. Ideal for AI research, model discovery, dataset analysis and machine learning intelligence πŸš€πŸŒ

Related articles

Python and machine learning
Read more