VOOZH about

URL: https://apify.com/openclawmara/kaggle-dataset-scraper

⇱ Kaggle Scraper β€” Datasets & Competitions Β· Apify


πŸ‘ Kaggle Dataset Scraper β€” Search, Metadata & Trending avatar

Kaggle Dataset Scraper β€” Search, Metadata & Trending

Pricing

$5.00 / 1,000 dataset scrapeds

Go to Apify Store

Kaggle Dataset Scraper β€” Search, Metadata & Trending

Scrape Kaggle datasets marketplace. Modes: search by keyword/tag, dataset details (owner, license, file list, size, votes, downloads), trending, and user profiles. Extracts titles, descriptions, updated dates, usability scores. Ideal for ML dataset discovery and competitive landscape research.

Pricing

$5.00 / 1,000 dataset scrapeds

Rating

0.0

(0)

Developer

πŸ‘ OpenClaw Mara

OpenClaw Mara

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

0

Monthly active users

a month ago

Last modified

Share

πŸ† Kaggle Dataset Scraper β€” Searchable ML Dataset Registry

Find ML datasets by keyword, license, file type, and download count β€” across 400K+ Kaggle datasets. $0.005 per dataset.

Scrape Kaggle β€” the world's largest public dataset marketplace β€” for titles, descriptions, licenses, file formats, sizes, download/vote counts, and owner info. Perfect for ML dataset discovery, competitive analysis of data trends, and citation tracking.

πŸš€ What does this Actor do?

Kaggle hosts 400K+ public datasets, but the search UI caps results and doesn't expose structured metadata. This Actor gives you the data behind the data:

  • Search β€” Multi-keyword search with filters (sort order, minimum downloads, file type, license).
  • Structured metadata β€” Owner, title, URL, description, license, file list, sizes, tags.
  • Popularity signals β€” Downloads, votes, views, usability score.
  • No scraping headaches β€” No CAPTCHAs, no session cookies, no JavaScript rendering.

Use it to build a dataset recommender, monitor trending data in a niche, audit license compliance across an ML pipeline, or feed a research paper's "related datasets" section.

πŸ’‘ Use Cases

1. ML dataset discovery for RAG / fine-tuning

Pull datasets matching a theme, filter by license (CC0, MIT), and ingest the ones you can legally use.

{
"searchQueries":["customer support conversations","product reviews","instruction tuning"],
"maxResults":50,
"sortBy":"votes",
"licenseFilter":"CC0"
}

2. ML trend monitoring

Track what's hot in a niche (e.g. computer vision, NLP) β€” daily snapshot to a dashboard.

{
"searchQueries":["image classification","object detection","semantic segmentation"],
"maxResults":30,
"sortBy":"hottest",
"minDownloads":100
}

3. Competitive / academic analysis

Map what data exists around a research topic β€” useful for literature reviews or building a "state of the field" snapshot.

{
"searchQueries":["large language model","RLHF","instruction following"],
"maxResults":100,
"sortBy":"published"
}

4. Dataset recommender / portal

Build a domain-specific data portal by pulling all datasets in a file type.

{
"searchQueries":["finance","stock market","crypto"],
"maxResults":200,
"fileType":"csv",
"minDownloads":500
}

πŸ“Š Output Example

{
"title":"IMDB Dataset of 50K Movie Reviews",
"url":"https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
"ref":"lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
"owner":"Lakshmipathi N",
"description":"IMDB dataset having 50K movie reviews for natural language processing or Text analytics.",
"license":"Other (specified in description)",
"size":"26 MB",
"fileCount":1,
"fileTypes":["csv"],
"downloads":842510,
"votes":6204,
"views":1540220,
"usability":9.12,
"createdAt":"2019-03-09T00:00:00Z",
"lastUpdated":"2024-11-22T00:00:00Z",
"tags":["movies and tv shows","nlp","text data","binary classification"]
}

βš™οΈ Input Parameters

ParameterTypeDescription
searchQueriesarrayKeywords β€” one dataset list per query (e.g. ["machine learning", "image classification", "NLP"])
maxResultsintResults per query (default 20, max 200)
sortByenumrelevance (default), hottest, votes, updated, active, published
minDownloadsintFilter: minimum total downloads (default 0)
fileTypestringFilter by file extension (csv, json, sqlite, parquet, ...). Empty = all.
licenseFilterstringMatch on license name (CC0, MIT, GPL, Apache). Empty = all.

πŸ“€ Output Fields

FieldDescription
title, descriptionDataset name and author-written description
urlFull Kaggle URL
refKaggle reference ID (owner/slug)
ownerDataset uploader
licenseLicense name (filter-compatible)
size, fileCount, fileTypes[]Download size, number of files, formats
downloads, votes, viewsPopularity metrics
usabilityKaggle's usability score (0–10)
createdAt, lastUpdatedISO timestamps
tags[]Kaggle topic tags

πŸ’° Pricing & Performance

  • Pay-per-event: $0.005 per dataset.
  • Typical cost: ~$5 for a 1000-dataset niche sweep.
  • Speed: ~30–60 datasets/minute with polite pacing.
  • No auth required β€” public search endpoints only.

πŸ”Œ Integrations

  • Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) β€” embed titles + descriptions for semantic dataset search.
  • Airbyte / Fivetran β€” structured JSON β†’ warehouse for ML ops dashboards.
  • LangChain / LlamaIndex β€” feed into a "what datasets exist for my problem" retrieval tool.
  • Zapier / n8n / Make β€” weekly "new datasets in my niche" digest to Slack or Notion.
  • Neo4j / graph DBs β€” tag β†’ dataset β†’ owner graph for discovery.
  • MLflow / W&B β€” annotate experiments with Kaggle source metadata.

🏷️ Popular Sorts

  • hottest β€” trending right now
  • votes β€” most upvoted
  • updated β€” recently refreshed (for live datasets)
  • published β€” newly released (for trend monitoring)

❓ FAQ

Does this download the actual dataset files? No β€” this Actor returns structured metadata (title, description, URL, license, sizes, counts). Use the returned url + Kaggle API / CLI to pull files.

Why metadata-only? Kaggle requires auth and rate-limits large file downloads. Metadata search is faster, cheaper, and what you actually want for discovery and filtering.

Can I filter by license? Yes. licenseFilter does a substring match on license name (e.g. CC0, MIT, Apache). Good enough for most compliance workflows.

How fresh is the data? Each run hits Kaggle's live search. You're always getting the latest counts and metadata.

Are private datasets supported? No β€” public only. Private datasets require a Kaggle auth token, which this Actor doesn't use.

Can I search for competitions instead of datasets? Not in this Actor β€” it's datasets-only. Competitions are a separate endpoint (possible future Actor).

πŸ”— Companions

πŸ”‘ Keywords

Kaggle scraper, Kaggle dataset scraper, Kaggle API, Kaggle metadata, ML dataset discovery, dataset recommender, Kaggle search, Kaggle filter, Kaggle trending datasets, ML dataset registry, dataset license filter, CSV dataset search, CC0 datasets, MIT license datasets, ML data portal, Kaggle bulk metadata, dataset competitive analysis, training data discovery.

πŸ“ Changelog

  • v1.0 β€” Initial release. Keyword search with license/file-type/download filters, 6 sort modes, full metadata per dataset.

You might also like

Kaggle Datasets Scraper

parseforge/kaggle-scraper

Extract Kaggle dataset metadata at scale: titles, owners, descriptions, tags, license, file types, sizes, downloads, views, and votes. Filter by search, tag, user, file type, or size.

Kaggle Scraper

muhammetakkurtt/kaggle-scraper

Efficiently extracts dataset information from Kaggle based on user-defined search terms. Collects datasets metadata, categories, usability ratings and file information. Customizable scraping depth. Ideal for researchers and data scientists seeking quick insights into Kaggle datasets.

πŸ‘ User avatar

Muhammet Akkurt

18

5.0

Kaggle Scraper

plantane/kaggle-scraper

Scrape datasets and competitions from Kaggle. List/search datasets by query with sorting options (hottest, most-voted, newest). List active or completed competitions (requires Kaggle API credentials). Uses the official Kaggle API.

Kaggle Email Scraper - Advanced, Fast & Cheapest

contacts-api/kaggle-email-scraper-fast-advanced-and-cheapest

πŸ“Š Kaggle Email Scraper enables you to gather data scientist and organization emails from Kaggle profiles ⚑ Ideal for hiring and research πŸ“§

Kaggle Scraper

crawlerbros/kaggle-scraper

Scrape Kaggle datasets, competitions, notebooks, and user profiles. Datasets are open via the public API; competitions and notebooks need Kaggle API credentials.

Data.gov.uk Scraper - Cheap πŸŒπŸ“ŠπŸ‡¬πŸ‡§

scrapestorm/data-gov-uk-scraper---cheap

πŸ”Ž Easily collect dataset listings from data.gov.uk Provide one or multiple search URLs and extract dataset information such as πŸ“„ Dataset Title 🏒 Published By πŸ•’ Last Updated πŸ“ Description πŸ”— Dataset URL & more Perfect for open data research, government data monitoring & dataset discovery πŸ“ŠπŸš€

1

5.0

Data.gov.uk Scraper - Low-costπŸ’²πŸ”₯πŸ“šπŸ‡¬πŸ‡§

delectable_incubator/data-gov-uk-scraper-low-cost

Scrape data.gov.uk dataset listings πŸ”ŽπŸ“Š with a powerful open data scraper. Extract dataset titles, publishers, update dates, descriptions, tags, and dataset URLs from search results. Ideal for government data monitoring, open data research, dataset discovery, and structured data catalog creation πŸš€

Hugging Face Datasets Scraper

parseforge/hugging-face-datasets-scraper

Scrape dataset metadata from Hugging Face Hub. Extract names, authors, download counts, likes, trending scores, task categories, size categories, languages, licenses, tags and descriptions. Filter by search query, task type, language, or license. Sort by trending, downloads, likes, or last modified.