Dataset Preview

Duplicate

_id string	id string	author string	cardData string	disabled bool	gated string	lastModified timestamp[ns]	likes int64	trendingScore float64	private bool	sha string	description string	downloads int64	downloadsAllTime int64	tags string	createdAt timestamp[ns]	paperswithcode_id string	citation string	embedding list
69524c8ad001e56220ced9bc	Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b	Alibaba-Apsara	{"license": "cc-by-4.0", "task_categories": ["text-generation"], "language": ["en"], "tags": ["code", "math", "scientific-qa", "instruction-following", "reasoning", "thinking", "gpt-oss-120b", "distill"], "size_categories": ["435K"], "configs": [{"config_name": "stage1", "data_files": "Superior-Reasoning-SFT-gpt-oss-12...	false	False	2026-01-15T06:39:55	236	115	false	e9d54e2a3f376fd5c62cafd3c4c99b304cdda698	Superior-Reasoning-SFT-gpt-oss-120b 🚀 Overview The Superior-Reasoning-SFT-gpt-oss-120b dataset is a high-quality, open-source collection containing 435K samples designed to democratize the training of high-performance Long Chain-of-Thought (Long-CoT) models. Unlike standard dis...	12,983	12,983	['task_categories:text-generation' 'language:en' 'license:cc-by-4.0' 'size_categories:100K<n<1M' 'format:json' 'modality:text' 'library:datasets' 'library:pandas' 'library:polars' 'library:mlcroissant' 'arxiv:2601.09088' 'arxiv:2512.20908' 'region:us' 'code' 'math' 'scientific-qa' 'instruction-following' 'reasoning...	2025-12-29T09:40:26	null	null	[ 0.3089316785335541, 0.5958349704742432, 0.5629348754882812, 0.25809532403945923, 0.7968100309371948, 0.4372105300426483, 0.7240143418312073, 0.4595176875591278, 0.9374338984489441, 0.5523331165313721, 0.7086126804351807, 0.2317904531955719, 0.33883732557296753, 0.7165320515632629, 0.1670...
69676b65aeecdadc87f8da8e	facebook/action100m-preview	facebook	{"license": "fair-noncommercial-research-license", "language": ["en"], "tags": ["video", "action"], "size_categories": ["10M<n<100M"]}	false	False	2026-01-14T14:24:13	100	91	false	c9404b5c9772d6883a2f062945273f171b585275	Action100M: A Large-scale Video Action Dataset Our data can be loaded from the 🤗 huggingface repo at facebook/action100m-preview where we released 10% of the full Action100M for preview. For examples of loading from local parquet files (from cloned repo) and visualization, see our GitHub repo. from datasets...	2,908	2,908	['language:en' 'license:fair-noncommercial-research-license' 'size_categories:100K<n<1M' 'format:parquet' 'modality:text' 'modality:video' 'library:datasets' 'library:dask' 'library:polars' 'library:mlcroissant' 'region:us' 'video' 'action']	2026-01-14T10:09:41	null	null	[ 0.42302507162094116, 0.835867702960968, 0.5068385601043701, 0.1327640563249588, 0.986899197101593, 0.8992253541946411, 0.3560081720352173, 0.32750505208969116, 0.8084222674369812, 0.19109703600406647, 0.34933358430862427, 0.5686920881271362, 0.519487202167511, 0.6172237992286682, 0.33704...
696b2406e6c69ff4f49745f4	sojuL/RubricHub_v1	sojuL	{"license": "apache-2.0", "language": ["zh", "en"], "tags": ["medical", "science", "wirting", "isntruction", "chat", "general"], "pretty_name": "RubricHub", "size_categories": ["100K<n<1M"], "task_categories": ["text-generation", "reinforcement-learning", "question-answering"]}	false	False	2026-01-20T07:16:51	81	81	false	bec50742963ed3672391fecbcc4b60067b9fa8bc	RubricHub_v1 RubricHub is a large-scale (approximately 110K), multi-domain dataset that provides high-quality rubric-based supervision for open-ended generation tasks. It is constructed via an automated coarse-to-fine rubric generation framework, which integrates principle-guided synthesis, multi-model aggre...	390	390	['task_categories:text-generation' 'task_categories:reinforcement-learning' 'task_categories:question-answering' 'language:zh' 'language:en' 'license:apache-2.0' 'size_categories:100K<n<1M' 'format:parquet' 'modality:text' 'library:datasets' 'library:dask' 'library:polars' 'library:mlcroissant' 'arxiv:2601.08430' ...	2026-01-17T05:54:14	null	null	[ 0.49041885137557983, 0.8460575342178345, 0.9790360331535339, 0.08758077770471573, 0.5944743752479553, 0.8000954389572144, 0.3159536123275757, 0.8329492211341858, 0.33446627855300903, 0.8034713268280029, 0.38892868161201477, 0.330281525850296, 0.33613070845603943, 0.1632225066423416, 0.23...
6969078587ce326016ddda46	lightonai/LightOnOCR-mix-0126	lightonai	{"dataset_info": {"features": [{"name": "key", "dtype": "string"}, {"name": "page_idx", "dtype": "int64"}, {"name": "content", "dtype": "string"}, {"name": "metadata", "struct": [{"name": "element_counts", "struct": [{"name": "formulas", "dtype": "int64"}, {"name": "images", "dtype": "int64"}, {"name": "tables", "dtype...	false	False	2026-01-23T08:39:35	60	60	false	09e11af7f0aacde1553b4d164049831e5bb7adb7	LightOnOCR-mix-0126 LightOnOCR-mix-0126 is a large-scale OCR training dataset built via distillation: a strong vision–language model is prompted to produce naturally ordered full-page transcriptions (Markdown with LaTeX math spans and HTML tables) from rendered document pages. The dataset is designed as supe...	831	831	['task_categories:text-to-image' 'task_categories:object-detection' 'language:en' 'language:fr' 'language:de' 'language:es' 'language:it' 'language:ja' 'language:ru' 'language:pl' 'language:nl' 'language:zh' 'language:pt' 'language:bg' 'language:tr' 'language:ur' 'language:hi' 'language:th' 'language:ar' 'language:...	2026-01-15T15:28:05	null	null	[ 0.823688805103302, 0.8406579494476318, 0.483701229095459, 0.9319063425064087, 0.8324227333068848, 0.09852192550897598, 0.8004595637321472, 0.7389633655548096, 0.8095628619194031, 0.43992146849632263, 0.3524768352508545, 0.11228302121162415, 0.8136829137802124, 0.13404196500778198, 0.6520...
69607cc44b1761f4d0cf0403	MiniMaxAI/OctoCodingBench	MiniMaxAI	{"license": "mit", "task_categories": ["text-generation"], "language": ["en"], "tags": ["code", "agent", "benchmark", "evaluation"], "pretty_name": "OctoCodingBench", "size_categories": ["n<1K"]}	false	False	2026-01-13T13:02:26	245	55	false	1555ecb6650a4448c1f7f714ce82d53f140b3414	OctoCodingBench: Instruction-Following Benchmark for Coding Agents English \| 中文 🌟 Overview OctoCodingBench benchmarks scaffold-aware instruction following in repository-grounded agentic coding. Why OctoCodingBench? Existing benchmarks (SWE-bench, etc.) focus on task completion — wheth...	13,077	13,077	['task_categories:text-generation' 'language:en' 'license:mit' 'size_categories:n<1K' 'format:json' 'modality:text' 'library:datasets' 'library:pandas' 'library:polars' 'library:mlcroissant' 'region:us' 'code' 'agent' 'benchmark' 'evaluation']	2026-01-09T03:57:56	null	null	[ 0.8765692710876465, 0.5790888071060181, 0.9762074947357178, 0.9661572575569153, 0.7798381447792053, 0.6735142469406128, 0.9520696401596069, 0.005745115224272013, 0.19977733492851257, 0.021700406447052956, 0.8745269775390625, 0.7878456115722656, 0.034170836210250854, 0.3435129225254059, 0...
68ba0ffd343a84103b603c45	Pageshift-Entertainment/LongPage	Pageshift-Entertainment	{"pretty_name": "LongPage", "dataset_name": "LongPage", "library_name": "datasets", "language": ["en"], "license": ["cc-by-4.0", "other"], "task_categories": ["text-generation"], "task_ids": ["language-modeling", "text2text-generation"], "size_categories": ["n<1K"], "source_datasets": ["original"], "annotations_creator...	false	False	2026-01-20T14:01:26	102	51	false	27d907b6a9f92682110e68ef91f001b4812698d6	Overview 🚀📚 The first comprehensive dataset for training AI models to write complete novels with sophisticated reasoning. 🧠 Hierarchical Reasoning Architecture — Multi-layered planning traces including character archetypes, story arcs, world rules, and scene breakdowns. A complete cognitive roadmap for l...	1,999	13,284	['task_categories:text-generation' 'task_ids:language-modeling' 'task_ids:text2text-generation' 'annotations_creators:machine-generated' 'language_creators:found' 'multilinguality:monolingual' 'source_datasets:original' 'language:en' 'license:cc-by-4.0' 'license:other' 'size_categories:1K<n<10K' 'format:parquet' '...	2025-09-04T22:17:33	null	null	[ 0.25633519887924194, 0.18907758593559265, 0.17922718822956085, 0.10687944293022156, 0.6752791404724121, 0.10808505117893219, 0.3827035129070282, 0.5174180865287781, 0.44070863723754883, 0.6763702034950256, 0.6158460974693298, 0.3872328996658325, 0.1837841272354126, 0.4385623037815094, 0....
695fb1b373628fa861fe84cf	HuggingFaceFW/finetranslations	HuggingFaceFW	{"license": "odc-by", "task_categories": ["text-generation", "translation"], "pretty_name": "FineTranslations", "size_categories": ["n>1T"], "language": ["abk", "abq", "abs", "acm", "adh", "adi", "ady", "aeb", "afr", "agx", "aii", "aim", "ain", "ajz", "akb", "aln", "als", "alt", "amh", "anp", "aoz", "apc", "apt", "arb"...	false	False	2026-01-09T16:45:58	248	42	false	af3f4ca895450216d4771cdbf3e3b95c5bacaa2a	💬 FineTranslations The world's knowledge in 1+1T tokens of parallel text What is it? This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from 🥂 FineWeb2 into English using Gemma3 27B. We relied on datatrove's ...	38,530	38,530	['task_categories:text-generation' 'task_categories:translation' 'language:abk' 'language:abq' 'language:abs' 'language:acm' 'language:adh' 'language:adi' 'language:ady' 'language:aeb' 'language:afr' 'language:agx' 'language:aii' 'language:aim' 'language:ain' 'language:ajz' 'language:akb' 'language:aln' 'language:...	2026-01-08T13:31:31	null	null	[ 0.5460655689239502, 0.4864030182361603, 0.5000516176223755, 0.9900186657905579, 0.1692168265581131, 0.7447656393051147, 0.32584118843078613, 0.7762579321861267, 0.7680622339248657, 0.4788389205932617, 0.6054204106330872, 0.6475067734718323, 0.5458201766014099, 0.6172909736633301, 0.92108...
695df55a4e351abe5277cca5	UniParser/OmniScience	UniParser	{"license": "cc-by-nc-sa-4.0", "task_categories": ["image-to-text"], "extra_gated_heading": "Request Access to This Dataset", "extra_gated_description": "Please complete the required fields below to request access. Access will be automatically granted upon submission.", "extra_gated_fields": {"Full Name": {"type": "tex...	false	auto	2026-01-22T02:55:43	73	41	false	9c9fdac9ea87b36e3889330463cd4aee2e81ce95	OmniScience: A Large-scale Dataset for Scientific Image Understanding 🚀 2026-01-21: The OmniScience dataset ranked Top 8 on Hugging Face Datasets Trending (Top 1 on Image Caption Filed). 🚀 2026-01-17: The OmniScience dataset surpassed 5,000 downloads within 5 days of its release. 🚀 2026-01-12: Official r...	7,703	7,710	['task_categories:image-to-text' 'license:cc-by-nc-sa-4.0' 'size_categories:1M<n<10M' 'format:parquet' 'format:optimized-parquet' 'modality:image' 'modality:text' 'library:datasets' 'library:dask' 'library:polars' 'library:mlcroissant' 'arxiv:2512.15098' 'region:us']	2026-01-07T05:55:38	null	null	[ 0.7002298831939697, 0.6913377046585083, 0.28462982177734375, 0.07918864488601685, 0.4246640205383301, 0.22101019322872162, 0.31037789583206177, 0.22279313206672668, 0.6234893202781677, 0.33018070459365845, 0.2684820592403412, 0.2061517983675003, 0.5017613172531128, 0.3566916584968567, 0....
68bb43410b54503c335cb3d8	HuggingFaceFW/finepdfs	HuggingFaceFW	"{\"license\": \"odc-by\", \"task_categories\": [\"text-generation\"], \"pretty_name\": \"\\ud83d\\u(...TRUNCATED)	false	False	2026-01-09T10:37:26	790	39	false	89f5411afb089ee310a09df61e7a58a1bf6d081c	"\n\nLiberating 3T of the finest tokens from PDFs\n\n\n\t\n\t\t\n\t\tWhat is this?\n\t\n\nAs we run (...TRUNCATED)	23,756	250,875	"['task_categories:text-generation' 'language:aai' 'language:aak' ...\n 'arxiv:2506.18421' 'arxiv:21(...TRUNCATED)	2025-09-05T20:08:33	null	null	[0.0881318747997284,0.8725446462631226,0.9592418074607849,0.767154335975647,0.7906309962272644,0.931(...TRUNCATED)
69314c12930718bfbd732f22	LEMAS-Project/LEMAS-Dataset-train	LEMAS-Project	"{\"license\": \"cc-by-nc-4.0\", \"language\": [\"it\", \"pt\", \"es\", \"fr\", \"de\", \"vi\", \"id(...TRUNCATED)	false	False	2026-01-09T03:33:49	71	29	false	e8bc66643f59bb55097203529022ff809de69c5d	"\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset is part of LEMAS-Project (lemas-project.github.io/LEMA(...TRUNCATED)	16,603	17,784	"['task_categories:text-to-speech'\n 'task_categories:automatic-speech-recognition' 'language:it'\n (...TRUNCATED)	2025-12-04T08:53:38	null	null	[0.6445622444152832,0.7504190802574158,0.5153056979179382,0.656548261642456,0.1625341773033142,0.197(...TRUNCATED)

End of preview.

Hub Stats (Lance format)

This dataset contains Hugging Face Hub statistics in Lance format, converted from the original cfahlgren1/hub-stats dataset.

Files

models.lance - Statistics for all models on the Hub (~2.5M rows)
datasets.lance - Statistics for all datasets on the Hub
spaces.lance - Statistics for all spaces on the Hub

Usage

import lance

# Load a dataset remotely
ds = lance.dataset("hf://datasets/julien-c/hub-stats-lance/datasets.lance")

# Convert to pandas
df = ds.to_table().to_pandas()

# Or query with SQL-like filters
table = ds.to_table(filter="downloads > 1000")

Example: Query datasets by author

import lance

ds = lance.dataset("hf://datasets/julien-c/hub-stats-lance/datasets.lance")
results = ds.to_table(filter="author = 'microsoft'").to_pandas()

# Sort by downloads
top = results.sort_values("downloads", ascending=False).head(10)
print(top[["id", "likes", "downloads"]])

Output:

 id likes downloads
 microsoft/ms_marco 221 11120
 microsoft/orca-math-word-problems-200k 468 6499
 microsoft/bing_coronavirus_query_set 0 6002
 microsoft/wiki_qa 69 5737
 microsoft/rStar-Coder 225 3492
 microsoft/Updesh_beta 8 3223
 microsoft/Dayhoff 7 2922
 microsoft/meta_woz 6 2801
 microsoft/cats_vs_dogs 61 1883
 microsoft/IMAGE_UNDERSTANDING 6 1833

Example: Vector similarity search

import lance
import numpy as np

ds = lance.dataset("hf://datasets/julien-c/hub-stats-lance/datasets.lance")

# Get an embedding to use as query (e.g., from microsoft/ms_marco)
query_row = ds.to_table(filter="id = 'microsoft/ms_marco'").to_pandas()
query_embedding = np.array(query_row["embedding"].iloc[0])

# Find 10 nearest neighbors
results = ds.to_table(
 nearest={"column": "embedding", "q": query_embedding, "k": 10}
).to_pandas()

print(results[["id", "likes", "downloads", "_distance"]])

Output:

 id likes downloads _distance
 microsoft/ms_marco 221 11120 2.23
 jiwonii97/atalk_as3 0 0 10.61
 AI-Art-Collab/ae5 0 1 10.85
 wgwgwgwgw/dbbdbbd 0 9 10.90
 1FDSFS/56803 0 8 10.94

Why Lance?

Lance is a modern columnar data format optimized for ML workflows:

Fast random access and filtering
Efficient for large datasets
Native support for vector search
Zero-copy integration with PyArrow/Pandas

Downloads last month: 140