VOOZH about

URL: https://huggingface.co/datasets/julien-c/hub-stats-lance

⇱ julien-c/hub-stats-lance Β· Datasets at Hugging Face


Dataset Preview
Duplicate
_id
string
id
string
author
string
cardData
string
disabled
bool
gated
string
lastModified
timestamp[ns]
likes
int64
trendingScore
float64
private
bool
sha
string
description
string
downloads
int64
downloadsAllTime
int64
tags
string
createdAt
timestamp[ns]
paperswithcode_id
string
citation
string
embedding
list
69524c8ad001e56220ced9bc
Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b
Alibaba-Apsara
{"license": "cc-by-4.0", "task_categories": ["text-generation"], "language": ["en"], "tags": ["code", "math", "scientific-qa", "instruction-following", "reasoning", "thinking", "gpt-oss-120b", "distill"], "size_categories": ["435K"], "configs": [{"config_name": "stage1", "data_files": "Superior-Reasoning-SFT-gpt-oss-12...
false
False
2026-01-15T06:39:55
236
115
false
e9d54e2a3f376fd5c62cafd3c4c99b304cdda698
Superior-Reasoning-SFT-gpt-oss-120b           πŸš€ Overview The Superior-Reasoning-SFT-gpt-oss-120b dataset is a high-quality, open-source collection containing 435K samples designed to democratize the training of high-performance Long Chain-of-Thought (Long-CoT) models. Unlike standard dis...
12,983
12,983
['task_categories:text-generation' 'language:en' 'license:cc-by-4.0' 'size_categories:100K<n<1M' 'format:json' 'modality:text' 'library:datasets' 'library:pandas' 'library:polars' 'library:mlcroissant' 'arxiv:2601.09088' 'arxiv:2512.20908' 'region:us' 'code' 'math' 'scientific-qa' 'instruction-following' 'reasoning...
2025-12-29T09:40:26
null
null
[ 0.3089316785335541, 0.5958349704742432, 0.5629348754882812, 0.25809532403945923, 0.7968100309371948, 0.4372105300426483, 0.7240143418312073, 0.4595176875591278, 0.9374338984489441, 0.5523331165313721, 0.7086126804351807, 0.2317904531955719, 0.33883732557296753, 0.7165320515632629, 0.1670...
69676b65aeecdadc87f8da8e
facebook/action100m-preview
facebook
{"license": "fair-noncommercial-research-license", "language": ["en"], "tags": ["video", "action"], "size_categories": ["10M<n<100M"]}
false
False
2026-01-14T14:24:13
100
91
false
c9404b5c9772d6883a2f062945273f171b585275
Action100M: A Large-scale Video Action Dataset Our data can be loaded from the πŸ€— huggingface repo at facebook/action100m-preview where we released 10% of the full Action100M for preview. For examples of loading from local parquet files (from cloned repo) and visualization, see our GitHub repo. from datasets...
2,908
2,908
['language:en' 'license:fair-noncommercial-research-license' 'size_categories:100K<n<1M' 'format:parquet' 'modality:text' 'modality:video' 'library:datasets' 'library:dask' 'library:polars' 'library:mlcroissant' 'region:us' 'video' 'action']
2026-01-14T10:09:41
null
null
[ 0.42302507162094116, 0.835867702960968, 0.5068385601043701, 0.1327640563249588, 0.986899197101593, 0.8992253541946411, 0.3560081720352173, 0.32750505208969116, 0.8084222674369812, 0.19109703600406647, 0.34933358430862427, 0.5686920881271362, 0.519487202167511, 0.6172237992286682, 0.33704...
696b2406e6c69ff4f49745f4
sojuL/RubricHub_v1
sojuL
{"license": "apache-2.0", "language": ["zh", "en"], "tags": ["medical", "science", "wirting", "isntruction", "chat", "general"], "pretty_name": "RubricHub", "size_categories": ["100K<n<1M"], "task_categories": ["text-generation", "reinforcement-learning", "question-answering"]}
false
False
2026-01-20T07:16:51
81
81
false
bec50742963ed3672391fecbcc4b60067b9fa8bc
RubricHub_v1 RubricHub is a large-scale (approximately 110K), multi-domain dataset that provides high-quality rubric-based supervision for open-ended generation tasks. It is constructed via an automated coarse-to-fine rubric generation framework, which integrates principle-guided synthesis, multi-model aggre...
390
390
['task_categories:text-generation' 'task_categories:reinforcement-learning' 'task_categories:question-answering' 'language:zh' 'language:en' 'license:apache-2.0' 'size_categories:100K<n<1M' 'format:parquet' 'modality:text' 'library:datasets' 'library:dask' 'library:polars' 'library:mlcroissant' 'arxiv:2601.08430' ...
2026-01-17T05:54:14
null
null
[ 0.49041885137557983, 0.8460575342178345, 0.9790360331535339, 0.08758077770471573, 0.5944743752479553, 0.8000954389572144, 0.3159536123275757, 0.8329492211341858, 0.33446627855300903, 0.8034713268280029, 0.38892868161201477, 0.330281525850296, 0.33613070845603943, 0.1632225066423416, 0.23...
6969078587ce326016ddda46
lightonai/LightOnOCR-mix-0126
lightonai
{"dataset_info": {"features": [{"name": "key", "dtype": "string"}, {"name": "page_idx", "dtype": "int64"}, {"name": "content", "dtype": "string"}, {"name": "metadata", "struct": [{"name": "element_counts", "struct": [{"name": "formulas", "dtype": "int64"}, {"name": "images", "dtype": "int64"}, {"name": "tables", "dtype...
false
False
2026-01-23T08:39:35
60
60
false
09e11af7f0aacde1553b4d164049831e5bb7adb7
LightOnOCR-mix-0126 LightOnOCR-mix-0126 is a large-scale OCR training dataset built via distillation: a strong vision–language model is prompted to produce naturally ordered full-page transcriptions (Markdown with LaTeX math spans and HTML tables) from rendered document pages. The dataset is designed as supe...
831
831
['task_categories:text-to-image' 'task_categories:object-detection' 'language:en' 'language:fr' 'language:de' 'language:es' 'language:it' 'language:ja' 'language:ru' 'language:pl' 'language:nl' 'language:zh' 'language:pt' 'language:bg' 'language:tr' 'language:ur' 'language:hi' 'language:th' 'language:ar' 'language:...
2026-01-15T15:28:05
null
null
[ 0.823688805103302, 0.8406579494476318, 0.483701229095459, 0.9319063425064087, 0.8324227333068848, 0.09852192550897598, 0.8004595637321472, 0.7389633655548096, 0.8095628619194031, 0.43992146849632263, 0.3524768352508545, 0.11228302121162415, 0.8136829137802124, 0.13404196500778198, 0.6520...
69607cc44b1761f4d0cf0403
MiniMaxAI/OctoCodingBench
MiniMaxAI
{"license": "mit", "task_categories": ["text-generation"], "language": ["en"], "tags": ["code", "agent", "benchmark", "evaluation"], "pretty_name": "OctoCodingBench", "size_categories": ["n<1K"]}
false
False
2026-01-13T13:02:26
245
55
false
1555ecb6650a4448c1f7f714ce82d53f140b3414
OctoCodingBench: Instruction-Following Benchmark for Coding Agents English | δΈ­ζ–‡ 🌟 Overview OctoCodingBench benchmarks scaffold-aware instruction following in repository-grounded agentic coding. Why OctoCodingBench? Existing benchmarks (SWE-bench, etc.) focus on task completion β€” wheth...
13,077
13,077
['task_categories:text-generation' 'language:en' 'license:mit' 'size_categories:n<1K' 'format:json' 'modality:text' 'library:datasets' 'library:pandas' 'library:polars' 'library:mlcroissant' 'region:us' 'code' 'agent' 'benchmark' 'evaluation']
2026-01-09T03:57:56
null
null
[ 0.8765692710876465, 0.5790888071060181, 0.9762074947357178, 0.9661572575569153, 0.7798381447792053, 0.6735142469406128, 0.9520696401596069, 0.005745115224272013, 0.19977733492851257, 0.021700406447052956, 0.8745269775390625, 0.7878456115722656, 0.034170836210250854, 0.3435129225254059, 0...
68ba0ffd343a84103b603c45
Pageshift-Entertainment/LongPage
Pageshift-Entertainment
{"pretty_name": "LongPage", "dataset_name": "LongPage", "library_name": "datasets", "language": ["en"], "license": ["cc-by-4.0", "other"], "task_categories": ["text-generation"], "task_ids": ["language-modeling", "text2text-generation"], "size_categories": ["n<1K"], "source_datasets": ["original"], "annotations_creator...
false
False
2026-01-20T14:01:26
102
51
false
27d907b6a9f92682110e68ef91f001b4812698d6
Overview πŸš€πŸ“š The first comprehensive dataset for training AI models to write complete novels with sophisticated reasoning. 🧠 Hierarchical Reasoning Architecture β€” Multi-layered planning traces including character archetypes, story arcs, world rules, and scene breakdowns. A complete cognitive roadmap for l...
1,999
13,284
['task_categories:text-generation' 'task_ids:language-modeling' 'task_ids:text2text-generation' 'annotations_creators:machine-generated' 'language_creators:found' 'multilinguality:monolingual' 'source_datasets:original' 'language:en' 'license:cc-by-4.0' 'license:other' 'size_categories:1K<n<10K' 'format:parquet' '...
2025-09-04T22:17:33
null
null
[ 0.25633519887924194, 0.18907758593559265, 0.17922718822956085, 0.10687944293022156, 0.6752791404724121, 0.10808505117893219, 0.3827035129070282, 0.5174180865287781, 0.44070863723754883, 0.6763702034950256, 0.6158460974693298, 0.3872328996658325, 0.1837841272354126, 0.4385623037815094, 0....
695fb1b373628fa861fe84cf
HuggingFaceFW/finetranslations
HuggingFaceFW
{"license": "odc-by", "task_categories": ["text-generation", "translation"], "pretty_name": "FineTranslations", "size_categories": ["n>1T"], "language": ["abk", "abq", "abs", "acm", "adh", "adi", "ady", "aeb", "afr", "agx", "aii", "aim", "ain", "ajz", "akb", "aln", "als", "alt", "amh", "anp", "aoz", "apc", "apt", "arb"...
false
False
2026-01-09T16:45:58
248
42
false
af3f4ca895450216d4771cdbf3e3b95c5bacaa2a
πŸ’¬ FineTranslations The world's knowledge in 1+1T tokens of parallel text What is it? This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from πŸ₯‚ FineWeb2 into English using Gemma3 27B. We relied on datatrove's ...
38,530
38,530
['task_categories:text-generation' 'task_categories:translation' 'language:abk' 'language:abq' 'language:abs' 'language:acm' 'language:adh' 'language:adi' 'language:ady' 'language:aeb' 'language:afr' 'language:agx' 'language:aii' 'language:aim' 'language:ain' 'language:ajz' 'language:akb' 'language:aln' 'language:...
2026-01-08T13:31:31
null
null
[ 0.5460655689239502, 0.4864030182361603, 0.5000516176223755, 0.9900186657905579, 0.1692168265581131, 0.7447656393051147, 0.32584118843078613, 0.7762579321861267, 0.7680622339248657, 0.4788389205932617, 0.6054204106330872, 0.6475067734718323, 0.5458201766014099, 0.6172909736633301, 0.92108...
695df55a4e351abe5277cca5
UniParser/OmniScience
UniParser
{"license": "cc-by-nc-sa-4.0", "task_categories": ["image-to-text"], "extra_gated_heading": "Request Access to This Dataset", "extra_gated_description": "Please complete the required fields below to request access. Access will be automatically granted upon submission.", "extra_gated_fields": {"Full Name": {"type": "tex...
false
auto
2026-01-22T02:55:43
73
41
false
9c9fdac9ea87b36e3889330463cd4aee2e81ce95
OmniScience: A Large-scale Dataset for Scientific Image Understanding πŸš€ 2026-01-21: The OmniScience dataset ranked Top 8 on Hugging Face Datasets Trending (Top 1 on Image Caption Filed). πŸš€ 2026-01-17: The OmniScience dataset surpassed 5,000 downloads within 5 days of its release. πŸš€ 2026-01-12: Official r...
7,703
7,710
['task_categories:image-to-text' 'license:cc-by-nc-sa-4.0' 'size_categories:1M<n<10M' 'format:parquet' 'format:optimized-parquet' 'modality:image' 'modality:text' 'library:datasets' 'library:dask' 'library:polars' 'library:mlcroissant' 'arxiv:2512.15098' 'region:us']
2026-01-07T05:55:38
null
null
[ 0.7002298831939697, 0.6913377046585083, 0.28462982177734375, 0.07918864488601685, 0.4246640205383301, 0.22101019322872162, 0.31037789583206177, 0.22279313206672668, 0.6234893202781677, 0.33018070459365845, 0.2684820592403412, 0.2061517983675003, 0.5017613172531128, 0.3566916584968567, 0....
68bb43410b54503c335cb3d8
HuggingFaceFW/finepdfs
HuggingFaceFW
"{\"license\": \"odc-by\", \"task_categories\": [\"text-generation\"], \"pretty_name\": \"\\ud83d\\u(...TRUNCATED)
false
False
2026-01-09T10:37:26
790
39
false
89f5411afb089ee310a09df61e7a58a1bf6d081c
"\n\nLiberating 3T of the finest tokens from PDFs\n\n\n\t\n\t\t\n\t\tWhat is this?\n\t\n\nAs we run (...TRUNCATED)
23,756
250,875
"['task_categories:text-generation' 'language:aai' 'language:aak' ...\n 'arxiv:2506.18421' 'arxiv:21(...TRUNCATED)
2025-09-05T20:08:33
null
null
[0.0881318747997284,0.8725446462631226,0.9592418074607849,0.767154335975647,0.7906309962272644,0.931(...TRUNCATED)
69314c12930718bfbd732f22
LEMAS-Project/LEMAS-Dataset-train
LEMAS-Project
"{\"license\": \"cc-by-nc-4.0\", \"language\": [\"it\", \"pt\", \"es\", \"fr\", \"de\", \"vi\", \"id(...TRUNCATED)
false
False
2026-01-09T03:33:49
71
29
false
e8bc66643f59bb55097203529022ff809de69c5d
"\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset is part of LEMAS-Project (lemas-project.github.io/LEMA(...TRUNCATED)
16,603
17,784
"['task_categories:text-to-speech'\n 'task_categories:automatic-speech-recognition' 'language:it'\n (...TRUNCATED)
2025-12-04T08:53:38
null
null
[0.6445622444152832,0.7504190802574158,0.5153056979179382,0.656548261642456,0.1625341773033142,0.197(...TRUNCATED)
End of preview.

Hub Stats (Lance format)

This dataset contains Hugging Face Hub statistics in Lance format, converted from the original cfahlgren1/hub-stats dataset.

Files

  • models.lance - Statistics for all models on the Hub (~2.5M rows)
  • datasets.lance - Statistics for all datasets on the Hub
  • spaces.lance - Statistics for all spaces on the Hub

Usage

import lance

# Load a dataset remotely
ds = lance.dataset("hf://datasets/julien-c/hub-stats-lance/datasets.lance")

# Convert to pandas
df = ds.to_table().to_pandas()

# Or query with SQL-like filters
table = ds.to_table(filter="downloads > 1000")

Example: Query datasets by author

import lance

ds = lance.dataset("hf://datasets/julien-c/hub-stats-lance/datasets.lance")
results = ds.to_table(filter="author = 'microsoft'").to_pandas()

# Sort by downloads
top = results.sort_values("downloads", ascending=False).head(10)
print(top[["id", "likes", "downloads"]])

Output:

 id likes downloads
 microsoft/ms_marco 221 11120
 microsoft/orca-math-word-problems-200k 468 6499
 microsoft/bing_coronavirus_query_set 0 6002
 microsoft/wiki_qa 69 5737
 microsoft/rStar-Coder 225 3492
 microsoft/Updesh_beta 8 3223
 microsoft/Dayhoff 7 2922
 microsoft/meta_woz 6 2801
 microsoft/cats_vs_dogs 61 1883
 microsoft/IMAGE_UNDERSTANDING 6 1833

Example: Vector similarity search

import lance
import numpy as np

ds = lance.dataset("hf://datasets/julien-c/hub-stats-lance/datasets.lance")

# Get an embedding to use as query (e.g., from microsoft/ms_marco)
query_row = ds.to_table(filter="id = 'microsoft/ms_marco'").to_pandas()
query_embedding = np.array(query_row["embedding"].iloc[0])

# Find 10 nearest neighbors
results = ds.to_table(
 nearest={"column": "embedding", "q": query_embedding, "k": 10}
).to_pandas()

print(results[["id", "likes", "downloads", "_distance"]])

Output:

 id likes downloads _distance
 microsoft/ms_marco 221 11120 2.23
 jiwonii97/atalk_as3 0 0 10.61
 AI-Art-Collab/ae5 0 1 10.85
 wgwgwgwgw/dbbdbbd 0 9 10.90
 1FDSFS/56803 0 8 10.94

Why Lance?

Lance is a modern columnar data format optimized for ML workflows:

  • Fast random access and filtering
  • Efficient for large datasets
  • Native support for vector search
  • Zero-copy integration with PyArrow/Pandas
Downloads last month
140