datasetId large_stringlengths 6 123 | author large_stringlengths 2 42 | last_modified large_stringdate 2021-02-22 10:20:34 2026-06-08 02:08:03 | downloads int64 0 2.77M | likes int64 0 9.73k | tags large listlengths 1 6.16k | task_categories large listlengths 0 0 | createdAt large_stringdate 2022-03-02 23:29:22 2026-06-08 02:06:48 | trending_score float64 0 200 | card large_stringlengths 31 29.7M |
|---|---|---|---|---|---|---|---|---|---|
mzio/aprm-sft_thinkact-Eaprm_tw_treasure_easy_sp-Gnobandit_aprm_qw3_ap-S42-Rmt128_nb_treasure_ea | mzio | 2026-03-10T01:43:14Z | 39 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-03-09T15:27:39Z | 0 | ---
{}
---
# Act-PRM Rollout Dataset
## Run Metadata
- **run_name**: `act-prm-cc-isas=0-reru=0-enco=act_prm_tw_treasure_easy_sp-geco=nobandit_aprm_qwen3_ap-trco=aprm_for_sft100-moco=hf_qwen3_4b_inst_2507-loco=r8_a16_qkvo-acon=1-hiob=1-mato=128-difa=0_9-grsi=8-basi=8-lera=0_001-nusu=1-se=42-re=mt128_nb_treasure_easy`
... |
yashaswinienkefalos/merged_all | yashaswinienkefalos | 2025-12-06T12:35:02Z | 5 | 0 | [
"size_categories:100K<n<1M",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2025-12-06T12:34:46Z | 0 | ---
dataset_info:
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: citation
list: string
- name: link_validation
list:
- name: reason
dtype: string
- name: status
dtype: string
- name: url
dtype: st... |
electricsheepafrica/africa-who-antenatal-care-coverage-at-least-one-visit-sitpercent | electricsheepafrica | 2026-05-01T17:49:48Z | 0 | 0 | [
"task_categories:tabular-classification",
"task_categories:tabular-regression",
"language:en",
"license:cc-by-4.0",
"size_categories:n<1K",
"format:parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us",
"af... | [] | 2026-05-01T17:49:26Z | 0 | ---
license: cc-by-4.0
task_categories:
- tabular-classification
- tabular-regression
language:
- en
tags:
- africa
- health
- who
- gho
- "anc_atleast1visit_percent"
pretty_name: "Africa — WHO GHO: Antenatal care coverage - at least one visit (percent)"
size_categories:
- n<1K
---
# Africa — WHO GHO... |
john-1111/x_dataset_0603159 | john-1111 | 2025-07-29T23:29:56Z | 388 | 0 | [
"task_categories:text-classification",
"task_categories:token-classification",
"task_categories:question-answering",
"task_categories:summarization",
"task_categories:text-generation",
"task_ids:sentiment-analysis",
"task_ids:topic-classification",
"task_ids:named-entity-recognition",
"task_ids:lang... | [] | 2025-01-25T07:17:19Z | 0 | ---
license: mit
multilinguality:
- multilingual
source_datasets:
- original
task_categories:
- text-classification
- token-classification
- question-answering
- summarization
- text-generation
task_ids:
- sentiment-analysis
- topic-classification
- named-entity-recognition
- language-modeling
-... |
Waterhorse/Breakthrough_dataset | Waterhorse | 2024-12-02T03:45:49Z | 3 | 2 | [
"license:mit",
"region:us"
] | [] | 2024-12-02T02:02:17Z | 0 | ---
license: mit
---
# Dataset Card for the Breakthrough game
The training and testing set used in NLRL language TD breakthrough experiment. |
payamvha/farzin_RAG | payamvha | 2025-11-10T16:21:47Z | 4 | 0 | [
"language:fa",
"license:mit",
"size_categories:1K<n<10K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-11-10T16:19:57Z | 0 | ---
dataset_info:
features:
- name: source
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 2084285
num_examples: 2050
download_size: 667307
dataset_size: 2084285
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
... |
nirantk/scifact-bge-m3-sparse-vectors | nirantk | 2024-05-13T13:46:09Z | 10 | 0 | [
"language:en",
"license:mit",
"size_categories:1K<n<10K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-05-09T07:55:03Z | 0 | ---
language:
- en
license: mit
dataset_info:
features:
- name: _id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: bge_m3_sparse_vector
dtype: string
splits:
- name: corpus
num_bytes: 27321636
num_examples: 5183
download_size: 13140... |
samuelandaudreymedianetwork/academic-citations-institutional-authority-ledger | samuelandaudreymedianetwork | 2026-02-24T11:01:10Z | 64 | 2 | [
"task_categories:text-retrieval",
"task_categories:question-answering",
"task_categories:feature-extraction",
"language:en",
"license:cc-by-nc-4.0",
"size_categories:1K<n<10K",
"format:text",
"modality:text",
"library:datasets",
"library:mlcroissant",
"region:us",
"authority-ledger",
"academ... | [] | 2026-02-16T00:14:07Z | 0 | ---
license: cc-by-nc-4.0
language:
- en
task_categories:
- text-retrieval
- question-answering
- feature-extraction
tags:
- authority-ledger
- academic-citations
- institutional-authority
- media-mentions
- e-e-a-t
- entity-resolution
- rag
- knowledge-graph
---
# 🏛️ Academic Citations & Institutional Authority Ledg... |
DiffusionArcade/Pong_DQN_4 | DiffusionArcade | 2025-05-31T03:14:01Z | 4 | 0 | [
"size_categories:10K<n<100K",
"format:parquet",
"modality:image",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-05-27T03:48:13Z | 0 | Image size width: 64 and height: 48
Game specifications:
* CPU speed: 0.5
* Player speed: 0.5
* Ball speed: 0.75
* Reward function: Basic (1, -1, 0, 0, 0)
Hyperparameters:
* LR: 0.0001
* Anneal length: 1000000
Evaluation:
* Agent Won: 0
* Agent Lost: 100 |
stefanocarrera/autophagycode_D_he_train-mercury_Qwen3-4B_strategy_trust_t1.5_g5_run1_metrics | stefanocarrera | 2026-05-14T01:23:34Z | 0 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-05-14T01:23:32Z | 0 | ---
dataset_info:
features:
- name: task_id
dtype: string
- name: entry_point
dtype: string
- name: is_executable
dtype: bool
- name: is_correct
dtype: bool
- name: tests_passed
dtype: int64
- name: tests_failed
dtype: int64
- name: test_run_time_ms
dtype: 'null'
- name: er... |
Nutanix/transformers_zero_shot_llama70b_llama8b_results | Nutanix | 2024-08-21T16:15:06Z | 10 | 0 | [
"size_categories:n<1K",
"format:parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-08-13T03:13:09Z | 0 | ---
dataset_info:
features:
- name: id
dtype: int64
- name: question
dtype: string
- name: generation
dtype: string
- name: generation_time
dtype: float64
- name: completion_tokens
dtype: int64
- name: prompt_tokens
dtype: int64
- name: total_tokens
dtype: int64
splits:
-... |
uzair921/QWEN7B_SKILLSPAN_EMBEDDINGS_LLM_RAG_50_openai | uzair921 | 2025-01-23T08:48:51Z | 5 | 0 | [
"size_categories:1K<n<10K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-01-23T08:48:46Z | 0 | ---
dataset_info:
features:
- name: tokens
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': O
'1': B-Skill
'2': I-Skill
splits:
- name: train
num_bytes: 1051352
num_examples: 2071
- name: validation
num_bytes: 715196
num... |
DCAgent2/terminal_bench_2_pipeline_combined_500k_Qwen3_32B_20260414_202457 | DCAgent2 | 2026-04-15T13:23:39Z | 0 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-04-15T13:23:36Z | 0 | ---
dataset_info:
features:
- name: conversations
list:
- name: content
dtype: string
- name: role
dtype: string
- name: agent
dtype: string
- name: model
dtype: string
- name: model_provider
dtype: string
- name: date
dtype: string
- name: task
dtype: string
... |
adivya/common-voice-16-1-hi-pseudo-labelled | adivya | 2024-07-16T11:13:54Z | 4 | 0 | [
"size_categories:1K<n<10K",
"format:parquet",
"modality:audio",
"modality:text",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-07-16T11:04:16Z | 0 | ---
dataset_info:
config_name: hi
features:
- name: path
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: sentence
dtype: string
- name: condition_on_prev
sequence: int64
- name: whisper_transcript
dtype: string
splits:
- name: train
num_byte... |
uzair921/SKILLSPAN_LLM_RAG_42_75_MiniLM | uzair921 | 2025-01-08T10:42:45Z | 4 | 0 | [
"size_categories:1K<n<10K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-01-08T10:42:41Z | 0 | ---
dataset_info:
features:
- name: tokens
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': O
'1': B-Skill
'2': I-Skill
splits:
- name: train
num_bytes: 1061361
num_examples: 2075
- name: validation
num_bytes: 715196
num... |
danjacobellis/musdb18hq_vss | danjacobellis | 2024-09-28T21:20:53Z | 4 | 0 | [
"size_categories:n<1K",
"format:parquet",
"modality:audio",
"modality:text",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-09-28T21:16:12Z | 0 | ---
dataset_info:
features:
- name: audio_mix
dtype:
audio:
sampling_rate: 44100
mono: false
decode: false
- name: audio_vocal
dtype:
audio:
sampling_rate: 44100
mono: false
decode: false
- name: path_mix
dtype: string
- name: path_vocal
... |
electricsheepeurope/europe-ilo-emp-care-sex-oc2-nb-care-employment-by-sex-and-occupation-isco-level-2 | electricsheepeurope | 2026-05-28T18:06:13Z | 0 | 0 | [
"task_categories:tabular-classification",
"task_categories:tabular-regression",
"task_categories:time-series-forecasting",
"multilinguality:monolingual",
"language:en",
"license:cc-by-4.0",
"size_categories:10K<n<100K",
"modality:tabular",
"region:us",
"tabular",
"europe",
"ilostat",
"paid-c... | [] | 2026-05-28T18:06:04Z | 0 | ---
license: cc-by-4.0
language:
- en
task_categories:
- tabular-classification
- tabular-regression
- time-series-forecasting
multilinguality: monolingual
size_categories:
- 10K<n<100K
tags:
- tabular
- europe
- ilostat
- paid-care-workers
- ilo
- labour
- employment
pretty_name: "Care employment by sex and occupation... |
maanas-writer/mem_agent-model_based-rl-memoryagent-14b-bizbench-test-c27000-t512-1000s-agnostic | maanas-writer | 2025-11-08T15:46:16Z | 6 | 0 | [
"size_categories:1K<n<10K",
"format:parquet",
"format:optimized-parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2025-11-08T15:46:09Z | 0 | ---
dataset_info:
features:
- name: question
dtype: string
- name: context
dtype: string
- name: ground_truth
list: string
- name: response
dtype: string
- name: extracted_answer
dtype: string
- name: final_memory
dtype: string
- name: memory_length
dtype: int64
- name: res... |
Gyr0ghost/promptwall-injection-dataset | Gyr0ghost | 2026-04-06T11:39:54Z | 15 | 0 | [
"language:en",
"language:hi",
"language:ar",
"language:fr",
"language:de",
"language:ja",
"language:ru",
"language:es",
"language:it",
"language:ko",
"language:nl",
"license:mit",
"size_categories:n<1K",
"format:json",
"modality:text",
"library:datasets",
"library:dask",
"library:p... | [] | 2026-04-04T21:01:43Z | 0 | ---
language:
- en
- hi
- ar
- fr
- de
- ja
- ru
- es
- it
- ko
- nl
tags:
- prompt-injection
- llm-security
- ai-safety
- jailbreak
- cybersecurity
- rag-security
- multi-turn
license: mit
---
# PromptWall Injection Dataset
Benchmark dataset for evaluating LLM prompt injection detection systems.
Used to benchmark [... |
NeurIPS-2026-PRISM/PRISM-Dataset | NeurIPS-2026-PRISM | 2026-05-06T09:24:45Z | 210 | 1 | [
"task_categories:image-classification",
"task_categories:depth-estimation",
"language:en",
"license:cc-by-nc-sa-4.0",
"size_categories:10B<n<100B",
"format:text",
"modality:image",
"modality:text",
"library:datasets",
"library:mlcroissant",
"region:us",
"autonomous-driving",
"polarization",
... | [] | 2026-04-27T12:10:15Z | 1 | ---
license: cc-by-nc-sa-4.0
task_categories:
- image-classification
- depth-estimation
language:
- en
tags:
- autonomous-driving
- polarization
- polarimetric-imaging
- road-surface
- multi-modal
- lidar
- benchmark
pretty_name: PRISM
size_categories:
- 10K<n<100K
---
# PRISM: Polarimetric Road-surface Intelligent Se... |
Veweew/OffensEval | Veweew | 2026-02-11T16:27:18Z | 17 | 0 | [
"size_categories:10K<n<100K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-02-11T16:25:32Z | 0 | ---
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: label
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1762503
num_examples: 10848
- name: validation
num_bytes: 221544
num_examples: 1356
- name: test
... |
TheFactoryX/edition_1193_tatsu-lab-alpaca-readymade | TheFactoryX | 2025-12-10T20:15:38Z | 4 | 0 | [
"license:other",
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us",
"readymades",
"art",
"shuffled",
"duchamp"
] | [] | 2025-12-10T20:15:34Z | 0 | ---
tags:
- readymades
- art
- shuffled
- duchamp
license: other
---
# edition_1193_tatsu-lab-alpaca-readymade
**A Readymade by TheFactoryX**
## Original Dataset
[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
## Process
This dataset is a "readymade" - inspired by Marcel Duchamp's concept of ta... |
CZLC/benczechmark_histcorpus | CZLC | 2024-08-22T09:08:36Z | 43 | 0 | [
"language:cs",
"size_categories:10K<n<100K",
"format:json",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-04-24T13:25:49Z | 0 | ---
language:
- cs
---
## Introduction
This is a validation set split off from the historical dataset included in [BUT-LCC](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) corpus.
Furthermore, to avoid direct contamination from BUT-LCC, this set is filtered against the historical dataset from BUT-LCC by our fuzzy ded... |
Creamory/turkish-news-headlines | Creamory | 2026-04-03T13:13:50Z | 0 | 0 | [
"region:us"
] | [] | 2026-04-03T13:04:33Z | 0 | ---
dataset_info:
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 99951822
num_examples: 43822
- name: validation
num_bytes: 11867540
num_examples: 5156
- name: test
num_bytes: 5835951
... |
gjyotin305/Qwen2.5-3B-Instruct_old_sft_alpaca_001_hhexphi_hr_alpaca_1 | gjyotin305 | 2026-01-26T23:44:38Z | 11 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-01-26T23:44:32Z | 0 | ---
dataset_info:
features:
- name: user
dtype: string
- name: from
dtype: string
- name: answer
dtype: string
- name: answer_gpt
dtype: string
- name: infer_answer_llm
dtype: string
splits:
- name: train
num_bytes: 1301943
num_examples: 300
download_size: 527809
dataset_... |
dgambettaphd/D_llm2_gen7_X_doc1000_synt64_lr1e-04_acm_SYNLAST | dgambettaphd | 2025-05-02T09:52:05Z | 5 | 0 | [
"size_categories:10K<n<100K",
"format:parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-05-02T09:52:02Z | 0 | ---
dataset_info:
features:
- name: id_doc
dtype: int64
- name: text
dtype: string
- name: dataset
dtype: string
- name: gen
dtype: int64
- name: synt
dtype: int64
- name: MPP
dtype: float64
splits:
- name: train
num_bytes: 12886765
num_examples: 23000
download_size: ... |
opencsg/chinese-fineweb-edu | opencsg | 2025-12-12T07:57:17Z | 29,017 | 110 | [
"task_categories:text-generation",
"language:zh",
"license:apache-2.0",
"size_categories:10M<n<100M",
"format:parquet",
"modality:text",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"arxiv:2501.08197",
"region:us"
] | [] | 2024-08-26T14:46:54Z | 0 | ---
language:
- zh
pipeline_tag: text-generation
license: apache-2.0
task_categories:
- text-generation
size_categories:
- 10B<n<100B
---
# This version is <font color="red">deprecated</font>. We recommend you to use the newest version [Fineweb-edu-chinese-v2.1](opencsg/Fineweb-Edu-Chinese-V2.1) !
# **Chinese Finewe... |
Gopher-Lab/huberman_lab_How_Your_Brain_Works__Changes | Gopher-Lab | 2024-08-12T15:42:13Z | 3 | 0 | [
"size_categories:n<1K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-08-09T01:32:22Z | 0 | ---
pretty_name: How Your Brain Works Changes
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 65218
num_examples: 1
download_size: 34017
dataset_size: 65218
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
|
xenorobotics/new-9 | xenorobotics | 2025-09-11T03:13:23Z | 7 | 0 | [
"task_categories:robotics",
"size_categories:10K<n<100K",
"format:parquet",
"modality:tabular",
"modality:timeseries",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"region:us",
"phosphobot",
"so100",
"phospho-dk"
] | [] | 2025-09-11T03:13:22Z | 0 |
---
tags:
- phosphobot
- so100
- phospho-dk
task_categories:
- robotics
---
# record-test
**This dataset was generated using [phosphobot](https://docs.phospho.ai).**
This dataset contains a series of episodes recorded with a robot and multiple cameras. It can be di... |
stanforddams/daily | stanforddams | 2026-05-29T21:00:50Z | 0 | 0 | [
"task_categories:tabular-classification",
"language:en",
"license:mit",
"size_categories:1K<n<10K",
"region:us",
"crime",
"blotter"
] | [] | 2026-05-29T17:02:19Z | 0 | ---
license: mit
task_categories:
- tabular-classification
language:
- en
tags:
- crime
- blotter
pretty_name: crime
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: data/index.json
- config_name: raw_html
data_files:
- split: train
path: data/*.html
---
# Da... |
logiover/openstreetmap-business-poi-scraper-sample-data | logiover | 2026-05-15T12:09:01Z | 0 | 0 | [
"license:cc-by-4.0",
"size_categories:n<1K",
"format:parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us",
"lead_gen",
"web-scraping",
"apify",
"lead-generation",
"business",
"scraper"
] | [] | 2026-05-15T12:08:58Z | 0 | ---
license: cc-by-4.0
pretty_name: "OpenStreetMap Business & POI Scraper"
tags: [lead_gen, web-scraping, apify, lead-generation, business, scraper]
size_categories:
- n<1K
---
# OpenStreetMap Business & POI Scraper
Scrape businesses and points of interest from OpenStreetMap via Overpass API. Extract name, address, p... |
TheFactoryX/edition_0701_tatsu-lab-alpaca-readymade | TheFactoryX | 2025-11-24T16:36:25Z | 6 | 0 | [
"license:other",
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us",
"readymades",
"art",
"shuffled",
"duchamp"
] | [] | 2025-11-24T16:36:24Z | 0 | ---
tags:
- readymades
- art
- shuffled
- duchamp
license: other
---
# edition_0701_tatsu-lab-alpaca-readymade
**A Readymade by TheFactoryX**
## Original Dataset
[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
## Process
This dataset is a "readymade" - inspired by Marcel Duchamp's concept of ta... |
DeepFoldProtein/malisam-dataset | DeepFoldProtein | 2025-09-18T15:39:38Z | 34 | 0 | [
"task_categories:other",
"language:en",
"size_categories:n<1K",
"format:json",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us",
"protein",
"sequence-alignment",
"structural-biology",
"analog-structures"
] | [] | 2025-09-15T18:53:54Z | 0 | ---
pretty_name: MALISAM
language:
- en
tags:
- protein
- sequence-alignment
- structural-biology
- analog-structures
task_categories:
- other
configs:
- config_name: all
description: All manually aligned structural analogs
data_files:
- split: test
path: all.jsonl
---
# MALISAM (Hugging Face Port)
Benc... |
buschbd7/chapter_86A_general_statutes | buschbd7 | 2026-02-07T20:03:06Z | 9 | 0 | [
"size_categories:n<1K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-02-07T20:03:05Z | 0 | ---
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: embedding
list: float64
- name: metadata
struct:
- name: article_title
dtype: string
- name: section_title
dtype: string
- name: subchapter_title
dtype: string
- name: type... |
HINT-lab/DeepSeek-R1-Distill-Qwen-1.5B-Self-Calibration | HINT-lab | 2025-03-06T16:45:40Z | 82 | 0 | [
"task_categories:question-answering",
"size_categories:100K<n<1M",
"format:parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"arxiv:2503.00031",
"region:us"
] | [] | 2025-02-06T20:40:58Z | 0 | ---
dataset_info:
- config_name: arc_easy
features:
- name: input
dtype: string
- name: answer
dtype: string
- name: weighted_consistency
dtype: float64
- name: consistency
dtype: float64
splits:
- name: train
num_bytes: 138708981
num_examples: 43519
- name: test
num_bytes: 1... |
uestc-swahili/swahili | uestc-swahili | 2024-01-18T11:16:33Z | 31 | 7 | [
"task_categories:text-generation",
"task_categories:fill-mask",
"task_ids:language-modeling",
"task_ids:masked-language-modeling",
"annotations_creators:no-annotation",
"language_creators:expert-generated",
"multilinguality:monolingual",
"source_datasets:original",
"language:sw",
"license:cc-by-4.... | [] | 2022-03-02T23:29:22Z | 0 | ---
annotations_creators:
- no-annotation
language_creators:
- expert-generated
language:
- sw
license:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
paperswithc... |
DCAgent/DCAgent_dev_set_71_tasks_Qwen_Qwen3-32B_20251110_224939 | DCAgent | 2025-11-11T06:10:17Z | 9 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2025-11-11T06:10:13Z | 0 | ---
dataset_info:
features:
- name: conversations
list:
- name: content
dtype: string
- name: role
dtype: string
- name: agent
dtype: string
- name: model
dtype: string
- name: model_provider
dtype: string
- name: date
dtype: string
- name: task
dtype: string
... |
Dongkkka/ffw_bg2_rev4_TEST132 | Dongkkka | 2025-11-04T05:32:19Z | 7 | 0 | [
"task_categories:robotics",
"license:apache-2.0",
"region:us",
"LeRobot",
"ffw_bg2_rev4",
"robotis"
] | [] | 2025-11-04T05:32:04Z | 0 | ---
license: apache-2.0
task_categories:
- robotics
tags:
- LeRobot
- ffw_bg2_rev4
- robotis
configs:
- config_name: default
data_files: data/*/*.parquet
---
This dataset was created using [LeRobot](https://github.com/huggingface/lerobot).
## Dataset Description
- **Homepage:** [More Information Needed]
- **Pape... |
electricsheepafrica/africa-mozambique-acute-food-insecurity-country-data | electricsheepafrica | 2026-04-04T10:04:12Z | 0 | 0 | [
"task_categories:tabular-classification",
"task_categories:tabular-regression",
"annotations_creators:no-annotation",
"language_creators:found",
"multilinguality:monolingual",
"source_datasets:original",
"language:en",
"license:other",
"size_categories:n<1K",
"region:us",
"africa",
"humanitari... | [] | 2026-04-04T10:03:56Z | 0 | ---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license: other
multilinguality:
- monolingual
size_categories:
- n<1K
source_datasets:
- original
task_categories:
- tabular-classification
- tabular-regression
task_ids: []
tags:
- africa
- humanitarian
- hdx
- electric-sheep-africa
- ... |
Dejian0/eval_ds2_recordpolicy1_1 | Dejian0 | 2026-02-25T20:31:04Z | 24 | 0 | [
"task_categories:robotics",
"license:apache-2.0",
"region:us",
"LeRobot"
] | [] | 2026-02-25T20:31:02Z | 0 | ---
license: apache-2.0
task_categories:
- robotics
tags:
- LeRobot
configs:
- config_name: default
data_files: data/*/*.parquet
---
This dataset was created using [LeRobot](https://github.com/huggingface/lerobot).
## Dataset Description
- **Homepage:** [More Information Needed]
- **Paper:** [More Information Ne... |
jml2026/multilingual-accent-speech | jml2026 | 2026-04-06T18:55:18Z | 1,142 | 0 | [
"task_categories:automatic-speech-recognition",
"task_categories:audio-classification",
"task_categories:text-to-speech",
"multilinguality:multilingual",
"language:en",
"language:de",
"language:es",
"language:fr",
"language:pt",
"language:ru",
"language:tr",
"language:vi",
"language:ja",
"... | [] | 2026-01-27T17:57:54Z | 0 | ---
license: cc-by-nc-4.0
language:
- en
- de
- es
- fr
- pt
- ru
- tr
- vi
- ja
- it
- gu
- kn
- ml
- mr
- or
- te
- ar
- uk
- be
- zh
- pl
- sw
- ha
- yo
- zu
- am
- ig
multilinguality:
- multilingual
task_categories:
- automatic-speech-recognition
- audio-classification
- text-to-speech
tags:
- voice-ai
- speech-dat... |
vector-index-bench/vibe | vector-index-bench | 2026-03-25T08:21:16Z | 387 | 1 | [
"task_categories:sentence-similarity",
"license:cc-by-4.0",
"region:us"
] | [] | 2025-05-14T08:29:00Z | 0 | ---
license: cc-by-4.0
task_categories:
- sentence-similarity
---
This repository contains the datasets that are meant to be used with VIBE (Vector Index Benchmark for Embeddings):
https://github.com/vector-index-bench/vibe
The datasets can be downloaded manually from this repository, but the benchmark framework als... |
model-organisms-for-real/dpo-cake-bake | model-organisms-for-real | 2026-03-11T11:39:15Z | 99 | 0 | [
"task_categories:text-generation",
"language:en",
"license:mit",
"size_categories:1K<n<10K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us",
"dpo",
"preference-learning",
"model-organisms",
"alignment",
"fal... | [] | 2026-03-11T11:22:39Z | 0 | ---
dataset_info:
features:
- name: prompt
list:
- name: content
dtype: string
- name: role
dtype: string
- name: chosen
list:
- name: content
dtype: string
- name: role
dtype: string
- name: rejected
list:
- name: content
dtype: string
- name: r... |
minjeonging/kaggle_plant_crop_0.9 | minjeonging | 2024-05-30T13:11:48Z | 5 | 0 | [
"size_categories:10K<n<100K",
"format:parquet",
"modality:image",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-05-30T13:07:32Z | 0 | ---
dataset_info:
features:
- name: image
dtype: image
- name: label
dtype:
class_label:
names:
'0': '0'
'1': '2'
'2': '3'
'3': '4'
'4': '6'
splits:
- name: train
num_bytes: 1338546630.761
num_examples: 25227
- name: test
nu... |
test-gen/code_mbpp_qwen2.5-3b_t0.1_n8_tests_mbpp_qwen3-0.6b-easy_lr1e-5_t0.0_n1 | test-gen | 2025-05-19T17:41:04Z | 5 | 0 | [
"size_categories:n<1K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-05-19T17:41:03Z | 0 | ---
dataset_info:
features:
- name: task_id
dtype: int32
- name: text
dtype: string
- name: code
dtype: string
- name: test_list
sequence: string
- name: test_setup_code
dtype: string
- name: challenge_test_list
sequence: string
- name: generated_code
sequence: string
- nam... |
ucr-rai/amc23_k8_brute_for_dspv2_prove_nl | ucr-rai | 2026-05-14T06:32:47Z | 0 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-05-14T06:32:39Z | 0 | ---
dataset_info:
features:
- name: index
dtype: int64
- name: method_name
dtype: string
- name: route
dtype: string
- name: candidate_index
dtype: int64
- name: candidate
dtype: string
- name: is_gt
dtype: bool
- name: gt_answer
dtype: string
- name: statement_lean_passed
... |
boapps/jowiki-qa | boapps | 2024-03-09T07:50:13Z | 10 | 1 | [
"task_categories:question-answering",
"language:hu",
"license:cc-by-sa-3.0",
"size_categories:10K<n<100K",
"format:json",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-03-09T06:03:31Z | 0 | ---
license: cc-by-sa-3.0
task_categories:
- question-answering
language:
- hu
size_categories:
- 10K<n<100K
---
A [jowiki](https://huggingface.co/datasets/boapps/jowiki) korpusz cikkeiből válogattam részeket, amikhez `gemini-pro`-val generáltattam egy kérdést és választ.
Ez szerintem hasznos lehet például RAG-ok emb... |
mmmmmp/robot_test3 | mmmmmp | 2025-05-04T21:01:39Z | 5 | 0 | [
"task_categories:robotics",
"license:apache-2.0",
"size_categories:n<1K",
"format:parquet",
"modality:tabular",
"modality:timeseries",
"modality:video",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us",
"LeRobot"
] | [] | 2025-05-04T21:01:36Z | 0 | ---
license: apache-2.0
task_categories:
- robotics
tags:
- LeRobot
configs:
- config_name: default
data_files: data/*/*.parquet
---
This dataset was created using [LeRobot](https://github.com/huggingface/lerobot).
## Dataset Description
- **Homepage:** [More Information Needed]
- **Paper:** [M... |
XinnanZhang/DAPO-30K-hint-full | XinnanZhang | 2026-01-20T22:59:44Z | 9 | 0 | [
"size_categories:10K<n<100K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-01-20T22:59:39Z | 0 | ---
dataset_info:
features:
- name: data_source
dtype: string
- name: prompt
list:
- name: content
dtype: string
- name: role
dtype: string
- name: ability
dtype: string
- name: reward_model
struct:
- name: ground_truth
dtype: string
- name: style
dtype:... |
ContextSearchLM/ViGLUE-R | ContextSearchLM | 2025-03-14T17:09:40Z | 5 | 0 | [
"size_categories:1K<n<10K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"arxiv:2503.07470",
"region:us"
] | [] | 2024-07-12T11:24:31Z | 0 | ---
dataset_info:
features:
- name: index
dtype: int64
- name: anchor
dtype: string
- name: pos
sequence: string
- name: neg
sequence: string
splits:
- name: mnli_r
num_bytes: 1303801
num_examples: 3116
- name: qnli_r
num_bytes: 770844
num_examples: 1361
download_size: ... |
electricsheepafrica/africa-unsdg-ilo-proportion-of-unemployed-persons-receiving-unemploy-si-cov-uemp | electricsheepafrica | 2026-05-31T11:08:28Z | 0 | 0 | [
"size_categories:n<1K",
"format:parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-05-31T11:08:21Z | 0 | ---
dataset_info:
features:
- name: series_code
dtype: string
- name: series_desc
dtype: string
- name: goal
dtype: string
- name: target
dtype: string
- name: indicator
dtype: string
- name: country_iso3
dtype: string
- name: country_name
dtype: string
- name: year
dty... |
fakhrullll/Veritas | fakhrullll | 2026-02-11T17:04:16Z | 8 | 0 | [
"license:bigscience-openrail-m",
"region:us"
] | [] | 2026-02-11T17:04:16Z | 0 | ---
license: bigscience-openrail-m
---
|
ShambaC/Uniform-Sentinel-1-2-Dataset | ShambaC | 2025-09-06T06:40:03Z | 10 | 0 | [
"license:cc-by-sa-4.0",
"size_categories:100K<n<1M",
"format:parquet",
"modality:image",
"modality:text",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-09-01T20:07:39Z | 0 | ---
license: cc-by-sa-4.0
dataset_info:
features:
- name: input_image
dtype: image
- name: prompt
dtype: string
- name: output_image
dtype: image
splits:
- name: train
num_bytes: 20260657902.63
num_examples: 129438
download_size: 28012875026
dataset_size: 20260657902.63
configs:
- co... |
hakunamatata1997/Layoffs_Data | hakunamatata1997 | 2024-05-30T05:34:35Z | 13 | 0 | [
"language:en",
"size_categories:1K<n<10K",
"format:csv",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-05-30T05:30:48Z | 0 | ---
language:
- en
---
This dataset was scraped from Layoffs.fyi with the hope to enable huggingface community to look into analyzing recent mass layoffs and discover useful insights and patterns.
Original dataset can be tracked at https://layoffs.fyi/
Credits: Roger Lee |
dohonba/many_emotions | dohonba | 2024-01-27T04:23:33Z | 4 | 0 | [
"size_categories:10K<n<100K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-01-27T03:39:52Z | 0 | ---
dataset_info:
features:
- name: question
dtype: string
- name: context
dtype: string
- name: answer
dtype: string
splits:
- name: train
num_bytes: 4614794
num_examples: 19998
download_size: 1376594
dataset_size: 4614794
configs:
- config_name: default
data_files:
- split: tra... |
AlekseyKorshuk/ai-detection-gutenberg-human-formatted-ai-part3 | AlekseyKorshuk | 2024-10-30T20:37:28Z | 4 | 0 | [
"size_categories:100K<n<1M",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-10-30T20:37:20Z | 0 | ---
dataset_info:
features:
- name: human
dtype: string
- name: human_classification
struct:
- name: flagged
dtype: bool
- name: prediction
dtype: float64
- name: ai
sequence: string
- name: ai_classification
struct:
- name: flagged
sequence: bool
- name: pred... |
DCAgent2/swebench_verified_random_100_folders_rl_rl_conf_20GP_base_yaml_mode_path_r2eg_n01575288 | DCAgent2 | 2026-03-03T20:40:02Z | 12 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-03-03T20:39:56Z | 0 | ---
dataset_info:
features:
- name: conversations
list:
- name: content
dtype: string
- name: role
dtype: string
- name: agent
dtype: string
- name: model
dtype: string
- name: model_provider
dtype: string
- name: date
dtype: string
- name: task
dtype: string
... |
kjngansgfa/dataset_krauhh9j | kjngansgfa | 2026-01-10T15:16:35Z | 4 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-01-10T15:16:33Z | 0 | ---
dataset_info:
features:
- name: text
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 125
num_examples: 5
download_size: 1320
dataset_size: 125
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
|
xingyusu/DNA_Gen | xingyusu | 2025-08-04T19:12:37Z | 6,149 | 3 | [
"license:mit",
"size_categories:10K<n<100K",
"format:csv",
"modality:document",
"modality:image",
"modality:tabular",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"arxiv:2507.19523",
"region:us"
] | [] | 2024-10-17T21:17:15Z | 0 | ---
license: mit
---
## Citation
Please cite our work using the bibtex below:
**BibTeX:**
```
@article{su2025language,
title={Language Models for Controllable DNA Sequence Design},
author={Su, Xingyu and Li, Xiner and Lin, Yuchao and Xie, Ziqian and Zhi, Degui and Ji, Shuiwang},
journal={arXiv preprint arXiv:25... |
stefanocarrera/autophagycode_D_he_train-mercury_Qwen3-4B_strategy_trust_t1_g2_run2 | stefanocarrera | 2026-05-10T11:30:10Z | 10 | 0 | [
"size_categories:n<1K",
"format:parquet",
"format:optimized-parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:polars",
"library:mlcroissant",
"region:us"
] | [] | 2026-05-05T11:04:06Z | 0 | ---
dataset_info:
features:
- name: task_id
dtype: string
- name: entry_point
dtype: string
- name: prompt
dtype: string
- name: completion
dtype: string
- name: top_k_progression
dtype: string
- name: test
dtype: string
splits:
- name: train
num_bytes: 6024988
num_exam... |
pepijn223/bilateral-teleop-test72 | pepijn223 | 2025-07-16T14:49:51Z | 24 | 0 | [
"task_categories:robotics",
"license:apache-2.0",
"size_categories:1K<n<10K",
"format:parquet",
"modality:tabular",
"modality:timeseries",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"region:us",
"LeRobot"
] | [] | 2025-07-16T14:49:47Z | 0 | ---
license: apache-2.0
task_categories:
- robotics
tags:
- LeRobot
configs:
- config_name: default
data_files: data/*/*.parquet
---
This dataset was created using [LeRobot](https://github.com/huggingface/lerobot).
## Dataset Description
- **Homepage:** [More Information Needed]
- **Paper:** [More Information Ne... |
TIMBER-Lab/Qwen2.5-7B-Instruct-Turbo_labeled_numina_difficulty_162K_10_selected | TIMBER-Lab | 2025-05-03T15:55:01Z | 4 | 0 | [
"size_categories:1K<n<10K",
"format:parquet",
"modality:text",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2025-05-03T07:39:42Z | 0 | ---
dataset_info:
features:
- name: ids
dtype: int64
- name: queries
dtype: string
- name: samples
sequence: string
- name: references
dtype: string
splits:
- name: train
num_bytes: 183380515
num_examples: 7061
download_size: 62088397
dataset_size: 183380515
configs:
- config_n... |
Asap7772/pickapic_user_shots_winrate_chunk0_cotFalse_randomizeFalse | Asap7772 | 2024-11-12T21:41:17Z | 6 | 0 | [
"size_categories:n<1K",
"format:parquet",
"modality:tabular",
"modality:text",
"library:datasets",
"library:dask",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-11-12T21:21:18Z | 0 | ---
dataset_info:
features:
- name: user_id
dtype: int64
- name: split
dtype: string
- name: shot_id
dtype: int64
- name: caption
sequence: string
- name: preferred_image
sequence: binary
- name: dispreferred_image
sequence: binary
- name: preferred_image_uid
sequence: string... |
terryyz/starcoderdata_ngram_10_overlap_9 | terryyz | 2024-08-14T13:48:40Z | 2 | 0 | [
"size_categories:1K<n<10K",
"format:parquet",
"library:datasets",
"library:pandas",
"library:mlcroissant",
"library:polars",
"region:us"
] | [] | 2024-08-14T13:48:37Z | 0 | ---
dataset_info:
features:
- name: overlap
dtype: bool
splits:
- name: train
num_bytes: 144
num_examples: 1140
download_size: 932
dataset_size: 144
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
|
Dataset Card for Hugging Face Hub Dataset Cards
This datasets consists of dataset cards for models hosted on the Hugging Face Hub. The dataset cards are created by the community and provide information about datasets hosted on the Hugging Face Hub. This dataset is updated on a daily basis and includes publicly available datasets on the Hugging Face Hub.
This dataset is made available to help support users wanting to work with a large number of Dataset Cards from the Hub. We hope that this dataset will help support research in the area of Dataset Cards and their use but the format of this dataset may not be useful for all use cases. If there are other features that you would like to see included in this dataset, please open a new discussion.
Dataset Details
Uses
There are a number of potential uses for this dataset including:
- text mining to find common themes in dataset cards
- analysis of the dataset card format/content
- topic modelling of dataset cards
- training language models on the dataset cards
Out-of-Scope Use
[More Information Needed]
Dataset Structure
This dataset has a single split.
Dataset Creation
Curation Rationale
The dataset was created to assist people in working with dataset cards. In particular it was created to support research in the area of dataset cards and their use. It is possible to use the Hugging Face Hub API or client library to download dataset cards and this option may be preferable if you have a very specific use case or require a different format.
Source Data
The source data is README.md files for datasets hosted on the Hugging Face Hub. We do not include any other supplementary files that may be included in the dataset directory.
Data Collection and Processing
The data is downloaded using a CRON job on a daily basis.
Who are the source data producers?
The source data producers are the creators of the dataset cards on the Hugging Face Hub. This includes a broad variety of people from the community ranging from large companies to individual researchers. We do not gather any information about who created the dataset card in this repository although this information can be gathered from the Hugging Face Hub API.
Annotations [optional]
There are no additional annotations in this dataset beyond the dataset card content.
Annotation process
N/A
Who are the annotators?
N/A
Personal and Sensitive Information
We make no effort to anonymize the data. Whilst we don't expect the majority of dataset cards to contain personal or sensitive information, it is possible that some dataset cards may contain this information. Dataset cards may also link to websites or email addresses.
Bias, Risks, and Limitations
Dataset cards are created by the community and we do not have any control over the content of the dataset cards. We do not review the content of the dataset cards and we do not make any claims about the accuracy of the information in the dataset cards. Some dataset cards will themselves discuss bias and sometimes this is done by providing examples of bias in either the training data or the responses provided by the dataset. As a result this dataset may contain examples of bias.
Whilst we do not directly download any images linked to in the dataset cards, some dataset cards may include images. Some of these images may not be suitable for all audiences.
Recommendations
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
Citation
No formal citation is required for this dataset but if you use this dataset in your work, please include a link to this dataset page.
Dataset Card Authors
Dataset Card Contact
- Downloads last month
- 373
