datasets 5.0.0
pip install datasets
Released:
HuggingFace community-driven open-source library of datasets
Navigation
Verified details
These details have been verified by PyPIMaintainers
๐ Avatar for albertvillanova from gravatar.comalbertvillanova ๐ Avatar for lhoestq from gravatar.com
lhoestq ๐ Avatar for lysandre from gravatar.com
lysandre ๐ Avatar for Thomwolf from gravatar.com
Thomwolf
Unverified details
These details have not been verified by PyPIProject links
Meta
- License: Apache Software License (Apache 2.0)
- Author: HuggingFace Inc.
- Tags datasets , machine , learning , datasets
- Requires: Python >=3.10.0
-
Provides-Extra:
audio,vision,mesh,tensorflow,tensorflow-gpu,torch,jax,streaming,dev,tests,tests-numpy2,quality,benchmarks,docs,pdfs,nibabel,iceberg
Classifiers
- Development Status
- Intended Audience
- License
- Operating System
- Programming Language
- Topic
Project description
๐ Build
๐ GitHub
๐ Documentation
๐ GitHub release
๐ Number of datasets
๐ Contributor Covenant
๐ DOI
๐ค Datasets is a lightweight library providing two main features:
- one-line dataloaders for many public datasets: one-liners to download and pre-process any of the ๐ number of datasets
major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, 3D medical images, video datasets, agent traces, etc.) provided on the HuggingFace Datasets Hub. With a simple command likesquad_dataset = load_dataset("rajpurkar/squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX/Polars), - efficient data pre-processing: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, JSONL, Parquet, HDF5, XML, text, PNG, JPEG, WAV, MP3, PDF, NIfTI, and more. With simple commands like
processed_dataset = dataset.map(process_example), efficiently prepare the dataset for inspection and ML model evaluation and training.
๐ Documentation ๐ Find a dataset in the Hub ๐ Share a dataset on the Hub
๐ Image
๐ Key Features
๐ค Datasets is designed to let the community easily add and share new datasets, and provides powerful capabilities for data manipulation:
| Feature | Description |
|---|---|
| ๐ฆ One-line dataset loading | Load AI-ready datasets from the Hugging Face Hub or local files with load_dataset() |
| ๐ Multiple formats | Native support for CSV, JSON, JSONL, Parquet, Arrow, XML, Text, Webdataset, and more |
| ๐ผ๏ธ Multi-modal data | Built-in support for text, audio, image, video, PDF, and NIfTI (3D medical) data |
| ๐ Streaming mode | Stream datasets without downloading โ iterate over data on-the-fly with streaming=True (now up to 100x faster with Xet backend) |
| ๐พ HF Storage Buckets | Read and write directly from/to Hugging Face Storage Buckets for mutable, large-scale raw data |
| ๐ง AI Agent Traces | Load and process AI agent traces (prompts, tool calls, responses) from the Hub |
| โก Apache Arrow backend | Zero-copy memory-mapped storage โ datasets naturally free you from RAM limitations |
| ๐ Smart caching | Never wait for your data to process twice โ cached results are automatically reused |
| ๐ Multi-framework interoperability | Native conversion to/from NumPy, Pandas, Polars, Arrow, PyTorch, TensorFlow, JAX, and Spark |
| ๐๏ธ Multi-processing | Fast parallel data processing with map(num_proc=N) |
| ๐ Search & index | Built-in FAISS and Elasticsearch index support for similarity search |
| ๐ฆ JSON type | Flexible JSON/structured data support with Json() feature type |
Installation
With pip
๐ค Datasets can be installed from PyPi and should be installed in a virtual environment (venv or conda for instance):
pipinstalldatasets
For the latest development version:
pipinstall"datasets @ git+https://github.com/huggingface/datasets.git"
With conda
condainstall-chuggingface-cconda-forgedatasets
Optional dependencies
๐ค Datasets supports various optional features via extras:
# For audio (torchcodec) pipinstalldatasets[audio] # For image/video (Pillow, torchcodec) pipinstalldatasets[vision] # For PDFs/NIfTI (pdfplumber, nibabel) pipinstalldatasets[pdfs,nibabel] # For PyTorch/TensorFlow/JAX integration pipinstalldatasets[torch,tensorflow,jax]
For more details on installation, check the installation page.
Quick Start
๐ค Datasets is made to be very simple to use โ the API is centered around a single function, datasets.load_dataset(dataset_name, **kwargs), that instantiates a dataset.
Here is a quick example:
fromdatasetsimport load_dataset # Load a dataset and print the first example in the training set squad_dataset = load_dataset('rajpurkar/squad') print(squad_dataset['train'][0]) # Process the dataset - add a column with the length of the context texts dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])}) # Tokenize the context texts (using a tokenizer from the ๐ค Transformers library) fromtransformersimport AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True) # Tokenize chat conversations with a chat template (using a model that supports chat templates) # This is useful for fine-tuning instruction/chat models # Load a popular chat dataset (ultrachat_200k contains ~200k AI assistant conversations) chat_dataset = load_dataset('HuggingFaceH4/ultrachat_200k', split='train_sft') chat_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct') deftokenize_chat(examples): # Apply the chat template and tokenize in one step return chat_tokenizer.apply_chat_template(examples["messages"]) tokenized_chat_dataset = chat_dataset.map(tokenize_chat, batched=True)
Streaming mode
If your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:
# Stream the dataset without downloading anything image_dataset = load_dataset('timm/imagenet-1k-wds', streaming=True) for example in image_dataset["train"]: print(example["image"]) break
Multi-modal data
๐ค Datasets supports a wide variety of data types out of the box:
# Audio dataset dataset = load_dataset("openslr/librispeech_asr", "clean") # Image dataset dataset = load_dataset("ILSVRC/imagenet-1k") # Video dataset dataset = load_dataset("Shofo/shofo-tiktok-general-small") # PDF documents dataset = load_dataset("pixparse/pdfa-eng-wds") # NIfTI (3D medical imaging) dataset = load_dataset("dartbrains/localizer", "betas")
From local files
# Load from local CSV dataset = load_dataset('csv', data_files='my_data.csv') # Load from local Parquet dataset = load_dataset('parquet', data_files='data/*.parquet') # Load from a local directory (auto-detect format) dataset = load_dataset('./path/to/data')
From Python objects
fromdatasetsimport Dataset # From a dictionary dataset = Dataset.from_dict({"text": ["Hello world", "How are you?"]}) # From a list dataset = Dataset.from_list([{"text": "Hello world"}, {"text": "How are you?"}]) # From Pandas importpandasaspd df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]}) dataset = Dataset.from_pandas(df) # From a generator defgen(): for i in range(10): yield {"value": i} dataset = Dataset.from_generator(gen)
For more details on using the library, check the quick start guide and the specific pages on:
Core Classes
The library provides two main dataset classes:
| Class | Description |
|---|---|
Dataset |
In-memory / memory-mapped dataset backed by Apache Arrow. Supports indexing, slicing, random access and caching. |
IterableDataset |
Lazy, streamable dataset for large-scale / out-of-core processing. Supports streaming and infinite iteration. |
Both are wrapped in DatasetDict / IterableDatasetDict for multi-split datasets (e.g., train/test/val).
Add a new dataset to the Hub
We have a very detailed step-by-step guide to add a new dataset to the ๐ number of datasets
datasets already provided on the HuggingFace Datasets Hub.
You can find:
- how to upload a dataset to the Hub using your web browser or Python and also
- how to upload it using Git.
Disclaimers
You can use ๐ค Datasets to load datasets based on versioned git repositories maintained by the dataset authors. For reproducibility reasons, we ask users to pin the revision of the repositories they use.
If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!
Contributing
We welcome contributions! Please see our Contributing Guide for details on:
- How to submit issues and pull requests
- Code style guidelines (we use Ruff)
- Testing requirements
- Documentation standards
BibTeX
If you want to cite our ๐ค Datasets library, you can use our paper:
@inproceedings{lhoest-etal-2021-datasets, title="Datasets: A Community Library for Natural Language Processing", author="Lhoest, Quentin and Villanova del Moral, Albert and Jernite, Yacine and Thakur, Abhishek and von Platen, Patrick and Patil, Suraj and Chaumond, Julien and Drame, Mariama and Plu, Julien and Tunstall, Lewis and Davison, Joe and {\v{S}}a{\v{s}}ko, Mario and Chhablani, Gunjan and Malik, Bhavitvya and Brandeis, Simon and Le Scao, Teven and Sanh, Victor and Xu, Canwen and Patry, Nicolas and McMillan-Major, Angelina and Schmid, Philipp and Gugger, Sylvain and Delangue, Cl{\'e}ment and Matussi{\`e}re, Th{\'e}o and Debut, Lysandre and Bekman, Stas and Cistac, Pierric and Goehringer, Thibault and Mustar, Victor and Lagunas, Fran{\c{c}}ois and Rush, Alexander and Wolf, Thomas", booktitle="Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month=nov, year="2021", address="Online and Punta Cana, Dominican Republic", publisher="Association for Computational Linguistics", url="https://aclanthology.org/2021.emnlp-demo.21", pages="175--184", abstract="The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.", eprint={2109.02846}, archivePrefix={arXiv}, primaryClass={cs.CL}, }
If you need to cite a specific version of our ๐ค Datasets library for reproducibility, you can use the corresponding version Zenodo DOI from this list.
Project details
Verified details
These details have been verified by PyPIMaintainers
๐ Avatar for albertvillanova from gravatar.comalbertvillanova ๐ Avatar for lhoestq from gravatar.com
lhoestq ๐ Avatar for lysandre from gravatar.com
lysandre ๐ Avatar for Thomwolf from gravatar.com
Thomwolf
Unverified details
These details have not been verified by PyPIProject links
Meta
- License: Apache Software License (Apache 2.0)
- Author: HuggingFace Inc.
- Tags datasets , machine , learning , datasets
- Requires: Python >=3.10.0
-
Provides-Extra:
audio,vision,mesh,tensorflow,tensorflow-gpu,torch,jax,streaming,dev,tests,tests-numpy2,quality,benchmarks,docs,pdfs,nibabel,iceberg
Classifiers
- Development Status
- Intended Audience
- License
- Operating System
- Programming Language
- Topic
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datasets-5.0.0.tar.gz.
File metadata
- Download URL: datasets-5.0.0.tar.gz
- Upload date:
- Size: 631.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83dbbbdb07a33b82192b8c419deb18739b138ee2ce1a322d55ce6b100954ec1a
|
|
| MD5 |
1e5106f261bc0e2c370cbf845e690cf6
|
|
| BLAKE2b-256 |
d985ce4f780c32f7e36d71257f1c27e8ba898ebe379cb54f211f5f2013f2c219
|
File details
Details for the file datasets-5.0.0-py3-none-any.whl.
File metadata
- Download URL: datasets-5.0.0-py3-none-any.whl
- Upload date:
- Size: 555.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dd34927a0fd7046e98aad5cb9430e699c373238a15befa7b9bf22b991a7fee6
|
|
| MD5 |
fda62db1e4100f7bdf30a3dd339f30a5
|
|
| BLAKE2b-256 |
056673034ad30b59f13439b75e620989dacba4c047256e358ba7c2e9ec98ea22
|
