Hugging Face Dataset Hub

Last Updated : 9 May, 2026

Hugging Face Dataset Hub is a platform that hosts an extensive collection of datasets for natural language processing (NLP) tasks and other machine learning domains like computer vision and speech recognition. It serves as a centralized repository where we can discover, download and use datasets for various ML applications.

Hub includes datasets for a wide range of tasks such as text classification, question answering, image captioning and much more.
Datasets are easily accessible via the datasets library which we can install and use in just a few lines of code.
Platform encourages collaboration, allowing anyone to share datasets and improvements, promoting a rich ecosystem of publicly available resources.
Datasets on the Hub are often paired with pre-trained models allowing us to fine-tune models with minimal setup.

Accessing and Using Datasets

We will access a dataset from the hugging face dataset hub by installing the necessary libraries.

pip install datasets

1. Loading a Dataset

Once the library is installed, we can load any available dataset with a simple line of code. For example, we will load the IMDB dataset which is frequently used for sentiment analysis.

load_dataset("imdb"): Loads the "imdb" dataset from the Hugging Face Dataset Hub.
dataset["train"][0]: Accesses the first example from the training split of the dataset.

Output:

👁 imdb

Loading a Dataset

2. Exploring the Dataset

The Hugging Face datasets library provides useful methods to explore the loaded datasets. We can check the dataset structure, see the number of entries and access specific splits such as train, test and validation.

print(dataset): Displays the structure of the entire dataset, showing its available splits (e.g., train, test, validation).
print(dataset["test"][0:5]): Displays the first 5 examples from the "test" split of the dataset.

Output:

👁 dataset

Dataset Exploration

Popular Datasets on Hugging Face Dataset Hub

The Hugging Face Dataset Hub is home to a variety of datasets across different domains. Some of the most popular datasets include:

IMDB: A dataset commonly used for sentiment analysis.
SQuAD (Stanford Question Answering Dataset): A dataset for machine reading comprehension tasks.
COCO (Common Objects in Context): A dataset used for image captioning and object detection.
LibriSpeech: A speech dataset for automatic speech recognition (ASR) tasks.

These datasets are preprocessed and ready to be used for model training and fine-tuning.

Creating and Uploading Your Own Dataset

Hugging Face Dataset Hub also enables us to upload and share our own datasets. Here’s how we can contribute to the platform.

1. Preparing our Dataset

Before uploading, ensure that our dataset is properly formatted (e.g., CSV, JSON, Parquet). Each dataset should include metadata to describe its content and how it should be used.

2. Uploading to the Hub

To upload a dataset, we need the huggingface_hub library which facilitates interaction with the Hugging Face Hub. You can download it using:

pip install huggingface_hub

3. Logging in to Hugging Face

Once installed, we can upload our dataset by following the instructions provided by Hugging Face. Run the command to log in to your Hugging Face account.

huggingface-cli login

4. Enabling Git Large File Support (LFS)

We will install Git LFS for uploading large datasets.

git lfs install

5. Cloning the Dataset Repository

We will then clone the repository for our dataset and place our dataset files inside. Use:

git clone https://huggingface.co/datasets/OUR_DATASET

6. Pushing the Dataset to Hugging Face

We will now commit and push our dataset to the Hugging Face Hub.

git add .
git commit -m "Initial dataset upload"
git push

Now our dataset will be available on the Hugging Face Dataset Hub, ready for others to use.

Advanced Features of Dataset Hub

The Hugging Face Dataset Hub provides advanced features that further enhance the usability and accessibility of datasets:

1. Dataset Versioning

Each dataset in the Hub is versioned which means we can track changes made over time. This feature ensures reproducibility and allows us to use specific versions of a dataset for model training.

2. Dataset Streaming

Hugging Face supports dataset streaming for large datasets that may be too large to fit in memory. This feature allows us to stream data from the Hub without needing to download the entire dataset upfront. We will be loading squad dataset which is a very large dataset.

load_dataset("squad", streaming=True): Loads the "squad" dataset in streaming mode
for example in dataset["train"]: The loop iterates through the "train" split of the dataset.
break: Stops the loop after printing the first example

Output:

👁 streaming

Dataset Streaming

3. Dataset Splitting

The datasets library also supports splitting of datasets into training, validation and test sets. This is particularly useful for preparing datasets for model training.

dataset.keys(): Lists the available splits (e.g., 'train', 'validation') in the dataset.
dataset["train"]: Accesses the training split of the dataset.
dataset["validation"]: Accesses the validation split of the dataset.
take(n): Retrieves the first n examples from the dataset (in this case, 1 example).

Output:

👁 splitting

Dataset Splitting

Comment

Article Tags:

Artificial Intelligence

Hugging Face

GenAI

Explore

Introduction to AI

AI Concepts

Machine Learning in AI

Robotics and AI

Generative AI

AI Practice

Courses

URL: https://www.geeksforgeeks.org/artificial-intelligence/hugging-face-dataset-hub/