![]() |
VOOZH | about |
PyTorch's DataLoader is a powerful tool for efficiently loading and processing data for training deep learning models. It provides functionalities for batching, shuffling, and processing data, making it easier to work with large datasets. In this article, we'll explore how PyTorch's DataLoader works and how you can use it to streamline your data pipeline.
Table of Content
PyTorch Dataloader is a utility class designed to simplify loading and iterating over datasets while training deep learning models. It has various constraints to iterating datasets, like batching, shuffling, and processing data. To implement the dataloader in Pytorch, we have to import the function by the following code,
from torch.utils.data import Dataset, DataLoaderTo improve the stability, efficiency, and generalization of the model, batching, shuffling, and processing are used for effective computation in data preparation. Let's look at the importance of each constraint separately,
Batching is the process of grouping data samples into smaller chunks (batches) for efficient training. Automatic batching is the default behavior of DataLoader. When batch_size is specified, the DataLoader automatically collates individual fetched data samples into batches, typically with the first dimension representing as the batch dimension.
When you create a DataLoader just specify the batch_size to create a mini-batches. During training, the DataLoader slices your dataset into multiple mini-batches for the given batch size. Each batch contains multiple data points (e.g., images, text samples) . The DataLoader returns the batched data (input features and labels) to the training loop.
batch_size (int, optional) -> how many samples per batch to load (default: 1).
Output:
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([0, 3, 7, 1, 4, 2, 9, 7, 4, 7, 1, 8, 3, 4, 6, 3, 8, 7, 2, 8, 3, 2, 4, 9,
6, 3, 8, 0, 0, 1, 3, 0])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([1, 1, 5, 5, 4, 6, 1, 1, 2, 3, 8, 3, 7, 0, 6, 3, 1, 7, 7, 9, 4, 0, 8, 0,
7, 4, 8, 1, 0, 6, 2, 5])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([8, 2, 6, 2, 8, 2, 0, 0, 4, 4, 9, 4, 8, 5, 2, 7, 6, 5, 0, 5, 4, 3, 9, 8,
8, 8, 1, 7, 7, 1, 7, 9])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([6, 6, 9, 3, 7, 2, 3, 6, 9, 4, 7, 8, 7, 7, 9, 4, 8, 3, 9, 9, 9, 9, 4, 2,
0, 0, 5, 4, 5, 7, 5, 7])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([3, 0, 2, 3, 7, 8, 0, 6, 6, 9, 1, 4, 6, 2, 7, 9, 1, 2, 9, 8, 4, 6, 2, 3,
7, 3, 5, 3, 6, 7, 3, 1])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([8, 8, 0, 4, 9, 1, 5, 4, 9, 4, 7, 5, 2, 2, 1, 6, 0, 3, 1, 1, 1, 8, 8, 5,
0, 6, 3, 3, 9, 3, 4, 7])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([7, 4, 9, 6, 2, 1, 0, 0, 1, 1, 3, 3, 9, 3, 5, 7, 0, 2, 3, 5, 9, 2, 3, 0,
9, 9, 0, 1, 9, 0, 3, 2])
The above results indicates a batch of 32 images, each with 3 channels and a size of 64x64 pixels. The batch_labels will be a tensor of shape (32,) containing the class labels for each image in the batch.
The smaller batch sizes leads to a stable training which results in noisy updates. On Larger batch size leads to faster convergence and excessively larger batches results in slower convergence. So it is beneficial to experiment with different batch sizes and monitor the training performance to determine a optimal batch size. For most of the cases , batch size of 64 for larger model and 32 for smaller datasets are used often.
Shuffling means the random reordering of data samples for every epoch to improve the model performance and generalization shuffling was used. By setting shuffle (True), internally random sampler was used.
When shuffle=True, the DataLoader randomly rearranges the data at the start of each epoch. The DataLoader returns the batched data (input features and labels) to the training loop.
From the code , we can infer that the built-in dataset MNIST was downloaded by shuffle=True. It ensures that the model encounters a diverse mix of samples in each batch.
shuffle (bool, optional) –> set to True to have the data reshuffled at every epoch (default: False).
To see the difference , we'll use a dataset of integers from 0 to 99 as our data points for simplicity. The goal here is not to train a real model but to observe how the order of data points changes with and without shuffling.
Output:
First batch with shuffle=True: tensor([53, 0, 56, 3, 92, 49, 72, 79, 64, 47])
First batch with shuffle=False: tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Shuffling can also be achieved using custom sampler classes. The Samplers provide flexibility in how you shuffle your dataset and can be used based on your specific requirements. Below we mentioned about the few sampler for a built-in image dataset MNIST .
This sampler randomly samples elements from the dataset without replacement. It ensures that each example is sampled exactly once in an epoch.
This shows how to use the built-in RandomSampler to randomly shuffle the entire dataset before each epoch.
This sampler sequentially samples elements from the dataset but shuffles the order of the elements at the beginning of each epoch, providing a balance between randomness and order.
It is used for implementing more complex shuffling strategies or sampling schemes based on specific requirements.A Custom Sampler allows for complex sampling strategies. Below is a simple example of a custom sampler that randomly selects half of the dataset without replacement.
The DataLoader uses single-process data loading by default. In this mode, data fetching is done in the same process a DataLoader is initialized. Thus, data loading may block computing. Processing will be preferred when resources used for sharing data among processes (e.g., shared memory, file descriptors) is limited, or when the entire dataset is small and can be loaded entirely in memory. By processing, it will apply transformations to the data. Transformations include resizing images, normalizing pixel values, or any other preprocessing steps.
Output:
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:03<00:00, 55893335.47it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Let's see the processing done in the above code,
The Dataset class in PyTorch plays a pivotal role in data handling and preprocessing, serving as a foundational building block for loading and organizing data in a way that is efficient and scalable for training deep learning models. Customizing data transformations within a Dataset class allows for flexible and dynamic data preprocessing, tailored specifically to the needs of a given model .
By implementing two essential methods, __len__ (to return the size of the dataset) and __getitem__ (to support the indexing such that dataset[i] can be used to get the ith sample), a custom Dataset can be created for virtually any data source. Then Dataset instances can be seamlessly used with a DataLoader, which can provide batch loading of data, optional shuffling, and parallel data loading using multiprocessing workers, thereby simplifying the data feeding process into the model.
Transformations are operations applied to your data before it's fed into your model. Common transformations include:
transforms.Compose takes a list of transformations and combines them into a single operation. This composite transformation can then be passed to your Dataset class and applied within the __getitem__ method. The ability to compose transformations makes your data preprocessing pipeline both more manageable and modular, as you can easily adjust or extend the sequence of transformations as needed.
You can infer more from the colab notebook provided in the conclusion.
Collate function helps to customize how individual samples are combined into batches within a DataLoader. It is useful when dealing with variable-sized input data or addition processing is required at the batch level.
Output:
Batch Inputs: tensor([[1, 2, 3],
[4, 5, 0]])
Batch Labels: tensor([0, 1])
Batch Inputs: tensor([[6, 7, 8, 9]])
Batch Labels: tensor([0])
DataLoader class allows you to specify the number of worker threads using the num_workers parameter. It can speed up data loading by allowing multiple samples to be loaded concurrently .This parameter determines how many subprocesses to use for data loading, which can significantly speed up the loading process, especially if loading data involves heavy I/O operations like reading from disk or fetching data from the network.
We can adjust the num_workers parameter based on your hardware capabilities and the specific requirements of your dataset to achieve optimal performance. Also by using profiling tools like torch.utils.bottleneck to identify potential bottlenecks in your data loading pipeline and optimize accordingly.
DataLoader significantly impacts training quality. Batching, shuffling, and preprocessing are essential to a robust model and efficient for deep learning pipeline.