When training a model – you will need Training, Validation, and Holdout Datasets

Understanding why you need 3 separate sets of data to build a model

Aug 21, 2021

3 min read

👁 Photo by Amirali Mirhashemian taken from Unsplash

Photo by Amirali Mirhashemian taken from Unsplash

Introduction

When I first started building machine learning models, I used to train my model on 2 sets of data – training dataset and validation dataset with the common splitting rule (80% for Training data, 20% for Validation data). However, when the model is deployed and applied to the new set of data the model performance begins to degrade. One of the reasons this happens is that the model was not further validated with a Holdout dataset which is important as it validates the model performance during the training process to give the final validation of the model performance.

In this article, let’s understand more on why we need to have different sets of data while developing a machine learning model including the function and importance of each of these data set – Training data set, Validation data set and a Holdout data set.

Partitioning of Data

Before you start your journey of building a machine learning model, partitioning of data is required if you are training a supervised learning model. The objective of having different sets of data is to have a subset of data available for verification of the model performance. Let’s understand the 3 sets of data you need to partition:

(1) Training dataset

(2) Validation dataset

(3) Holdout dataset (Also known as test dataset)

What is a Training Dataset and Validation Dataset?

👁 Training dataset & Validation dataset (Image by Author)

Training dataset & Validation dataset (Image by Author)

The training dataset is the set of data used for training a model and this will also be the largest set of data. This is the set of data that the model will use and learns the behavior from. The model will be trained continuously based on the training dataset to understand the behavior and patterns in it.

The validation dataset is used for model evaluation and fine-tunes the model hyperparameters during the training process. The model will validate its performance and accuracy based on this set of data but does not learn from the validation data set.

What is a Holdout Dataset?

👁 Adding a Holdout dataset (Image by Author)

Adding a Holdout dataset (Image by Author)

The holdout dataset is not used in the model training process and the **** purpose is to provide an unbiased estimate of the model performance during the training process. This set of data will only be used once the model has finish training with the Training dataset and Validation dataset. The holdout dataset plays an important role as it ensures that the model can generalize well on unseen data. Therefore it is important to ensure that the Holdout dataset does not contain any training or validation data set in order to ensure the accuracy of the model.

In addition, the accuracy of the model on the Holdout dataset should also be compared with the accuracy during training to ensure that the model is not overfitting. If the accuracy during training performs significantly better compared to the accuracy from the Holdout data set, then this is an indication that the model might be overfitting.

Configuring Split Ratio

👁 The common split ratio for Training, Validation and Holdout dataset (Image by Author)

The common split ratio for Training, Validation and Holdout dataset (Image by Author)

Generally, the split ratio often used is 60:20:20 (60% for Training data, 20% for Validation data, and 20% for Holdout data) or 50:25:25. However, this also depends on the size and type of data used. It is important to ensure that the dataset is well-partition with each set of data containing the patterns or trends of the original data or we might end up selecting a model that is biased based on the patterns or trends in the validation data.

Conclusion

This short article summarizes the importance of splitting your data into three different sets of data – training dataset, validation dataset, and holdout dataset. The holdout data set serves as a final estimate of the model performance and should only be used after the model has finish training and tune based on the validation dataset.

Thanks for reading this article, I hope this will be good information for anyone out there.

References & Links:

[1] https://machinelearningmastery.com/difference-test-validation-datasets/

[2] https://sdsclub.com/how-to-train-and-test-data-like-a-pro/

[3] https://towardsdatascience.com/training-vs-testing-vs-validation-sets-a44bed52a0e1

Written By

Sue Lynn

See all from Sue Lynn

Data Science, Machine Learning, Overfitting, Programming, Train Test Split

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/when-training-a-model-you-will-need-training-validation-and-holdout-datasets-7566b2eaad80/