Overfitting in ML: Avoiding the Pitfalls

Exploring the Causes and Solutions for Overfitting in Machine Learning Models

Dec 12, 2022

6 min read

Overfitting in machine learning is a common problem that occurs when a model is trained so much on the training dataset that it learns specific details about the training data that don’t generalise well, and cause poor performance on new, unseen data. Overfitting can happen for a variety of reasons, but ultimately it leads to a model that is not able to generalize well and make accurate predictions on data it has not seen before.

In this blog post, we will explore the causes of overfitting, the ways in which it can be prevented, and some strategies for dealing with overfitting if it occurs.

Causes of Overfitting

We will talk about two of the main reasons for overfitting in this article: the model is overly complex, and training is run for too long. In fact, the combination of both of these situations is when overfitting is most prevalent!

The Model Has Too Many Parameters

One of the most common causes of overfitting is having too many parameters in a model relative to the amount of training data available. When a model has a lot parameters, it can easily learn specific patterns in the training data, which can result in incredible performance on that data. However, when training performance seems too good to be true, it often is!

If the model has learned specific details within the training data, it may not be able to generalize well and make accurate predictions when it encounters new data. This is because the model has essentially memorized the training data, rather than learning the underlying patterns and relationships that are relevant for making predictions.

Example

Suppose you have a dataset of 100 houses, with their sizes, number of bedrooms, locations, and prices. You decide to train a complex model with many parameters, such as a deep neural network, on this dataset to predict the price of each house.

👁 Photo by Breno Assis on Unsplash

Photo by Breno Assis on Unsplash

After training the model, you evaluate its performance on the training data and find that it can predict the prices of the houses in the training set with very high accuracy. Imagine the model might have an average error of only $10,000 on the training data, for example. This might lead you to believe that the model is very good and can be used to make accurate predictions about new houses.

But, when you try to use the model to make predictions on new houses, you find that it performs poorly. For example, it might have an average error of $100,000 on new houses. This is an example of overfitting, because the model has so many parameters it can learn specific patterns in the training data that do not generalize to new data.

In this example, one way to avoid overfitting would be to use a simpler model with fewer parameters. Alternatively, you could try to collect more training data so that the model has more examples to learn from. Both of these potential solutions can help the model to learn more general patterns that can be applied to new data.

Overtraining

Another cause of overfitting is when a model is trained for too long. If the model is trained for too long, it can begin to over-specialize and learn the specific patterns in the training data, rather than the general patterns that are relevant for making accurate predictions. This can lead to poor performance on new, unseen data.

👁 Photo by Karsten Winegeart on Unsplash

Photo by Karsten Winegeart on Unsplash

Example of Overtraining

Continuing with our example from above where we are training a model to predict the price of a house. You train the model on the dataset of 100 houses and evaluate its performance after each training epoch. Initially, the model has a high error rate on the training data, but as you train it for more epochs, the error rate decreases and the model starts to perform well on the training data.

However, if you continue to train the model for too many epochs, it will eventually start to overfit to the training data. This means that it will learn patterns in the training data that do not generalize to new data, and will therefore perform poorly on new houses.

In this example, one way to avoid overfitting would be to use a validation dataset to evaluate the model during training. This validation dataset should be separate from the training dataset, and should be used to evaluate the model’s performance on new data and decide on hyperparameters such as the number of training epochs. If the model’s error rate on the validation dataset starts to increase while the error rate on the training dataset continues to decrease, this is a sign of overfitting. You can then stop training the model at this point to avoid overfitting.

Preventing Overfitting

👁 Photo by Kai Pilger on Unsplash

Photo by Kai Pilger on Unsplash

One way to prevent overfitting is to use regularization. Regularization is a technique that adds a penalty to the model for having too many parameters, or for having parameters with large values. This penalty encourages the model to learn only the most important patterns in the data, which can help to prevent overfitting.

Regularization – Check out this article for more detail!

Another way to prevent overfitting is to use cross-validation. In cross-validation, the training data is split into several subsets, and the model is trained on each subset and evaluated on the remaining data. This allows the model to be trained and evaluated multiple times, which can help to identify and prevent overfitting. However, cross validation can be computationally expensive, especially for large datasets since it involves training the model multiple times.

Simplifying a model by reducing the number of parameters, or by using a generally less complex model can also help to prevent overfitting. Generally, a model with fewer parameters is less likely to overfit. However, there is a balance that must be found here since the model need to be complex enough to capture the patterns of interest in the data.

Another approach is to use ensemble learning, which involves training multiple models and combining their predictions. This can help to reduce the overfitting that may occur in individual models, and can lead to better overall performance.

Finally, it can be helpful to gather more training data. In many cases, this can be difficult, time-consuming or expensive, but if possible, gathering more data for training is always a good idea! More data allows a model to learn more general patterns and relationships, which can improve its ability to make accurate predictions on unseen data.

Conclusion

In conclusion, overfitting is a common problem in machine learning that can occur when a complex model is trained for too long on a training dataset. Overfitting can be prevented by using regularization and cross-validation, and can be addressed by simplifying the model, using ensemble learning, or gathering more training data. By understanding and addressing overfitting, it is possible to improve the performance of machine learning models and make more accurate predictions on new data.

Written By

Rian Dolphin

See all from Rian Dolphin

Artificial Intelligence, Data Science, Machine Learning, ML

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/overfitting-in-ml-avoiding-the-pitfalls-d5225b7118d/