![]() |
VOOZH | about |
In machine learning, accurately processing how well a model performs and whether it can handle new data is crucial. Yet, with limited data or concerns about generalization, traditional methods of evaluation may not cut it. That's where cross-validation steps in. It's a method that rigorously tests predictive models by splitting the data, training on one part, and testing on another. Among these methods, K-Fold Cross-validation shines as a reliable and popular choice.
In this article, we'll look at the K-Fold cross-validation approach and how it helps to reduce overfitting in models.
A method for evaluating a predictive model's effectiveness and capacity for generalization is called cross-validation. The dataset is divided into subsets, the model is fitted to one of the subsets (the training set), and the model is assessed on the complementary subset (the validation set). The performance numbers are averaged over the course of several rounds of this operation, each with a distinct split.
There are various approaches to cross-validation; K-Fold Cross-validation is one of the more well-known techniques.
K-Fold Cross-validation is a technique used in machine learning to assess the performance and generalizability of a model. The basic idea is to partition the dataset into "K" subsets (folds) of approximately equal size. The model is trained K times, each time using K-1 folds for training and the remaining fold for validation. This process is repeated K times, with a different fold used as the validation set in each iteration.
K-Fold Cross-validation helps in obtaining a more reliable estimate of a model's performance by reducing the impact of the specific data split on the evaluation. It is particularly useful when the dataset is limited or when there is a concern about the randomness of the data partitioning.
Common choices for K include 5, 10, or sometimes even higher values, depending on the size of the dataset and the computational resources available. In the extreme case where K equals the total number of samples in the dataset, it is called "Leave-One-Out Cross-validation" (LOOCV). However, LOOCV can be computationally expensive and might not be practical for large datasets.
The dataset is divided into k equal-sized partitions at random for k fold cross validation. For greater randomization, D may occasionally be shuffled before to cross validation. We usually have k = 2, 5, 10 (10 is most common). For D = 250 and K = 5, each fold will contains 50 data.
Overfitting happens when a machine learning model learns the training data so well that it detects noise or random oscillations in the data as meaningful patterns. This can result in poor performance when the model is applied to new, previously unseen data since it does not generalize properly.
Overfitting can be reduced by using:
Let's see the difference on the model prediction while utilizing K-Fold cross validation versus not utilizing it. For this, I will utilize california_housing_test.csv.
First, we need to import the relevant libraries.
Output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
median_house_value is our target and rest of features are input columns
Output:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity_encoded
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 3
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 3
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 3
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 3
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 3
Output:
R2 Score: 0.6114554518898516
This code implements K-Fold Cross-validation for a linear regression model where the target variable is median_house_value.
Output:
Fold 1 R2 Score: 0.6114554518898566
Fold 2 R2 Score: 0.6425719794066727
Fold 3 R2 Score: 0.6382892378835952
Fold 4 R2 Score: 0.6654790505178491
Fold 5 R2 Score: 0.6057229383411187
Average R2 Score: 0.6327037316078185
With k-fold cross-validation, we evaluate the model numerous times on distinct subsets of the data, resulting in a more trustworthy estimate of performance and aiding in the detection of overfitting or model instability. We only assess the model's performance on one split of the data without cross-validation.
The R2 score in the case above is 0.61 when cross validation using K-Fold is used.
K-fold cross-validation reduces model overfitting through a variety of mechanisms: