![]() |
VOOZH | about |
Cross-validation is a technique used to check how well a machine learning model performs on unseen data while preventing overfitting. It works by:
In Holdout Validation the dataset is split into training and testing sets. Common splits include 70β30, 80β20 or 75β25 depending on the dataset size and problem. Making it simple and quick to apply.
In this method the model is trained on the entire dataset except for one data point which is used for testing. This process is repeated for each data point in the dataset.
It is a technique that ensures each fold of the cross-validation process has the same class distribution as the full dataset. This is useful for imbalanced datasets where some classes are underrepresented.
K-Fold Cross Validation splits the dataset into k equal-sized folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times each time using a different fold for testing.
Note: A commonly used value of k is 10, but the choice depends on the dataset size and problem requirements.
This method repeats the K-Fold cross-validation process multiple times with different random splits. It helps reduce the effect of randomness in data splitting and provides a more robust performance estimate.
The diagram below shows an example of the training subsets and evaluation subsets generated in k-fold cross-validation. Here we have total 25 instances.
| Iteration | Training Set Observations | Testing Set Observations |
|---|---|---|
| 1 | [5-24] | [0-4] |
| 2 | [0-4, 10-24] | [5-9] |
| 3 | [0-9, 15-24] | [10-14] |
| 4 | [0-14, 20-24] | [15-19] |
| 5 | [0-19] | [20-24] |
Each iteration uses different subsets for testing and training, ensuring that all data points are used for both training and testing.
K-Fold Cross-Validation and Hold Out Method are used technique and sometimes they are confusing so here is the quick comparison between them:
| Feature | K-Fold Cross-Validation | Holdout Method |
|---|---|---|
| Data Split | Dataset is divided into k folds and each fold is used once as test set | Dataset is split once, typically into training and testing sets |
| Training & Testing | Model is trained and tested k times, each fold serving as test set once | Model is trained once on training set and tested once on test set |
| Bias & Variance | Lower bias, more reliable performance estimate and variance depends on k | Higher bias if the split is not representative and results can vary significantly |
| Execution Time | Slower, especially for large datasets because model is trained k times | Faster, only one training and testing cycle |
| Best Use Case | Small to medium datasets where accuracy estimation is important | Very large datasets or when quick evaluation is needed |
We will import essential modules from scikit-learn.
We will use the Iris dataset a built-in, multi-class dataset with 150 samples and 3 flower species (Setosa, Versicolor and Virginica).
SVC() from scikit-learn is used to build the Support Vector Machine model. Here, we are using a linear kernel, suitable for linearly separable data.
We define 5 folds, meaning the dataset will be split into 5 parts. The model will train on 4 parts and test on 1, repeating this process 5 times for balanced evaluation.
We use cross_val_score() to automatically split data, train and evaluate the model across all folds. It returns the accuracy for each fold
We print individual fold accuracies and the mean accuracy across all folds to understand the modelβs stability and generalization.
Output:
The output shows the accuracy scores from each of the 5 folds in the K-fold cross-validation process. The mean accuracy is the average of these individual scores which is approximately 97.33% indicating the model's overall performance across all the folds.