KNN is a machine learning algorithm which is used for both classification (using KNearestClassifier) and Regression (using KNearestRegressor) problems.In KNN algorithm K is the
Hyperparameter. Choosing the right value of K matters. A machine learning model is said to have high model complexity if the built model is having low Bias and High Variance.
We know that,
- High Bias and Low Variance = Under-fitting model.
- Low Bias and High Variance = Over-fitting model. [Indicated highly complex model ].
- Low Bias and Low Variance = Best fitting model. [This is preferred ].
- High training accuracy and Low test accuracy ( out of sample accuracy ) = High Variance = Over-fitting model = More model complexity.
- Low training accuracy and Low test accuracy ( out of sample accuracy ) = High Bias = Under-fitting model.
Code: To understand how K value in KNN algorithm affects the model complexity.
Output:
Test Accuracy: 0.6465919540035108
Training Accuracy: 0.8687977824212627
Now let's vary the value of K (Hyperparameter) from Low to High and observe the model complexity
K = 1
K = 10
K = 20
K = 50
K = 70
Observations:
- When K value is small i.e. K=1, The model complexity is high ( Over-fitting or High Variance).
- When K value is very large i.e. K=70, The model complexity decreases ( Under-fitting or High Bias ).
Conclusion:
As K value becomes small model complexity increases and as K value becomes large the model complexity decreases.
Code: Let's consider the below plot
Output:
Observation:
From the above graph, we can conclude that when K is small i.e. K=1, Training Accuracy is High but Test Accuracy is Low which means the model is over-fitting ( High Variance or
High Model Complexity). When the value of K is large i.e. K=50, Training Accuracy is Low as well as Test Accuracy is Low which means the model is under-fitting ( High Bias or Low Model Complexity ).
So
Hyperparameter tuning is necessary i.e. to select the best value of K in KNN algorithm for which the model has Low Bias and Low Variance and results in a good model with high out of sample accuracy.
We can use
GridSearchCV or
RandomSearchCv to find the best value of hyper parameter K.