k-nearest neighbor algorithm using Sklearn - Python
Last Updated : 23 Mar, 2026
K-Nearest Neighbors (KNN) works by identifying the 'k' nearest data points called as neighbors to a given input and predicting its class or value based on the majority class or the average of its neighbors. In this article we will implement it using Python's Scikit-Learn library.
train_test_split() splits the data into 70% training and 30% testing.
random_state=42 ensures reproducibility.
stratify=y maintains the same class distribution in both training and test sets which is important for balanced evaluation.
StandardScaler() standardizes the features by removing the mean and scaling to unit variance (z-score normalization).
This is important for distance-based algorithms like k-NN as it ensures all features contribute equally to distance calculations.
3. Fit the k-NN Model and Evaluate
This creates a k-Nearest Neighbors (k-NN) classifier with k = 5 meaning it considers the 5 nearest neighbors for making predictions.
fit(X_train, y_train) trains the model on the training data.
predict(X_test) generates predictions for the test data.
accuracy_score() compares the predicted labels (y_pred) with the true labels (y_test) and calculates the accuracy i.e the proportion of correct predictions.
Output:
Test Accuracy (k=5): 0.87
4. Cross-Validation to Choose Best k
Choosing the optimal k-value is critical before building the model for balancing the model's performance.
A smaller k value makes the model sensitive to noise, leading to overfitting (complex models).
A larger k value results in smoother boundaries, reducing model complexity but possibly underfitting.
This code performs model selection for the k value in the k-NN algorithm using 5-fold cross-validation:
It tests values of k from 1 to 20.
For each k, a new k-NN model is trained and validated using cross_val_score which automatically splits the dataset into 5 folds, trains on 4 and evaluates on 1, cycling through all folds.
The mean accuracy of each fold is stored in cv_scores.
A line plot shows how accuracy varies with k helping visualize the optimal choice.
The best_k is the value of k that gives the highest mean cross-validated accuracy.