VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/k-nearest-neighbor-algorithm-in-python/

⇱ k-nearest neighbor algorithm using Sklearn - Python - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

k-nearest neighbor algorithm using Sklearn - Python

Last Updated : 23 Mar, 2026

K-Nearest Neighbors (KNN) works by identifying the 'k' nearest data points called as neighbors to a given input and predicting its class or value based on the majority class or the average of its neighbors. In this article we will implement it using Python's Scikit-Learn library.

1. Generating and Visualizing the 2D Data

  • We will import libraries like pandas, matplotlib, seaborn and scikit learn.
  • The make_moons() function generates a 2D dataset that forms two interleaving half circles.
  • This kind of data is non-linearly separable and perfect for showing how k-NN handles such cases.

Output:

👁 2dd
2D Classification Data Visualisation

2. Train-Test Split and Normalization

  • train_test_split() splits the data into 70% training and 30% testing.
  • random_state=42 ensures reproducibility.
  • stratify=y maintains the same class distribution in both training and test sets which is important for balanced evaluation.
  • StandardScaler() standardizes the features by removing the mean and scaling to unit variance (z-score normalization).
  • This is important for distance-based algorithms like k-NN as it ensures all features contribute equally to distance calculations.

3. Fit the k-NN Model and Evaluate

  • This creates a k-Nearest Neighbors (k-NN) classifier with k = 5 meaning it considers the 5 nearest neighbors for making predictions.
  • fit(X_train, y_train) trains the model on the training data.
  • predict(X_test) generates predictions for the test data.
  • accuracy_score() compares the predicted labels (y_pred) with the true labels (y_test) and calculates the accuracy i.e the proportion of correct predictions.

Output:

Test Accuracy (k=5): 0.87

4. Cross-Validation to Choose Best k

Choosing the optimal k-value is critical before building the model for balancing the model's performance.

  • A smaller k value makes the model sensitive to noise, leading to overfitting (complex models).
  • A larger k value results in smoother boundaries, reducing model complexity but possibly underfitting.

This code performs model selection for the k value in the k-NN algorithm using 5-fold cross-validation:

  • It tests values of k from 1 to 20.
  • For each k, a new k-NN model is trained and validated using cross_val_score which automatically splits the dataset into 5 folds, trains on 4 and evaluates on 1, cycling through all folds.
  • The mean accuracy of each fold is stored in cv_scores.
  • A line plot shows how accuracy varies with k helping visualize the optimal choice.
  • The best_k is the value of k that gives the highest mean cross-validated accuracy.

Output:

👁 bestk
Choosing Best k

Best k from cross-validation: 6

5. Training with Best k

  • The model is trained on the training set with the optimized k (Here k = 6).
  • The trained model then predicts labels for the unseen test set to evaluate its real-world performance.

6. Evaluate Using More Metrics

  • Calculate the confusion matrix comparing true labels (y_test) with predictions (y_pred).
  • Use ConfusionMatrixDisplay to visualize the confusion matrix with labeled classes

Print a classification report that includes:

  • Precision: How many predicted positives are actually positive.
  • Recall: How many actual positives were correctly predicted.
  • F1-score: Harmonic mean of precision and recall.
  • Support: Number of true instances per class.

Output:

👁 cm6
Confusion Matrix for k = 6
👁 Screenshot2025-05-29155625
Classification Report

7. Visualize Decision Boundary with Best k

  • Create a 2D mesh grid (xx, yy) covering the feature space.
  • Use the trained model (best_knn) to predict class labels for each grid point.
  • Reshape predictions to match the grid and plot decision regions using contourf.
  • Overlay the original data points using sns.scatterplot to compare true classes with model predictions.

This helps visualize how the model separates classes for the chosen value of k.

Output:

👁 db6
Decision Boundary with best K = 6

We can see that our KNN model is working fine in classifying datapoints.

You can download the complete code from here.

Comment