Using KNNImputer in Scikit-Learn to Handle Missing Data in Python

Last Updated : 15 Jul, 2025

KNNimputer is a scikit-learn class used to fill out or predict the missing values in a dataset. It is a more useful method that works on the basic approach of the KNN algorithm rather than the naive approach of filling all the values with the mean or the median. In this approach, we specify a distance from the missing values which is also known as the K parameter. The missing value will be predicted about the mean of the neighbors.

How Does KNNImputer Work?

The KNNImputer works by finding the k-nearest neighbors (based on a specified distance metric) for the data points with missing values. It then imputes the missing values using the mean or median (depending on the specified strategy) of the neighboring data points. The key advantage of this approach is that it preserves the relationships between features, which can lead to better model performance.

For example, consider a dataset with a missing value in a column representing a student’s math score. Instead of simply filling this missing value with the overall mean or median of the math scores, KNNImputer finds the k-nearest students (based on other features like scores in physics, chemistry, etc.) and imputes the missing value using the mean or median of these neighbors' math scores.

It is implemented by the KNNimputer() method which contains the following arguments:

n_neighbors: number of data points to include closer to the missing value. metric: the distance metric to be used for searching. values - {nan_euclidean. callable} by default - nan_euclidean weights: to determine on what basis should the neighboring values be treated values -{uniform , distance, callable} by default- uniform.

Code: Python code to illustrate KNNimputor class

Output:

Data Before performing imputation
 Maths Chemistry Physics Biology
0 80.0 60.0 NaN 78.0
1 90.0 65.0 57.0 83.0
2 NaN 56.0 80.0 67.0
3 95.0 NaN 78.0 NaN


After performing imputation
 [[80. 60. 68.5 78. ]
 [90. 65. 57. 83. ]
 [87.5 56. 80. 67. ]
 [95. 58. 78. 72.5]]

Note: After transforming the data becomes a numpy array.

Advantages of Using KNNImputer

Preserves Relationships: By using the k-nearest neighbors, this method preserves the relationships between features, which can improve model performance.
Customizable: The ability to customize the number of neighbors, distance metric, and weighting scheme makes KNNImputer highly versatile and adaptable to different types of data.
Handles Different Data Types:KNNImputer can be used with both continuous and categorical data, making it a flexible tool for a wide range of applications.

Limitations of KNNImputer

Computationally Intensive: Finding the k-nearest neighbors for each missing value can be computationally expensive, especially for large datasets with many missing values.
Sensitive to Outliers: The method may be influenced by outliers in the dataset, as outliers can distort the imputation by skewing the mean of the neighbors.
Requires Sufficient Data:KNNImputer works best when there is sufficient data to find reliable neighbors. In datasets with a high proportion of missing values, this method may not perform as well.

Conclusion

KNNImputer in Scikit-Learn is a powerful tool for handling missing data, offering a more sophisticated alternative to traditional imputation methods. By leveraging the relationships between features, it provides more accurate imputations that can lead to better model performance. However, it is essential to be mindful of its computational demands and sensitivity to outliers. When used appropriately, KNNImputer can significantly enhance your data preprocessing pipeline, leading to more robust and reliable machine-learning models.

Comment

Article Tags:

Machine Learning

AI-ML-DS

python

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/python-imputation-using-the-knnimputer/