![]() |
VOOZH | about |
KNNimputer is a scikit-learn class used to fill out or predict the missing values in a dataset. It is a more useful method that works on the basic approach of the KNN algorithm rather than the naive approach of filling all the values with the mean or the median. In this approach, we specify a distance from the missing values which is also known as the K parameter. The missing value will be predicted about the mean of the neighbors.
The KNNImputer works by finding the k-nearest neighbors (based on a specified distance metric) for the data points with missing values. It then imputes the missing values using the mean or median (depending on the specified strategy) of the neighboring data points. The key advantage of this approach is that it preserves the relationships between features, which can lead to better model performance.
For example, consider a dataset with a missing value in a column representing a student’s math score. Instead of simply filling this missing value with the overall mean or median of the math scores, KNNImputer finds the k-nearest students (based on other features like scores in physics, chemistry, etc.) and imputes the missing value using the mean or median of these neighbors' math scores.
It is implemented by the KNNimputer() method which contains the following arguments:
n_neighbors: number of data points to include closer to the missing value. metric: the distance metric to be used for searching. values - {nan_euclidean. callable} by default - nan_euclidean weights: to determine on what basis should the neighboring values be treated values -{uniform , distance, callable} by default- uniform.
Code: Python code to illustrate KNNimputor class
Output:
Data Before performing imputation
Maths Chemistry Physics Biology
0 80.0 60.0 NaN 78.0
1 90.0 65.0 57.0 83.0
2 NaN 56.0 80.0 67.0
3 95.0 NaN 78.0 NaN
After performing imputation
[[80. 60. 68.5 78. ]
[90. 65. 57. 83. ]
[87.5 56. 80. 67. ]
[95. 58. 78. 72.5]]
Note: After transforming the data becomes a numpy array.
KNNImputer highly versatile and adaptable to different types of data.KNNImputer can be used with both continuous and categorical data, making it a flexible tool for a wide range of applications.KNNImputer works best when there is sufficient data to find reliable neighbors. In datasets with a high proportion of missing values, this method may not perform as well.KNNImputer in Scikit-Learn is a powerful tool for handling missing data, offering a more sophisticated alternative to traditional imputation methods. By leveraging the relationships between features, it provides more accurate imputations that can lead to better model performance. However, it is essential to be mindful of its computational demands and sensitivity to outliers. When used appropriately, KNNImputer can significantly enhance your data preprocessing pipeline, leading to more robust and reliable machine-learning models.