![]() |
VOOZH | about |
K Nearest Neighbors Classification is one of the classification techniques based on instance-based learning. Models based on instance-based learning to generalize beyond the training examples. To do so, they store the training examples first. When it encounters a new instance (or test example), then they instantly build a relationship between stored training examples and this new instant to assign a target function value for this new instance. Instance-based methods are sometimes called lazy learning methods because they postponed learning until the new instance is encountered for prediction.
Instead of estimating the hypothetical function (or target function) once for the entire space, these methods will estimate it locally and differently for each new instance to be predicted.
Basic Assumption:
An instance can be represented by < x1, x2, .............., xn >. Euclidean distance between two instances xa and xb is given by d( xa, xb ) :
How does it work?
K-Nearest Neighbors Classifier first stores the training examples. During prediction, when it encounters a new instance (or test example) to predict, it finds the K number of training instances nearest to this new instance. Then assigns the most common class among the K-Nearest training instances to this test instance.
The optimal choice for K is by validating errors on test data. K can also be chosen by the square root of m, where m is the number of examples in the dataset.
In the above figure, "+" denotes training instances labelled with 1. "-" denotes training instances with 0. Here we classified for the test instance xt as the most common class among K-Nearest training instances to it. Here we choose K = 3, so xt is classified as "-" or 0.
Diabetes Dataset used in this implementation can be downloaded from link.
It has 8 features columns like i.e "Age", "Glucose" e.t.c, and the target variable βOutcomeβ for 108 patients. So in this, we will create a link Neighbors Classifier model to predict the presence of diabetes or not for patients with such information.
Accuracy on test set by our model : 63.888888888888886 Accuracy on test set by sklearn model : 63.888888888888886
The accuracy achieved by our model and sklearn is equal which indicates the correct implementation of our model.
Note: Above Implementation is for model creation from scratch, not to improve the accuracy of the diabetes dataset.
K Nearest Neighbors Regression first stores the training examples. During prediction, when it encounters a new instance ( or test example ) to predict, it finds the K number of training instances nearest to this new instance. Then predicts the target value for this instance by calculating the mean of the target values of these nearest neighbors.
The optimal choice for K is by validating errors on test data. K can also be chosen by the square root of m, where m is the number of examples in the dataset.
Dataset used in this implementation can be downloaded from link
It has 2 columns β βYearsExperienceβ and βSalaryβ for 30 employees in a company. So in this, we will create a K Nearest Neighbors Regression model to learn the correlation between the number of years of experience of each employee and their respective salary.
The model, we created predicts the same value as the sklearn model predicts for the test set.
Predicted values by our model : [ 43024.33 113755.33 58419. ] Predicted values by sklearn model : [ 43024.33 113755.33 58419. ] Real values : [ 37731 122391 57081]
Disadvantage: Instance Learning models are computationally very costly because all the computations are done during prediction. It also considers all the training examples for the prediction of every test example.