Instance-Based Learning: K-Nearest Neighbour Learning
Instance-Based Learning: K-Nearest Neighbour Learning
Instance-Based Learning: K-Nearest Neighbour Learning
The advantage of this approach is that instead of estimating the target function
once for the entire instance space these methods can estimate it locally and
differently for each new instance to be classified.
Characteristics of Instance-based Learning Methods
Instance-based methods can construct a different approximation to the target
function for each distinct query instance that must be classified.
These methods can use more complex, symbolic representations for instances.
Another limitation of these methods, specially KNN, is that they typically consider
all attributes of the instances while attempting to retrieve similar training examples
from memory.
k-Nearest Neighbour Learning
k-Nearest Neighbour Learning
It is the most basic instance-based learning method.
kNN is a non-parametric supervised learning
technique.
kNN captures information of all training cases and
classifies new cases based on a similarity.
The nearest neighbors (i.e. similarity) of an instance
are defined in terms of the standard Euclidean
distance.
The target function may be discrete-valued or
real-valued.
Euclidean Distance
Euclidean Distance
In mathematics, the Euclidean distance between
two points in Euclidean space is the length of a line
segment between the two points.
Few rows of the data including height, weight and T-shirt size information is shown
below - Height Weight T-shirt Size
(in cms) (in kgs)
158 58 M
163 60 M
165 62 L
170 64 L
Step 1: Calculate Similarity based on distance function
Calculate the euclidean distance between the new test example and all the
training examples.
Rank the training examples in terms of their distance from new test example. The
smallest distance value will be considered as rank 1 and considered as nearest
neighbour.
Step 2: Find K-Nearest Neighbors
Select the top K nearest neighbours of the test example based on their distance
calculated in previous step.
Assign the most common value among the k nearest training examples.
In case of real valued target function, the mean of target values of k nearest
examples is output as target value for test example.
Visualizing KNN (for k=5)
For a very low value of k (suppose k=1), the model overfits on the training data, which leads to a high error rate on the
validation set. On the other hand, for a high value of k, the model performs poorly on both train and validation set. If
you observe closely, the validation error curve reaches a minima at a value of k = 9. This value of k is the optimum
value of the model (it will vary for different datasets).