1 - KNN

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Machine

Learning
Prepared By:
Dr. Sara Sweidan
Lazy Learning – Classification
Using Nearest Neighbors
To know

• The key concepts that define nearest neighbor classifiers, and why they are
considered "lazy" learners
• Methods to measure the similarity of two examples using distance
• To apply a popular nearest neighbor classifier called k-NN
Nearest Neighbor Classification

➢Nearest neighbor classifiers are defined by their characteristic of classifying unlabeled


examples by assigning them the class of similar labeled examples.

➢In general, nearest neighbor classifiers are well-suited for classification tasks,

• Where relationships among the features and the target classes are numerous, complicated, or
extremely difficult to understand, yet the items of similar class type tend to be fairly
homogeneous.

• On the other hand, if the data is noisy and thus no clear distinction exists among the groups,
the nearest neighbor algorithms may struggle to identify the class boundaries.
The k-NN algorithm
• The nearest neighbors approach to classification is exemplified by the k-nearest
neighbors algorithm (k-NN).
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• The strengths and weaknesses of this algorithm are as follows:

Strengths Weaknesses
• Simple and effective. • Does not produce a model, limiting the ability
to understand how the features are related to the
• Makes no assumptions about the underlying class.
data distribution. • Requires selection of an appropriate k.
• Slow classification phase.
• Fast training phase. • Nominal features and missing data require
additional processing.
The k-NN algorithm
• The k-NN algorithm gets its name from the fact that it uses information about an
example's k-nearest neighbors to classify unlabeled examples.

• The letter k is a variable term implying that any number of nearest neighbors could be
used.

• After choosing k, the algorithm requires a training dataset made up of examples that have
been classified into several categories, as labeled by a nominal variable.

• Then, for each unlabeled record in the test dataset, k-NN identifies k records in the
training data that are the "nearest" in similarity.

• The unlabeled test instance is assigned the class of the majority of the k nearest neighbors.
The k-NN algorithm
The k-NN
algorithm
The k-NN algorithm
The K-NN working can be explained on the
basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data
points in each category.
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
• Example:
There is a Car manufacturer
company that has
manufactured a new SUV car.
The company wants to give
the ads to the users who are
interested in buying that SUV.
So for this problem, we have a
dataset that contains multiple
user's information through
the social network.
160000

140000

120000

100000

80000

60000

40000

20000

0
0 10 20 30 40 50 60
160000

140000

120000

100000

80000

60000

40000

20000

0
0 10 20 30 40 50 60
Training Set

Learning Algorithm

Testing hypothesis Predicated


Set model output

1.In case of very large value of k, we may include points from other classes into the neighborhood.
2.In case of too small value of k the algorithm is very sensitive to noise
•There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
•A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
•Large values for K are good, but it may find some difficulties.
• k-NN algorithm does more computation on test time rather
than train time.
(a) True (b) false

• Which of the following statement is true about k-NN algorithm?


1.k-NN performs much better if all of the data have the same scale
2.k-NN works well with a small number of input variables (p), but
struggles when the number of inputs is very large
3.k-NN makes no assumptions about the functional form of the
problem being solved
(a)1and 3 (b)1 and 2 (c) all of above
• Which of the following will be Euclidean Distance between the
two data point A(1,2) and B(2,3)?
(a)1 (b) 2 (c) 4 (d) 8

Which of the following will be true about k in k-NN in terms of


Bias?
(A)When you increase the k the bias will be increased
(B) When you decrease the k the bias will be decreased
(C) Can’t say
(D) None of these
• A company has build a kNN classifier that gets 100% accuracy
on training data. When they deployed this model on client side it
has been found that the model is not at all accurate. Which of
the following thing might gone wrong?
• Note: Model has successfully deployed and no technical issues
are found at client side except the model performance

• A) It is probably an overfitted model


B) It is probably an underfitted model
C) Can’t say
D) None of these

You might also like