KNN Is A Very Simple Algorithm Used To Solve Classification Problems. KNN Stands For K-Nearest Neighbors. K Is The Number of Neighbors in KNN

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 9

Why KNN is poor choice for spam filter?

What is KNN?

 KNN is a very simple algorithm used to solve classification


problems. KNN stands for K-Nearest Neighbors. K is the
number of neighbors in KNN.
Why KNN is poor choice as spam filter
 KNN classifiers are good whenever there is a really
meaningful distance metric. In the spam case, KNN
classifiers are going to label as spam things that are “close”
to known spams being “close” in the sense of your distance
metric (which will likely be poor).
Therefore, KNN classifiers are only going to filter
spams that are really similar to what you already
know. It won’t really generalize properly.
Also, you have to train on non-spam examples too,
and KNN will suffer from the same problem: it will
only confidently say something is non-spam if it is
written very similarly to a non-spam email that KNN
was trained on.
 Limitations of KNN to use as spam filters

1. Doesn’t work well with a large dataset:


Since KNN is a distance-based algorithm, the cost of
calculating distance between a new point and each
existing point is very high which in turn degrades
the performance of the algorithm.
2. Doesn’t work well with a high number of
dimensions:
Again, the same reason as above. In higher
dimensional space, the cost to calculate distance
becomes expensive and hence impacts the
performance.
 Distribution of e-mails data set
 3. Sensitive to outliers and missing values:
KNN is sensitive to outliers and missing values
and hence we first need to impute the missing
values and get rid of the outliers before applying
the KNN algorithm.
 4. Need feature scaling: We need to do
feature scaling (standardization and
normalization) before applying KNN
algorithm to any dataset. If we don't do so,
KNN may generate wrong predictions.
 5. For different values of ‘k’
prediction of gain data may
varies, therefore accuracy may
be poor.
 For example
 With respect to given data if k=3
,the given data belongs to class B
 If K=7,the given data belongs to
classA
 So, for different values of k
predictions may varies
 Failure of KNN
CASE 1
In this case, the data is grouped in
clusters but the query point seems far
away from the actual grouping. In such
a case, we can use K nearest neighbors
to identify the class, however, it doesn’t
make much sense because the query
point (yellow point) is really far from the
data points and hence we can’t be very
sure about its classification.
Case 2
In this case, the data is randomly
spread and hence no useful
information can be obtained from it.
Now in such a scenario when we are
given a query point (yellow point), the
KNN algorithm will try to find the k
nearest neighbors but since the data
points are jumbled, the accuracy
is questionable
 Based on accuracy

You might also like