0% found this document useful (1 vote)
1K views

KNN Is A Very Simple Algorithm Used To Solve Classification Problems. KNN Stands For K-Nearest Neighbors. K Is The Number of Neighbors in KNN

KNN is a poor choice for spam filtering because: 1. KNN classifiers will only filter spam that is very similar to known spam examples and will not generalize well to new spam. 2. KNN also suffers from only confidently labeling emails as non-spam if they are very similar to a trained non-spam email. 3. KNN performs poorly on large datasets because calculating distances between all points is computationally expensive, and it is also sensitive to outliers, missing values, and the number of dimensions in the data.

Uploaded by

Jessica Samuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
1K views

KNN Is A Very Simple Algorithm Used To Solve Classification Problems. KNN Stands For K-Nearest Neighbors. K Is The Number of Neighbors in KNN

KNN is a poor choice for spam filtering because: 1. KNN classifiers will only filter spam that is very similar to known spam examples and will not generalize well to new spam. 2. KNN also suffers from only confidently labeling emails as non-spam if they are very similar to a trained non-spam email. 3. KNN performs poorly on large datasets because calculating distances between all points is computationally expensive, and it is also sensitive to outliers, missing values, and the number of dimensions in the data.

Uploaded by

Jessica Samuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Why KNN is poor choice for spam filter?

What is KNN?

 KNN is a very simple algorithm used to solve classification


problems. KNN stands for K-Nearest Neighbors. K is the
number of neighbors in KNN.
Why KNN is poor choice as spam filter
 KNN classifiers are good whenever there is a really
meaningful distance metric. In the spam case, KNN
classifiers are going to label as spam things that are “close”
to known spams being “close” in the sense of your distance
metric (which will likely be poor).
Therefore, KNN classifiers are only going to filter
spams that are really similar to what you already
know. It won’t really generalize properly.
Also, you have to train on non-spam examples too,
and KNN will suffer from the same problem: it will
only confidently say something is non-spam if it is
written very similarly to a non-spam email that KNN
was trained on.
 Limitations of KNN to use as spam filters

1. Doesn’t work well with a large dataset:


Since KNN is a distance-based algorithm, the cost of
calculating distance between a new point and each
existing point is very high which in turn degrades
the performance of the algorithm.
2. Doesn’t work well with a high number of
dimensions:
Again, the same reason as above. In higher
dimensional space, the cost to calculate distance
becomes expensive and hence impacts the
performance.
 Distribution of e-mails data set
 3. Sensitive to outliers and missing values:
KNN is sensitive to outliers and missing values
and hence we first need to impute the missing
values and get rid of the outliers before applying
the KNN algorithm.
 4. Need feature scaling: We need to do
feature scaling (standardization and
normalization) before applying KNN
algorithm to any dataset. If we don't do so,
KNN may generate wrong predictions.
 5. For different values of ‘k’
prediction of gain data may
varies, therefore accuracy may
be poor.
 For example
 With respect to given data if k=3
,the given data belongs to class B
 If K=7,the given data belongs to
classA
 So, for different values of k
predictions may varies
 Failure of KNN
CASE 1
In this case, the data is grouped in
clusters but the query point seems far
away from the actual grouping. In such
a case, we can use K nearest neighbors
to identify the class, however, it doesn’t
make much sense because the query
point (yellow point) is really far from the
data points and hence we can’t be very
sure about its classification.
Case 2
In this case, the data is randomly
spread and hence no useful
information can be obtained from it.
Now in such a scenario when we are
given a query point (yellow point), the
KNN algorithm will try to find the k
nearest neighbors but since the data
points are jumbled, the accuracy
is questionable
 Based on accuracy

You might also like