KNN - Feb 19
KNN - Feb 19
- classification algorithm
1
Jayaraj P B
Outline
1. Supervised Classifier
2. KNN - Overview
3. Working of KNN
4. Pros & Cons
5. Sample problem
Supervised Classifiers
Test instances
with unknown categories
X1
X2
feature X3 Y
… Classifier category
values
Xn
collection of instances
DB
with known categories
K-Nearest Neighbour
• The K-Nearest Neighbors (KNN) algorithm is a supervised machine
learning method employed to tackle classification and regression
problems.
• Evelyn Fix and Joseph Hodges: They introduced the k-NN algorithm
in their paper "Discriminatory Analysis, Nonparametric
Discrimination: Consistency Properties" published in 1951.
• K-NN algorithm stores all the available data and classifies a new
data point based on the similarity.
• It is widely disposable in real-life scenarios since it is non-
parametric, meaning it does not make any underlying
assumptions about the distribution of data
• KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar
to cat and dog, but we want to know either it is a cat or dog. So for
this identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm
• Suppose there are two categories, i.e., Category A and Category
B, and we have a new data point x1, so this data point will lie in
which of these categories
How K-NN Algorithm works
Suppose we have a new data point and we need to put it in the
required category. Consider the below image:
Euclidean distance
Manhattan distance
Minkowski distance
Euclidean distance
• This is nothing but the cartesian distance between the two points
which are in the plane/hyperplane.
Euclidean Distance
Minkowski distance
Minkowski distance
Hamming distance
• There are other distance metrics as well like Hamming Distance
which come in handy while dealing with problems that require
overlapping comparisons between two vectors whose contents
can be Boolean as well as string values.
• Cosine similarity
Measure angle formed by the two samples (with the
origin)
• Jaccard distance
Determine percentage of exact matches between the
samples (not including unavailable data)
Cosine Similarity
Advantages of the KNN Algorithm
• Easy to implement as the complexity of the algorithm is not that
high.
• Image recognition
• Spam email detection
• Medical diagnosis
• Credit risk assessment
• Recommender systems
Problem 2
• Solve the following question using KNN
K=1, Setosa
K=2, Setosa……………K=5, Setosa