0% found this document useful (0 votes)
43 views

KNN - Feb 19

Uploaded by

nidhinb200723cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

KNN - Feb 19

Uploaded by

nidhinb200723cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

K-Nearest Neighbour

- classification algorithm

1
Jayaraj P B
Outline
1. Supervised Classifier
2. KNN - Overview
3. Working of KNN
4. Pros & Cons
5. Sample problem
Supervised Classifiers
Test instances
with unknown categories

X1
X2
feature X3 Y
… Classifier category
values
Xn

collection of instances
DB
with known categories
K-Nearest Neighbour
• The K-Nearest Neighbors (KNN) algorithm is a supervised machine
learning method employed to tackle classification and regression
problems.

• Evelyn Fix and Joseph Hodges: They introduced the k-NN algorithm
in their paper "Discriminatory Analysis, Nonparametric
Discrimination: Consistency Properties" published in 1951.

• Thomas Cover and Peter Hart: They popularized the k-NN


algorithm and demonstrated its effectiveness in classification tasks
in their seminal paper "Nearest Neighbor Pattern Classification"
published in 1967.

• K-NN algorithm stores all the available data and classifies a new
data point based on the similarity.
• It is widely disposable in real-life scenarios since it is non-
parametric, meaning it does not make any underlying
assumptions about the distribution of data

• It is also called a lazy learner algorithm because it does not


learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on
the dataset.

• KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar
to cat and dog, but we want to know either it is a cat or dog. So for
this identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm
• Suppose there are two categories, i.e., Category A and Category
B, and we have a new data point x1, so this data point will lie in
which of these categories
How K-NN Algorithm works
Suppose we have a new data point and we need to put it in the
required category. Consider the below image:

• Firstly, we will choose the number of neighbors, so we will choose


the k=5.
• Next, we will calculate the Euclidean distance between the data
points.
• The Euclidean distance is the distance between two points, which
we have already studied in geometry. It can be calculated as:
• By calculating the Euclidean distance we will get the nearest
neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B.

• As we can see the 3 nearest neighbors are from category A, hence


this new data point must belong to category A.
Working of KNN algorithm
• Thе K-Nearest Neighbors (KNN) algorithm operates on the
principle of similarity, where it predicts the label or value of a new
data point by considering the labels or values of its K nearest
neighbors in the training dataset.
• In the regression problem, the class label is calculated by taking
average of the target values of K nearest neighbors.

• The calculated average value becomes the predicted output for


the target data point.
Example
tomato (sweetness = 6, crunchiness = 4) ????
The distance formula involves comparing the values of each
feature.

To calculate the distance between the


tomato (sweetness = 6, crunchiness = 4), and the green bean
(sweetness = 3, crunchiness = 7), we can use the formula as follows:
tomato (sweetness = 6, crunchiness = 4) -> Fruit
Choosing appropriate k
• Deciding how many neighbors to use for kNN determines
how well the mode will generalize to future data.

• The balance between overfitting and underfitting the


training data is a problem known as the bias-variance
tradeoff.

• Choosing a large k reduces the impact or variance caused


by noisy data, but can bias the learner such that it runs
the risk of ignoring small, but important patterns.
Choosing appropriate k
• In practice, choosing k depends on the difficulty of the concept to
be learned and the number of records in the training data.

• Typically, k is set somewhere between 3 and 10. One common


practice is to set k equal to the square root of the number of
training examples.

• In the classifier, we might set k = 4, because there were 15


example ingredients in the training data and the square root of 15
is 3.87.
Distance Metrics Used in KNN Algorithm

• As we know that the KNN algorithm helps us identify the


nearest points or the groups for a query point.

• But to determine the closest groups or the nearest points for a


query point we need some metric.

• For this purpose, we use below distance metrics:

 Euclidean distance
 Manhattan distance
 Minkowski distance
Euclidean distance
• This is nothing but the cartesian distance between the two points
which are in the plane/hyperplane.

• Euclidean distance can also be visualized as the length of the


straight line that joins the two points which are into
consideration.

• This metric helps us calculate the net displacement done


between the two states of an object.
Manhattan distance
• Manhattan Distance metric is generally used when we are
interested in the total distance traveled by the object instead of
the displacement.

• This metric is calculated by summing the absolute difference


between the coordinates of the points in n-dimensions.
Manhattan Distance

Euclidean Distance
Minkowski distance
Minkowski distance
Hamming distance
• There are other distance metrics as well like Hamming Distance
which come in handy while dealing with problems that require
overlapping comparisons between two vectors whose contents
can be Boolean as well as string values.

• The Hamming distance between two strings or vectors of equal


length is the number of positions at which the corresponding
symbols are different.

• In other words, it measures the minimum number of


substitutions required to change one vector into the other, or
equivalently, the minimum number of errors that could have
transformed one vector into the other.
Hamming Distance
Other distance measures
• City-block distance (Manhattan dist)
Add absolute value of differences

• Cosine similarity
Measure angle formed by the two samples (with the
origin)

• Jaccard distance
Determine percentage of exact matches between the
samples (not including unavailable data)
Cosine Similarity
Advantages of the KNN Algorithm
• Easy to implement as the complexity of the algorithm is not that
high.

• Adapts Easily – As per the working of the KNN algorithm it stores


all the data in memory storage and hence whenever a new
example or data point is added then the algorithm adjusts itself
as per that new example and has its contribution to the future
predictions as well.

• Few Hyperparameters – The only parameters which are required


in the training of a KNN algorithm are the value of k and the
choice of the distance metric which we would like to choose from
our evaluation metric.
Disadvantages of the KNN Algorithm
• Does not scale – As we have heard about this that the KNN
algorithm is also considered a Lazy Algorithm.
• The main significance of this term is that this takes lots of computing power as well
as data storage. This makes this algorithm both time-consuming and resource
exhausting.
• Curse of Dimensionality – There is a term known as the peaking
phenomenon according to this the KNN algorithm is affected by
the curse of dimensionality which implies the algorithm faces a
hard time classifying the data points properly when the
dimensionality is too high.
• Prone to Overfitting – As the algorithm is affected due to the
curse of dimensionality it is prone to the problem of overfitting as
well.
• Hence generally feature selection as well as dimensionality reduction techniques
are applied to deal with this problem.
k-NN Time Complexity
• Suppose there are m instances and n features in the
dataset

• Nearest neighbor algorithm requires computing m


distances

• Each distance computation involves scanning through


each feature value

• Running time complexity is proportional to m X n


Applications

Real-world applications of k-NN classification:

• Image recognition
• Spam email detection
• Medical diagnosis
• Credit risk assessment
• Recommender systems
Problem 2
• Solve the following question using KNN
K=1, Setosa
K=2, Setosa……………K=5, Setosa

You might also like