8.2. Machine Learning K Nearest Neighbor
8.2. Machine Learning K Nearest Neighbor
K-Nearest Neighbor
Part – 2
Dr. Oybek Eraliev,
Department of Computer Engineering
Inha University In Tashkent.
Email: [email protected]
Ø Have you ever heard of the Gestalt Principles? These are part of a theory
of perception that examines how humans group similar objects in order
to interpret the complex world around them.
Ø The KNN algorithm predicts responses for new data (testing data) based
upon its similarity with other known data (training) samples. It assumes
that data with similar traits sit together and uses distance measures at its
core.
Ø Vectors which are most similar will have a value of 0 degrees between
them (the value of cos = 0 is 1), while vectors that are most dissimilar
will have a value of -1. The smaller the angle, the higher the similarity.
Ø After the user defines a distance function, like the ones we mentioned
earlier, KNN calculates the distance between data points in order to find
the closest data points from our training data for any new data point. The
existing data points closest to the new data point using the defined
distance will become the “k-neighbors”.
Ø For a classification task, KNN will use the most frequent of all values
from the k-neighbors to predict the new data label.
Ø For a regression task, the algorithm will use the average of all values from
the k-neighbors to predict the new data value.
Source: https://commons.wikimedia.org/wiki/File:KNN_decision_surface_animation.gif
Ø All KNN does is store the complete dataset, and without doing any
calculation or modeling on top of it, measure the distance between the
new data point and its closest data points.
Ø For this reason, and since there’s not really a learning process happening,
KNN is called a “lazy” algorithm (as opposed to “eager” algorithms like
Decision Trees that build generalized models before performing
predictions on new data points).
Sepal Length Sepal Width Species Sepal Length Sepal Width Species
5.3 3.7 Setosa 5.2 3.1 ?
5.1 3.8 Setosa
7.2 3.0 Virginica
Step 1: Find Distence
5.4 3.4 Setosa
5.1 3.3 Setosa
5.4 3.9 Setosa
𝐷(𝑆𝑒𝑝𝑎𝑙 𝐿𝑒𝑛𝑔𝑡ℎ, 𝑆𝑒𝑝𝑎𝑙 𝑊𝑖𝑑𝑡ℎ) = (𝑥 − 𝑎)! +(𝑦 − 𝑏)!
7.4 2.8 Virginica
6.1 2.8 Vercicolor 𝐷(𝑆𝑒𝑝𝑎𝑙 𝐿𝑒𝑛𝑔𝑡ℎ, 𝑆𝑒𝑝𝑎𝑙 𝑊𝑖𝑑𝑡ℎ) = (5.2 − 5.3)! +(3.1 − 3.7)!
7.3 2.9 Virginica
6.0 2.7 Vercicolor 𝐷(𝑆𝑒𝑝𝑎𝑙 𝐿𝑒𝑛𝑔𝑡ℎ, 𝑆𝑒𝑝𝑎𝑙 𝑊𝑖𝑑𝑡ℎ) = 0.608
5.8 2.8 Virginica Sepal Length Sepal Width Species Distance
6.3 2.3 Vercicolor
5.3 3.7 Setosa 0.608
Sepal Length Sepal Width Species Distance Rank Step 3: Find Nearest Neighbor
5.3 3.7 Setosa 0.608 3
5.1 3.8 Setosa 0.707 5
If K=3
7.2 3.0 Virginica 2.002 10
Sepal Length Sepal Width Species Distance Rank
5.4 3.4 Setosa 0.36 2
5.1 3.3 Setosa 0.22 1
5.1 3.3 Setosa 0.22 1
5.4 3.9 Setosa 0.82 6
5.4 3.4 Setosa 0.36 2
7.4 2.8 Virginica 2.22 12
5.3 3.7 Setosa 0.608 3
6.1 2.8 Vercicolor 0.94 8
Ø This number can go from 1 (in which case the algorithm only looks at the
closest neighbor for each prediction) to the total number of data points of
the dataset (in which case the algorithm would predict the majority class
of the complete dataset).
Ø Source: ShareTechnote
Ø So how can you know the optimum value of “k”? We can decide based on
the error calculation of a training and testing set. Separating the data into
training and test sets allows for an objective model evaluation.
While the kNN algorithm is simple, it also has a set of challenges and
limitations, due in part to its simplicity: