0% found this document useful (0 votes)
3 views

8.2. Machine Learning K Nearest Neighbor

The document provides an overview of the K-Nearest Neighbor (KNN) algorithm, detailing its principles, distance measures, and implementation in Python. KNN is a supervised learning method used for classification and regression tasks, relying on the proximity of data points to make predictions. It discusses how to determine the optimal number of neighbors (k) and highlights the algorithm's applications across various industries.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

8.2. Machine Learning K Nearest Neighbor

The document provides an overview of the K-Nearest Neighbor (KNN) algorithm, detailing its principles, distance measures, and implementation in Python. KNN is a supervised learning method used for classification and regression tasks, relying on the proximity of data points to make predictions. It discusses how to determine the optimal number of neighbors (k) and highlights the algorithm's applications across various industries.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine Learning:

K-Nearest Neighbor
Part – 2
Dr. Oybek Eraliev,
Department of Computer Engineering
Inha University In Tashkent.
Email: [email protected]

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 1


Content

ØIntroduction to K-Nearest Neighbor


ØDistance Measures
ØHow KNN works?
ØHow to find k?
ØWhy KNN?
ØImplementation of KNN in Python

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 2


Introduction to K-Nearest Neighbor

Ø Have you ever heard of the Gestalt Principles? These are part of a theory
of perception that examines how humans group similar objects in order
to interpret the complex world around them.

Ø According to the Gestalt principle of proximity, elements that are closer


together are perceived to be more related than elements that are farther
apart, helping us understand and organize information faster and more
efficiently.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 3


Introduction to K-Nearest Neighbor

Our brains perceive these sets of closely placed elements as groups.


Proximity is essential to our perception. Source: UXMISFIT

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 4


Introduction to K-Nearest Neighbor

Ø Likewise, in Machine Learning, the concepts of proximity and similarity


are tightly linked. Meaning that the closer two data points are, the more
similar they are to one another than to other data points. The content
recommendation systems you use every day for movies, texts, songs, and
more, rely on this principle.

Ø One Machine Learning algorithm that relies on the concepts of proximity


and similarity is K-Nearest Neighbor (KNN). KNN is a supervised
learning algorithm capable of performing both classification and
regression tasks.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 5


Introduction to K-Nearest Neighbor

Ø The KNN algorithm predicts responses for new data (testing data) based
upon its similarity with other known data (training) samples. It assumes
that data with similar traits sit together and uses distance measures at its
core.

Ø KNN belongs to the class of non-parametric models as it doesn’t learn


parameters from the training dataset to come up with a discriminative
function to predict new unseen data. It operates by memorizing the training
dataset.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 6


Introduction to K-Nearest Neighbor

Given its flexibility, KNN is used in industries such as:

• Agriculture: to perform climate forecasting, estimating soil water


parameters, or predicting crop yields.
• Finance: to predict bankruptcies, understanding and managing financial
risk.
• Healthcare: to identify cancer risk factors, predict heart attacks, or
analyze gene expression data.
• Internet: using clickstream data from websites, to provide automatic
recommendations to users on additional content.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 7


Content

ØIntroduction to K-Nearest Neighbor


ØDistance Measures
ØHow KNN works?
ØHow to find k?
ØWhy KNN?
ØImplementation of KNN in Python

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 8


Distance Measures

Ø How near are two data points? It depends on


how you measure it, and in Machine
Learning, there isn’t just one way of
measuring distances.

Ø A distance measure is an objective score that


summarizes the relative difference between
two objects in a problem domain and plays
an important role in most algorithms.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 9


Distance Measures

Ø Euclidean is probably the most intuitive


one and represents the shortest distance
between two points. It’s calculated using
the well-known Pythagorean theorem.

Ø Conceptually, it should be used


whenever we are comparing
observations with continuous features,
like height, weight, or salaries. This
distance measure is often the “default”
distance used in algorithms like KNN.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 10


Distance Measures

Ø Manhattan is used to estimate the distance to


get from one data point to another if a grid-
like path is taken. Unlike Euclidean distance,
the Manhattan distance calculates the sum of
the absolute values of the difference of the
coordinates of two points. This way, instead
of estimating a straight line between two
points, we “walk” through available paths.
Ø The Manhattan distance is useful when our
observations are distributed along a grid
(when the features of our observations are
entire integers with no decimal parts).
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 11
Distance Measures

Ø Cosine is employed to calculate similarity between two vectors. Through


this measure, data objects in a dataset are treated as vectors, and
similarity is calculated by the cosine of the angle between two vectors.

Ø Vectors which are most similar will have a value of 0 degrees between
them (the value of cos = 0 is 1), while vectors that are most dissimilar
will have a value of -1. The smaller the angle, the higher the similarity.

Ø This different perspective about distance can provide novel insights


which might not be found using the previous distance metrics.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 12


Distance Measures

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 13


Distance Measures

Ø Hamming distance: This technique


is used typically used with Boolean
or string vectors, identifying the
points where the vectors do not
match.

Ø As a result, it has also been referred


to as the overlap metric. This can be
represented with the following
formula:

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 14


Distance Measures

Ø Minkowski distance: is a generalization


of Euclidean and Manhattan distance
metrics, which enables the creation of
other distance metrics. It is calculated in
a normed vector space.

Ø In the Minkowski distance, p is the


parameter that defines the type of
distance used in the calculation. If p=1,
then the Manhattan distance is used. If
p=2, then the Euclidean distance is used.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 15


Distance Measures

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 16


Content

ØIntroduction to K-Nearest Neighbor


ØDistance Measures
ØHow KNN works?
ØHow to find k?
ØWhy KNN?
ØImplementation of KNN in Python

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 17


How KNN works?

Ø After the user defines a distance function, like the ones we mentioned
earlier, KNN calculates the distance between data points in order to find
the closest data points from our training data for any new data point. The
existing data points closest to the new data point using the defined
distance will become the “k-neighbors”.

Ø For a classification task, KNN will use the most frequent of all values
from the k-neighbors to predict the new data label.

Ø For a regression task, the algorithm will use the average of all values from
the k-neighbors to predict the new data value.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 18


How KNN works?

Source: https://commons.wikimedia.org/wiki/File:KNN_decision_surface_animation.gif

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 19


How KNN works?

Ø All KNN does is store the complete dataset, and without doing any
calculation or modeling on top of it, measure the distance between the
new data point and its closest data points.

Ø For this reason, and since there’s not really a learning process happening,
KNN is called a “lazy” algorithm (as opposed to “eager” algorithms like
Decision Trees that build generalized models before performing
predictions on new data points).

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 20


How KNN works?

Sepal Length Sepal Width Species Sepal Length Sepal Width Species
5.3 3.7 Setosa 5.2 3.1 ?
5.1 3.8 Setosa
7.2 3.0 Virginica
Step 1: Find Distence
5.4 3.4 Setosa
5.1 3.3 Setosa
5.4 3.9 Setosa
𝐷(𝑆𝑒𝑝𝑎𝑙 𝐿𝑒𝑛𝑔𝑡ℎ, 𝑆𝑒𝑝𝑎𝑙 𝑊𝑖𝑑𝑡ℎ) = (𝑥 − 𝑎)! +(𝑦 − 𝑏)!
7.4 2.8 Virginica
6.1 2.8 Vercicolor 𝐷(𝑆𝑒𝑝𝑎𝑙 𝐿𝑒𝑛𝑔𝑡ℎ, 𝑆𝑒𝑝𝑎𝑙 𝑊𝑖𝑑𝑡ℎ) = (5.2 − 5.3)! +(3.1 − 3.7)!
7.3 2.9 Virginica
6.0 2.7 Vercicolor 𝐷(𝑆𝑒𝑝𝑎𝑙 𝐿𝑒𝑛𝑔𝑡ℎ, 𝑆𝑒𝑝𝑎𝑙 𝑊𝑖𝑑𝑡ℎ) = 0.608
5.8 2.8 Virginica Sepal Length Sepal Width Species Distance
6.3 2.3 Vercicolor
5.3 3.7 Setosa 0.608

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 21


How KNN works?

Sepal Length Sepal Width Species Distance Step 2: Find Rank


5.3 3.7 Setosa 0.608
5.1 3.8 Setosa 0.707
7.2 3.0 Virginica 2.002
5.4 3.4 Setosa 0.36
5.1 3.3 Setosa 0.22
5.4 3.9 Setosa 0.82
7.4 2.8 Virginica 2.22
6.1 2.8 Vercicolor 0.94

7.3 2.9 Virginica 2.1


6.0 2.7 Vercicolor 0.89

5.8 2.8 Virginica 0.67


6.3 2.3 Vercicolor 1.36

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 22


How KNN works?

Sepal Length Sepal Width Species Distance Rank Step 3: Find Nearest Neighbor
5.3 3.7 Setosa 0.608 3
5.1 3.8 Setosa 0.707 5
If K=3
7.2 3.0 Virginica 2.002 10
Sepal Length Sepal Width Species Distance Rank
5.4 3.4 Setosa 0.36 2
5.1 3.3 Setosa 0.22 1
5.1 3.3 Setosa 0.22 1
5.4 3.9 Setosa 0.82 6
5.4 3.4 Setosa 0.36 2
7.4 2.8 Virginica 2.22 12
5.3 3.7 Setosa 0.608 3
6.1 2.8 Vercicolor 0.94 8

7.3 2.9 Virginica 2.1 11 Sepal Length Sepal Width Species


6.0 2.7 Vercicolor 0.89 7 5.2 3.1 ?

5.8 2.8 Virginica 0.67 4


Sepal Length Sepal Width Species
6.3 2.3 Vercicolor 1.36 9
5.2 3.1 Setosa

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 23


Content

ØIntroduction to K-Nearest Neighbor


ØDistance Measures
ØHow KNN works?
ØHow to find k?
ØWhy KNN?
ØImplementation of KNN in Python

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 24


How to find k?

Ø With KNN, in order to make a classification/regression task, you need to


define a number of neighbors, and that number is given by the parameter
“k”. In other words, “k” determines the number of neighbors the
algorithm looks at when assigning a value to any new observation.

Ø This number can go from 1 (in which case the algorithm only looks at the
closest neighbor for each prediction) to the total number of data points of
the dataset (in which case the algorithm would predict the majority class
of the complete dataset).

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 25


How to find k?

Ø In the above example, if k=3 then


the new data point will be classified
as B.

Ø But if k=6, then it will be classified


as A because the majority of points
are from class A.

Ø Source: ShareTechnote

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 26


How to find k?

Ø So how can you know the optimum value of “k”? We can decide based on
the error calculation of a training and testing set. Separating the data into
training and test sets allows for an objective model evaluation.

Ø One popular approach is testing different numbers of “k” and measuring


the resulting error, choosing the “k” value at which an increase will cause
a very small decrease in the error sum, while a decrease will sharply
increase the error sum. This point that defines the optimal number is
known as the “elbow point”.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 27


How to find k?

Ø The “k” value best suited for a dataset is


selected by using the error rate (this is,
the difference between the real and
predicted values). After a certain point,
the error rate becomes almost constant.

Ø We can see that after “k” > 5, the error


rate stabilizes and becomes almost
constant. For this reason, “k” = 5
seems to be the optimal value.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 28


Content

ØIntroduction to K-Nearest Neighbor


ØDistance Measures
ØHow KNN works?
ØHow to find k?
ØWhy KNN?
ØImplementation of KNN in Python

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 29


Why KNN?
Advantages of the kNN algorithm

The kNN algorithm is often described as the “simplest” supervised learning


algorithm, which leads to its several advantages:
ü Simple: kNN is easy to implement because of how simple and
accurate it is. As such, it is often one of the first classifiers that a data
scientist will learn.
ü Adaptable: As soon as new training samples are added to its dataset,
the kNN algorithm adjusts its predictions to include the new training
data.
ü Easily programmable: kNN requires only a few hyperparameters — a
k value and a distance metric. This makes it a fairly uncomplicated
algorithm.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 30


Why KNN?
Challenges and limitations of kNN

While the kNN algorithm is simple, it also has a set of challenges and
limitations, due in part to its simplicity:

ü Difficult to scale: Because kNN takes up a lot of memory and data


storage, it brings up the expenses associated with storage. This
reliance on memory also means that the algorithm is computationally
intensive, which is in turn resource-intensive.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 31


Why KNN?
Challenges and limitations of kNN

ü Curse of dimensionality: This refers to a phenomenon that occurs in


computer science, wherein a fixed set of training examples is challenged by
an increasing number of dimensions and the inherent increase of feature
values in these dimensions. In other words, the model’s training data can’t
keep up with the evolving dimensionality of the hyperspace. This means
that predictions become less accurate because the distance between the
query point and similar points grows wider — on other dimensions.
ü Overfitting: The value of k, as shown earlier, will impact the algorithm’s
behavior. This can happen especially when the value of k is too low. Lower
values of k can overfit the data, while higher values of k will ‘smooth’ the
prediction values because the algorithm averages values over a greater area.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 32


Content

ØIntroduction to K-Nearest Neighbor


ØDistance Measures
ØHow KNN works?
ØHow to find k?
ØWhy KNN?
ØImplementation of KNN in Python

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 33


Implementation of KNN in Python

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 34


Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 35
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 36

You might also like