0% found this document useful (0 votes)
14 views5 pages

Bài nhóm tìm hiểu về KNN

Uploaded by

Đỗ Quỳnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Bài nhóm tìm hiểu về KNN

Uploaded by

Đỗ Quỳnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Thành viên nhóm

Nguyễn Ngọc Thanh Lam


Lê Ngọc Hân
Đỗ Nguyễn Quỳnh Như
Phan Ngọc Diễm Trinh
Mai Thị Huỳnh Như
Lê Thị Thúy Ngọc

1. What is the K-Nearest Neighbors Algorithm?


- K-Nearest Neighbors (KNN) is a supervised machine learning model that can be
used for both regression and classification tasks. The algorithm is non-parametric,
which means that it doesn't make any assumption about the underlying distribution of
the data.
- KNN is one of the most basic yet essential classification algorithms in machine
learning. It belongs to the supervised learning domain and finds intense application in
pattern recognition, data mining, and intrusion detection.
- It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
- KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know whether it is a cat or dog. So for this identification, we can use
the KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
2. How does K-Nearest Neighbors Classification work?

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K
nearest neighbors in the training dataset.

Step-by-Step explanation of how KNN works is discussed below:


Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to be considered while making a
prediction.
Step 2: Calculating distance
To measure the similarity between target and training data points, Euclidean distance is
used. Distance is calculated between each of the data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are the nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors
becomes the predicted class for the target data point.
In the regression problem, the class label is calculated by taking average of the target values
of K nearest neighbors. The calculated average value becomes the predicted output for the
target data point.
Let X be the training dataset with n data points, where each data point is represented by a
d-dimensional feature vector Xi and Y be the corresponding labels or values for each data
point in X. Given a new data point x, the algorithm calculates the distance between x and
each data point Xi in X using a distance metric, such as Euclidean distance:
The algorithm selects the K data points from X that have the shortest distances to x. For
classification tasks, the algorithm assigns the label y that is most frequent among the K
nearest neighbors to x. For regression tasks, the algorithm calculates the average or
weighted average of the values y of the K nearest neighbors and assigns it as the predicted
value for x.

3. Distance Metrics Used in KNN Algorithm


The distance between the new data point and all the points in the training set is
calculated. Common methods of calculating the distance include:

● Euclidean Distance

This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the
straight line that joins the two points which are into consideration. This metric helps
us calculate the net displacement done between the two states of an object.

● Manhattan Distance

Manhattan Distance metric is generally used when we are interested in the total
distance traveled by the object instead of the displacement. This metric is calculated
by summing the absolute difference between the coordinates of the points in
n-dimensions.

● Minkowski Distance

We can say that the Euclidean, as well as the Manhattan distance, are special cases of
the Minkowski distance.
From the formula above we can say that when p = 2 then it is the same as the formula
for the Euclidean distance and when p = 1 then we obtain the formula for the
Manhattan distance.
4. How to choose the value of k for KNN Algorithm?
There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
Large values for K are good, but it may find some difficulties.
5. K-nearest neighbors for handwriting recognition

One of the classic applications of the K-Nearest Neighbors (KNN) algorithm is


handwriting recognition, particularly for recognizing handwritten digits. This
application demonstrates the versatility and effectiveness of KNN in dealing with
high-dimensional data. Here’s how KNN is used for handwriting recognition:

How It Works:

- Dataset: The algorithm uses datasets like the MNIST (Modified National
Institute of Standards and Technology) database, which contains thousands of
labeled images of handwritten digits (0–9). Each image is represented as a
high-dimensional vector of pixel intensities (e.g., a 28x28 pixel image flattened
into a 784-dimensional vector).
- Feature Extraction: In this task, each pixel in the image is treated as a feature.
The goal is to compare the pixel intensity values between the test image and the
training images.
- Distance Calculation: To classify a new handwritten digit, KNN computes the
distance between the feature vector of the test image and the feature vectors of
all the training images. The Euclidean distance is commonly used, though other
distance metrics can also be applied.
- Finding Neighbors: The algorithm identifies the K nearest neighbors—training
images with the smallest distances to the test image.
- Classification: KNN assigns the digit that is most frequently occurring among
the K nearest neighbors as the predicted label for the test image.
Advantages in Handwriting Recognition

- Simplicity: KNN is straightforward and requires no complex training phase,


making it ideal for rapid prototyping.
- High Accuracy for Small Datasets: When paired with a robust distance metric,
KNN performs well on relatively small handwriting datasets.

Challenges

- High Dimensionality: Each image contributes hundreds of features, leading to


increased computation time and memory usage during classification.
- Scalability: KNN becomes computationally expensive with large datasets since
it involves computing distances for every training sample during inference.

Example:

Suppose you want to classify the following digit:

- Input: A 28x28 image of a handwritten number "7".


- Steps:
○ Flatten the image into a 784-dimensional vector.
○ Calculate distances between this vector and all labeled images in the
training set.
○ Select K nearest neighbors (e.g., K=5).
○ Perform majority voting. If the nearest neighbors have labels [7, 7, 1, 7,
3], the predicted label is "7".

Optimization Techniques

- Dimensionality Reduction: Techniques like Principal Component Analysis


(PCA) can reduce the number of features, improving computational efficiency.
- Efficient Data Structures: Data structures like KD-trees or Ball-trees can speed
up nearest-neighbor search.

By leveraging KNN, handwriting recognition systems can effectively classify digits


and characters, forming the basis for many real-world applications, such as postal
code recognition and form processing.

You might also like