Bài nhóm tìm hiểu về KNN
Bài nhóm tìm hiểu về KNN
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K
nearest neighbors in the training dataset.
● Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the
straight line that joins the two points which are into consideration. This metric helps
us calculate the net displacement done between the two states of an object.
● Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total
distance traveled by the object instead of the displacement. This metric is calculated
by summing the absolute difference between the coordinates of the points in
n-dimensions.
● Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of
the Minkowski distance.
From the formula above we can say that when p = 2 then it is the same as the formula
for the Euclidean distance and when p = 1 then we obtain the formula for the
Manhattan distance.
4. How to choose the value of k for KNN Algorithm?
There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
Large values for K are good, but it may find some difficulties.
5. K-nearest neighbors for handwriting recognition
How It Works:
- Dataset: The algorithm uses datasets like the MNIST (Modified National
Institute of Standards and Technology) database, which contains thousands of
labeled images of handwritten digits (0–9). Each image is represented as a
high-dimensional vector of pixel intensities (e.g., a 28x28 pixel image flattened
into a 784-dimensional vector).
- Feature Extraction: In this task, each pixel in the image is treated as a feature.
The goal is to compare the pixel intensity values between the test image and the
training images.
- Distance Calculation: To classify a new handwritten digit, KNN computes the
distance between the feature vector of the test image and the feature vectors of
all the training images. The Euclidean distance is commonly used, though other
distance metrics can also be applied.
- Finding Neighbors: The algorithm identifies the K nearest neighbors—training
images with the smallest distances to the test image.
- Classification: KNN assigns the digit that is most frequently occurring among
the K nearest neighbors as the predicted label for the test image.
Advantages in Handwriting Recognition
Challenges
Example:
Optimization Techniques