KNN
KNN
NEIGHBOR
Presented By:
Mahnoor Farooq- 2021-CS-403
Ayesha Nadeem - 2021-CS-413
Saria Irshad- 2021-CS-425
Saba Shahzadi-2021-CS-411
• Introduction
• Distance Metrics
• Choice of k
• Search Algorithm
•
NAMES
K-Nearest Neighbbors
• Memory-based Reasoning
• Example-Based Reasoning
• Instance-Based Learning
• Lazy Learning
KNN-PRINCIPLE
• Manhattan Distance: Manhatten distance says that the distance between your
starting point and the destination point.
• Hamming Distance
EUCLIDEAN
DISTANCE
Suppose the vector X1 and X2 coordinates are in 2-Dim,
X1(x1, y1) = X1(3, 4)
X2(x2, y2) = X2(4, 7), as you can see in the image.
So these are a 2-Dim vector so our eucledian distance
mathematical equation for finding the distance between
X1 and X2 is:
distance = sqrt( (x2-x1)2 + (y2-y1)2 )
When we put our coordinates in the equations,
distance = sqrt( (4-3)2 + (7-4)2 )
distance = sqrt( 1+32 )
distance = sqrt( 10 )
distance = 3.1 (approx)
MANHATTAN DISTANCE
Distance Calculation: Same as for classification, using Euclidean distance or other distance
metrics.
• Prediction Rule: The predicted value for regression is the average of the values of the K
nearest neighbors.
Formula:
• Where y_i is the target value for the i-th nearest neighbor.
• Weighted Average (Optional): A weighted average can be used where closer neighbors have
more influence.
Where w_i is the weight based on distance.
CHOOSING K FOR CLASSIFICATION &
REGRESSION
choosing K:
• Small K: Small K values (e.g., K=1) are sensitive to noise and can lead to overfitting.
• Large K: Large K values (e.g., K=20) may smooth out the model too much, leading to
underfitting.
• The optimal value of K is typically found by cross-validation.
For Classification:
K should be large enough to avoid noise but small enough to preserve local patterns in
the data.
For Regression:
The optimal K balances between the local behavior of the data and generalizing trends.
KNN FOR CLASSIFICATION VS REGRESSION
Classification: K-NN assigns a class label based on the majority vote of the K
nearest neighbors.
Regression: K-NN predicts a continuous value based on the average (or weighted
average) of the K nearest neighbors' target values.
Mathematical Difference:
In classification, the output is a class label.
In regression, the output is a continuous value (mean or weighted mean of
neighbors’ target values).
CHALLENGES & LIMITATIONS
1.Computational Cost (Slow Prediction Time):
Problem: K-NN calculates the distance between the test point and all points in the training
dataset at prediction time, making it computationally expensive.
Example: In a dataset with 1 million data points, predicting the class of a new data point
requires calculating the distance between this point and all 1 million points, which can be
slow.
2.Curse of Dimensionality:
Problem: As the number of features (dimensions) increases, the concept of distance
becomes less meaningful, leading to poor performance in high-dimensional spaces.
Example: When classifying high-dimensional data (e.g., pixel values of an image with 100
features), the notion of 'closeness' becomes distorted.
CHALLENGES & LIMITATIONS
3.Memory and Storage Requirements:
Problem: K-NN stores the entire training dataset, which can be inefficient if the
dataset is very large.
Example: For a large image dataset, storing millions of images can consume a lot of
memory.
4.Sensitivity to Noisy Data and Outliers:
Problem: K-NN relies on the proximity of data points to make predictions. If there are
noisy data points or outliers, they can influence the classification of the test point.
Example: In a classification task with most data points labeled as 'cat' but one outlier
labeled 'dog', the test point may be incorrectly classified as 'dog'.
OPTIMIZING KNN
KD-Tree Overview:
A KD-Tree is a hierarchical binary tree used to organize data points
in a k-dimensional space.
How it Works: KD-Trees recursively split the data into two halves
along the median of each dimension, creating a binary tree
structure.
Benefit: Searching for nearest neighbors becomes more efficient,
as the tree structure allows for pruning of large portions of the
data, reducing the search space during prediction.
Example: When a test point is provided, the KD-Tree quickly
narrows down the region of interest by traversing the tree
structure, making the search faster than a brute-force search.
BALL TREE
Ball Tree Overview:
A Ball Tree is another hierarchical data structure designed for
high-dimensional spaces, much like a KD-Tree, but optimized
for cases where data has more than 2-3 dimensions.
How it Works: The Ball Tree organizes data points into
hierarchical clusters (balls) based on distance from centroids,
recursively subdividing the dataset into smaller balls.
Benefit: Ball Trees are especially efficient when dealing with
high-dimensional data (e.g., image data with hundreds of
features).
How Ball Tree Works: The tree recursively divides the data
into smaller clusters, calculating the distance from the
centroid of each cluster.
CONCLUSION
In summary, K-NN is a good choice for problems where simplicity and flexibility
are crucial, but for larger datasets or high-dimensional data, optimizations like
KD-Tree or Ball-Tree are essential for ensuring performance
THANK
YOU