0% found this document useful (0 votes)
62 views

Lecture 3

The document discusses the k-nearest neighbors (k-NN) machine learning algorithm. It begins with a recap of supervised learning concepts like classification, regression, bias, variance and overfitting/underfitting. It then introduces k-NN, which predicts the class or value of an unseen data point based on its distance to nearby training examples. The algorithm finds the k closest examples and outputs the majority class or average value. Choosing k involves a bias-variance tradeoff, with smaller k prone to overfitting and larger k underfitting.

Uploaded by

Mohit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Lecture 3

The document discusses the k-nearest neighbors (k-NN) machine learning algorithm. It begins with a recap of supervised learning concepts like classification, regression, bias, variance and overfitting/underfitting. It then introduces k-NN, which predicts the class or value of an unseen data point based on its distance to nearby training examples. The algorithm finds the k closest examples and outputs the majority class or average value. Choosing k involves a bias-variance tradeoff, with smaller k prone to overfitting and larger k underfitting.

Uploaded by

Mohit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

APL 405: Machine Learning for Mechanics

Lecture 3: 𝑘-Nearest Neighbour

by

Rajdip Nayek
Assistant Professor,
Applied Mechanics Department,
IIT Delhi

Instructor email: [email protected]


Supervised Learning: Recap
Supervised Learning
Learning (training, estimating) a function (or model) 𝑓 so that it best fits the relationship between
▪ the input 𝐱, and
▪ the output 𝑦
from observed training data (the individual data points are assumed to be (probabilistically) independent)

𝐷train = 𝐱 (1) , 𝑦 (1) , 𝐱 (2) , 𝑦 (2) , ⋯ , 𝐱 (𝑁) , 𝑦 (𝑁)

End goal will be to construct an output prediction 𝑦ො 𝐱 ∗ for unseen input 𝐱 ∗ so that it is close to 𝑦 ∗

Types of Supervised learning: Classification and Regression

▪ Output variable 𝑦? → categorical → Classification


▪ Output variable 𝑦?→ numerical → Regression
▪ Input variable can be categorical or numerical or mix of both
2
Supervised Learning: Recap
▪ Parametric vs Non-parametric models

▪ Prediction errors caused due to bias, variance and irreducible errors

▪ Bias make algorithms easier to understand but are generally less flexible
▪ Low bias: Suggests less assumptions about the function 𝑓
▪ High bias: Suggests more assumptions about the function 𝑓
▪ Machine learning algorithms that have a high variance are strongly influenced by the specifics
of the training data
▪ Low variance: Suggests small changes to the estimated function 𝑓 with changes to the training dataset
▪ High variance: Suggests large changes to the estimated function 𝑓 with changes to the training dataset

3
Overfit vs Underfit
▪ Overfitting refers to the phenomenon when a model fits the training data “too well”
▪ It happens when a model learns the detail and noise in the training data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model.
▪ Models that have high variance and low bias leads to overfitting
▪ Does not generalize to new unseen data well

Overfitting Underfitting Balanced fit

▪ Underfitting refers to the phenomenon when a model is unable to fit to the training data
▪ It happens when a model is “too rigid”
▪ Models that have high bias and low variance leads to underfitting
▪ Does not generalize to new unseen data well
4
Introduction to 𝑘-Nearest Neighbours (𝑘-NN)
▪ We will start with the relatively simple 𝑘-nearest neighbours (𝑘-NN) method.
Can be used for both regression and classification

▪ Most ML algorithms are based on the intuition that if the unseen data point 𝐱 ∗ is close to training data point 𝐱 (𝑖), then
the prediction 𝑦ො 𝐱 ∗ should be close to y (𝑖) .

▪ A simple way to implement this idea is to find the “nearest” training data point
▪ Compute the EuclideanϮ distance between the unseen input and all training inputs.

(𝑖) 2 (𝑖) 2 (𝑖) 2


The 𝑖th Euclidean distance: 𝐱 (𝑖) − 𝐱∗ 2
= 𝑥1 − 𝑥1∗ + 𝑥2 − 𝑥2∗ + ⋯+ 𝑥𝑝 − 𝑥𝑝∗

▪ Find the data point 𝐱 (𝑗) with the shortest distance to 𝐱 ∗ , and use its output as the prediction, 𝑦ො 𝐱 ∗ = 𝑦 (𝑗)

▪ This is the 1-nearest neighbour algorithm


Ϯ There are many other distances: Manhattan, Mahalanobis, cosine similarity, etc. Use Manhattan if inputs variables are not similar in
type (such as age, gender, height, etc.) 5
Introduction to 𝑘-Nearest Neighbours (𝑘-NN)
▪ In practice we can rarely say for certain what the output value 𝑦 will be!
▪ Mathematically, we handle this by describing 𝑦 as a random variable. That is, we consider the data as noisy, meaning that
it is affected by random errors referred to as noise.

▪ Shortcoming: 1-nearest neighbour algorithm is sensitive to noise in data and mis-labelled data

Every test example in the blue


shaded area will be mis-
classified as the blue class
6
Introduction to 𝑘-Nearest Neighbours (𝑘-NN)
▪ In practice we can rarely say for certain what the output value 𝑦 will be!
▪ Mathematically, we handle this by describing 𝑦 as a random variable. That is, we consider the data as noisy, meaning that
it is affected by random errors referred to as noise.

▪ Shortcoming: 1-nearest neighbour algorithm is sensitive to noise in data and mis-labelled data

Every test example in the blue Every test example in the blue
shaded area will be mis- shaded area will be classified
classified as the blue class as the red class

▪ How to improve: Use 𝑘-nearest neighbours to obtain a majority vote (or take an average)
7
Introduction to 𝑘-Nearest Neighbours (𝑘-NN)
𝒌-NN algorithm

𝑁
Data: Training data 𝐱 (𝑖) , 𝑦 (𝑖) 𝑖=1
and unseen (test) input 𝐱 ∗

Result: Predicted test output 𝑦ො 𝐱 ∗

1. Compute the distances 𝐱 (𝑖) − 𝐱 ∗ 2


for all training data points 𝑖 = 1,2, ⋯ , 𝑁

2. Find 𝑘 examples 𝐱 (𝑖) , 𝑦 (𝑖) closest to the test instance 𝐱 ∗


3. Compute the prediction 𝑦ො 𝐱 ∗

Mean (or median) of 𝑘 closest examples 𝐑𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧


𝑦ො 𝐱 ∗ = ቊ
Majority vote mode of 𝑘 closest examples 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧

▪ 𝑘-NN is a non-parametric algorithm; makes no assumptions about the functional form and has no fixed set
of parameters. Uses the entire training data when making predictions
8
Example of 𝑘NN for binary classification
𝑖 𝐱 (𝑖) − 𝐱 ∗ 2
𝑦𝑖
𝑖 𝑥1 𝑥2 𝑦
6 1 Red
1 -1 3 Red
2 2 Blue
2 2 1 Blue
4 4 Blue
3 -2 2 Red
1 5 Red
4 -1 2 Blue
5 -1 0 Blue 5 8 Blue

6 1 1 Red 3 9 Red

▪ Predict the output for 𝐱 ∗ = 1 2 𝑇


▪ Consider two different kNN classifiers
▪ one using 𝑘 = 1, and (result is red)
▪ another using 𝑘 = 3 (result is blue)

9
Decision boundary of a classifier
𝑖 𝑥1 𝑥2 𝑦 ▪ Decision boundaries are the points in input space where
the class prediction changes, that is, the borders
1 -1 3 Red between different classes
2 2 1 Blue
3 -2 2 Red ▪ They can help to understand a classifier and given a
4 -1 2 Blue concise summary of a classifier
5 -1 0 Blue
6 1 1 Red

▪ Predict the output for 𝐱 ∗ = 1 2 𝑇


▪ Consider two different kNN classifiers
▪ one using 𝑘 = 1, and
▪ another using 𝑘 = 3

10
How to choose 𝑘?
▪ The number of neighbours 𝑘 is chosen by the user
𝑘=1

▪ Since it is not learned, it is not a parameter, and we refer to it as the hyperparameter

▪ The choice of hyperparameter 𝑘 has a big impact on the predictions made by 𝑘-NN
▪ Small 𝑘
• Good at capturing fine-grained patterns
• May overfit, i.e. be sensitive to random errors in the training data
▪ Large 𝑘 𝑘 = 15
• Makes stable predictions by averaging over lots of samples
• May underfit, i.e. fail to capture important patterns
▪ Balancing 𝑘 (trade-off between flexibility and rigidity)
• Optimal choice of 𝑘 depends on the number of data points 𝑁
• Rule of thumb: choose 𝑘 < 𝑁
• We can choose 𝑘 using cross-validation
11
Validation and Test sets
▪ We can tune the hyperparameters using a validation set:

▪ The test set is used only at the very end, to measure the generalization performance of the algorithm.

12
Pitfalls of 𝑘NN: Curse of dimensionality
▪ 𝑘NN works well with a small dimension of inputs (e.g. 2-3), but struggles when the input dimension is high
▪ In high dimensions, “most” points are far apart and are approximately at the same distance
▪ Hence, our intuition that works for distances in 2- and 3- dimensional spaces breaks down in higher dimensions

▪ We can show this by applying the rules of expectation and covariance of random variables (HW maybe)

13
Pitfalls of 𝑘NN: Normalization
▪ 𝑘NN can be quite sensitive to the range of the input features
▪ Example, 𝐱 = 𝑥1 𝑥2 𝑇 , where 𝑥1 is in the range [100, 1000] and the values of 𝑥2 is in the range [0, 1] (or vice-versa)

𝑥1

𝑥2

(𝑖) 2 (𝑖) 2
▪ The Euclidean distance between a test point 𝐱∗ and a training data point 𝐱 𝑖 is 𝑥1 − 𝑥1∗ + 𝑥2 − 𝑥2∗

(𝑖) 2
▪ The Euclidean distance is dominated by the first term 𝑥1 − 𝑥1∗ simply due to the larger magnitude of 𝑥1

▪ Thus, the variable 𝑥1 gets considered much more important than 𝑥2 by 𝑘NN
14
Pitfalls of 𝑘NN: Normalization
▪ 𝑘NN can be sensitive to the ranges of the input features

𝑥1

▪ Simple fix: Normalize each dimension to be in the range [0, 1]


𝑥2
𝑖 𝑖
𝑥𝑗 −min 𝑥𝑗
▪ 𝑥𝑗ҧ =
𝑖
𝑖
𝑖
𝑖 for all 𝑖 = 1,2, ⋯ , 𝑁 and 𝑗 = 1,2, ⋯ , 𝑝
max 𝑥𝑗 −min 𝑥𝑗
𝑖 i

▪ Simple fix: Standardize each dimension using mean and standard deviation of data
𝑖
𝑥𝑗 −𝜇𝑗
▪ 𝑥𝑗ҧ =
𝑖
𝜎𝑗
for all 𝑖 = 1,2, ⋯ , 𝑁 and 𝑗 = 1,2, ⋯ , 𝑝

15
Pitfalls of 𝑘NN: Computationally costly
▪ Computational cost for training time: 0
▪ Computational cost at test time, per test data point
▪ Calculate 𝑝-dimensional Euclidean distances with 𝑁 data points: 𝒪 𝑁𝑝
▪ Sort the distances: 𝒪 𝑁 log 𝑁
▪ This must be done for each test data point, which is very expensive by the standards of a learning algorithm!
▪ Need to store the entire dataset in memory!
▪ Gives decent accuracy when there is lots of data

MNIST digit classification


• Handwritten digits
• 28x28 pixel images: 𝑝 = 784
• 60,000 training samples
• 10,000 test samples
16
Summary
▪ 𝑘-Nearest Neighbors algorithm can be used for both classification and regression

▪ 𝑘NN stores the entire training dataset in memory which it uses as its representation
▪ 𝑘NN does not learn any model

▪ 𝑘NN makes predictions just-in-time by calculating the similarity between a test input and each training sample
▪ There are many distance measures to choose from to match the structure of your input data

▪ It is a good idea to rescale your data, such as using normalization, when using 𝑘NN

17

You might also like