Lecture 3
Lecture 3
by
Rajdip Nayek
Assistant Professor,
Applied Mechanics Department,
IIT Delhi
End goal will be to construct an output prediction 𝑦ො 𝐱 ∗ for unseen input 𝐱 ∗ so that it is close to 𝑦 ∗
▪ Bias make algorithms easier to understand but are generally less flexible
▪ Low bias: Suggests less assumptions about the function 𝑓
▪ High bias: Suggests more assumptions about the function 𝑓
▪ Machine learning algorithms that have a high variance are strongly influenced by the specifics
of the training data
▪ Low variance: Suggests small changes to the estimated function 𝑓 with changes to the training dataset
▪ High variance: Suggests large changes to the estimated function 𝑓 with changes to the training dataset
3
Overfit vs Underfit
▪ Overfitting refers to the phenomenon when a model fits the training data “too well”
▪ It happens when a model learns the detail and noise in the training data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model.
▪ Models that have high variance and low bias leads to overfitting
▪ Does not generalize to new unseen data well
▪ Underfitting refers to the phenomenon when a model is unable to fit to the training data
▪ It happens when a model is “too rigid”
▪ Models that have high bias and low variance leads to underfitting
▪ Does not generalize to new unseen data well
4
Introduction to 𝑘-Nearest Neighbours (𝑘-NN)
▪ We will start with the relatively simple 𝑘-nearest neighbours (𝑘-NN) method.
Can be used for both regression and classification
▪ Most ML algorithms are based on the intuition that if the unseen data point 𝐱 ∗ is close to training data point 𝐱 (𝑖), then
the prediction 𝑦ො 𝐱 ∗ should be close to y (𝑖) .
▪ A simple way to implement this idea is to find the “nearest” training data point
▪ Compute the EuclideanϮ distance between the unseen input and all training inputs.
▪ Find the data point 𝐱 (𝑗) with the shortest distance to 𝐱 ∗ , and use its output as the prediction, 𝑦ො 𝐱 ∗ = 𝑦 (𝑗)
▪ Shortcoming: 1-nearest neighbour algorithm is sensitive to noise in data and mis-labelled data
▪ Shortcoming: 1-nearest neighbour algorithm is sensitive to noise in data and mis-labelled data
Every test example in the blue Every test example in the blue
shaded area will be mis- shaded area will be classified
classified as the blue class as the red class
▪ How to improve: Use 𝑘-nearest neighbours to obtain a majority vote (or take an average)
7
Introduction to 𝑘-Nearest Neighbours (𝑘-NN)
𝒌-NN algorithm
𝑁
Data: Training data 𝐱 (𝑖) , 𝑦 (𝑖) 𝑖=1
and unseen (test) input 𝐱 ∗
▪ 𝑘-NN is a non-parametric algorithm; makes no assumptions about the functional form and has no fixed set
of parameters. Uses the entire training data when making predictions
8
Example of 𝑘NN for binary classification
𝑖 𝐱 (𝑖) − 𝐱 ∗ 2
𝑦𝑖
𝑖 𝑥1 𝑥2 𝑦
6 1 Red
1 -1 3 Red
2 2 Blue
2 2 1 Blue
4 4 Blue
3 -2 2 Red
1 5 Red
4 -1 2 Blue
5 -1 0 Blue 5 8 Blue
6 1 1 Red 3 9 Red
9
Decision boundary of a classifier
𝑖 𝑥1 𝑥2 𝑦 ▪ Decision boundaries are the points in input space where
the class prediction changes, that is, the borders
1 -1 3 Red between different classes
2 2 1 Blue
3 -2 2 Red ▪ They can help to understand a classifier and given a
4 -1 2 Blue concise summary of a classifier
5 -1 0 Blue
6 1 1 Red
10
How to choose 𝑘?
▪ The number of neighbours 𝑘 is chosen by the user
𝑘=1
▪ The choice of hyperparameter 𝑘 has a big impact on the predictions made by 𝑘-NN
▪ Small 𝑘
• Good at capturing fine-grained patterns
• May overfit, i.e. be sensitive to random errors in the training data
▪ Large 𝑘 𝑘 = 15
• Makes stable predictions by averaging over lots of samples
• May underfit, i.e. fail to capture important patterns
▪ Balancing 𝑘 (trade-off between flexibility and rigidity)
• Optimal choice of 𝑘 depends on the number of data points 𝑁
• Rule of thumb: choose 𝑘 < 𝑁
• We can choose 𝑘 using cross-validation
11
Validation and Test sets
▪ We can tune the hyperparameters using a validation set:
▪ The test set is used only at the very end, to measure the generalization performance of the algorithm.
12
Pitfalls of 𝑘NN: Curse of dimensionality
▪ 𝑘NN works well with a small dimension of inputs (e.g. 2-3), but struggles when the input dimension is high
▪ In high dimensions, “most” points are far apart and are approximately at the same distance
▪ Hence, our intuition that works for distances in 2- and 3- dimensional spaces breaks down in higher dimensions
▪ We can show this by applying the rules of expectation and covariance of random variables (HW maybe)
13
Pitfalls of 𝑘NN: Normalization
▪ 𝑘NN can be quite sensitive to the range of the input features
▪ Example, 𝐱 = 𝑥1 𝑥2 𝑇 , where 𝑥1 is in the range [100, 1000] and the values of 𝑥2 is in the range [0, 1] (or vice-versa)
𝑥1
𝑥2
(𝑖) 2 (𝑖) 2
▪ The Euclidean distance between a test point 𝐱∗ and a training data point 𝐱 𝑖 is 𝑥1 − 𝑥1∗ + 𝑥2 − 𝑥2∗
(𝑖) 2
▪ The Euclidean distance is dominated by the first term 𝑥1 − 𝑥1∗ simply due to the larger magnitude of 𝑥1
▪ Thus, the variable 𝑥1 gets considered much more important than 𝑥2 by 𝑘NN
14
Pitfalls of 𝑘NN: Normalization
▪ 𝑘NN can be sensitive to the ranges of the input features
𝑥1
▪ Simple fix: Standardize each dimension using mean and standard deviation of data
𝑖
𝑥𝑗 −𝜇𝑗
▪ 𝑥𝑗ҧ =
𝑖
𝜎𝑗
for all 𝑖 = 1,2, ⋯ , 𝑁 and 𝑗 = 1,2, ⋯ , 𝑝
15
Pitfalls of 𝑘NN: Computationally costly
▪ Computational cost for training time: 0
▪ Computational cost at test time, per test data point
▪ Calculate 𝑝-dimensional Euclidean distances with 𝑁 data points: 𝒪 𝑁𝑝
▪ Sort the distances: 𝒪 𝑁 log 𝑁
▪ This must be done for each test data point, which is very expensive by the standards of a learning algorithm!
▪ Need to store the entire dataset in memory!
▪ Gives decent accuracy when there is lots of data
▪ 𝑘NN stores the entire training dataset in memory which it uses as its representation
▪ 𝑘NN does not learn any model
▪ 𝑘NN makes predictions just-in-time by calculating the similarity between a test input and each training sample
▪ There are many distance measures to choose from to match the structure of your input data
▪ It is a good idea to rescale your data, such as using normalization, when using 𝑘NN
17