0% found this document useful (0 votes)
16 views

Week 5 - Instance-Based Learning & PCA

Uploaded by

fantiaoxi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Week 5 - Instance-Based Learning & PCA

Uploaded by

fantiaoxi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

CS6140: Machine Learning

k-Nearest Neighbors (k-NN) & Principle Component Analysis (PCA)

Dr. Ryan Rad

Summer 2024
Today’s Agenda

• Recommendation System
• Instance Based Learning
• k Nearest Neighbors (kNN)
k-NN & PCA • Principle Component Analysis (PCA)
• Labs – PCA
• HW3 Walkthrough - Video (coming tomorrow)

2
Recommendation
Systems

4
A common architecture for recommendation systems:

Credit: https://developers.google.com/machine-learning/recommendation/overview/types

Recommendation Type Definition Example


Systems Content-based filtering Uses similarity between
items to recommend items
If user A watches two cute cat
videos, then the system can
similar to what the user likes. recommend cute animal
videos to that user.
Interested to learn more? Collaborative filtering Uses similarities between If user A is similar to user B,
• https://developers.google.com/machine-learning/recommendation
• https://www.datacamp.com/tutorial/recommender-systems-python queries and items and user B likes video 1, then
simultaneously to provide the system can recommend
recommendations. video 1 to user A (even if user
A hasn’t seen any videos
similar to video 1).

5
Case-Study
(University Track & Field Team)

Every year Northeastern University recruits


some additional talented athletes to join the
team.

• How kNN can support our team selection


process?

6
Fundamentals (K-NN)

The fundamentals of similarity-based learning are:

Feature space Similarity metrics

7
Fundamentals (Feature Space)

Figure: A feature space plot of The speed and agility ratings


for 20 college athletes labelled with the decisions for
whether they were drafted or not.

The Triangles represent ‘Non-draft’ instances and the


crosses represent the ‘Draft’ instances.

8
Fundamentals (Feature Space)

A feature space is an abstract n-dimensional space

• In a feature space, each descriptive feature corresponds to an axis.


• Each instance in the dataset is mapped to a point in the feature space

9
K Nearest Neighbours (K-NN)

Example

• Should we draft an athlete with the following profile:

Query: SPEED = 6.75, AGILITY= 3


Using 1-NN?

10
K Nearest Neighbours (K-NN) – Exercise 1

Question: Should we select an athlete with


the following profile this year?

Query: SPEED = 6.75, AGILITY= 3


Using 1-NN?

11
Fundamentals (Distance Metrics)

One of the best known metrics is Euclidean distance which computes the length of the straight
line between two points. Euclidean distance between two instances a and b in a m-dimensional
feature space is defined as:

12
Fundamentals (Distance Metrics)

Example
• The Euclidean distance between instances d12 (SPEED= 5.00, AGILITY= 2.5) and
d5 (SPEED= 2.75,AGILITY= 7.5) is:

Euclidean(⟨5.00, 2.50⟩ , ⟨2.75, 7.50⟩)


= 5.4829

13
K Nearest Neighbours (K-NN) – Exercise 1

Query: SPEED= 6.75, AGILITY= 3

IS “Draft” or “Non-draft”? DRAFT!


Distance to PLUS sign (“Draft”) ?
Distance to Triangle sign (“Non-Draft”) ?
SPEED= 7
AGILITY= 4

Distance to PLUS sign: root[ (7-6.75)2 + (4-3)2 ] SPEED= 5


AGILITY= 2.5

14
K Nearest Neighbours (K-NN) – Exercise 2
Question: Should we select an athlete with
the following profile this year?

Query: SPEED = 8.5, AGILITY= 9

IS “Draft” or “Non-draft

Using 1-NN?

Using 3-NN?
Using 5-NN?
15
Handling Noisy Data

Figure: Is the instance at the top


right of the diagram really noise?

16
Handling Noisy Data

Figure: The decision boundary using majority


classification of the nearest 3 neighbors.

17
Handling Noisy Data

Figure: The decision boundary using majority


classification of the nearest 5 neighbors.

18
Fundamentals (Distance Metrics)

(a) The Voronoi tessellation of the feature space for the dataset
(b) the decision boundary created by aggregating the neighboring Voronoi regions that belong to the same target level.
19
Fundamentals (Distance Metrics)

One of the great things about nearest neighbour algorithms is


that we can add in new data to update the model very easily.

21
Fundamentals (Decision Trees)

The Nearest Neighbour Algorithm

• Require: set of training instances


• Require: a query to be classified
1: Iterate across the instances in memory and find the instance that is
shortest distance from the query position in the feature space.
2: Make a prediction for the query equal to the value of the target
feature of the nearest neighbor.

23
K- NN Algorithm
(summary)

Photo via kdnuggets.com

24
How to tune the K in k-NN ?

What happens if we select a very small value for K?

What happens if we select a very large value for K?

25
Dealing with a tie (draw) situation
Only possible when K is an even number

26
Dealing with a tie (draw) situation
Only possible when K is an even number

In a distance weighted k nearest neighbor algorithm the


contribution of each neighbor to the classification
decision is weighted by the reciprocal of the squared
distance between the neighbor d and the query q:
1 / [dist (q, d)]

Figure: The weighted KNN decision boundary.

27
Quiz Time!
Data Normalization

29
Data Normalization

NoT
Figure: A dataset listing the salary and age
information for customers and whether or
not the purchased a pension plan.

The marketing department wants to decide


whether or not they should contact a customer
with the following profile:
⟨SALARY = 56, 000, NoT = 35⟩ ?

30
Data Normalization

NoT NoT
NoT

31
Data Normalization

This odd prediction is caused by features taking different ranges of values, this is
equivalent to features having different variances.

We can adjust for this using normalization; the equation for range normalization is:

32
Data Normalization

NoT NoT
NoT

33
Data Normalization

Normalizing the data is an


important thing to do for almost
all machine learning algorithms,
not just nearest neighbor!

34
Predicting Continuous Targets

This time, instead of majority voting, we use (weighted) average for the top K nearest neighbors:

Return the average value in the neighborhood:

35
Whisky
Dataset

36
Predicting Continuous Targets

Figure: A dataset of whiskeys


listing the age (in years) and the
rating (between 1 and 5, with 5
being the best) and the bottle
price of each whiskey.

37
Predicting Continuous Targets

Figure: The whiskey


dataset after the
descriptive features
have been normalized.

38
Predicting Continuous Targets

Figure: The AGE and RATING feature space for


the whiskey dataset. The location of the query
instance is indicated by the ? symbol. The circle
plotted with a dashed line demarcates the
border of the neighborhood around the query
when k = 3. The three nearest neighbors to the
query are labelled with their ID values.

39
Predicting Continuous Targets

The model will return a price prediction that is the


average price of the three neighbors:

(200.00 + 250.00 + 55.00) / 3 = 168.33 3

40
Predicting Continuous Targets

Table: The calculations for the weighted k


nearest neighbor prediction

(411.64 + 5987.53 + 4494.38) / (7.4844 + 29.9376 + 17.9775)


= ~ 197

41
K-NN Demos

Best Demos
• http://vision.stanford.edu/teaching/cs231n-demos/knn/
• http://sleepyheads.jp/apps/knn/knn.html

42
Pros and Cons of K-NN
Advantages of K-NN:

1. K-NN is pretty intuitive and simple: K-NN algorithm is very simple to understand and equally easy to implement. To classify the
new data point K-NN algorithm reads through whole dataset to find out K nearest neighbors.
2. K-NN has no assumptions: K-NN is a non-parametric algorithm which means there are assumptions to be met to implement K-
NN. Parametric models like linear regression has lots of assumptions to be met by data before it can be implemented which is
not the case with K-NN.
3. No Training Step: K-NN does not explicitly build any model, it simply tags the new data entry based learning from historical
data. New data entry would be tagged with majority class in the nearest neighbor.
4. It constantly evolves: Given it’s an instance-based learning; k-NN is a memory-based approach. The classifier immediately
adapts as we collect new training data. It allows the algorithm to respond quickly to changes in the input during real-time use.
5. Very easy to implement for multi-class problem: Most of the classifier algorithms are easy to implement for binary problems
and needs effort to implement for multi class whereas K-NN adjust to multi class without any extra efforts.
6. Can be used both for Classification and Regression: One of the biggest advantages of K-NN is that K-NN can be used both for
classification and regression problems.
7. One Hyper Parameter: K-NN might take some time while selecting the first hyper parameter but after that rest of the
parameters are aligned to it.
8. Variety of distance criteria to be choose from: K-NN algorithm gives user the flexibility to choose distance while building K-NN
model. Euclidean Distance, Hamming Distance, Manhattan Distance, or Minkowski Distance
Pros and Cons of K-NN

Disadvantages of K-NN:
1. K-NN slow algorithm: K-NN might be very easy to implement but as dataset grows efficiency or speed of algorithm
declines very fast.
2. Curse of Dimensionality: KNN works well with small number of input variables but as the numbers of variables grow K-NN
algorithm struggles to predict the output of new data point.
3. K-NN needs homogeneous features: If you decide to build k-NN using a common distance, like Euclidean or Manhattan
distances, it is completely necessary that features have the same scale, since absolute differences in features weight the
same, i.e., a given distance in feature 1 must means the same for feature 2.
4. Optimal number of neighbors: One of the biggest issues with K-NN is to choose the optimal number of neighbors to be
consider while classifying the new data entry.
5. Imbalanced data causes problems: k-NN doesn’t perform well on imbalanced data. If we consider two classes, A and B,
and the majority of the training data is labeled as A, then the model will ultimately give a lot of preference to A. This might
result in getting the less common class B wrongly classified.
6. Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it simply chose the neighbors based on distance criteria.
Brain Break – 5 min
Dimensionality Reduction

• Motivation
• Data compression
• Data visualization
• Principal component analysis
• Intuition
• Formulation
• Algorithm
• Reconstruction
• Choosing the number of principal components
• Applying PCA
A beginner-friendly tutorial on PCA with the math behind it:
http://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf
Principle Component Analysis

𝒉𝟐 = 𝒂𝟐 + 𝒃𝟐

Eigenvector 2

Eigenvector 1

Eigenvector 1
Dot Product of Two Vectors

Intuitively, the dot product is a measure of how much two vectors are aligned.

So, if we have two vectors, u and v, the dot product between these two would give
the length of the vector v along the vector u, or if you will, the projection of v along u.
Eigenvectors & Eigenvalues

In linear algebra, an eigenvector of a linear transformation is a nonzero vector that changes at most
by a scalar factor when that linear transformation is applied to it.

The corresponding eigenvalue, often denoted by 𝜆, is the factor by which the eigenvector is scaled.
Week 6 Quiz – Q4

Which of the following figures correspond to


possible values that PCA may return for first A B
principal component (the first eigen vector)?
Select ALL that apply.

C D

Two acceptable answers:


Only A (partial credit)
Both A, C
Eigenvectors & Eigenvalues

• For a square matrix A (𝒏 ∗ 𝒏), there are n eigenvectors.

• For a 3 * 3 matrix, there are 3 eigenvectors!

• Lastly, all the eigenvectors of a matrix are perpendicular, ie. at right angles to each other, no matter how
many dimensions you have.
Data Compression

• Reduces the required time and storage space


• Removing multi-collinearity improves the interpretation of the
parameters of the machine learning model.

𝑥! 𝑥 (%) ∈ 𝑅' → 𝑧 % ∈𝑅
𝑥 (') ∈ 𝑅' → 𝑧 % ∈𝑅


𝑥" 𝑥 (() ∈ 𝑅' → 𝑧 ( ∈𝑅

𝑧"
Data Compression
• Reduce data from 3D to 2D

𝑥# 𝑧!
𝑧!
𝑥# 𝑧"
𝑥" 𝑥!
𝑥! 𝑥" 𝑧"
Data pre-processing

• Training set: 𝑥 (") , 𝑥 ($) , ⋯ , 𝑥 (%)


• Preprocessing (feature scaling/mean normalization)
1 (')
𝜇& = * 𝑥&
𝑚
'
(')
Replace each 𝑥& with 𝑥& − 𝜇&

If different features on different scales, scale features to have comparable range of values
(')
(') 𝑥& − 𝜇&
𝑥& ←
𝑠&
Principal Component Analysis Algorithm

Goal: Reduce data from n-dimensions to k-dimensions

• Step 1: Compute “covariance matrix”


+
1 ) ) ,
Σ= + 𝑥 𝑥
𝑚
)*%
• Step 2: Compute “eigenvectors” of the covariance matrix
• Principal components: 𝑢(%) , 𝑢(') , ⋯ , 𝑢 - ∈ 𝑅+
Reconstruction from compressed
representation
,
• Compression: 𝑧 ()) = 𝑈./012/ 𝑥 ())
())
• Reconstruction: 𝑥344.56 = 𝑈./012/𝑧 ())
())
• 𝑥344.56 ∈ 𝑅+ 𝑈./012/ ∈ 𝑅+×- 𝑧 ()) ∈ 𝑅-×%

𝑥! 𝑥!

𝑥" 𝑥"
Transformed Data
How do we choose k
(number of principal components)
% '
) )
• Average squared projection error: ∑ 𝑥 − 𝑥344.56
8 )
% ) '
• Total variation in the data: ∑ 𝑥
8 )

• Typically, choose 𝑘 to be the smallest value so that


! # )
#
∑# : ;:$%%&'(
"
! ) ≤ 0.01 (1%)
∑# : #
"

“99% of variance is retained”


How do we choose k
(number of principal components)

• Try PCA with k = 1, 2, ⋯


• Compute U./012/, z (%) , z (') , ⋯ , z ( ,

% ' (
𝑥344.56 , 𝑥344.56 , ⋯ , 𝑥344.56

• Check if

! # )
#
∑# : ;:$%%&'(
"
! ) ≤ 0.01 ?
∑# : #
"
Application of PCA

• Compression
• Reduce memory/disk needed to store data
• Speed up learning algorithm
• Visualization (k=2, k=3)

• Bad use of PCA


• Reduce the number of features -> less likely to overfit?
• Use regularization instead.
Variance ≠ Predictive Power

• High Variance, Low Predictive Power:


• Favorite Color in Customer Demographics: Imagine you're building a model to
predict customer churn (when a customer stops using your service). While
favorite color has a lot of variance (people have many different favorites), it likely
has no bearing on whether someone cancels.
• Low Variance, High Predictive Power:
• Purchase History (Same Product in Last Month): This feature might have low
variance (many customers might not have bought the same product recently), but
it's a strong indicator of potential future purchase.
Application: Image compression

Original Image

• Divide the original 372x492 image into patches:


• Each patch is an instance that contains 12x12 pixels on a grid
• View each as a 144-D vector
PCA compression: 144D à 60D
PCA compression: 144D à 16D
16 most important eigenvectors
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
60 most important eigenvectors

Looks like the discrete cosine bases of JPG!...


2D Discrete Cosine Basis

http://en.wikipedia.org/wiki/Discrete_cosine_transform
• Week 6 – Decision Trees & Ensemble Methods

Coming up
Next…
• Homework #3 due June 14 (@ 7pm Pacific Time)

I’ll introduce course projects in Week 6


• Team Formation due June 14
• Course Survey due June 14
Lab Session:
Principal-Component-Analysis
Questions?

You might also like