0% found this document useful (0 votes)
8 views

Lect05 Instance ML

Uploaded by

Hưng Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lect05 Instance ML

Uploaded by

Hưng Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Instance Based Learning

Bùi Tiến Lên

2022
Contents

1. Classification

2. Metric Learning

3. Regression

4. Clustering
Notation
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
symbol meaning
Loss Function
a, b, c, N . . . scalar number
Regression
Kernel Function w, v, x, y . . . column vector
Kernel Regression
k-NN Regression
X, Y . . . matrix operator meaning
Nadaraya-Watson
Model R set of real numbers w| transpose
Nadaraya-Watson
Parametric Model Z set of integer numbers XY matrix multiplication
Clustering N set of natural numbers X −1 inverse
RD set of vectors
k-Means
Hierarchical Clustering

set
k-d Tree
X , Y, . . .
A algorithm

3
Parametric vs Non-parametric Models
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Parametric Models Non-parametric Models
Metric Learning
Loss Function • In the models that we have seen, • A non parametric model is one
Regression
Kernel Function
we select a hypothesis space H that can not be characterized by a
Kernel Regression and adjust a fixed set of fixed set of parameters
parameters w with the training
k-NN Regression
Nadaraya-Watson
Model
• A family of non parametric models
Nadaraya-Watson
Parametric Model
data D is Instance Based Learning. The
Clustering • We assume that the parameters w function is based on the training
summarize the training data D data D = {x 1 , x 2 , ...x n }
k-Means
Hierarchical Clustering

and we can forget about it


k-d Tree

y = f (x; w) (1) y = f (x; x 1 , x 2 , ..., x n ) (2)

4
Inductive Bias
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Concept 1
In nonparametric model, we assume that similar inputs have similar outputs.
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
• This is a reasonable assumption: The world is smooth, and functions,
Nadaraya-Watson
Model whether they are densities, discriminants, or regression functions, change
Nadaraya-Watson
Parametric Model slowly. Similar instances mean similar things.
Clustering
k-Means
Hierarchical Clustering
k-d Tree

5
Classification
• k-Nearest Neighbor (k-NN)
• Effects of Hyper-parameters
When To Consider Nearest Neighbor
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Data points x ∈ RD
Metric Learning
Loss Function
• Less than D < 20 attributes
Regression
Kernel Function • Lots of training data D
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

7
Nearest Neighbor
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Learning mode
Metric Learning
Loss Function • Store all training examples D = {(x i , yi ) | i = 1, ..., N}
Running mode
Regression
Kernel Function

• Nearest neighbor: Given query instance x q , first locate the nearest


Kernel Regression
k-NN Regression

neighbhor x (1) , then estimate


Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering h(x q ) = y (1) (3)


k-Means
Hierarchical Clustering
k-d Tree
• k-Nearest neighbor: Given x q , take vote among its k nearest neighbors
{x (1) , x (2) , ..., x (k) }

h(x q ) = majority vote{y (1) , y (2) , ..., y (k) } (4)

8
Distance
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Some common distances in space RD
Metric Learning
Loss Function • The Minkowski distance of order p > 0
Regression
Kernel Function
D
!1/p
Kernel Regression
p
d(x, y) = Lp (x, y) = |xi − yi | (5)
X
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
i=1
Parametric Model

Clustering • Euclidean distance (popular)


k-Means
Hierarchical Clustering
k-d Tree
v
u D
d(x, y) = L2 (x, y) = t (xi − yi )2 (6)
uX

i=1

9
Distance (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
• Manhattan distance
Metric Learning
Loss Function
D
Regression
d(x, y) = L1 (x, y) = |xi − yi | (7)
X
Kernel Function
Kernel Regression
i=1
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree 0 0 0 0

p = 0.5 p= 1 p= 2 p= 4

Figure 1: Contours of the distance from the origin O for various values of the parameter p

10
The Curse of dimensionality
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • The more dimensions we have, the more examples we need
Metric Learning
Loss Function
• The number of examples that we have in a volume of space decreases
exponentially with the number of dimensions
Regression
Kernel Function
Kernel Regression
k-NN Regression
• If the number of dimensions is very high, the nearest neighbours can be
Nadaraya-Watson
Model very far away
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

11
Analysis
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Advantages Disadvantages
Metric Learning
Loss Function • No training, just store data • Slow at query time
Regression
Kernel Function
• Learn complex target functions • Easily fooled by irrelevant
Kernel Regression
k-NN Regression • Don’t lose information attributes
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

12
Parameter k
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • if k = 1 the cross point x should be classified to square class
Metric Learning
Loss Function
• if k = 3 ?
Regression
Kernel Function • if k = 5 ?
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
square class
Nadaraya-Watson
Parametric Model
circle class
Clustering
k-Means
Hierarchical Clustering
k-d Tree

13
Parameter k (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Data set D with 500 samples belonging to two classes {blue, orange}
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

14
Parameter k (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Decision regions for various values of k
Metric Learning
Loss Function k=1 k=2 k=3
Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model
k=4 k=5 k=6
Clustering
k-Means
Hierarchical Clustering
k-d Tree

k = 10 k = 20 k = 50

15
Metric Learning
• Motivation
• Metric Learning
• Loss Function
Motivation
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Nearest neighbor classification
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

17
Motivation (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Clustering
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

18
Motivation (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Information retrieval
Metric Learning
Loss Function
Query image
Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

Most similar images

19
Motivation (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Data visualization
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

20
Metric Learning
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Given a set of data points X and their corresponding labels Y
Metric Learning
Loss Function • Select a parametric distance or similarity function
Regression

dW (x, x 0 ) = L fW (x), fW (x 0 ) (8)


Kernel Function

Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
• An embedding function (parametric function)
Parametric Model

Clustering
k-Means
fW (x) : X → Rn (9)
Hierarchical Clustering
k-d Tree • A distance function (which is usually fixed beforehand)

L(x, x 0 ) : Rn × Rn → R (10)
• The goal is to train the parametric distance, so that the combination
dW (x, x 0 ) produces small values if the labels y, y 0 ∈ Y of the samples
x, x 0 ∈ X are equal, and larger values if they aren’t.
21
Metric Learning (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Collect similarity judgements on data pairs/triplets
Metric Learning
Loss Function

Regression S = {(x i , x j ) : x i and x j should be similar},


Kernel Function
Kernel Regression D = {(x i , x j ) : x i and x j should be dissimilar}. (11)
k-NN Regression
Nadaraya-Watson
Model R = {(x i , x j , x k ) : x i should be more similar to x j than to x k }.
Nadaraya-Watson
Parametric Model

Clustering • Estimate parameters s.t. metric best agrees with judgements


k-Means
Hierarchical Clustering  
k-d Tree

Ŵ = arg min `(dW , S, D, R) + λR(W )  (12)


 
W | {z } | {z }
loss function regularization

22
Metric Learning (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

23
Metric Learning (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

24
Contrastive Approaches
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • An embedding function is usually a neural network
Metric Learning
Loss Function
• A distance function is L2 distance
Regression
Kernel Function • A loss function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

25
Contrastive Loss
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Contrastive Loss (Chopra et al. 2005)
Metric Learning
Loss Function • Let x 1 , x 2 be some samples in the dataset, and y1 , y2 are their corresponding
Regression
Kernel Function
labels. Also, for some condition A, let’s denote IA as the identity function
Kernel Regression that is equal to 1 if A is true, and 0 otherwise. The loss function is then
defined as follows:
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson

`contrast = Iy1 =y2 dW (x 1 , x 2 ) + Iy1 6=y2 max (0, α − dW (x 1 , x 2 )) (13)


Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree
where α is the margin.

26
Triplet Loss
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Triplet Loss (Schroff et al. 2015)
Metric Learning
Loss Function • Let x a , x p , x n be some samples from the dataset and y a , y p , y n be their
Regression
Kernel Function
corresponding labels, so that ya = yp and ya 6= yn . Usually, x a is called
Kernel Regression anchor sample, x p is called positive sample because it has the same label as
x a , and x n is called negative sample because it has a different label. It is
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model
defined as:
Clustering
k-Means `triplet = max (0, dW (x a , x p ) − dW (x a , x n ) + α) (14)
Hierarchical Clustering
k-d Tree

where α is the margin.

27
Contrastive Loss vs. Triplet Loss
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

contrastive lost triplet lost

28
Regression
• Kernel Function
• Kernel Regression
• k-NN Regression
• Nadaraya-Watson Model
• Nadaraya-Watson Parametric Model
Feature Space
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Project the data into a higher dimensional space (feature space) F
Metric Learning
Loss Function • Transformation function
Regression φ : RD → F
Kernel Function (15)
Kernel Regression x i → φ(x i )
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
• Work with φ(x i ) instead of working with x i .
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

30
The Kernel Function
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Concept 2
A kernel is a function k(x, z) which represents a dot product in a “hidden”
Loss Function

Regression
Kernel Function feature space of φ.
k(x, z) = φ(x) · φ(z) (16)
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model • Note that: we have only dot products φ(x i ) · φ(x j ) to compute; however,
Clustering
k-Means
this could be very expensive in a high dimensional space.
Hierarchical Clustering
k-d Tree
• Kernel trick:
√ x1
2
 
x1
 
instead of φ(x) = φ =  2x1 x2 , use k(x, z) = (x · z)2
x2
x22

31
Common Kernels
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Polynomial:
Metric Learning
Loss Function

Regression k(x, z) = (ux · z + v)p (u ∈ R, v ∈ R, p ∈ N) (17)


Kernel Function
Kernel Regression
k-NN Regression • Gaussian:
kx − zk2
Nadaraya-Watson
!
Model
Nadaraya-Watson
Parametric Model
k(x, z) = exp − , σ ∈ R+ (18)
σ2
Clustering

Note: feature space is infinite-dimensional


k-Means
Hierarchical Clustering
k-d Tree

32
Techniques for Construction of Kernels
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation In all the following, k1 , k2 , ..., kj are assumed to be valid kernel functions
Metric Learning
Loss Function 1. Scalar multiplication: The validity of a kernel is conserved after
Regression
Kernel Function
multiplication by a positive scalar, i.e., for any α > 0, the function
Kernel Regression
k-NN Regression
Nadaraya-Watson
k(x, z) = αk1 (x, z) (19)
Model
Nadaraya-Watson
Parametric Model
2. Adding a positive constant: For any positive constant α > 0, the function
Clustering
k-Means
Hierarchical Clustering
k-d Tree
k(x, z) = α + k1 (x, z) (20)

33
Techniques for Construction of Kernels (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation 3. Linear combination: A linear combination of kernel functions involving only
Metric Learning
Loss Function
positive weights, i.e.,
Regression
Kernel Function
m
k(x, z) = αj kj (x, z), with αj > 0 (21)
X
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
j=1
Nadaraya-Watson

is a valid kernel function.


Parametric Model

Clustering
k-Means
Hierarchical Clustering
4. Product: The product of two kernel functions, i.e.,
k-d Tree

k(x, z) = k1 (x, z)k2 (x, z) (22)

is a valid kernel function.

34
Techniques for Construction of Kernels (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation 5. Polynomial functions of a kernel output: Given a polynomial f : R → R
Metric Learning
Loss Function
with positive coefficients, the function
Regression
Kernel Function k(x, z) = f (k1 (x, z)) (23)
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
is a valid kernel function.
6. Exponential function of a kernel output: The function
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k(x, z) = exp(k1 (x, z)) (24)
k-d Tree

is a valid kernel function.


7. Product of matrix and vectors:

k(x, z) = x | Az (25)

where A is a symmetric positive semidefinite matrix.


35
Linear Regression Revisted
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Problem: Given a dataset of input-output pairs D = {(x 1 , y1 ), . . . , (x N , yN )},
Metric Learning
Loss Function
find the best linear regresion
Regression • Primal form
Kernel Function
D
Kernel Regression

ŷ = f (x) = wi xi (26)
X
k-NN Regression
Nadaraya-Watson
Model
i=1
Nadaraya-Watson

where
Parametric Model

Clustering
k-Means w = (X | X + λI D )−1 X | y (27)
Hierarchical Clustering
k-d Tree
• Dual Form
N
ŷ = f (x) = αi x |i x (28)
X

i=1

where
α = (XX | + λI N )−1 y (29)
36
The Kernel Trick
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Question: How introduce nonlinearity to
Metric Learning
Loss Function

Regression
N
ŷ = f (x) = αi x |i x
X
Kernel Function
Kernel Regression
k-NN Regression i=1
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model
• Solution: Replace the inner product x |i x by k(x, x i ), we have
Clustering
k-Means N
ŷ = f (x) = αi k(x, x i ) (30)
Hierarchical Clustering
X
k-d Tree

i=1

37
Kernel Method
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
1. Select a kernel function k(·, ·)
Loss Function
2. Construct a kernel matrix K ∈ RN×N where
Regression
Kernel Function
Kernel Regression [K ]ij = k(x i , x j ) (31)
k-NN Regression
Nadaraya-Watson

3. Compute the coefficients α ∈ RN , with


Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means α = (K + λI N )−1 y (32)
Hierarchical Clustering
k-d Tree

4. Estimate the predicted value for a new sample x


N
ŷ = αi k(x, x i ) (33)
X

i=1

38
Linear Regression vs. Kernel Method
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Linear regression Kernel method
Loss Function
pick a global model, best fit globally pick a local model, best fit locally
Regression
Kernel Function based on the columns (features) based on the rows (samples)
Kernel Regression
k-NN Regression
handle linearity handle nonlinearity
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

39
k-NN Regression
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Problem: Given a dataset of input-output pairs
Metric Learning
Loss Function D = {(x 1 , y1 ), . . . , (x N , yN )}, how to learn f to predict the output ŷ = f (x)
Regression
Kernel Function
for any new input x?
Kernel Regression
k-NN Regression
• Solution: Take the mean of the values of k nearest neighbors
Nadaraya-Watson
Model {x (1) , x (2) , ..., x (k) }
Pk
Nadaraya-Watson
y (i)
ŷ = i=1 (34)
Parametric Model

Clustering
k-Means
k
Hierarchical Clustering
k-d Tree

40
Nadaraya-Watson Model
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Problem: Given a dataset of input-output pairs
Metric Learning
Loss Function D = {(x 1 , y1 ), . . . , (x N , yN )}, how to learn f to predict the output ŷ = f (x)
Regression
Kernel Function
for any new input x?
Kernel Regression
k-NN Regression
• Solution: Consider (x i , yi ) as a pair of key-value and x as query
Nadaraya-Watson

key value
Model
Nadaraya-Watson
Parametric Model

Clustering
x1 y1
k-Means .. ..
Hierarchical Clustering
k-d Tree
. .
xN yN

N
ŷ = α(x, x i )yi , (35)
X

i=1

41
Nadaraya-Watson Model (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
• We define α using a Gaussian kernel
Metric Learning
Loss Function h i
Regression exp − 12 kx − x i k2
Kernel Function
α(x, x i ) = P i. (36)
n
h
exp x
Kernel Regression 2
1
k-NN Regression
j=1 − 2 kx − j k
Nadaraya-Watson
Model

and plug it into equation (17)


Nadaraya-Watson
Parametric Model

Clustering
N
k-Means

ŷ = α(x, x i )yi
X
Hierarchical Clustering
k-d Tree

i=1
h i (37)
N exp − 12 kx − x i k2
i yi
X
= PN h
i=1 j=1 exp − 1
2 kx − x j k2

42
Nadaraya-Watson Model (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
• A key x i that is closer to the given query x will get more attention via a
Metric Learning
Loss Function
larger attention weight assigned to the key’s corresponding value yi .
Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

43
Example 1
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Generate an artificial dataset including 50 training examples and 50 testing
examples according to the following nonlinear function with the noise term
Metric Learning
Loss Function

Regression  ∼ N (0, 0.5)


y = 2 sin(x) + x 0.8 +  (38)
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
• Find the kernel regression
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

44
Nadaraya-Watson Parametric Model
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Kernel regression enjoys the consistency benefit: given enough data this
Metric Learning
Loss Function model converges to the optimal solution.
Regression
Kernel Function
• Nonetheless, we can easily integrate learnable parameters.
• In the following the distance between the query x and the key x i is multiplied
Kernel Regression
k-NN Regression

a learnable parameter w:
Nadaraya-Watson
Model
Nadaraya-Watson

w)
h i
exp x
Parametric Model
N − 1
(kx − i k 2
Clustering 2
ŷ = i yi (39)
X
k-Means
PN
w)
h
Hierarchical Clustering
k-d Tree
i=1 j=1 exp − 1
2 (kx − x j k 2

45
Example 2
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Generate an artificial dataset including 50 training examples and 50 testing
Metric Learning
Loss Function
examples according to the following nonlinear function with the noise term
Regression  ∼ N (0, 0.5)
Kernel Function
Kernel Regression y = 2 sin(x) + x 0.8 +  (40)
k-NN Regression
Nadaraya-Watson
Model
• Find the parametric kernel regression
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

46
Clustering
• k-Means
• Hierarchical Clustering
• k-d Tree
Clustering
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Concept 3
Loss Function
Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in some
Regression
Kernel Function
Kernel Regression
k-NN Regression
sense) to each other than to those in other groups (clusters).
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

48
k-Means
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Concept 4
Given a set of observations D = {x 1 , . . . , x N }, k-means clustering aims to
Loss Function

Regression
Kernel Function partition the N observations into k (≤ N) sets S = {S1 , S2 , ..., Sk } so as to
minimize the within-cluster sum of squares
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
• The objective to find
Parametric Model

Clustering k X
arg min (41)
X
kx − µi k2
k-Means
Hierarchical Clustering
S
i=1 x∈Si
k-d Tree

where µi is the mean of Si

49
Illustration
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Loss Function

Regression
Kernel Function
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering
k-Means
Hierarchical Clustering
k-d Tree

50
Naive k-Means Algorithm
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation

1. Initialise a set of k means m 1 ,...,m k


Metric Learning
(0) (0)
Loss Function

Regression 2. For t = 1, 2, 3, ... do


Kernel Function
Kernel Regression • Assignment step: Assign each observation to the cluster with the
nearest mean: that with the least squared Euclidean distance
k-NN Regression
Nadaraya-Watson
Model

Si = x | L2 (x, m i ) < L2 (x, m j ), ∀j 6= i (42)


n o
Nadaraya-Watson (t) (t) (t)
Parametric Model

Clustering
k-Means
Hierarchical Clustering
• Update step: Recalculate means (centroids) for observations assigned
k-d Tree
to each cluster.
1
mi x (43)
(t+1)
X
= (t)
|Si |
x∈Si
(t)

The algorithm has converged when the assignments no longer change

51
Hierarchical Clustering
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Concept 5
Loss Function
Hierarchical clustering is a method of cluster analysis which seeks to build a
Regression
Kernel Function
hierarchy of clusters.
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model
Hierarchical Clustering Dendrogram
Nadaraya-Watson
Parametric Model 30
Clustering
k-Means
25
Hierarchical Clustering
k-d Tree 20

15

10

0
(7) (8) 41 (5)(10)(7) (4) (8) (9)(15)(5) (7) (4)(22)(15)(23)
Number of points in node (or index of point if no parenthesis).

52
Linkage Function
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation
Metric Learning
Concept 6
A linkage function L is used to calculate the distance (similarity/dissimilarity)
Loss Function

Regression
Kernel Function between arbitrary subsets of the instance space, given a distance metric d
Kernel Regression
k-NN Regression
Nadaraya-Watson
Model • Single linkage: defines the distance between two clusters as the smallest
pairwise distance between elements from each cluster.
Nadaraya-Watson
Parametric Model

Clustering

Lsingle (A, B) = min{d(x, y) | x ∈ A, y ∈ B} (44)


k-Means
Hierarchical Clustering
k-d Tree

• Complete linkage: defines the distance between two clusters as the largest
pointwise distance.

Lcomplete (A, B) = max{d(x, y) | x ∈ A, y ∈ B} (45)

53
Agglomerative algorithm
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning

• Given a set of observations D = {x 1 , . . . , x n }


Motivation
Metric Learning
Loss Function

Regression Initialise clusters to singleton data points


Kernel Function
Kernel Regression
Create a leaf node for every singleton cluster
k-NN Regression
Nadaraya-Watson
Repeat
Model
Nadaraya-Watson
find the pair of clusters X , Y with lowest linkage
Parametric Model
merge X , Y into Z
Clustering
k-Means create a node for Z (parent node of X , Y )
Hierarchical Clustering
k-d Tree
Until all data points are in one cluster
Return the constructed binary tree

54
k-d Tree
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • The fundamental problem of k-NN is that distance computation is costly and
Metric Learning
Loss Function the total cost unavoidably linear in the number of points compared.
Regression
Kernel Function
• To increase the processing speed, it is possible to partition the data space
Kernel Regression
k-NN Regression
and reduce this number significantly using k-d tree
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model
Concept 7
Clustering A k-d tree (short for k-dimensional tree) is a space-partitioning data structure for
organizing points in a k-dimensional space
k-Means
Hierarchical Clustering
k-d Tree

55
Algorithm
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Construct k-d tree
Metric Learning
Loss Function
• Given and D-dimensional dataset D = {x 1 , x 2 , ..., x N }
Regression
Kernel Function • Cut data with a plane at its median value along that dimension
Kernel Regression
k-NN Regression
Nadaraya-Watson
• Recurse this procedure to create a balanced binary tree k-d tree
Model
Nadaraya-Watson

Nearest neighbor search


Parametric Model

Clustering
k-Means
Hierarchical Clustering • To locate the NN of an query vector x, determine which leaf cell it lies
within
k-d Tree

• To perform an exhaustive search within this cell.

56
Example
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation Given a dataset D = {(x1 , x2 )} = {(2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2)}
Metric Learning
Loss Function • Construct k-d tree
Regression
Kernel Function
Kernel Regression 10
k-NN Regression
Nadaraya-Watson
Model
Nadaraya-Watson 8 (7,2)
Parametric Model

Clustering
k-Means 6
Hierarchical Clustering
(5,4) (9,6)
k-d Tree

(2,3) (4,7) (8,1)


2

0
0 2 4 6 8 10

57
Example (cont.)
Classification
k-Nearest Neighbor
(k-NN)
Effects of
Hyper-parameters

Metric Learning
Motivation • Nearest neighbor search
Metric Learning
Loss Function 10
Regression
Kernel Function
Kernel Regression
k-NN Regression 8
Nadaraya-Watson
Model
Nadaraya-Watson
Parametric Model

Clustering 6
k-Means
Hierarchical Clustering
k-d Tree

0
0 2 4 6 8 10 58
References

Goodfellow, I., Bengio, Y., and Courville, A. (2016).


Deep learning.
MIT press.
Lê, B. and Tô, V. (2014).
Cở sở trí tuệ nhân tạo.
Nhà xuất bản Khoa học và Kỹ thuật.
Russell, S. and Norvig, P. (2021).
Artificial intelligence: a modern approach.
Pearson Education Limited.

You might also like