0% found this document useful (0 votes)

28 views

05 Lectureslides Kernels

The document discusses kernels and how they can be used for linear regression. Kernels allow representing data in a transformed feature space to perform linear operations even when the data is non-linearly separable in the original input space. The key concepts covered are dual representation, kernel trick, and how they help reduce the number of parameters needed compared to using the original feature space.

Uploaded by

Việt Nguyễn

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

05 Lectureslides Kernels

Uploaded by

Việt Nguyễn

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Kernels

Lecture given by Grady Jensen

Slides created by Sebastian Urban

Notation

𝒟 𝒟 dataset with 𝑁 samples

CHIRP, CHIRP

Chirps per second versus

temperature.
In an ideal world…

Cricket-Chirp model: 𝑦 =
𝑓𝑤 𝑥

fit model parameters 𝑤

Suppose that a biologist is not available.

Can we make predictions without a scientific

model?
Linear regression
Given the dataset
𝒟= 𝑥 𝑛 ,𝑦 𝑛 , 𝑛 = 1, … , 𝑁
we want to fit the linear model
𝑦 𝑥 = 𝑎𝑥 + 𝑏 .

We define the feature map

𝝓: ℝ → ℝ2 , 𝝓 𝑥 = 𝑥, 1 𝑇
and can formulate our linear model as
𝑦 𝑥 = 𝒘𝑇 𝝓 𝑥
with 𝒘 = 𝑎, 𝑏 𝑇 .

Using the squared prediction error we

Chirps per second versus obtain the objective function
𝑁
temperature. 2
𝐸 𝒘 = 𝑦 (𝑛) − 𝒘𝑇 𝝓 𝑛
𝑛=1
where 𝝓 𝑛 ≔ 𝝓(𝑥 𝑛 ).
Linear regression in matrix form
In matrix form the error function is
𝐸 𝒘 = 𝒚𝑇 − 𝒘 𝑇 𝚽 T 𝒚𝑇 − 𝒘𝑇 𝚽
𝑛
with 𝒚𝑛 = 𝑦 and 𝚽𝑖𝑗 = 𝜙𝑖 𝒙 𝑗 .

Minimum error is obtained by

𝒘 = 𝚽𝚽 𝑇 −1 𝚽 𝒚
where 𝚽𝚽 𝑇 −1 𝚽 is called the Moore-
Penrose pseudo-inverse of 𝚽.

𝚽𝚽 𝑇 is a 𝑀x𝑀 matrix where 𝑀 is the

number of basis functions.
⇒ 𝑂(𝑀3 ) operations to calculate 𝚽𝚽 𝑇 −1
.

Chirps per second versus

temperature.
More complex basis functions
We can also use a more complex model, for
example a 3th order polynomial
𝑦 𝑥 = 𝑤1 + 𝑤2 𝑥 + 𝑤3 𝑥 2 + 𝑤4 𝑥 3 .

This yields the feature map

1
𝑥
𝝓 𝑥 = .
𝑥2
𝑥3

Chirps per second versus Quality of solution? Worse.

temperature. Model complexity? Higher.
Why? Model does not match data.
Dealing with unknown basis functions

One way to deal with overfitting is by using regularization.

However it will not help much if the basis functions do not match
the data.

So, what should we do if we do not know the model?

Is there a universal set of basis functions that can approximate

arbitrary functions reasonably well?
Linear regression using “bump” functions
Approximate the function with a sum of bump-shaped
functions.

H. Takeda. Kernel Regression for Image Processing and Reconstruction. 2006.

Radial basis functions (RBFs)

Gaussian Radial Basis Functions (RBFs):

2
𝒙−𝒎 𝑖
𝜙𝑖 𝒙 = exp −
2𝛼 2

The center is at 𝒎 𝑖 and 𝛼 determines the 𝛼 = 0.2, 𝑚 = 0

width of the bump.
Advantage: They decrease smoothly to zero
and do not show osciallatory behavior
unlike higher order polynomials.

𝛼 = 0.1, 𝑚 = −0.15; 𝑚 = +0.15

Gaussian radial basis functions (RBFs)

Gaussian RBF in two dimensional space.

Linear regression using RBFs
Use
𝑖 2
𝒙−𝒎
𝜙𝑖 𝒙 = exp −
2𝛼 2
with 𝑖 = 1, … , 16 as basis functions.

Apply linear regression.

Approximation of sin(10𝑥) by
16 Gaussian RBFs.
Black line: original function In one dimension this seems to work
Red line: regression well, but does it scale to higher
dimensions?
Curse of dimensionality
To cover the data with a constant discretization level (number of
RBFs per unit volume) the number of basis functions and
weights grows exponentially with the number of dimensions.
1D Dimensions # of basis functions /
weights
1 16
2 162 = 256
3 163 = 4096
4 164 = 65 536
5 165 = 1 048 576
⋮ ⋮
10 1610 ≈ 1012
2D

Is there a way to reduce the number of required weights?

DUAL REPRESENTATION
Notation

𝒟 𝒟 dataset with 𝑁 samples

𝑥𝑛 𝑥 𝑛 n-th input sample
𝑧𝑛 𝑛 target output for n-th sample
𝑦
1 2 𝑁 𝑇
𝑍 𝒚 = 𝑦 ,𝑦 ,…,𝑦
𝑦(𝑥) 𝑦(𝑥) output computed for input 𝑥 by model
𝑦(𝑥𝑛 ) 𝑛 output computed for n-th input sample
𝑌
by model
1 2 𝑁 𝑇
𝑌 𝒀 = 𝑌 ,𝑌 ,…,𝑌
𝑊 𝒘 weights (model parameters)
Row space and null space of a matrix
Let 𝐴 ∈ ℝ𝑁×𝑀 be a matrix and let 1 0 0
𝐴=
𝑨𝒊 = (𝐴𝑖1 , … , 𝐴𝑖𝑀 ) be its 𝑖th row. 1 1 0

The row space 𝒜 of 𝐴 is the set of 𝒜 = 𝑐1 (1,0,0) + 𝑐2 (1,1,0), 𝒄 ∈ ℝ2

all possible linear combinations of its
rows, 𝒜 = span 𝑨𝟏 , … , 𝑨𝒏 .

The null space ker 𝐴 of 𝐴 is the

orthogonal complement of 𝒜 , ker 𝐴 = 𝒜 ⊥ = 𝑐(0,0,1), 𝑐 ∈ ℝ
ker 𝐴 = 𝒜⊥ = 𝒙 ∈ ℝ𝑀 : 𝐴𝒙 = 𝟎 .

Every element 𝒗 in the null space of

𝐴 is orthogonal to all rows of 𝐴,
𝑨𝒊 ⋅ 𝒗 = 𝟎 ∀𝑖 ∈ 1, … , 𝑁 .
Orthogonal Complement and Decomposition
Let 𝒜 be a subspace of ℝ𝑁 . 𝒜 = 𝑐1 (1,0,0) + 𝑐2 (1,1,0), 𝒄 ∈ ℝ2

The orthogonal complement 𝒜⊥ of 𝒜⊥ = 𝑐(0,0,1), 𝑐 ∈ ℝ

𝒜 is the set of all vectors that are 𝒜⊥
orthogonal to all elements of 𝒜,
𝒜⊥ = 𝒛 ∈ ℝ𝑁 : 𝒙𝑇 𝒛 = 0 ∀𝒙 ∈ 𝒜 .

Orthogonal decomposition: Any

𝒘 ∈ ℝ𝑁 can be written uniquely in 𝒘 = 2,3,5
the form 𝒘 = 2,3,0
𝒘=𝒘+𝒛 𝒛 = 0,0,5
with 𝒘 ∈ 𝒜 and 𝒛 ∈ 𝒜⊥ .
Null space of training set
Let 𝒳 = 𝒙 𝑛 , 𝑛 = 1, … , 𝑁 be the set of
all training points.
Consider the predictions of the model
𝑦 𝒙 = 𝒘𝑇 𝒙
on the training set, i.e. 𝒙 ∈ 𝒳.
Using orthogonal decomposition we write
𝒘=𝒘+𝒛
with 𝒘 ∈ span(𝒳) and 𝒛 ∈ 𝒳 ⊥ .
The prediction becomes
𝑦 𝑥 = 𝒘 + 𝒛 𝑇 𝒙 = 𝒘𝑇 𝒙 + 𝒛𝑇 𝒙 = 𝒘𝑇 𝒙.

Hence we can assume that

𝒘 ∈ span 𝒳 .
Dual representation
The weight vector 𝒘 is thus a linear combination of the training
samples 𝒳,
𝑁

𝒘= 𝑎𝑛 𝒙 𝑛 .
𝑛=1
The parameters 𝒂 = (𝑎1 , … , 𝑎𝑁 ) are called the dual parameters.

This also applies when using basis functions,

𝑁

𝒘= 𝑎𝑛 𝝓 𝒙 𝑛 .
𝑛=1

There are as many dual parameters as training samples. Their

number is independent of the number of basis functions.
Predictions in dual representation 𝝓 𝑚
=𝝓 𝒙 𝑚

Substituting 𝒘 back into the linear regression 𝑦 𝒙 = 𝒘𝑇 𝝓(𝒙)

yields
𝑁 𝑁
𝑛 𝑇 𝑛
𝑦 𝒙 = 𝑎𝑛 𝝓 𝝓(𝒙) = 𝑎𝑛 𝐾 𝒙 ,𝒙 .
𝑛=1 𝑛=1
with the kernel function of 𝝓,
𝐾 𝒙, 𝒚 ≔ 𝝓 𝒙 𝑇 𝝓 𝒚 .

Using the Gram matrix

𝑚 𝑛 𝑚 𝑇 𝑛
𝑲𝑚𝑛 ≔ 𝐾 𝒙 ,𝒙 =𝝓 𝒙 𝝓 𝒙
the predictions on the training set are
𝑁 𝑁

𝑌 𝑚 =𝑦 𝒙 𝑚 = 𝑎𝑛 𝐾 𝒙 𝑛 ,𝒙 𝑚 = 𝑎𝑛 𝑲𝑚𝑛
𝑛=1 𝑛=1
𝑇
𝒀= 𝒂𝑻 𝑲 with 𝒀 = 𝑌 1 ,𝑌 2 ,…,𝑌 𝑁 .
Solution of dual representation
Primal representation: Dual representation:
𝐸 𝒘 = 𝒚𝑇 − 𝒘𝑇 𝚽 22 𝐸 𝒂 = 𝒚𝑇 − 𝒂𝑇 𝑲 22
where 𝚽 is the matrix of feature where 𝐊 is the gram matrix (also
vectors. called kernel matrix).

Solution: Solution:
𝒘 = 𝚽𝚽 𝑇 −1 𝚽 𝒚 = 𝚽 + 𝐲 𝒂 = 𝐊𝐊 𝑇 −1 𝐊 𝒚 = 𝑲+ 𝒚
Complexity is 𝑂(𝑀3 ) where 𝑀 is the Complexity is 𝑂(𝑁 3 ) where 𝑁 is the
number of basis functions. number of training samples.

Predictions: Predictions:
𝑦 𝒙 = 𝒘𝑇 𝝓(𝒙) 𝑁
𝑛
𝑦 𝒙 = 𝑎𝑛 𝐾(𝒙 , 𝒙)
𝑛=1
“Component 𝑎𝑛 weighted with
similarity to training sample 𝒙 𝑛 .”
Why is the dual representation useful?

𝑦 𝒙 = 𝑎𝑛 𝐾(𝒙 𝑛 , 𝒙)
𝑛=1

𝑁 is the number of training samples.

Lots of basis functions and moderate number of training samples:

⇒ dual representation saves lots of parameters.

Example: Kernel of RBF basis (not RBF kernel)
Gaussian RBF basis functions:
𝑖 2
𝒙−𝒎
𝜙𝑖 𝒙 = exp − , 𝑖 = 1, … , 𝑀
2𝛼 2

Calculate kernel:
𝑀

𝐾 𝒙, 𝒚 = 𝝓 𝒙 𝑇 𝝓 𝒚 = 𝜙𝑖 𝒙 𝜙𝑖 𝒚
𝑖=1
𝑀
𝒙2 + 𝒚2 𝒎 𝑖 𝒙+𝒚 −𝒎 𝑖 2
= exp − exp
2𝛼 2 𝛼2
𝑖=1

Linear regression:
𝑁
𝑛
𝑦 𝒙 = 𝑎𝑛 𝐾 𝒙 ,𝒙
𝑛=1
where 𝑁 is number of training samples.
Example: Kernel of polynomial basis

The polynomial basis of order 𝐷

𝜙𝑖 𝑥 = 𝑥 𝑖−1 , 𝑖 = 1, … , 𝐷
induces the kernel
𝐷 𝐷−1
𝑇
1 − 𝑥𝑦 𝐷
𝐾 𝑥, 𝑦 = 𝝓 𝑥 𝝓 𝑦 = 𝑥 𝑖−1 𝑦 𝑖−1 = (𝑥𝑦)𝑖 =
1 − 𝑥𝑦
𝑖=1 𝑖=0
and the Gram matrix
𝑛 𝑚 𝐷
𝑛 𝑚
1− 𝑥 𝑥
𝐾𝑛𝑚 = 𝐾 𝑥 ,𝑥 = .
1−𝑥 𝑛 𝑥 𝑚

Evaluating the kernel is much cheaper than calculating the

mapping 𝝓 into feature space and the scalar product explicitly.
PROPERTIES OF KERNELS
Notation

𝒟 𝒟 dataset with 𝑁 samples

Calculate 𝚽 𝑇 𝚽 component-wise,

𝚽𝑇 𝚽 mn = 𝚽𝑇 𝑚𝑘 𝚽𝑘𝑛 = 𝚽𝑘𝑚 𝚽𝑘𝑛

𝑘 𝑘
𝑚 𝑇 𝑛
=𝝓 𝒙 𝝓 𝒙 = 𝐾𝑚𝑛 .

Thus we have
𝑲 = 𝚽𝑇 𝚽
with K symmetric.

𝑲 is positive semi-definite, that is 𝒃𝑇 𝑲𝒃 ≥ 𝟎 for any 𝒃, since

0 ≤ 𝚽𝒃 22 = 𝒃𝑇 𝚽 𝑇 𝚽𝒃 = 𝒃𝑇 𝑲𝒃 for any 𝒃.
When is a function a kernel?
Mercer‘s theorem (for finite input spaces): Consider a finite input space 𝒳 =
𝒙 1 , … , 𝒙 𝑁 with 𝐾 𝒙 𝑛 , 𝒙 𝑚 a function on 𝒳.
Then 𝐾 𝒙 𝑛 , 𝒙 𝑚 is a kernel function, that is a scalar product in a feature
space, if and only if 𝐾 is symmetric and the matrix 𝑲𝑛𝑚 = 𝐾 𝒙 𝑛 , 𝒙 𝑚 is
positive semi-definite.
Proof.
𝑲 symmetric ⟹ The eigenvalue decomposition has the form 𝑲 = 𝑽𝚲𝑽𝑇 ,
where 𝚲 is a diagonal matrix containing the eigenvalues 𝜆𝑡 = 𝚲𝑡𝑡 of 𝑲, and 𝑽
is an orthogonal matrix containing the eigenvectors 𝑽⋅𝑡 .
𝑲 is positive semi-definite ⟹ All eigenvalues are non-negative, 𝜆𝑡 ≥ 0.
Define the feature map
𝝓𝑡 𝒙 𝑛 = 𝜆𝑡 𝑽𝑛𝑡
and see that it corresponds to the kernel by calculating the scalar product
𝑁
𝑛 𝑇
𝝓 𝒙 𝝓 𝒙𝑚 = 𝑽𝑛𝑡 𝜆𝑡 𝑽𝑚𝑡 = 𝑽𝚲𝑽𝑇 𝑛𝑚 = 𝑲𝑛𝑚 .
𝑡=1
∎
Making kernels from kernels
Use the following rules to construct more complex kernels from simple ones.

Let 𝐾1 and 𝐾2 be kernels on 𝒳 ⊆ ℝ𝑛 , then the following functions are kernels:

1. 𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 + 𝐾2 𝒙, 𝒚 ,
2. 𝐾 𝒙, 𝒚 = 𝑎 𝐾1 𝒙, 𝒚 for 𝑎 > 0,
3. 𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 𝐾2 𝒙, 𝒚 ,
4. 𝐾 𝒙, 𝒚 = 𝐾3 𝝓 𝒙 , 𝝓 𝒚 for 𝐾3 kernel on ℝ𝑚 and 𝝓: 𝒳 → ℝ𝑚 ,
5. 𝐾 𝒙, 𝒚 = 𝒙𝑇 𝑩𝒚 for 𝑩 symmetric and positive
semi-definite 𝑛 × 𝑛 matrix.

Proofs.
Fix a finite set of points 𝒙1 , … , 𝒙𝑙 and let 𝐊, 𝑲1 , 𝑲2 be the corresponding
Gram matrices of the kernel functions 𝐾, 𝐾1 , 𝐾2 .
Obviously the resulting 𝑲 is symmetric for every case.
It remains to be shown that 𝑲 is positive semi-definite, that is
𝒛𝑇 𝑲𝒛 ≥ 0 for all 𝒛.
Making kernels from kernels
1. 𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 + 𝐾2 𝒙, 𝒚
We have
𝒛𝑇 𝑲1 + 𝑲2 𝒛 = 𝒛𝑇 𝑲1 𝒛 + 𝒛𝑇 𝑲2 𝒛 ≥ 0
since 𝐾1 and 𝐾2 are positive semi-definite.

2. 𝐾 𝒙, 𝒚 = 𝑎 𝐾1 𝒙, 𝒚 for 𝑎 > 0
Similarly, we have
𝒛𝑇 𝑎𝑲1 𝒛 = 𝑎 𝒛𝑇 𝑲1 𝒛 ≥ 0 .
Making kernels from kernels
3. 𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 𝐾2 𝒙, 𝒚
We explicitly construct a feature space corresponding to 𝐾 𝒙, 𝒚 ,
𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 𝐾2 𝒙, 𝒚 = 𝝓𝟏 𝒙 𝑇 𝝓𝟏 𝒚 𝝓𝟐 𝒙 𝑇 𝝓𝟐 𝒚
= 𝜙1 𝒙 𝑖 𝜙1 𝒚 𝑖 𝜙2 𝒙 𝑗 𝜙2 𝒚 𝑗
𝑖 𝑗

= 𝜙1 𝒙 𝑖 𝜙2 𝒙 𝑗 𝜙1 𝒚 𝑖 𝜙2 𝒚 𝑗 = 𝝓 𝒙 𝑇𝝓 𝒚
𝑖,𝑗
with 𝝓 𝒙 = 𝝓𝟏 𝒙 ⊗ 𝝓𝟐 𝒙 (Kronecker product).
Making kernels from kernels
4. 𝐾 𝒙, 𝒚 = 𝐾3 𝝓 𝒙 , 𝝓 𝒚 for 𝐾3 kernel on ℝ𝑚 and 𝝓: 𝒳 → ℝ𝑚
Since 𝐾3 is a kernel for all input values there is nothing to prove.

5. 𝐾 𝒙, 𝒚 = 𝒙𝑇 𝑩𝒚 for 𝑩 symmetric and positive

semi-definite 𝑛 × 𝑛 matrix
1
A positive-definite matrix has exactly one square root 𝑩 2 with the property
1 1
𝑩 2 𝑩 2 = 𝑩.
1
Define the feature map 𝝓 𝒙 = 𝑩 2 𝒙 and see that 𝝓 𝑥 𝑇 𝝓 𝑦 =
1 𝑇 1 𝑻 𝟏
𝑩 2 𝒙 𝑩 2 𝒚 = 𝒙𝑻 𝑩 𝟐 𝑩 𝟐 𝒙 = 𝒙𝑇 𝑩𝒚 = 𝐾 𝒙, 𝒚 .
∎
EXAMPLE AND SUMMARY
Notation

𝒟 𝒟 dataset with 𝑁 samples

𝑥𝑛 𝑥 𝑛 n-th input sample
𝑧𝑛 𝑛 target output for n-th sample
𝑦
1 2 𝑁 𝑇
𝑍 𝒚 = 𝑦 ,𝑦 ,…,𝑦
𝑦(𝑥) 𝑦(𝑥) output computed for input 𝑥 by model
𝑦(𝑥𝑛 ) 𝑛 output computed for n-th input sample
𝑌
by model
1 2 𝑁 𝑇
𝑌 𝒀 = 𝑌 ,𝑌 ,…,𝑌
𝑊 𝒘 weights (model parameters)
Example: Gaussian kernel
Linear regression:
𝑁 x: training sample
𝑦 𝒙 = 𝑎𝑛 𝐾 𝒙 𝑛 ,𝒙
𝑛=1
„Similarity“ between data points 𝒙 and 𝒚 is
calculated by the Gaussian kernel,
𝒙−𝒚 2
𝐾 𝒙, 𝒚 = exp − ,
2𝜎 2
where 𝜎 is a scale parameter.
This puts a bump shaped function on every
training sample, Is this really a kernel?
𝑁 𝑛 2
Yes, use rules from „Making
𝒙−𝒙 kernels from kernels“ to prove it.
𝑦 𝒙 = 𝑎𝑛 exp − .
2𝜎 2
𝑛=1
Due to the fast decline of the exponential Corresponding feature space?
function, data points far away do not Infinite dimensional, will be
influence the result. shown in an exercise.
Example: Gaussian kernel

https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning

Linear regression using a Gaussian kernel with 𝜎 = 0.5.

Generating function was linear.
Example: Gaussian kernel

https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning

Linear regression using a Gaussian kernel with 𝜎 = 2.0.

Generating function was linear.
Example: Gaussian kernel

https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning

Linear regression using a Gaussian kernel with 𝜎 = 4.0.

Generating function was linear.
Example: Gaussian kernel

https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning

Linear regression using a Gaussian kernel with 𝜎 = 8.0.

Generating function was linear.
Example: Gaussian kernel

2
𝒙−𝒚
𝐾 𝒙, 𝒚 = exp −
2𝜎 2

The quality of the result is very sensitive to the choice of the variance 𝜎.

Use cross-validation to choose the right value.

Combining model-based and model-free approaches
If there is a feature space 𝝓(𝒙) that
partly explains the data, we can add
the kernel induced by that feature
space, 𝝓 𝒙 𝑇 𝝓 𝒚 , to a generic
kernel 𝐾2 𝒙, 𝒚 to obtain a better
fit.

𝐾 𝒙, 𝒚 = 𝝓 𝒙 𝑇 𝝓 𝒚 + 𝐾2 𝒙, 𝒚

Here, we could combine a linear

model, 𝝓 𝒙 = 𝒙, with the Gaussian
kernel 𝐾2 𝒙, 𝒚 .
Problem: Gaussian kernel cannot
extrapolate, since Linear model: computes general
𝒙−𝒙 𝑛
2 trend of data.
𝑛 𝑎𝑛 exp −
2𝜎2
→0
further away from the training set. Gaussian kernel: captures local
variations.
Commonly used kernels

 Linear: 𝐾 𝒙, 𝒚 = 𝒙𝑇 𝒚
𝑇 𝑑
 Polynomial: 𝐾 𝒙, 𝒚 = 𝒙 𝒚 +𝑐
𝒙−𝒚 2
 Gaussian: 𝐾 𝒙, 𝒚 = exp −
2𝜎 2

 Sometimes functions are used as kernels that

are not positive semi-definite. Usually this
works but it can lead to strange results.
Applications of kernels

 Regression
 Gaussian Processes (GPs)
 Classification (especially Support Vector
Machines)
 Principal Component Analysis (PCA)
 Every algorithm with a scalar product.

 Kernels can also be defined on domains other

than ℝ𝑁 , for example strings. This allows the
above methods to be applied on new domains.
Summary

 Kernels compute the value of the scalar

product in a high dimensional feature space
without explicitly computing the features.
 They can be used in any model that depends
(or that can be rewritten so that it depends)
on the scalar product of the training samples.
 Not every function is a kernel.
 Use kernel construction rules to prove that a
function is a valid kernel.
References

C. Bishop
Pattern Recognition and Machine Learning
Springer-Verlag 2006

N. Cristianini, J. Shawe-Taylor
An Introduction to Support Vector Machines and other
kernel-based learning methods
Cambridge University Press 2000

GRE Pool Solved ISSUE
77% (69)
GRE Pool Solved ISSUE
308 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Practice_Problems_for_ML_Midterms
No ratings yet
Practice_Problems_for_ML_Midterms
5 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Lect 3
No ratings yet
Lect 3
14 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Lecture03_kernel
No ratings yet
Lecture03_kernel
28 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Chap6.1-KernelMethods
No ratings yet
Chap6.1-KernelMethods
36 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Week 2 DrBuddhananda Banerjee Vector RV
No ratings yet
Week 2 DrBuddhananda Banerjee Vector RV
10 pages
Dokumen - Tips - Homework 3 Solution Ee263 Introduction To Linear Ee263 Homework 3 Solution
No ratings yet
Dokumen - Tips - Homework 3 Solution Ee263 Introduction To Linear Ee263 Homework 3 Solution
27 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Vahid
No ratings yet
Vahid
18 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
wainwrightslides1
No ratings yet
wainwrightslides1
67 pages
SVM
No ratings yet
SVM
40 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
NNLS1 2019 HW4 Solutions
No ratings yet
NNLS1 2019 HW4 Solutions
11 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Lecture 04
No ratings yet
Lecture 04
19 pages
Kernel Methods!: Sargur Srihari!
No ratings yet
Kernel Methods!: Sargur Srihari!
29 pages
2linear Regression
No ratings yet
2linear Regression
102 pages
SVM
No ratings yet
SVM
57 pages
HW 3
No ratings yet
HW 3
5 pages
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
No ratings yet
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
113 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Lec3-The Kernel Trick
No ratings yet
Lec3-The Kernel Trick
4 pages
DSA5102X_lecture2
No ratings yet
DSA5102X_lecture2
43 pages
HW 1
No ratings yet
HW 1
4 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Lecture Notes on High Dimensional Linear Regression
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
Nonlinear Regression
No ratings yet
Nonlinear Regression
8 pages
Lecture 05
No ratings yet
Lecture 05
10 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Neural Network Lectures RBF 1
No ratings yet
Neural Network Lectures RBF 1
44 pages
CS 532 Lecture Notes
No ratings yet
CS 532 Lecture Notes
25 pages
Introduction To Support Vector Machines: BTR Workshop Fall 2006
No ratings yet
Introduction To Support Vector Machines: BTR Workshop Fall 2006
88 pages
Introduction To Support Vector Machines: BTR Workshop Fall 2006
No ratings yet
Introduction To Support Vector Machines: BTR Workshop Fall 2006
88 pages
Final 2012 Wsolutions
No ratings yet
Final 2012 Wsolutions
14 pages
2014 VL CogSys 6
No ratings yet
2014 VL CogSys 6
23 pages
A Comprehensive Review of YOLO From YOLOv1 To YOLO
No ratings yet
A Comprehensive Review of YOLO From YOLOv1 To YOLO
27 pages
Final 2015 W
No ratings yet
Final 2015 W
4 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
00 Lectureslides LinAlg
No ratings yet
00 Lectureslides LinAlg
20 pages
03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
Tasd
No ratings yet
Tasd
143 pages
General Structure
No ratings yet
General Structure
6 pages
GRE Analytical Writing ISSUE Essay Topic - 50
No ratings yet
GRE Analytical Writing ISSUE Essay Topic - 50
2 pages
2014 VL CogSys 2
No ratings yet
2014 VL CogSys 2
46 pages
GRE Analytical Writing ISSUE Essay Topic - 31
No ratings yet
GRE Analytical Writing ISSUE Essay Topic - 31
2 pages
GRE Analytical Writing ISSUE Essay Topic - 45
No ratings yet
GRE Analytical Writing ISSUE Essay Topic - 45
2 pages
GRE Analytical Writing ISSUE Essay Topic - 41
No ratings yet
GRE Analytical Writing ISSUE Essay Topic - 41
1 page
GRE Analytical Writing ISSUE Essay Topic - 24
100% (1)
GRE Analytical Writing ISSUE Essay Topic - 24
2 pages
GRE Analytical Writing ISSUE Essay Topic - 32
No ratings yet
GRE Analytical Writing ISSUE Essay Topic - 32
2 pages
GRE Analytical Writing ISSUE Essay Topic - 9
No ratings yet
GRE Analytical Writing ISSUE Essay Topic - 9
2 pages
GRE Analytical Writing ISSUE Essay Topic - 14
No ratings yet
GRE Analytical Writing ISSUE Essay Topic - 14
2 pages
IELTS Writing Task 1 Simon
100% (13)
IELTS Writing Task 1 Simon
27 pages

05 Lectureslides Kernels

Uploaded by

05 Lectureslides Kernels

Uploaded by

Kernels

Lecture given by Grady Jensen

Slides created by Sebastian Urban

𝒟 𝒟 dataset with 𝑁 samples

Chirps per second versus

fit model parameters 𝑤

Can we make predictions without a scientific

We define the feature map

Using the squared prediction error we

Minimum error is obtained by

𝚽𝚽 𝑇 is a 𝑀x𝑀 matrix where 𝑀 is the

Chirps per second versus

This yields the feature map

Chirps per second versus Quality of solution? Worse.

One way to deal with overfitting is by using regularization.

So, what should we do if we do not know the model?

Is there a universal set of basis functions that can approximate

H. Takeda. Kernel Regression for Image Processing and Reconstruction. 2006.

Gaussian Radial Basis Functions (RBFs):

The center is at 𝒎 𝑖 and 𝛼 determines the 𝛼 = 0.2, 𝑚 = 0

𝛼 = 0.1, 𝑚 = −0.15; 𝑚 = +0.15

Gaussian RBF in two dimensional space.

Apply linear regression.

Is there a way to reduce the number of required weights?

𝒟 𝒟 dataset with 𝑁 samples

The row space 𝒜 of 𝐴 is the set of 𝒜 = 𝑐1 (1,0,0) + 𝑐2 (1,1,0), 𝒄 ∈ ℝ2

The null space ker 𝐴 of 𝐴 is the

Every element 𝒗 in the null space of

The orthogonal complement 𝒜⊥ of 𝒜⊥ = 𝑐(0,0,1), 𝑐 ∈ ℝ

Orthogonal decomposition: Any

Hence we can assume that

This also applies when using basis functions,

There are as many dual parameters as training samples. Their

Substituting 𝒘 back into the linear regression 𝑦 𝒙 = 𝒘𝑇 𝝓(𝒙)

Using the Gram matrix

𝑁 is the number of training samples.

Lots of basis functions and moderate number of training samples:

⇒ dual representation saves lots of parameters.

The polynomial basis of order 𝐷

Evaluating the kernel is much cheaper than calculating the

𝒟 𝒟 dataset with 𝑁 samples

𝚽𝑇 𝚽 mn = 𝚽𝑇 𝑚𝑘 𝚽𝑘𝑛 = 𝚽𝑘𝑚 𝚽𝑘𝑛

𝑲 is positive semi-definite, that is 𝒃𝑇 𝑲𝒃 ≥ 𝟎 for any 𝒃, since

Let 𝐾1 and 𝐾2 be kernels on 𝒳 ⊆ ℝ𝑛 , then the following functions are kernels:

5. 𝐾 𝒙, 𝒚 = 𝒙𝑇 𝑩𝒚 for 𝑩 symmetric and positive

𝒟 𝒟 dataset with 𝑁 samples

Linear regression using a Gaussian kernel with 𝜎 = 0.5.

Linear regression using a Gaussian kernel with 𝜎 = 2.0.

Linear regression using a Gaussian kernel with 𝜎 = 4.0.

Linear regression using a Gaussian kernel with 𝜎 = 8.0.

Use cross-validation to choose the right value.

Here, we could combine a linear

 Sometimes functions are used as kernels that

 Kernels can also be defined on domains other

 Kernels compute the value of the scalar

You might also like