05 Lectureslides Kernels
05 Lectureslides Kernels
CHIRP, CHIRP
Cricket-Chirp model: 𝑦 =
𝑓𝑤 𝑥
Approximation of sin(10𝑥) by
16 Gaussian RBFs.
Black line: original function In one dimension this seems to work
Red line: regression well, but does it scale to higher
dimensions?
Curse of dimensionality
To cover the data with a constant discretization level (number of
RBFs per unit volume) the number of basis functions and
weights grows exponentially with the number of dimensions.
1D Dimensions # of basis functions /
weights
1 16
2 162 = 256
3 163 = 4096
4 164 = 65 536
5 165 = 1 048 576
⋮ ⋮
10 1610 ≈ 1012
2D
𝒘= 𝑎𝑛 𝒙 𝑛 .
𝑛=1
The parameters 𝒂 = (𝑎1 , … , 𝑎𝑁 ) are called the dual parameters.
𝒘= 𝑎𝑛 𝝓 𝒙 𝑛 .
𝑛=1
𝑌 𝑚 =𝑦 𝒙 𝑚 = 𝑎𝑛 𝐾 𝒙 𝑛 ,𝒙 𝑚 = 𝑎𝑛 𝑲𝑚𝑛
𝑛=1 𝑛=1
𝑇
𝒀= 𝒂𝑻 𝑲 with 𝒀 = 𝑌 1 ,𝑌 2 ,…,𝑌 𝑁 .
Solution of dual representation
Primal representation: Dual representation:
𝐸 𝒘 = 𝒚𝑇 − 𝒘𝑇 𝚽 22 𝐸 𝒂 = 𝒚𝑇 − 𝒂𝑇 𝑲 22
where 𝚽 is the matrix of feature where 𝐊 is the gram matrix (also
vectors. called kernel matrix).
Solution: Solution:
𝒘 = 𝚽𝚽 𝑇 −1 𝚽 𝒚 = 𝚽 + 𝐲 𝒂 = 𝐊𝐊 𝑇 −1 𝐊 𝒚 = 𝑲+ 𝒚
Complexity is 𝑂(𝑀3 ) where 𝑀 is the Complexity is 𝑂(𝑁 3 ) where 𝑁 is the
number of basis functions. number of training samples.
Predictions: Predictions:
𝑦 𝒙 = 𝒘𝑇 𝝓(𝒙) 𝑁
𝑛
𝑦 𝒙 = 𝑎𝑛 𝐾(𝒙 , 𝒙)
𝑛=1
“Component 𝑎𝑛 weighted with
similarity to training sample 𝒙 𝑛 .”
Why is the dual representation useful?
𝑦 𝒙 = 𝑎𝑛 𝐾(𝒙 𝑛 , 𝒙)
𝑛=1
Calculate kernel:
𝑀
𝐾 𝒙, 𝒚 = 𝝓 𝒙 𝑇 𝝓 𝒚 = 𝜙𝑖 𝒙 𝜙𝑖 𝒚
𝑖=1
𝑀
𝒙2 + 𝒚2 𝒎 𝑖 𝒙+𝒚 −𝒎 𝑖 2
= exp − exp
2𝛼 2 𝛼2
𝑖=1
Linear regression:
𝑁
𝑛
𝑦 𝒙 = 𝑎𝑛 𝐾 𝒙 ,𝒙
𝑛=1
where 𝑁 is number of training samples.
Example: Kernel of polynomial basis
Calculate 𝚽 𝑇 𝚽 component-wise,
Thus we have
𝑲 = 𝚽𝑇 𝚽
with K symmetric.
Proofs.
Fix a finite set of points 𝒙1 , … , 𝒙𝑙 and let 𝐊, 𝑲1 , 𝑲2 be the corresponding
Gram matrices of the kernel functions 𝐾, 𝐾1 , 𝐾2 .
Obviously the resulting 𝑲 is symmetric for every case.
It remains to be shown that 𝑲 is positive semi-definite, that is
𝒛𝑇 𝑲𝒛 ≥ 0 for all 𝒛.
Making kernels from kernels
1. 𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 + 𝐾2 𝒙, 𝒚
We have
𝒛𝑇 𝑲1 + 𝑲2 𝒛 = 𝒛𝑇 𝑲1 𝒛 + 𝒛𝑇 𝑲2 𝒛 ≥ 0
since 𝐾1 and 𝐾2 are positive semi-definite.
2. 𝐾 𝒙, 𝒚 = 𝑎 𝐾1 𝒙, 𝒚 for 𝑎 > 0
Similarly, we have
𝒛𝑇 𝑎𝑲1 𝒛 = 𝑎 𝒛𝑇 𝑲1 𝒛 ≥ 0 .
Making kernels from kernels
3. 𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 𝐾2 𝒙, 𝒚
We explicitly construct a feature space corresponding to 𝐾 𝒙, 𝒚 ,
𝐾 𝒙, 𝒚 = 𝐾1 𝒙, 𝒚 𝐾2 𝒙, 𝒚 = 𝝓𝟏 𝒙 𝑇 𝝓𝟏 𝒚 𝝓𝟐 𝒙 𝑇 𝝓𝟐 𝒚
= 𝜙1 𝒙 𝑖 𝜙1 𝒚 𝑖 𝜙2 𝒙 𝑗 𝜙2 𝒚 𝑗
𝑖 𝑗
= 𝜙1 𝒙 𝑖 𝜙2 𝒙 𝑗 𝜙1 𝒚 𝑖 𝜙2 𝒚 𝑗 = 𝝓 𝒙 𝑇𝝓 𝒚
𝑖,𝑗
with 𝝓 𝒙 = 𝝓𝟏 𝒙 ⊗ 𝝓𝟐 𝒙 (Kronecker product).
Making kernels from kernels
4. 𝐾 𝒙, 𝒚 = 𝐾3 𝝓 𝒙 , 𝝓 𝒚 for 𝐾3 kernel on ℝ𝑚 and 𝝓: 𝒳 → ℝ𝑚
Since 𝐾3 is a kernel for all input values there is nothing to prove.
https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning
https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning
https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning
https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.LocalLearning
2
𝒙−𝒚
𝐾 𝒙, 𝒚 = exp −
2𝜎 2
The quality of the result is very sensitive to the choice of the variance 𝜎.
𝐾 𝒙, 𝒚 = 𝝓 𝒙 𝑇 𝝓 𝒚 + 𝐾2 𝒙, 𝒚
Linear: 𝐾 𝒙, 𝒚 = 𝒙𝑇 𝒚
𝑇 𝑑
Polynomial: 𝐾 𝒙, 𝒚 = 𝒙 𝒚 +𝑐
𝒙−𝒚 2
Gaussian: 𝐾 𝒙, 𝒚 = exp −
2𝜎 2
Regression
Gaussian Processes (GPs)
Classification (especially Support Vector
Machines)
Principal Component Analysis (PCA)
Every algorithm with a scalar product.
C. Bishop
Pattern Recognition and Machine Learning
Springer-Verlag 2006
N. Cristianini, J. Shawe-Taylor
An Introduction to Support Vector Machines and other
kernel-based learning methods
Cambridge University Press 2000