10 Understanding Kernels
10 Understanding Kernels
Lecture 10
Understanding Kernels
Philipp Hennig
18 May 2020
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. Understanding Kernels 23 13.07. Example: Topic Models
11 25.05. An Example for GP Regression 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
What we’ve seen:
▶ Inference in models involving linear relationships between Gaussian random variables only
requires linear algebra operations
▶ features can be used to learn nonlinear (real-valued) functions on various domains
▶ feature representations can be learned using type-II-maximum likelihood
▶ Gaussian process models allow utilizing infinitely many features in finite time
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Warning
Results shown here are often simplified.
Some regularity assumptions have been dropped for easier readability.
If you don’t like math, wait for the next lecture.
and
I. Steinwart, A. Christmann
Support Vector Machines
Springer SBM, 2008
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
What are kernels? Can I think of them as “infinitely large matrices”?
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Quick Linear-Algebra Refresher
positive definite matrices
Definition (Eigenvalue)
Let A ∈ Rn×n be a matrix. A scalar λ ∈ C and vector v ∈ Cn are called eigenvalue and corresponding
eigenvector if
X
n
[Av]i = [A]ij [v]j = λ[v]i .
j=1
X
n
[A]ij = λa [va ]i [va ]j and λa > 0 ∀a = 1, . . . , n.
a=1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Kernels are Inner Products
Mercer’s Theorem image: The Royal Society
Definition (Eigenfunction)
A function ϕ : X _ R and scalar λ ∈ C that obeys
∫
k(x, x̃)ϕ(x̃) dν(x̃) = λϕ(x)
i∈I
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Are Kernels Infinitely Large Positive Definite Matrices?
Kind of …
X
k(a, b) = λi ϕi (a)ϕi (b) =: Φ(a)ΣΦ(b)⊺ ∀ a, b ∈ X.
i∈I
▶ In the sense of Mercer’s theorem, one may think vaguely of a kernel k : X × X _ R evaluated at
k(a, b) for a, b ∈ X as the “element” of an “infinitely large” matrix kab .
▶ However, this interpretation is only relative to the measure ν : X _ R.
▶ In general, it is not straightforward to find the eigenfunctions
▶ The better question is: Why do you want to think about infinite matrices?
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Bochner’s Theorem
Here’s why operators are tricky image: Rice University, 1970, CC-BY 3.0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Gaussian processes, by any other name
one of the most deeply studied models in history
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
The Gaussian Posterior Mean is a Least-Squares estimate
nonparametric formulation, at explicit locations
The posterior mean estimator of Gaussian (process) regression is equal to the regularized least-squares
estimate with the regularizer ∥f∥2k . This is also known as the kernel ridge estimate.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
200 years of data analysis
and counting portrait: Julien-Léopold Boilly, 1820 (all other portraits show a different Legendre!)
Adrien-Marie Legendre
Nouvelles méthodes pour la détermination des orbites des comètes, 1805 1752–1833
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
200 years of data analysis
and counting
Carl-Friedrich Gauss
Theorie der Bewegung der Himmelskörper welche in Kegelschnitten die Sonne
1777 – 1855
umlaufen, 1877
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What about all those kernel concepts?
What’s the relationship between GPs and kernel ridge regression? slides: Ulrike v. Luxburg, 2020
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Reproducing Kernel Hilbert Spaces
Two definitions [Schölkopf & Smola, 2002 / Rasmussen & Williams, 2006]
Theorem [Aronszajn, 1950]: For every pos.def. k on X, there exists a unique RKHS.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
What is the RKHS? (1)
The RKHS is the space of possible posterior mean functions [e.g. Rasmussen & Williams, 2006, Eq. 6.5]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
What is the RKHS? (1)
The RKHS is the space of possible posterior mean functions [e.g. Rasmussen & Williams, 2006, Eq. 6.5]
X
n
µ(x) = kxX (kXX + σ 2 I)−1 y = wi k(x, xi ) for n ∈ N.
| {z }
:=w i=1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
To understand what a GP can learn we have to analyze the RKHS
the connection to the statistical learning theory of RKHSs
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
What is the meaning of the GP point estimate?
The posterior mean is the least-squares estimate in the RKHS
1 X
L(f) = (f(xi ) − yi )2 + ∥f∥2Hk .
σ2
i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
What is the meaning of uncertainty?
Frequentist interpretation of the posterior variance
How far could the posterior mean be from the truth, assuming noise-free observations?
2
X
2 −1
sup (m(x) − f(x)) = sup f(xi ) [KXX k(X, x)]i −f(x)
f∈H,∥f∥≤1 f∈H,∥f∥≤1 i
| {z }
wi
* +2
X
reproducing property: = sup wi k(·, xi ) − k(·, x), f(·)
i H
2
X
Cauchy-Schwartz: (|⟨a, b⟩| ≤ ∥a∥ · ∥b∥) = wi k(·, xi ) − k(·, x)
i
X XH
reproducing property: = wi wj k(xi , xj ) − 2 wi k(x, xi ) + k(x, x)
ij i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Bayesians expect the worst
it’s not always true that ”Frequentists are pessimists”
Theorem
Assume p(f) = GP(f; 0, k) and noise-free observations p(y | f) = δ(y − fX ). The GP posterior variance
(the expected square error)
is a worst-case bound on the divergence between m(x) and an RKHS element of bounded norm:
The GP’s expected square error is the RKHS’s worst-case square error for bounded norm.
Nb: v(x) is not, in general, itself an element of Hk .
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
What is the RKHS? (2)
Representation in terms of eigenfunctions [I. Steinwart and A. Christmann. Support Vector Machines, 2008, Thm. 4.51]
P 1/2 P 1/2
For f = i∈I αi λi ϕi and g = i∈I βi λi ϕi .
A compact space, simplified, is a space that is both bounded (all points have finite distance from each other) and closed (it contains all limits). For topological spaces, this is more generally defined by every open
cover (every union C of open sets covering all of X) having a finite subcover (i.e. a finite subset of C that also covers X).
Simplified proof: First, show that this space matches the RKHS definition
P 1/2 1/2 P
1. ∀x ∈ X : k(·, x) = i∈I λi ϕi (·) · λi ϕi (x) and ∥k(·, x)∥2 = i λi ϕi (x)2 = k(x, x) < ∞
| {z }
αi
P 1/2
2. ⟨f(·), k(·, x)⟩ = i∈I αi λi ϕ(x) = f(x). Then use Aronszajn’s uniqueness result. □
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
What about the samples?
Draws from a Gaussian process [for non-simplified version, cf. Kanagawa et al., 2018 (op.cit.), Thms. 4.3 and 4.9]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
What about the samples?
Draws from a Gaussian process [for non-simplified version, cf. Kanagawa et al., 2018 (op.cit.), Thms. 4.3 and 4.9]
Theorem (Kanagawa, 2018. Restricted from Steinwart, 2017, itself generalized from Driscoll, 1973)
Let Hk be a RKHS and 0 < θ ≤ 1. Consider the θ-power of Hk given by
( )
X θ/2
X X
Hk = f(x) :=
θ
αi λi ϕi (x) such that ∥f∥Hk :=
2
αi < ∞
2
with ⟨f, g⟩Hk := αi βi .
i∈I i∈I i∈I
Then, X
λ1−θ
i <∞ ⇒ f ∼ GP(0, k) ∈ Hkθ with prob. 1
i∈I
Non-representative Example: Let kλ (a, b) = exp(−(a − b)2 /(2λ2 )). Then f ∼ GP(0, kλ ) is in Hkθλ
with prob. 1 for all 0 < θ < 1. The situation is more complicated for other kernels.
GP samples are not in the RKHS. They belong to a kind of “completion” of the RKHS (but that completion
can be strictly larger than the RKHS).
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
▶ GP and Kernel Methods are very closely related
▶ the RKHS is the space of all possible posterior mean functions
▶ the posterior mean is the ℓ2 -least-squares estimate in the RKHS
▶ the posterior variance (expected square error) is the worst-case error of bounded norm in the RKHS
▶ GP samples are not in the RKHS
If GP’s / kernel machines use infinitely many features, can they learn every function?
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
How powerful are kernel/GP models?
first, the hope [Micchelli, Xu, Zhang, JMLR 7 (2006) 2651–2667]
▶ For some kernels, the RKHS “lies dense” in the space of all continuous functions (such kernels are
known as “universal”). An example is the square-exponential / Gaussian / RBF kernel
(in fact, there are many universal kernels. E.g. all stationary kernels with power spectrum of full support.)
▶ When using such kernels for GP / kernel-ridge regression, for any continuous functions f, for any
ϵ > 0 there is an RKHS element f̂ ∈ Hk such that ∥f − f̂∥ < ϵ (where ∥ · ∥ is the maximum norm
on a compact subset of X).
▶ that is: Given enough data, the GP posterior mean can approximate any function arbitrarily well!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
The bad news
if f is not in the RKHS – prior
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 1 evaluation
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 2 evaluations
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 5 evaluations
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 10 evaluations
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 20 evaluations
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 50 evaluations
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 100 evaluations
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 500 evaluations
5
f(x)
−5
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
Convergence Rates are Important
non-obvious aspects of f can ruin convergence v.d.Vaart & v.Zanten. Information Rates of Nonparametric GP models. JMLR 12 (2011)
100
∥f − m∥2
∫
10−2
If f is “not well covered” by the RKHS, the number of datapoints required to achieve ϵ error can be
exponential in ϵ. Outside of the observation range, there are no guarantees at all.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
An Analogy
representing π in Q
▶ Q is dense in R
1 1 1 1
π =3· +1· +4· +1· + ... decimal
1 10 100 1000
1 1 1 1
= 4 · − 4 · + 4 · − 4 · + ... Gregory-Leibniz
1 3 5 7
1 1 1 1
=3· +4· −4· +4· Nilakantha
1 2·3·4 4·5·6 6·7·8
0
−5
log10 error
decimal
Nilakantha
−10
Gregory-Leibniz
Chudnovsky
−15
0 2 4 6 8 10 12 14 16 18 20
‘datapoints’
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
But if you’re patient, you can learn anything!
The good news. [wording from Kanagawa et al., 2018]
where EX,Y|f0 denotes expectation with respect to Dn = (xi , yi )ni=1 with the model xi ∼ PX and
p(y | f0 ) = N (y; f0 (X), σ 2 I), and Πn (f|Dn ) the posterior given by GP-regression with kernel ks .
The Sobolev space Ws2 (X) is the vector space of real-valued functions over X whose derivatives up to s-th order have bounded L2 norm. L2 (PX ) is the
Hilbert space of square-integrable functions with respect to PX .
If f0 is from a sufficiently smooth space, and Hk is “covering” that space well, then the entire GP posterior
(including the mean!) can contract around the true function at a linear rate.
GPs are “infinitely flexible”: They can learn infinite-dimensional functions arbitrarily well!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
▶ Gaussian process regression is closely related to kernel ridge regression.
▶ the posterior mean is the kernel ridge / regularized kernel least-squares estimate in the RKHS Hk .
▶ the posterior variance (expected square error) is the worst-case square error for bounded-norm RKHS
elements.
v(x) = kxx − kxX (kXX )−1 kXx = arg max ∥f(x) − m(x)∥2
f∈Hk ,∥f∥H ≤1
k
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30