0% found this document useful (0 votes)
8 views

10 Understanding Kernels

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

10 Understanding Kernels

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Probabilistic Inference and Learning

Lecture 10
Understanding Kernels

Philipp Hennig
18 May 2020

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. Understanding Kernels 23 13.07. Example: Topic Models
11 25.05. An Example for GP Regression 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
What we’ve seen:
▶ Inference in models involving linear relationships between Gaussian random variables only
requires linear algebra operations
▶ features can be used to learn nonlinear (real-valued) functions on various domains
▶ feature representations can be learned using type-II-maximum likelihood
▶ Gaussian process models allow utilizing infinitely many features in finite time

Some questions you may have:


▶ What are kernels? Can I think of them as “infinitely large matrices”?
▶ I’ve heard of kernel machines. What’s the connection to GPs?
▶ If GP’s / kernel machines use infinitely many features, can they learn every function?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Warning
Results shown here are often simplified.
Some regularity assumptions have been dropped for easier readability.
If you don’t like math, wait for the next lecture.

For deeper introductions, check out


M. Kanagawa, P. Hennig, D. Sejdinovic, and B.K. Sriperumbudur
Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences
https://arxiv.org/abs/1807.02582
(still in review)

and

I. Steinwart, A. Christmann
Support Vector Machines
Springer SBM, 2008

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
What are kernels? Can I think of them as “infinitely large matrices”?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Quick Linear-Algebra Refresher
positive definite matrices

Definition (Eigenvalue)
Let A ∈ Rn×n be a matrix. A scalar λ ∈ C and vector v ∈ Cn are called eigenvalue and corresponding
eigenvector if
X
n
[Av]i = [A]ij [v]j = λ[v]i .
j=1

Theorem (spectral theorem for symmetric positive-definite matrices)


The eigenvectors of symmetric matrices A = A⊺ are real, and form the basis of the image of A. A
symmetric positive definite matrix A can be written as a Gramian (outer product) of the eigenvectors:

X
n
[A]ij = λa [va ]i [va ]j and λa > 0 ∀a = 1, . . . , n.
a=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Kernels are Inner Products
Mercer’s Theorem image: The Royal Society

Definition (Eigenfunction)
A function ϕ : X _ R and scalar λ ∈ C that obeys

k(x, x̃)ϕ(x̃) dν(x̃) = λϕ(x)

are called an eigenfunction and eigenvalue of k with respect to ν.

Theorem (Mercer, 1909)


Let (X, ν) be a finite measure space and k : X × X _ R a continuous (Mercer)
kernel. Then there exist eigenvalues/functions (λi , ϕi )i∈I w.r.t. ν such that I is
countable, all λi are real and non-negative, the eigenfunctions can be made
orthonormal, and the following series converges absolutely and uniformly
ν 2 -almost-everywhere:

k(a, b) = λi ϕi (a)ϕi (b) ∀ a, b ∈ X. James Mercer (1883–1932)
(4 i~I _rA-L~LL~~lr~lJ P-~~LeecnJ
_
I
..r-.*Ylg*EYC"f~C~-"
r

i∈I

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Are Kernels Infinitely Large Positive Definite Matrices?
Kind of …

X
k(a, b) = λi ϕi (a)ϕi (b) =: Φ(a)ΣΦ(b)⊺ ∀ a, b ∈ X.
i∈I

▶ In the sense of Mercer’s theorem, one may think vaguely of a kernel k : X × X _ R evaluated at
k(a, b) for a, b ∈ X as the “element” of an “infinitely large” matrix kab .
▶ However, this interpretation is only relative to the measure ν : X _ R.
▶ In general, it is not straightforward to find the eigenfunctions
▶ The better question is: Why do you want to think about infinite matrices?

▶ What are the eigenfunctions?


▶ Do they eigenfunctions span a space like the eigenvectors of a matrix?
▶ What’s that space? Is it the sample space of a GP?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Bochner’s Theorem
Here’s why operators are tricky image: Rice University, 1970, CC-BY 3.0

A kernel k(a, b) is called stationary if it can be written as

k(a, b) = k(τ ) with τ := a − b

Theorem (Bochner’s theorem (simplified))


A complex-valued function k on RD is the covariance function of a weakly
stationary mean square continuous complex-valued random process on
RD if, and only if, its Fourier transform is a probability (i.e. finite positive)
meausure µ:
Z Z   ∗
⊺ ⊺ ⊺
k(τ ) = e2πis τ dµ(s) = e2πis a e2πis b dµ(s)
RD RD

Note, though: Mercer’s theorem described a countable representation!


One way to use such insights: linear-time approximations to Gaussian Salomon Bochner
process regression (Rahimi & Recht, NeurIPS 2008) 1899–1982
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
What are kernels? Can I think of them as “infinitely large matrices”?

▶ kernels have eigenfunctions, like matrices have eigenvectors


▶ eigenfunctions, though, are only defined relative to a base measure
▶ Mercer’s theorem that the eigenfunctions ”generate” the kernel
▶ but finding the eigenfunctions can be tricky

I’ve heard of kernel machines. What’s the connection to GPs?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Gaussian processes, by any other name
one of the most deeply studied models in history

Equivalent and closely related names for Gaussian process regression


▶ Kriging (in particular in the geosciences)
▶ kernel ridge regression
▶ Wiener–Kolmogorov prediction
▶ linear least-squares regression

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
The Gaussian Posterior Mean is a Least-Squares estimate
nonparametric formulation, at explicit locations

p(y | fX )p(f) N (y; fX , σ 2 I)GP(fx,X ; m, k)


p(fx | y) = =
p(y) N (y; mX , kXX + σ 2 I)
= GP(fx ; mx + kxX (kXX + σ 2 I)−1 (y − mX ), kxx − kxX (kXX + σ 2 I)−1 kXx )
Ep(fX |y) (fX ) = arg max p(fX | y)
fX ∈R|X|

= arg min −p(fX | y) = arg min − log p(fX | y)


fX fX
1 1
= arg min 2 ∥y − fX ∥2 + ∥fX − mX ∥2k where ∥fX ∥2k := f⊺X k−1
XX fX
fX 2σ 2

The posterior mean estimator of Gaussian (process) regression is equal to the regularized least-squares
estimate with the regularizer ∥f∥2k . This is also known as the kernel ridge estimate.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
200 years of data analysis
and counting portrait: Julien-Léopold Boilly, 1820 (all other portraits show a different Legendre!)

Adrien-Marie Legendre
Nouvelles méthodes pour la détermination des orbites des comètes, 1805 1752–1833

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
200 years of data analysis
and counting

Carl-Friedrich Gauss
Theorie der Bewegung der Himmelskörper welche in Kegelschnitten die Sonne
1777 – 1855
umlaufen, 1877
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What about all those kernel concepts?
What’s the relationship between GPs and kernel ridge regression? slides: Ulrike v. Luxburg, 2020

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Reproducing Kernel Hilbert Spaces
Two definitions [Schölkopf & Smola, 2002 / Rasmussen & Williams, 2006]

Definition (Reproducing kernel Hilbert space (RKHS))


Let H = (X, ⟨·, ·⟩) be a Hilbert space of functions f : X _ R. Then H is called a reproducing kernel
Hilbert space if there exists a kernel k : X × X _ R s.t.
1. ∀x ∈ X : k(·, x) ∈ H
2. ∀f ∈ H : ⟨f(·), k(·, x)⟩H = f(x) k reproduces H

Theorem [Aronszajn, 1950]: For every pos.def. k on X, there exists a unique RKHS.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
What is the RKHS? (1)
The RKHS is the space of possible posterior mean functions [e.g. Rasmussen & Williams, 2006, Eq. 6.5]

Theorem (Reproducing kernel map representation)


Let X, ν, (ϕi , λi )i∈I be defined as before. Let (xi )i∈I ⊂ X be a countable collection of points in X. Then
the RKHS can also be written as the space of linear combinations of kernel functions:
( )
X X α̃i β̃i
Hk = f(x) := α̃i k(xi , x) with ⟨f, g⟩Hk :=
k(xi , xi )
i∈I i∈I

Proof: cf. Prof. v. Luxburg’s lecture

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
What is the RKHS? (1)
The RKHS is the space of possible posterior mean functions [e.g. Rasmussen & Williams, 2006, Eq. 6.5]

Theorem (Reproducing kernel map representation)


Let X, ν, (ϕi , λi )i∈I be defined as before. Let (xi )i∈I ⊂ X be a countable collection of points in X. Then
the RKHS can also be written as the space of linear combinations of kernel functions:
( )
X X α̃i β̃i
Hk = f(x) := α̃i k(xi , x) with ⟨f, g⟩Hk :=
k(xi , xi )
i∈I i∈I

Proof: cf. Prof. v. Luxburg’s lecture


Consider the Gaussian process p(f) = GP(0, k) with likelihood p(y | f, X) = N (y; fX , σ 2 I). The RKHS
is the space of all possible posterior mean functions

X
n
µ(x) = kxX (kXX + σ 2 I)−1 y = wi k(x, xi ) for n ∈ N.
| {z }
:=w i=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
To understand what a GP can learn we have to analyze the RKHS
the connection to the statistical learning theory of RKHSs

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
What is the meaning of the GP point estimate?
The posterior mean is the least-squares estimate in the RKHS

Theorem (The Kernel Ridge Estimate)


Consider the model p(f) = GP(f; 0, k), p(y | f) = N (y; fX , σ 2 I). The posterior mean

m(x) = kxX (kXX + σ 2 I)−1 y

is the element of the RKHS Hk that minimizes the regularised ℓ2 loss

1 X
L(f) = (f(xi ) − yi )2 + ∥f∥2Hk .
σ2
i

Proof: cf. Prof. v. Luxburg’s lecture

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
What is the meaning of uncertainty?
Frequentist interpretation of the posterior variance

How far could the posterior mean be from the truth, assuming noise-free observations?
 2
X 
2 −1
sup (m(x) − f(x)) = sup  f(xi ) [KXX k(X, x)]i −f(x)
f∈H,∥f∥≤1 f∈H,∥f∥≤1 i
| {z }
wi
* +2
X
reproducing property: = sup wi k(·, xi ) − k(·, x), f(·)
i H
2
X
Cauchy-Schwartz: (|⟨a, b⟩| ≤ ∥a∥ · ∥b∥) = wi k(·, xi ) − k(·, x)
i
X XH
reproducing property: = wi wj k(xi , xj ) − 2 wi k(x, xi ) + k(x, x)
ij i

= kxx − kxX K−1


XX kXx = E|y [(fx − µx )2 ]

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Bayesians expect the worst
it’s not always true that ”Frequentists are pessimists”

Theorem
Assume p(f) = GP(f; 0, k) and noise-free observations p(y | f) = δ(y − fX ). The GP posterior variance
(the expected square error)

v(x) := Ep(f|y) (f(x) − m(x))2 = kxx − kxX K−1


XX kXx

is a worst-case bound on the divergence between m(x) and an RKHS element of bounded norm:

v(x) = sup (m(x) = f(x))2


f∈Hk ,∥f∥≤1

The GP’s expected square error is the RKHS’s worst-case square error for bounded norm.
Nb: v(x) is not, in general, itself an element of Hk .

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
What is the RKHS? (2)
Representation in terms of eigenfunctions [I. Steinwart and A. Christmann. Support Vector Machines, 2008, Thm. 4.51]

Theorem (Mercer Representation)


Let X be a compact metric space, k be a continuous kernel on X, ν be a finite Borel measure whose
support is X. Let (ϕi , λi )i∈I be the eigenfunctions and values of k w.r.t. ν. Then the RKHS Hk is given by
( )
X 1/2
X X
Hk = f(x) := αi λi ϕi (x) such that ∥f∥Hk :=
2
αi < ∞
2
with ⟨f, g⟩Hk := αi βi
i∈I i∈I i∈I

P 1/2 P 1/2
For f = i∈I αi λi ϕi and g = i∈I βi λi ϕi .
A compact space, simplified, is a space that is both bounded (all points have finite distance from each other) and closed (it contains all limits). For topological spaces, this is more generally defined by every open
cover (every union C of open sets covering all of X) having a finite subcover (i.e. a finite subset of C that also covers X).

Simplified proof: First, show that this space matches the RKHS definition
P 1/2 1/2 P
1. ∀x ∈ X : k(·, x) = i∈I λi ϕi (·) · λi ϕi (x) and ∥k(·, x)∥2 = i λi ϕi (x)2 = k(x, x) < ∞
| {z }
αi
P 1/2
2. ⟨f(·), k(·, x)⟩ = i∈I αi λi ϕ(x) = f(x). Then use Aronszajn’s uniqueness result. □

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
What about the samples?
Draws from a Gaussian process [for non-simplified version, cf. Kanagawa et al., 2018 (op.cit.), Thms. 4.3 and 4.9]

Theorem (Karhunen-Loève Expansion)


Let X be a compact metric space, k : X × X, k be a continuous kernel, ν a finite Borel measure whose
support is X, and (ϕi , λi )i∈I as above. Let (zi )i∈I be a collection of iid. standard Gaussian random
variables:
zi ∼ N (0, 1) and E[zi , zj ] = δij , for i, j ∈ I.
Then (simplified!): X 1/2
f(x) = zi λi ϕi (x) ∼ GP(0, k).
i∈I

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
What about the samples?
Draws from a Gaussian process [for non-simplified version, cf. Kanagawa et al., 2018 (op.cit.), Thms. 4.3 and 4.9]

Theorem (Karhunen-Loève Expansion)


Let X be a compact metric space, k : X × X, k be a continuous kernel, ν a finite Borel measure whose
support is X, and (ϕi , λi )i∈I as above. Let (zi )i∈I be a collection of iid. standard Gaussian random
variables:
zi ∼ N (0, 1) and E[zi , zj ] = δij , for i, j ∈ I.
Then (simplified!): X 1/2
f(x) = zi λi ϕi (x) ∼ GP(0, k).
i∈I

Corollary (Wahba, 1990. Proper proof in Kanagawa et al., Thm. 4.9)


If I is infinite, f ∼ GP(0, k) implies almost surely f ̸∈ Hk . To see this, note
!
X X X
E(∥f∥Hk ) = E
2 2
zi = E[z2i ] = 1 ̸< ∞
i∈I i∈I i∈I
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
GP samples are not in the RKHS!
But almost …

Theorem (Kanagawa, 2018. Restricted from Steinwart, 2017, itself generalized from Driscoll, 1973)
Let Hk be a RKHS and 0 < θ ≤ 1. Consider the θ-power of Hk given by
( )
X θ/2
X X
Hk = f(x) :=
θ
αi λi ϕi (x) such that ∥f∥Hk :=
2
αi < ∞
2
with ⟨f, g⟩Hk := αi βi .
i∈I i∈I i∈I

Then, X
λ1−θ
i <∞ ⇒ f ∼ GP(0, k) ∈ Hkθ with prob. 1
i∈I

Non-representative Example: Let kλ (a, b) = exp(−(a − b)2 /(2λ2 )). Then f ∼ GP(0, kλ ) is in Hkθλ
with prob. 1 for all 0 < θ < 1. The situation is more complicated for other kernels.
GP samples are not in the RKHS. They belong to a kind of “completion” of the RKHS (but that completion
can be strictly larger than the RKHS).

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
▶ GP and Kernel Methods are very closely related
▶ the RKHS is the space of all possible posterior mean functions
▶ the posterior mean is the ℓ2 -least-squares estimate in the RKHS
▶ the posterior variance (expected square error) is the worst-case error of bounded norm in the RKHS
▶ GP samples are not in the RKHS

If GP’s / kernel machines use infinitely many features, can they learn every function?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
How powerful are kernel/GP models?
first, the hope [Micchelli, Xu, Zhang, JMLR 7 (2006) 2651–2667]

▶ For some kernels, the RKHS “lies dense” in the space of all continuous functions (such kernels are
known as “universal”). An example is the square-exponential / Gaussian / RBF kernel

k(a, b) = exp(−1/2(a − b)2 )

(in fact, there are many universal kernels. E.g. all stationary kernels with power spectrum of full support.)

▶ When using such kernels for GP / kernel-ridge regression, for any continuous functions f, for any
ϵ > 0 there is an RKHS element f̂ ∈ Hk such that ∥f − f̂∥ < ϵ (where ∥ · ∥ is the maximum norm
on a compact subset of X).
▶ that is: Given enough data, the GP posterior mean can approximate any function arbitrarily well!

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
The bad news
if f is not in the RKHS – prior

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 1 evaluation

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 2 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 5 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 10 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 20 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 50 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 100 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 500 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
Convergence Rates are Important
non-obvious aspects of f can ruin convergence v.d.Vaart & v.Zanten. Information Rates of Nonparametric GP models. JMLR 12 (2011)

100
∥f − m∥2

10−2

100 101 102 103


# function evaluations

If f is “not well covered” by the RKHS, the number of datapoints required to achieve ϵ error can be
exponential in ϵ. Outside of the observation range, there are no guarantees at all.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
An Analogy
representing π in Q

▶ Q is dense in R
1 1 1 1
π =3· +1· +4· +1· + ... decimal
1 10 100 1000
1 1 1 1
= 4 · − 4 · + 4 · − 4 · + ... Gregory-Leibniz
1 3 5 7
1 1 1 1
=3· +4· −4· +4· Nilakantha
1 2·3·4 4·5·6 6·7·8
0

−5
log10 error

decimal
Nilakantha
−10
Gregory-Leibniz
Chudnovsky
−15
0 2 4 6 8 10 12 14 16 18 20
‘datapoints’

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
But if you’re patient, you can learn anything!
The good news. [wording from Kanagawa et al., 2018]

Theorem (v.d. Vaart & v. Zanten, 2011)


Let f0 be an element of the Sobolev space Wβ2 [0, 1]d with β > d/2. Let ks be a kernel on [0, 1]d whose
RKHS is norm-equivalent to the Sobolev space Ws2 ([0, 1]d ) of order s := α + d/2 with α > 0. If
f0 ∈ Cβ ([0, 1]d ) ∩ Wβ2 ([0, 1]d ) and min(α, β) > d/2, then we have
Z 
EDn |f0 ∥f − f0 ∥L2 (PX ) dΠn (f|Dn ) = O(n−2 min(α,β)/(2α+d) ) (n _ ∞),
2
(1)

where EX,Y|f0 denotes expectation with respect to Dn = (xi , yi )ni=1 with the model xi ∼ PX and
p(y | f0 ) = N (y; f0 (X), σ 2 I), and Πn (f|Dn ) the posterior given by GP-regression with kernel ks .
The Sobolev space Ws2 (X) is the vector space of real-valued functions over X whose derivatives up to s-th order have bounded L2 norm. L2 (PX ) is the
Hilbert space of square-integrable functions with respect to PX .

If f0 is from a sufficiently smooth space, and Hk is “covering” that space well, then the entire GP posterior
(including the mean!) can contract around the true function at a linear rate.
GPs are “infinitely flexible”: They can learn infinite-dimensional functions arbitrarily well!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
▶ Gaussian process regression is closely related to kernel ridge regression.
▶ the posterior mean is the kernel ridge / regularized kernel least-squares estimate in the RKHS Hk .

m(x) = kxX (kXX + σ 2 I)−1 y = arg min ∥y − fX ∥2 + ∥f∥2Hk


f∈Hk

▶ the posterior variance (expected square error) is the worst-case square error for bounded-norm RKHS
elements.
v(x) = kxx − kxX (kXX )−1 kXx = arg max ∥f(x) − m(x)∥2
f∈Hk ,∥f∥H ≤1
k

▶ Similar connections apply for most kernel methods.


▶ GPs are quite powerful: They can learn any function in the RKHS (a large, generally
infinite-dimensional space!)
▶ GPs are quite limited: If f ̸∈ Hk , they may converge very (e.g. exponentially) slowly to the truth.
▶ But if we are willing to be cautious enough (e.g. with a rough kernel whose RKHS is a Sobolev
space of low order), then polynomial rates are achievable. (Unfortunately, exponentially slow in the
dimensionality of the input space)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30

You might also like