0% found this document useful (0 votes)

8 views

10 Understanding Kernels

Uploaded by

irpower1375

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

10 Understanding Kernels

Uploaded by

irpower1375

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Probabilistic Inference and Learning

Lecture 10
Understanding Kernels

Philipp Hennig
18 May 2020

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. Understanding Kernels 23 13.07. Example: Topic Models
11 25.05. An Example for GP Regression 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
What we’ve seen:
▶ Inference in models involving linear relationships between Gaussian random variables only
requires linear algebra operations
▶ features can be used to learn nonlinear (real-valued) functions on various domains
▶ feature representations can be learned using type-II-maximum likelihood
▶ Gaussian process models allow utilizing infinitely many features in finite time

Some questions you may have:

▶ What are kernels? Can I think of them as “infinitely large matrices”?
▶ I’ve heard of kernel machines. What’s the connection to GPs?
▶ If GP’s / kernel machines use infinitely many features, can they learn every function?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Warning
Results shown here are often simplified.
Some regularity assumptions have been dropped for easier readability.
If you don’t like math, wait for the next lecture.

For deeper introductions, check out

M. Kanagawa, P. Hennig, D. Sejdinovic, and B.K. Sriperumbudur
Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences
https://arxiv.org/abs/1807.02582
(still in review)

and

I. Steinwart, A. Christmann
Support Vector Machines
Springer SBM, 2008

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
What are kernels? Can I think of them as “infinitely large matrices”?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Quick Linear-Algebra Refresher
positive definite matrices

Definition (Eigenvalue)
Let A ∈ Rn×n be a matrix. A scalar λ ∈ C and vector v ∈ Cn are called eigenvalue and corresponding
eigenvector if
X
n
[Av]i = [A]ij [v]j = λ[v]i .
j=1

Theorem (spectral theorem for symmetric positive-definite matrices)

The eigenvectors of symmetric matrices A = A⊺ are real, and form the basis of the image of A. A
symmetric positive definite matrix A can be written as a Gramian (outer product) of the eigenvectors:

X
n
[A]ij = λa [va ]i [va ]j and λa > 0 ∀a = 1, . . . , n.
a=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Kernels are Inner Products
Mercer’s Theorem image: The Royal Society

Definition (Eigenfunction)
A function ϕ : X _ R and scalar λ ∈ C that obeys
∫
k(x, x̃)ϕ(x̃) dν(x̃) = λϕ(x)

are called an eigenfunction and eigenvalue of k with respect to ν.

Theorem (Mercer, 1909)

Let (X, ν) be a finite measure space and k : X × X _ R a continuous (Mercer)
kernel. Then there exist eigenvalues/functions (λi , ϕi )i∈I w.r.t. ν such that I is
countable, all λi are real and non-negative, the eigenfunctions can be made
orthonormal, and the following series converges absolutely and uniformly
ν 2 -almost-everywhere:
∑
k(a, b) = λi ϕi (a)ϕi (b) ∀ a, b ∈ X. James Mercer (1883–1932)
(4 i~I _rA-L~LL~~lr~lJ P-~~LeecnJ
_
I
..r-.*Ylg*EYC"f~C~-"
r

i∈I

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Are Kernels Infinitely Large Positive Definite Matrices?
Kind of …

X
k(a, b) = λi ϕi (a)ϕi (b) =: Φ(a)ΣΦ(b)⊺ ∀ a, b ∈ X.
i∈I

▶ In the sense of Mercer’s theorem, one may think vaguely of a kernel k : X × X _ R evaluated at
k(a, b) for a, b ∈ X as the “element” of an “infinitely large” matrix kab .
▶ However, this interpretation is only relative to the measure ν : X _ R.
▶ In general, it is not straightforward to find the eigenfunctions
▶ The better question is: Why do you want to think about infinite matrices?

▶ What are the eigenfunctions?

▶ Do they eigenfunctions span a space like the eigenvectors of a matrix?
▶ What’s that space? Is it the sample space of a GP?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Bochner’s Theorem
Here’s why operators are tricky image: Rice University, 1970, CC-BY 3.0

A kernel k(a, b) is called stationary if it can be written as

k(a, b) = k(τ ) with τ := a − b

Theorem (Bochner’s theorem (simplified))

A complex-valued function k on RD is the covariance function of a weakly
stationary mean square continuous complex-valued random process on
RD if, and only if, its Fourier transform is a probability (i.e. finite positive)
meausure µ:
Z Z ∗
⊺ ⊺ ⊺
k(τ ) = e2πis τ dµ(s) = e2πis a e2πis b dµ(s)
RD RD

Note, though: Mercer’s theorem described a countable representation!

One way to use such insights: linear-time approximations to Gaussian Salomon Bochner
process regression (Rahimi & Recht, NeurIPS 2008) 1899–1982
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
What are kernels? Can I think of them as “infinitely large matrices”?

▶ kernels have eigenfunctions, like matrices have eigenvectors

▶ eigenfunctions, though, are only defined relative to a base measure
▶ Mercer’s theorem that the eigenfunctions ”generate” the kernel
▶ but finding the eigenfunctions can be tricky

I’ve heard of kernel machines. What’s the connection to GPs?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Gaussian processes, by any other name
one of the most deeply studied models in history

Equivalent and closely related names for Gaussian process regression

▶ Kriging (in particular in the geosciences)
▶ kernel ridge regression
▶ Wiener–Kolmogorov prediction
▶ linear least-squares regression

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
The Gaussian Posterior Mean is a Least-Squares estimate
nonparametric formulation, at explicit locations

p(y | fX )p(f) N (y; fX , σ 2 I)GP(fx,X ; m, k)

= arg min −p(fX | y) = arg min − log p(fX | y)

fX fX
1 1
= arg min 2 ∥y − fX ∥2 + ∥fX − mX ∥2k where ∥fX ∥2k := f⊺X k−1
XX fX
fX 2σ 2

The posterior mean estimator of Gaussian (process) regression is equal to the regularized least-squares
estimate with the regularizer ∥f∥2k . This is also known as the kernel ridge estimate.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
200 years of data analysis
and counting portrait: Julien-Léopold Boilly, 1820 (all other portraits show a different Legendre!)

Adrien-Marie Legendre
Nouvelles méthodes pour la détermination des orbites des comètes, 1805 1752–1833

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
200 years of data analysis
and counting

Carl-Friedrich Gauss
Theorie der Bewegung der Himmelskörper welche in Kegelschnitten die Sonne
1777 – 1855
umlaufen, 1877
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What about all those kernel concepts?
What’s the relationship between GPs and kernel ridge regression? slides: Ulrike v. Luxburg, 2020

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Reproducing Kernel Hilbert Spaces
Two definitions [Schölkopf & Smola, 2002 / Rasmussen & Williams, 2006]

Definition (Reproducing kernel Hilbert space (RKHS))

Let H = (X, ⟨·, ·⟩) be a Hilbert space of functions f : X _ R. Then H is called a reproducing kernel
Hilbert space if there exists a kernel k : X × X _ R s.t.
1. ∀x ∈ X : k(·, x) ∈ H
2. ∀f ∈ H : ⟨f(·), k(·, x)⟩H = f(x) k reproduces H

Theorem [Aronszajn, 1950]: For every pos.def. k on X, there exists a unique RKHS.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
What is the RKHS? (1)
The RKHS is the space of possible posterior mean functions [e.g. Rasmussen & Williams, 2006, Eq. 6.5]

Theorem (Reproducing kernel map representation)

Let X, ν, (ϕi , λi )i∈I be defined as before. Let (xi )i∈I ⊂ X be a countable collection of points in X. Then
the RKHS can also be written as the space of linear combinations of kernel functions:
( )
X X α̃i β̃i
Hk = f(x) := α̃i k(xi , x) with ⟨f, g⟩Hk :=
k(xi , xi )
i∈I i∈I

Proof: cf. Prof. v. Luxburg’s lecture

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
What is the RKHS? (1)
The RKHS is the space of possible posterior mean functions [e.g. Rasmussen & Williams, 2006, Eq. 6.5]

Theorem (Reproducing kernel map representation)

Proof: cf. Prof. v. Luxburg’s lecture

Consider the Gaussian process p(f) = GP(0, k) with likelihood p(y | f, X) = N (y; fX , σ 2 I). The RKHS
is the space of all possible posterior mean functions

X
n
µ(x) = kxX (kXX + σ 2 I)−1 y = wi k(x, xi ) for n ∈ N.
| {z }
:=w i=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
To understand what a GP can learn we have to analyze the RKHS
the connection to the statistical learning theory of RKHSs

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
What is the meaning of the GP point estimate?
The posterior mean is the least-squares estimate in the RKHS

Theorem (The Kernel Ridge Estimate)

Consider the model p(f) = GP(f; 0, k), p(y | f) = N (y; fX , σ 2 I). The posterior mean

m(x) = kxX (kXX + σ 2 I)−1 y

is the element of the RKHS Hk that minimizes the regularised ℓ2 loss

1 X
L(f) = (f(xi ) − yi )2 + ∥f∥2Hk .
σ2
i

Proof: cf. Prof. v. Luxburg’s lecture

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
What is the meaning of uncertainty?
Frequentist interpretation of the posterior variance

How far could the posterior mean be from the truth, assuming noise-free observations?
 2
X 
2 −1
sup (m(x) − f(x)) = sup  f(xi ) [KXX k(X, x)]i −f(x)
f∈H,∥f∥≤1 f∈H,∥f∥≤1 i
| {z }
wi
* +2
X
reproducing property: = sup wi k(·, xi ) − k(·, x), f(·)
i H
2
X
Cauchy-Schwartz: (|⟨a, b⟩| ≤ ∥a∥ · ∥b∥) = wi k(·, xi ) − k(·, x)
i
X XH
reproducing property: = wi wj k(xi , xj ) − 2 wi k(x, xi ) + k(x, x)
ij i

= kxx − kxX K−1

XX kXx = E|y [(fx − µx )2 ]

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Bayesians expect the worst
it’s not always true that ”Frequentists are pessimists”

Theorem
Assume p(f) = GP(f; 0, k) and noise-free observations p(y | f) = δ(y − fX ). The GP posterior variance
(the expected square error)

v(x) := Ep(f|y) (f(x) − m(x))2 = kxx − kxX K−1

XX kXx

is a worst-case bound on the divergence between m(x) and an RKHS element of bounded norm:

v(x) = sup (m(x) = f(x))2

f∈Hk ,∥f∥≤1

The GP’s expected square error is the RKHS’s worst-case square error for bounded norm.
Nb: v(x) is not, in general, itself an element of Hk .

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
What is the RKHS? (2)
Representation in terms of eigenfunctions [I. Steinwart and A. Christmann. Support Vector Machines, 2008, Thm. 4.51]

Theorem (Mercer Representation)

Let X be a compact metric space, k be a continuous kernel on X, ν be a finite Borel measure whose
support is X. Let (ϕi , λi )i∈I be the eigenfunctions and values of k w.r.t. ν. Then the RKHS Hk is given by
( )
X 1/2
X X
Hk = f(x) := αi λi ϕi (x) such that ∥f∥Hk :=
2
αi < ∞
2
with ⟨f, g⟩Hk := αi βi
i∈I i∈I i∈I

P 1/2 P 1/2
For f = i∈I αi λi ϕi and g = i∈I βi λi ϕi .
A compact space, simplified, is a space that is both bounded (all points have finite distance from each other) and closed (it contains all limits). For topological spaces, this is more generally defined by every open
cover (every union C of open sets covering all of X) having a finite subcover (i.e. a finite subset of C that also covers X).

Simplified proof: First, show that this space matches the RKHS definition
P 1/2 1/2 P
1. ∀x ∈ X : k(·, x) = i∈I λi ϕi (·) · λi ϕi (x) and ∥k(·, x)∥2 = i λi ϕi (x)2 = k(x, x) < ∞
| {z }
αi
P 1/2
2. ⟨f(·), k(·, x)⟩ = i∈I αi λi ϕ(x) = f(x). Then use Aronszajn’s uniqueness result. □

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
What about the samples?
Draws from a Gaussian process [for non-simplified version, cf. Kanagawa et al., 2018 (op.cit.), Thms. 4.3 and 4.9]

Theorem (Karhunen-Loève Expansion)

Let X be a compact metric space, k : X × X, k be a continuous kernel, ν a finite Borel measure whose
support is X, and (ϕi , λi )i∈I as above. Let (zi )i∈I be a collection of iid. standard Gaussian random
variables:
zi ∼ N (0, 1) and E[zi , zj ] = δij , for i, j ∈ I.
Then (simplified!): X 1/2
f(x) = zi λi ϕi (x) ∼ GP(0, k).
i∈I

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
What about the samples?
Draws from a Gaussian process [for non-simplified version, cf. Kanagawa et al., 2018 (op.cit.), Thms. 4.3 and 4.9]

Theorem (Karhunen-Loève Expansion)

Corollary (Wahba, 1990. Proper proof in Kanagawa et al., Thm. 4.9)

If I is infinite, f ∼ GP(0, k) implies almost surely f ̸∈ Hk . To see this, note
!
X X X
E(∥f∥Hk ) = E
2 2
zi = E[z2i ] = 1 ̸< ∞
i∈I i∈I i∈I
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
GP samples are not in the RKHS!
But almost …

Theorem (Kanagawa, 2018. Restricted from Steinwart, 2017, itself generalized from Driscoll, 1973)
Let Hk be a RKHS and 0 < θ ≤ 1. Consider the θ-power of Hk given by
( )
X θ/2
X X
Hk = f(x) :=
θ
αi λi ϕi (x) such that ∥f∥Hk :=
2
αi < ∞
2
with ⟨f, g⟩Hk := αi βi .
i∈I i∈I i∈I

Then, X
λ1−θ
i <∞ ⇒ f ∼ GP(0, k) ∈ Hkθ with prob. 1
i∈I

Non-representative Example: Let kλ (a, b) = exp(−(a − b)2 /(2λ2 )). Then f ∼ GP(0, kλ ) is in Hkθλ
with prob. 1 for all 0 < θ < 1. The situation is more complicated for other kernels.
GP samples are not in the RKHS. They belong to a kind of “completion” of the RKHS (but that completion
can be strictly larger than the RKHS).

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
▶ GP and Kernel Methods are very closely related
▶ the RKHS is the space of all possible posterior mean functions
▶ the posterior mean is the ℓ2 -least-squares estimate in the RKHS
▶ the posterior variance (expected square error) is the worst-case error of bounded norm in the RKHS
▶ GP samples are not in the RKHS

If GP’s / kernel machines use infinitely many features, can they learn every function?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
How powerful are kernel/GP models?
first, the hope [Micchelli, Xu, Zhang, JMLR 7 (2006) 2651–2667]

▶ For some kernels, the RKHS “lies dense” in the space of all continuous functions (such kernels are
known as “universal”). An example is the square-exponential / Gaussian / RBF kernel

k(a, b) = exp(−1/2(a − b)2 )

(in fact, there are many universal kernels. E.g. all stationary kernels with power spectrum of full support.)

▶ When using such kernels for GP / kernel-ridge regression, for any continuous functions f, for any
ϵ > 0 there is an RKHS element f̂ ∈ Hk such that ∥f − f̂∥ < ϵ (where ∥ · ∥ is the maximum norm
on a compact subset of X).
▶ that is: Given enough data, the GP posterior mean can approximate any function arbitrarily well!

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
The bad news
if f is not in the RKHS – prior

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 1 evaluation

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 2 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 5 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 10 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 20 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 50 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 100 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
The bad news
if f is not in the RKHS – 500 evaluations

5
f(x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
Convergence Rates are Important
non-obvious aspects of f can ruin convergence v.d.Vaart & v.Zanten. Information Rates of Nonparametric GP models. JMLR 12 (2011)

100
∥f − m∥2
∫

10−2

100 101 102 103

# function evaluations

If f is “not well covered” by the RKHS, the number of datapoints required to achieve ϵ error can be
exponential in ϵ. Outside of the observation range, there are no guarantees at all.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
An Analogy
representing π in Q

▶ Q is dense in R
1 1 1 1
π =3· +1· +4· +1· + ... decimal
1 10 100 1000
1 1 1 1
= 4 · − 4 · + 4 · − 4 · + ... Gregory-Leibniz
1 3 5 7
1 1 1 1
=3· +4· −4· +4· Nilakantha
1 2·3·4 4·5·6 6·7·8
0

−5
log10 error

decimal
Nilakantha
−10
Gregory-Leibniz
Chudnovsky
−15
0 2 4 6 8 10 12 14 16 18 20
‘datapoints’

Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
But if you’re patient, you can learn anything!
The good news. [wording from Kanagawa et al., 2018]

Theorem (v.d. Vaart & v. Zanten, 2011)

Let f0 be an element of the Sobolev space Wβ2 [0, 1]d with β > d/2. Let ks be a kernel on [0, 1]d whose
RKHS is norm-equivalent to the Sobolev space Ws2 ([0, 1]d ) of order s := α + d/2 with α > 0. If
f0 ∈ Cβ ([0, 1]d ) ∩ Wβ2 ([0, 1]d ) and min(α, β) > d/2, then we have
Z
EDn |f0 ∥f − f0 ∥L2 (PX ) dΠn (f|Dn ) = O(n−2 min(α,β)/(2α+d) ) (n _ ∞),
2
(1)

where EX,Y|f0 denotes expectation with respect to Dn = (xi , yi )ni=1 with the model xi ∼ PX and
p(y | f0 ) = N (y; f0 (X), σ 2 I), and Πn (f|Dn ) the posterior given by GP-regression with kernel ks .
The Sobolev space Ws2 (X) is the vector space of real-valued functions over X whose derivatives up to s-th order have bounded L2 norm. L2 (PX ) is the
Hilbert space of square-integrable functions with respect to PX .

If f0 is from a sufficiently smooth space, and Hk is “covering” that space well, then the entire GP posterior
(including the mean!) can contract around the true function at a linear rate.
GPs are “infinitely flexible”: They can learn infinite-dimensional functions arbitrarily well!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 10: Understanding Kernels— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
▶ Gaussian process regression is closely related to kernel ridge regression.
▶ the posterior mean is the kernel ridge / regularized kernel least-squares estimate in the RKHS Hk .

m(x) = kxX (kXX + σ 2 I)−1 y = arg min ∥y − fX ∥2 + ∥f∥2Hk

f∈Hk

▶ the posterior variance (expected square error) is the worst-case square error for bounded-norm RKHS
elements.
v(x) = kxx − kxX (kXX )−1 kXx = arg max ∥f(x) − m(x)∥2
f∈Hk ,∥f∥H ≤1
k

▶ Similar connections apply for most kernel methods.

▶ GPs are quite powerful: They can learn any function in the RKHS (a large, generally
infinite-dimensional space!)
▶ GPs are quite limited: If f ̸∈ Hk , they may converge very (e.g. exponentially) slowly to the truth.
▶ But if we are willing to be cautious enough (e.g. with a rough kernel whose RKHS is a Sobolev
space of low order), then polynomial rates are achievable. (Unfortunately, exponentially slow in the
dimensionality of the input space)

Biocare - Im 12 Service Manual
No ratings yet
Biocare - Im 12 Service Manual
103 pages
Finite Elements and Approximation
From Everand
Finite Elements and Approximation
O. C. Zienkiewicz
4.5/5 (4)
Marc Potters - A First Course in Random Matrix Theory - For Physicists, Engineers and Data Scientists-Cambridge University Press (2020)
No ratings yet
Marc Potters - A First Course in Random Matrix Theory - For Physicists, Engineers and Data Scientists-Cambridge University Press (2020)
371 pages
Applied Stochastic Processes: M. Ottobre
No ratings yet
Applied Stochastic Processes: M. Ottobre
164 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Tungban Probabilistic ML 2021 - Lecture09
No ratings yet
Tungban Probabilistic ML 2021 - Lecture09
46 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
The Representation of Similarities in Linear Spaces
No ratings yet
The Representation of Similarities in Linear Spaces
17 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
ML Kernel Methods
No ratings yet
ML Kernel Methods
51 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
Lecture2 Math ML Review
No ratings yet
Lecture2 Math ML Review
87 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory
No ratings yet
Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory
74 pages
Kernels Regularization and Differential Equations
No ratings yet
Kernels Regularization and Differential Equations
16 pages
Lecture 8_Kernels
No ratings yet
Lecture 8_Kernels
32 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
Gaussian Processes in Machine Learning _ GeeksforGeeks
No ratings yet
Gaussian Processes in Machine Learning _ GeeksforGeeks
10 pages
Combining Entropy Measures For Anomaly Detection
No ratings yet
Combining Entropy Measures For Anomaly Detection
14 pages
Ds 11
No ratings yet
Ds 11
21 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Notes Mainimp
No ratings yet
Notes Mainimp
164 pages
KernelMethods
No ratings yet
KernelMethods
19 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Solution
No ratings yet
Solution
148 pages
ICS E4030 Lecture1
No ratings yet
ICS E4030 Lecture1
37 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Lectures On Randomized Numerical Linear Algebra: Petros Drineas Michael W. Mahoney
No ratings yet
Lectures On Randomized Numerical Linear Algebra: Petros Drineas Michael W. Mahoney
45 pages
08 Learning Representations
No ratings yet
08 Learning Representations
38 pages
SVM 4
No ratings yet
SVM 4
8 pages
hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
Pattern Classification
No ratings yet
Pattern Classification
41 pages
lec16
No ratings yet
lec16
23 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
HW 2
No ratings yet
HW 2
7 pages
06 Gaussian Distributions
No ratings yet
06 Gaussian Distributions
33 pages
Get (Ebook) Discrete Stochastic Processes and Optimal Filtering by Jean-Claude Bertein, Roger Ceschi ISBN 9781905209743, 1905209746 free all chapters
100% (4)
Get (Ebook) Discrete Stochastic Processes and Optimal Filtering by Jean-Claude Bertein, Roger Ceschi ISBN 9781905209743, 1905209746 free all chapters
71 pages
Math Review For ML
No ratings yet
Math Review For ML
41 pages
Stochastic Processes, Detection and Estimation: 6.432 Course Notes
No ratings yet
Stochastic Processes, Detection and Estimation: 6.432 Course Notes
52 pages
Arthur Gretton - Slides4A
No ratings yet
Arthur Gretton - Slides4A
121 pages
Hilbert Space For Random Processes
No ratings yet
Hilbert Space For Random Processes
11 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
Lec2 IntroToProbabilityAndStatistics
No ratings yet
Lec2 IntroToProbabilityAndStatistics
37 pages
Lectures On Stochastic Processes
No ratings yet
Lectures On Stochastic Processes
207 pages
Introduction To Kriging: To Cite This Version
No ratings yet
Introduction To Kriging: To Cite This Version
40 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
Discrete Stochastic Processes and Optimal Filtering Jean-Claude Bertein - The latest ebook is available, download it today
100% (1)
Discrete Stochastic Processes and Optimal Filtering Jean-Claude Bertein - The latest ebook is available, download it today
58 pages
Mathematics of Signals, Networks, and Learning
No ratings yet
Mathematics of Signals, Networks, and Learning
68 pages
7 PDF
No ratings yet
7 PDF
4 pages
Reproducing Kernel Hilbert Spaces
No ratings yet
Reproducing Kernel Hilbert Spaces
4 pages
Open Problems in The Mathematics of Data Science
No ratings yet
Open Problems in The Mathematics of Data Science
152 pages
Concepts of Combinatorial Optimization
From Everand
Concepts of Combinatorial Optimization
Vangelis Th. Paschos
No ratings yet
VW Oil Spec
100% (2)
VW Oil Spec
18 pages
Questions Link For Trees DSA GFG
No ratings yet
Questions Link For Trees DSA GFG
5 pages
tcl-40fs3800-32s3800-40fs3750-32s3750-roku-user-guide
No ratings yet
tcl-40fs3800-32s3800-40fs3750-32s3750-roku-user-guide
94 pages
1830PSS-24x Release 9.1.1 - Customer Release Notes
No ratings yet
1830PSS-24x Release 9.1.1 - Customer Release Notes
40 pages
unit-5 cloud computing
No ratings yet
unit-5 cloud computing
15 pages
GX Xpelair-Catalogue
No ratings yet
GX Xpelair-Catalogue
3 pages
Applied
No ratings yet
Applied
8 pages
Propositional Logic: Artificial Intelligence
No ratings yet
Propositional Logic: Artificial Intelligence
60 pages
Lec 03 Effective Communictaion Skills
No ratings yet
Lec 03 Effective Communictaion Skills
16 pages
Assessment of Pre-Service Teachers' Satisfaction On The Frequency and Mode of Ict Use in Teacher-Training Programmes in Nigerian Universities
No ratings yet
Assessment of Pre-Service Teachers' Satisfaction On The Frequency and Mode of Ict Use in Teacher-Training Programmes in Nigerian Universities
16 pages
RamKumar - Ganesh - SAP PI (XI) Developer - E11TULCSRCEG001 - Resume
No ratings yet
RamKumar - Ganesh - SAP PI (XI) Developer - E11TULCSRCEG001 - Resume
6 pages
Assignment 1 STA 301 Solution
No ratings yet
Assignment 1 STA 301 Solution
5 pages
G8 - Cuối Kì Ii 23 - 24
No ratings yet
G8 - Cuối Kì Ii 23 - 24
8 pages
Lab 1 - Getting Started With C++
No ratings yet
Lab 1 - Getting Started With C++
5 pages
Grade 7 EMS Case Study -Term 1- 20235 Memorandum.
No ratings yet
Grade 7 EMS Case Study -Term 1- 20235 Memorandum.
5 pages
module 1-Discrete Structure (2)
No ratings yet
module 1-Discrete Structure (2)
7 pages
A Guide To The New Ebl Skybanking
No ratings yet
A Guide To The New Ebl Skybanking
35 pages
Drivers Trip Ticket
No ratings yet
Drivers Trip Ticket
1 page
Manual Electrico Kobelco SK 330-8
80% (5)
Manual Electrico Kobelco SK 330-8
6 pages
Careers in Water Resources Engineering
No ratings yet
Careers in Water Resources Engineering
11 pages
Value Analysis & Value Engineering: DR - Sanjay Rajurkar
No ratings yet
Value Analysis & Value Engineering: DR - Sanjay Rajurkar
36 pages
Invoice INV482460
No ratings yet
Invoice INV482460
2 pages
CV-Ngo Phuong Trang
No ratings yet
CV-Ngo Phuong Trang
3 pages
ANSYS CFX Tutorials R180 PDF
No ratings yet
ANSYS CFX Tutorials R180 PDF
918 pages
Soldering Process Validation
0% (1)
Soldering Process Validation
6 pages
Tutorial On Lug Evaluation Using CAEPIPE: General
No ratings yet
Tutorial On Lug Evaluation Using CAEPIPE: General
9 pages
Unit 02. NIS (22620)
No ratings yet
Unit 02. NIS (22620)
17 pages
Esteem-Brochure-Concise Updated
No ratings yet
Esteem-Brochure-Concise Updated
20 pages
Instruction Manual For: - Rugged, Reliable Firearms
No ratings yet
Instruction Manual For: - Rugged, Reliable Firearms
52 pages

10 Understanding Kernels

Uploaded by

10 Understanding Kernels

Uploaded by

Probabilistic Inference and Learning

Some questions you may have:

For deeper introductions, check out

Theorem (spectral theorem for symmetric positive-definite matrices)

are called an eigenfunction and eigenvalue of k with respect to ν.

Theorem (Mercer, 1909)

▶ What are the eigenfunctions?

A kernel k(a, b) is called stationary if it can be written as

k(a, b) = k(τ ) with τ := a − b

Theorem (Bochner’s theorem (simplified))

Note, though: Mercer’s theorem described a countable representation!

▶ kernels have eigenfunctions, like matrices have eigenvectors

I’ve heard of kernel machines. What’s the connection to GPs?

Equivalent and closely related names for Gaussian process regression

p(y | fX )p(f) N (y; fX , σ 2 I)GP(fx,X ; m, k)

= arg min −p(fX | y) = arg min − log p(fX | y)

Definition (Reproducing kernel Hilbert space (RKHS))

Theorem (Reproducing kernel map representation)

Proof: cf. Prof. v. Luxburg’s lecture

Theorem (Reproducing kernel map representation)

Proof: cf. Prof. v. Luxburg’s lecture

Theorem (The Kernel Ridge Estimate)

m(x) = kxX (kXX + σ 2 I)−1 y

is the element of the RKHS Hk that minimizes the regularised ℓ2 loss

Proof: cf. Prof. v. Luxburg’s lecture

= kxx − kxX K−1

v(x) := Ep(f|y) (f(x) − m(x))2 = kxx − kxX K−1

v(x) = sup (m(x) = f(x))2

Theorem (Mercer Representation)

Theorem (Karhunen-Loève Expansion)

Theorem (Karhunen-Loève Expansion)

Corollary (Wahba, 1990. Proper proof in Kanagawa et al., Thm. 4.9)

k(a, b) = exp(−1/2(a − b)2 )

100 101 102 103

Theorem (v.d. Vaart & v. Zanten, 2011)

m(x) = kxX (kXX + σ 2 I)−1 y = arg min ∥y − fX ∥2 + ∥f∥2Hk

▶ Similar connections apply for most kernel methods.

You might also like