Gaussian Process Regression: 4F10 Pattern Recognition, 2010
Gaussian Process Regression: 4F10 Pattern Recognition, 2010
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 1 / 40
The Prediction Problem
420
CO2 concentration, ppm
400
?
380
360
340
320
1960 1980 2000 2020
year
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 2 / 40
The Gaussian Distribution
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 3 / 40
Conditionals and Marginals of a Gaussian
Both the conditionals and the marginals of a joint Gaussian are again Gaussian.
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 4 / 40
Conditionals and Marginals of a Gaussian
In algebra, if x and y are jointly Gaussian
hai h A B i
p(x, y) = N , ,
b B> C
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 5 / 40
What is a Gaussian Process?
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 6 / 40
The marginalization property
Recall: Z
p(x) = p(x, y)dy.
For Gaussians:
hai h A B i
p(x, y) = N , = p(x) = N(a, A)
b B> C
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 7 / 40
Random functions from a Gaussian Process
To get an indication of what this distribution over functions looks like, focus on a
finite subset of function values f = (f (x1 ), f (x2 ), . . . , f (xn ))> , for which
f N(0, ),
where ij = k(xi , xj ).
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 8 / 40
Some values of the random function
1.5
0.5
output, f(x)
0.5
1.5
5 0 5
input, x
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 9 / 40
Joint Generation
z = randn(D,1);
y = chol(K)*z + m;
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 10 / 40
Sequential Generation
Y
n
p(f1 , . . . , fn |x1 , . . . xn ) = p(fi |fi1 , . . . , f1 , xi , . . . , x1 ),
i=1
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 11 / 40
Function drawn at random from a Gaussian Process with Gaussian covariance
6
4
6
2
4
0 2
2 0
2
4
4
6 6
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 12 / 40
The Linear Model
In the linear model, the output is assumed to be a linear function of the input:
y = x> w + .
w = (XX> )1 Xy.
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 13 / 40
Basis functions, or Linear-in-the-Parameters Models
The linear model is of little use of the data doesnt follow a linear relationship.
But can be generalized using basis functions:
y = (x)> w + ,
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 14 / 40
Maximum likelihood, parametric model
Supervised parametric learning:
data: x, y
model: y = fw (x) +
Gaussian likelihood:
Y
p(y|x, w, Mi ) exp( 12 (yc fw (xc ))2 /2noise ).
c
p(y |x , wML , Mi )
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 15 / 40
Bayesian Inference, parametric model
Supervised parametric learning:
data: x, y
model: y = fw (x) +
Gaussian likelihood:
Y
p(y|x, w, Mi ) exp( 12 (yc fw (xc ))2 /2noise ).
c
Parameter prior:
p(w|Mi )
p(w|Mi )p(y|x, w, Mi )
p(w|x, y, Mi ) =
p(y|x, Mi )
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 16 / 40
Bayesian Inference, parametric model, cont.
Making predictions:
Z
p(y |x , x, y, Mi ) = p(y |w, x , Mi )p(w|x, y, Mi )dw
Marginal likelihood:
Z
p(y|x, Mi ) = p(w|Mi )p(y|x, w, Mi )dw.
Model probability:
p(Mi )p(y|x, Mi )
p(Mi |x, y) =
p(y|x)
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 17 / 40
Non-parametric Gaussian process models
In our non-parametric model, the parameters are the function itself!
Gaussian likelihood:
y|x, f (x), Mi N(f, 2noise I)
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 18 / 40
Prior and Posterior
2 2
1 1
output, f(x)
output, f(x)
0 0
1 1
2 2
5 0 5 5 0 5
input, x input, x
Predictive distribution:
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 19 / 40
Factor Graph for Gaussian Process
The predictive distribution for test case y(xj ) depends only on the corresponding
latent variable f (xj ).
X
n X
n
(x ) = k(x , X)[K(X, X) + 2n I]1 y = c y(c) = c k(x , x(c) ).
c=1 c=1
the first term is the prior variance, from which we subtract a (positive) term,
telling how much the data X has explained. Note, that the variance is
independent of the observed outputs y.
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 21 / 40
The marginal likelihood
Log marginal likelihood:
1 1 n
log p(y|x, Mi ) = y> K1 y log |K| log(2)
2 2 2
is the combination of a data fit term and complexity penalty. Occams Razor is
automatic.
log p(y|x, , Mi ) 1 K 1 1 K
= y> K1 K y trace(K1 )
j 2 j 2 j
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 22 / 40
Example: Fitting the length scale parameter
(x x 0 )2
Parameterized covariance function: k(x, x 0 ) = v2 exp + 2n xx 0 .
2`2
1.5
observations
too short
1 good length scale
too long
0.5
0.5
10 8 6 4 2 0 2 4 6 8 10
The mean posterior predictive function is plotted for 3 different length scales (the
green curve corresponds to optimizing the marginal likelihood). Notice, that an
almost exact fit to the data can be achieved by reducing the length scale but the
marginal likelihood does not favour this!
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 23 / 40
Why, in principle, does Bayesian Inference work?
Occams Razor
too simple
P(Y|Mi)
"just right"
too complex
Y
All possible data sets
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 24 / 40
An illustrative analogous example
Imagine the simple task of fitting the variance, 2 , of a zero-mean Gaussian to a
set of n scalar observations.
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 26 / 40
From random functions to covariance functions II
Consider the class of functions (sums of squared exponentials):
1X
f (x) = lim i exp((x i/n)2 ), where i N(0, 1), i
n n
i
Z
= (u) exp((x u)2 )du, where (u) N(0, 1), u.
Z
x + x 0 2 (x + x 0 )2 (x x 0 )2
x2 x 02 du exp
= exp 2(u ) + .
2 2 2
Thus, the squared exponential covariance function is equivalent to regression
using infinitely many Gaussian shaped basis functions placed everywhere, not just
at your training points!
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 27 / 40
Using finitely many basis functions may be dangerous!
0.5
0.5
10 8 6 4 2 0 2 4 6 8 10
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 28 / 40
Model Selection in Practise; Hyperparameters
There are two types of task: form and parameters of the covariance function.
Typically, our prior is too weak to quantify aspects of the covariance function.
We use a hierarchical model using hyperparameters. Eg, in ARD:
X
D
(xd xd0 )2
0
k(x, x ) = v20 exp , hyperparameters = (v0 , v1 , . . . , vd , 2n ).
2v2d
d=1
2 2 2
1 0 0
0
2 2
2 2 2
2 2 2
0 0 0 0 0 0
x2 2 2 x1 x2 2 2 x1 x2 2 2 x1
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 29 / 40
Rational quadratic covariance function
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 30 / 40
Rational quadratic covariance function II
1 =1/2
3
=2
2
0.8
1
output, f(x)
covariance
0.6
0
0.4 1
0.2 2
0 3
0 1 2 3 5 0 5
input distance input, x
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 31 / 40
Matrn covariance functions
Stationary covariance functions can be based on the Matrn form:
1 h 2 i 2
0 0 0
k(x, x ) = |x x | K |x x | ,
()21 ` `
where K is the modified Bessel function of second kind of order , and ` is the
characteristic length scale.
Sample functions from Matrn forms are b 1c times differentiable. Thus, the
hyperparameter can control the degree of smoothness
Special cases:
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 32 / 40
Matrn covariance functions II
Univariate Matrn covariance function with unit characteristic length scale and
unit variance:
output, f(x)
=2 1
0.5 0
1
2
0
0 1 2 3 5 0 5
input distance input, x
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 33 / 40
Periodic, smooth functions
To create a distribution over periodic functions of x, we can first map the inputs
to u = (sin(x), cos(x))> , and then measure distances in the u space. Combined
with the SE covariance function, which characteristic length scale `, we get:
kperiodic (x, x 0 ) = exp(2 sin2 ((x x 0 ))/`2 )
3 3
2 2
1 1
0 0
1 1
2 2
3 3
2 1 0 1 2 2 1 0 1 2
380
360
340
320
1960 1980 2000 2020
year
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 35 / 40
Covariance Function
The covariance function consists of several terms, parameterized by a total of 11
hyperparameters:
400 1
CO2 concentration, ppm
360 0
340 0.5
320 1
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 37 / 40
Mean Seasonal Component
2020
3.6
2010
2000 2.8
1
Year
1990 0 2
2 1
1 2 3.1
2 1 0
1980 3.3
3 2.8
2.8
1970
1960
J F M A M J J A S O N D
Month
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 38 / 40
Conclusions
Many other models are (crippled versions) of GPs: Relevance Vector Machines
(RVMs), Radial Basis Function (RBF) networks, splines, neural networks.
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 39 / 40
More on Gaussian Processes
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 40 / 40