Probabilistic Numerics
Probabilistic Numerics
PROBABILISTIC NUMERICS
C O M P U TAT I O N A S M A C H I N E L E A R N I N G
I Mathematical Background 17
1 Key Points 19
2 Probabilistic Inference 21
3 Gaussian Algebra 23
4 Regression 27
5 Gauss–Markov Processes: Filtering and SDEs 41
6 Hierarchical Inference in Gaussian Models 55
7 Summary of Part I 61
II Integration 63
8 Key Points 65
9 Introduction 69
10 Bayesian Quadrature 75
11 Links to Classical Quadrature 87
12 Probabilistic Numerical Lessons from Integration 107
13 Summary of Part II and Further Reading 119
22 Proofs 183
23 Summary of Part III 193
References 369
Index 397
Acknowledgements
Philipp Hennig
Michael A. Osborne
I would like to thank Isis Hjorth, for being the most valuable
source of support I have in life, and our amazing children
Osmund and Halfdan – I wonder what you will think of this
book in a few years?
Hans P. Kersting
Bold symbols (x) are used for vectors, but only where the fact that a variable is a vector is relevant.
Square brackets indicate elements of a matrix or vector: if x = [ x1 , . . . , x N ] is a row vector, then
[ x]i = xi denotes its entries; if A ∈ Rn×m is a matrix, then [ A]ij = Aij denotes its entries. Round
brackets (·) are used in most other cases (as in the notations listed below).
Notation Meaning
a∝c a is proportional to c: there is a constant k such that a = k · c.
A ∧ B, A ∨ B The logical conjunctions “and” and “or”; i.e. A ∧ B is true iff
both A and B are true, A ∨ B is true iff ¬ A ∧ ¬ B is false.
AB The Kronecker product of matrices A, B. See Eq. (15.2).
A
B The symmetric Kronecker product. See Eq. (19.16).
A⊙B The element-wise product (aka Hadamard product) of two
matrices A and B of the same shape, i.e. [ A ⊙ B]ij = [ A]ij · [ B]ij .
#– #– #–
A, ♮ A A is the vector arising from stacking the elements of a matrix A
#–
row after row, and its inverse (A = ♮ A). See Eq. (15.1).
cov p ( x, y) The covariance of x and y under p. That is,
cov p ( x, y) := E p ( x · y) − E p ( x )E p (y).
C q (V, Rd ) The set of q-times continuously differentiable functions from
V to Rd , for some q, d ∈ N.
δ( x − y) The Dirac delta, heuristically characterised by the property
R
f ( x )δ( x − y) dx = f (y) for functions f : R _ R.
δij The Kronecker symbol: δij = 1 if i = j, otherwise δij = 0.
det( A) The determinant of a square matrix A.
diag( x) The diagonal matrix with entries [diag( x)]ij = δij [ x ]i .
dωt The notation for an an Itô integral in a stochastic differential
equation. See Definition 5.4.
Rx
erf ( x ) The error function erf( x ) := √2π 0 exp(−t2 ) dt.
R
Ep ( f ) The expectation of f under p. That is, E p ( f ) := f ( x ) dp( x ).
E |Y ( f ) The expectation of f under p( f | Y ).
R∞
Γ(z) The Gamma function Γ(z) := 0 x z−1 exp(− x ) dx. See
Eq. (6.1).
G(·; a, b) The Gamma distribution with shape a > 0 and rate b > 0, with
a a −1
probability density function G(z; a, b) := bΓz(a) e−bz .
GP ( f ; µ, k) The Gaussian process measure on f with mean function µ and
covariance function (kernel) k. See §4.2
H p (x) The (differential) entropy of the distribution p( x ).
R
That is, H p ( x ) := − p( x ) log p( x ) dx. See Eq. (3.2).
H( x | y ) The (differential) entropy of the cond. distribution p( x | y).
That is, H( x | y) := H p(·|y) ( x ).
I ( x ; y) The mutual information between random variables X and Y.
That is, I ( x ; y) := H( x ) − H( x | y) = H(y) − H(y | x ).
xii Symbols and Notation
Notation Meaning
I, IN The identity matrix (of dimensionality N): [ I ]ij = δij .
I(· ∈ A) The indicator function of a set A.
Kν The modified Bessel function for some parameter ν ∈ C.
R∞
That is, Kν ( x ) := 0 exp(− x · cosh(t))cosh(νt) dt.
L The loss function of an optimization problem (§26.1), or the
log-likelihood of an inverse problem (§41.2).
M The model M capturing the probabilistic relationship between
the latent object and computable quantities. See §9.3.
N, C, R, R+ The natural numbers (excluding zero), the complex numbers,
the real numbers, and the positive real numbers, respectively.
N ( x; µ, Σ) = p( x ) The vector x has the Gaussian probability density function
with mean vector µ and covariance matrix Σ. See Eq. (3.1).
N (µ, Σ) ∼ X The random variable X is distributed according to a Gaussian
distribution with mean µ and covariance Σ.
O(·) Landau big-Oh: for functions f , g defined on N, the notation
f (n) = O( g(n)) means that f (n)/g(n) is bounded for n _ ∞.
p(y | x ) The conditional the probability density function for variable Y
having value y conditioned on variable X having value x.
rk( A) The rank of a matrix A.
span{ x1 , . . . , xn } The linear span of { x1 , . . . , xn }.
St(·; µ, λ1 , λ1 ) The Student’s-t probability density function with parameters
µ ∈ R and λ1 , λ2 > 0, see Eq. (6.9).
tr( A) The trace of matrix A, That is, tr( A) = ∑i [ A]ii .
A⊺ The transpose of matrix A: [ A⊺ ]ij = [ A] ji .
Ua,b The uniform distribution with probability density function
p(u) := I(u ∈ ( a, b)), for a < b.
V p (x) The variance of x under p. That is, V p ( x ) := cov p ( x, x ).
V |Y ( f ) The variance of f under p( f | Y ). That is, H( x | y) :=
R
− log p( x | y) dp( x | y).
W (V, ν) The Wishart distribution with probability density function
−1
W ( x; V, ν) ∝ | x |(ν− N −1)/2 e−1/2 tr(V x) . See Eq. (19.1).
x⊥y x is orthogonal to y, i.e. ⟨ x, y⟩ = 0.
x := a The object x is defined to be equal to a.
∆
x=a The object x is equal to a by virtue of its definition.
x^a The object x is assigned the value of a (used in pseudo-code).
X∼p The random variable X is distributed according to p.
1, 1d A column vector of d ones, 1d := [1, . . . , 1]⊺ ∈ Rd .
∇ x f ( x, t) The gradient of f w.r.t. x. (We omit subscript x if redundant.)
Introduction
will consume. There are almost always choices for the character
of an iteration, such as where to evaluate an integrand or an
objective function to be optimised. Not all iterations are equal,
and it takes an intelligent agent to optimise the cost–benefit
trade-off.
On a related note, a well-designed probabilistic numerical
agent gives a reliable estimate of its own uncertainty over their
result. This helps to reduce bias in subsequent computations. For
instance, in ODE inverse problems, we will see how simulating
the forward map with a probabilistic solver accounts for the
tendency of numerical ODE solvers to systemically over- or
underestimate solution curves. While this does not necessarily
give a more precise ODE estimate (in the inner loop), it helps
the inverse-problem solver to explore the parameter space more
efficiently (in the outer loop). As these examples highlight, pn
hence promises to make more effective use of computation.
bers:
[Roughly:] The need for probability only
Une question de probabilités ne se pose que par suite de notre arises out of uncertainty: It has no place if
we are certain that we know all aspects of
ignorance: il n’y aurait place que pour la certitude si nous a problem. But our lack of knowledge also
connaissions toutes les données du problème. D’autre part, notre must not be complete, otherwise we would
have nothing to evaluate. There is thus a
ignorance ne doit pas être complète, sans quoi nous ne pourrions spectrum of degrees of uncertainty.
rien évaluer. Une classification s’opérerait donc suivant le plus While the probability for the sixth decimal
ou moins de profondeur de notre ignorance. digit of a number in a table of logarithms to
equal 6 is 1/10 a priori, in reality, all aspects
Ainsi la probabilité pour que la sixième décimale d’un nombre of the corresponding problem are well deter-
dans une table de logarithmes soit égale à 6 est a priori de 1/10; mined, and, if we wanted to make the effort,
we could find out its exact value. The same
en réalité, toutes les données du problème sont bien déterminées, holds for interpolation, for the integration
et, si nous voulions nous en donner la peine, nous connaîtrions methods of Cotes or Gauss, etc. (Emphasis
exactement cette probabilité. De même, dans les interpolations, in the op. cit.)
to wait over a decade. By then, the plot had thickened and au-
thors in many communities became interested in Bayesian ideas
for numerical analysis. Among them Kadane and Wasilkowski
(1985), Diaconis (1988), and O’Hagan (1992). Skilling (1991) even
ventured boldly toward solving differential equations, display-
ing the physicist’s willingness to cast aside technicalities in the
name of progress. Exciting as these insights must have been
for their authors, they seem to have missed fertile ground. The
development also continued within mathematics, for example in
the advancement of information-based complexity5 and average- 5
Traub, Wasilkowski, and Woźniakowski
case analysis.6 But the wider academic community, in particular (1983); Packel and Traub (1987); Novak
(2006).
users in computer science, seem to have missed much of it. But 6
Ritter (2000)
the advancements in computer science did pave the way for
the second of the central insights of pn: that numerics requires
thinking about agents.
This Book
research questions.
p( x, y) −2
then the marginal distributionZ is given by the sum rule 0 p( x | y)
−2 0
p(y) = p( x, y) dx, 0
x
2 2
y
and the conditional distribution p( x | y) for x given that Y = y is
provided implicitly by the product rule Figure 2.1: Conceptual sketch of a joint
probability distribution p( x, y) over two
p( x | y) p(y) = p( x, y), (2.1) variables x, y, with marginal p(y) and a
conditional p( x | y).
whose terms are depicted in Figure 2.1. The corollary of these
two rules is Bayes’ theorem, which describes how prior knowledge,
combined with data generated according to the conditional
density p(y | x ), gives rise to the posterior distribution on x:
likelihood prior
z }| { z}|{
p(y | x ) p( x )
p( x | y) = Z .
| {z } p ( y | x ) p ( x ) dx
posterior
| {z }
evidence
rich analytic theory, and that computers are good at the basic More on this in a report by MacKay
linear operations – addition and multiplication. (2006).
2
The entropy,
Z
In fact, the connection between linear functions and Gaussian H p ( x ) := − p( x ) log p( x ) dx, (3.2)
distributions runs deeper: Gaussians are a family of probability of the Gaussian is given by
distributions that are preserved under all linear operations. The
HN ( x;µ,Σ) ( x )
following properties will be used extensively:
N 1
= (1 + log(2π )) + log |Σ|. (3.3)
If a variable x ∈ RD is normal distributed, then every affine 2 2
transformation of it also has a Gaussian distribution (Fig-
24 I Mathematical Background
ure 3.1):
if p( x) = N ( x; µ, Σ),
and y := Ax + b for A ∈ R M× D , b ∈ R M , c
b
then p(y) = N (y; Aµ + b, AΣA⊺ ). (3.4) a
N ( x; a, A)N ( x; b, B) = N ( x; c, C )N ( a; b, A + B), 3
This statement is about the product of
−1 −1 −1 (3.5) two probability density functions. In con-
where C := ( A +B ) ,
trast, the product of two Gaussian ran-
−1 −1
and c := C ( A a+B b ). dom variables is not a Gaussian random
variable.
p(
y|
x)
then both the posterior and the marginal distribution for y (the p( x | y)
evidence) are Gaussian (Figure 3.2): p( x)
with Σ̃ := (Σ −1
+ ΦX Λ −1
Φ⊺X )−1
−1 −1 10
and µ̃ := Σ̃(Σ µ + ΦX Λ y ).
f(x)
Using the matrix inversion lemma (Eq. (15.9)), these expressions 0
0
arising from µ = 0, the marginal stan-
−10 dard deviation diag(Φ⊺x ΣΦ x )1/2 for the
choice Σ = I, and four samples drawn
−20 i.i.d. from the joint Gaussian over the
function.
Even-numbered rows: Posteriors arising
20 in these models from the observations
shown in Figure 4.1. The feature func-
10 tions giving rise to these eight different
plots are the polynomials
f(x)
0
ϕi ( x ) = xi , i = 0, . . . , 3;
−10
the trigonometric functions
−20
ϕi ( x ) = sin( x/i), i = 1, . . . , 8, and
steps linears Legendre
ϕi ( x ) = cos( x/i−8), i = 9, . . . , 16;
20
as well as, for i = −8, −7, . . . , 8, the
10 “switch” functions
f(x)
0 ϕi ( x ) = sign( x − i );
−10 the “step” functions
−20 ϕi ( x ) = I( x − i > 0);
the linear functions
20 ϕi ( x ) = | x − i |;
10 the first 13 Legendre polynomials (scaled
to [−10, 10]); the absolute’s exponential
f(x)
0
ϕi ( x ) = e−| x−i| ;
−10
the square exponential
−20
2
exp-abs exp-square sigmoids ϕi ( x ) = e−( x−i) ;
20
10
f(x)
−10
−20
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
x x x
30 I Mathematical Background
Generalised linear regression allows inference on real-valued Exercise 4.1 (easy). Consider the likelihood
of Eq. (4.2) with the parametric form for
functions over arbitrary input domains. f of Eq. (4.1). Show that the maximum-
likelihood estimator for w is given by the
By varying the feature set Φ, broad classes of hypotheses ordinary least-squares estimate
can be created. In particular, Gaussian regression models can wML = (Φ X Φ⊺X )−1 Φ X y.
model nonlinear, discontinuous, even unbounded functions.
To do so, use the explicit form of the Gaus-
There are literally no limitations on the choice of ϕ : X _ R. sian pdf to write out log p(y | X, w), take
the gradient with respect to the elements [w]i
Neither the posterior mean nor the covariance over function of the vector w and set it to zero. If you find it
difficult to do this in vector notation, it may
values (Eq. (4.3)) contain “lonely” feature vectors, but only
be helpful to write out Φ⊺X w = ∑i wi [Φ X ]i:
inner products of the form k ab := Φ⊺a ΣΦb and m a := Φ a µ. where [Φ X ]i: is the ith column of Φ X . Cal-
For reasons to become clear in the following section, these culate the derivative of log p(y | X, w) with
respect to wi , which is scalar.
quantities are known as the covariance function k : X × X _ R
and mean function m : X _ R, respectively.
metric case studied in §4.1: recall from Eq. (4.3) that, for general
linear regression using features ϕ, the mean vector and covari-
ance matrix of the posterior on function values does not contain
isolated, explicit forms of the features. Instead, for finite subsets
a, b ⊂ X it contains only projections and inner products, of the
form
F
m a := Φ a µ and k ab := Φ⊺a ΣΦb = ∑ ϕi (a)ϕj (b)Σij .
ij=1
with scale λ ∈ R+ over the domain [cmin , cmax ] ⊂ R. That is, set 10
Proof sketch, omitting technicalities:
Using Eq. (3.4), we get
( x − ci )2 i−1 √ 2
ϕi ( x ) = exp − , ci = cmin + (cmax − cmin ). 2θ (cmax − cmin )
λ2 F k ab = √
πλF
We also choose, for an arbitrary scale θ 2 ∈ R+ , the covariance F
·∑e
−
( a − c i )2
λ2 e
−
( b − c i )2
λ2
√ i =1
√
2θ 2 (cmax − cmin ) 2θ 2 (cmax − cmin ) − (a−b2)
2
Σ= √ I, = √ e 2λ
πλF πλF
F (ci −1/2( a+b))2
−
and set µ = 0. It is then possible to convince oneself10 that the ·∑e λ2 /2 .
i =1
limit of F _ ∞ and cmin _ −∞, cmax _ ∞ yields
In the limit of large F, the number of fea-
tures in a region of width δc converges
( a − b )2
k( a, b) = θ 2 exp − . (4.4) to F·δc/(cmax −cmin ), and the sum becomes
2λ2 the Gaussian integral
√ 2 2
2θ − (a−b2)
This is called a nonparametric formulation of regression, since k ab = √ e 2λ
πλ
the parameters w of the model (regardless of whether their Z cmax
−
(c−1/2( a+b))2
2
· e λ /2 dc.
number is finite or infinite) are not explicitly represented in the cmin
computation. The function k constructed in Eq. (4.4), if used in For cmax , √ ±∞, that integral con-
cmin _√
a Gaussian regression framework, assigns a covariance of full verges to πλ/ 2.
[k XX ]ij = k( xi , x j )
Proof. v⊺ (K + H )v = v⊺ Kv + v⊺ Hv > 0.
(k ⊙ h)( a, b) = k( a, b) · h( a, b)
Wiener linear splines int. Wiener Figure 4.3: Analogous plot to Figure 4.2,
for Gaussian process regression with var-
20
ious kernels. From top left to bottom
10 right, the prior processes are identified
by the following kernels:
f(x)
0
the Wiener kernel (producing Brownian
−10 motion)
0
the additive combination
−10
2
−20 k ( a, b) = kSE ( a, b) + ∑ ( ab)i ;
i =0
Gauss, nonlin-scaled Gauss + Poly Gauss ⊙ feature
and the point-wise product
20 2 +b2 )/4
k( a, b) = kSE ( a, b) · e−(a .
10
As in Figure 4.2, these examples demon-
f(x)
−10
−20
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
x x x
36 I Mathematical Background
8 end procedure
defined by Eq. (4.6) is tightly bounded above by the associated GP |⟨ f , g⟩|2 ≤ ⟨ f , f ⟩ · ⟨ g, g⟩,
posterior marginal variance for Λ _ 0: for f , g ∈ H.
sup (m x − f x )2 = k xx − k xX KXX
−1
k Xx .
f ∈H,∥ f ∥≤1
Z b
! Z
∂m f ( x ) ∂m k( x, x ′ )
cov f ( x ) dν( x ), = dν( x ),
a ∂x mj ∂x ′j m
and
∂kSE ( x, x̃ )
∂ f ( x̃ ) ∂ f (x)
cov f ( x ), = = − cov , f ( x̃ )
∂ x̃ ∂ x̃ ∂x
x − x̃
= kSE ( x, x̃ ),
λ2
∂ f ( x ) ∂ f ( x̃ ) ∂2 kSE ( x, x̃ )
cov , =
∂x ∂ x̃ ∂x∂ x̃
1 ( x − x̃ )2
= − kSE ( x, x̃ ),
λ2 λ4
2 Z
∂ f (x) b 1
cov 2
, f ( x̃ ) d x̃ = 2 [( x − b)kSE ( x, b) − ( x − a)kSE ( x, a)]
∂x a λ
2
∂ f (x) ( x − x̃ )2 1
cov , f ( x̃ ) = − kSE ( x, x̃ ).
∂x2 λ4 λ2
5
Gauss–Markov Processes
Filtering and Stochastic Differential Equations
p ( x t | x 0 , x 1 , . . . , x t −1 ) = p ( x t | x t −1 ). (5.1)
p ( y t | X ) = p ( y t | x t ).
Then, we just insert Eq. (5.5) into Eq. (5.6),4 which yields 4
Kitagawa (1987)
Z
p ( x t +1 | y )
p( xt | y) = p( xt | y0:t ) p ( x t +1 | x t ) dx . (5.7)
p( xt+1 | y0:t ) t+1
So to compute the marginal p( xt | y) for all 0 ≤ t ≤ T, using
the notion of time, we can think of this as first performing
a forward-pass as described above to compute the predictions
p( xt | y0:t−1 ) and updated beliefs p( xt | y0:t ). The final update
step in that pass provides p( x T | y), from which we can start
a backward-pass to compute the posterior marginals p( xt | y)
using Eq. (5.7) and all terms computed in the forward pass. This
nomenclature, of passing messages along the Markov chain is
popular in statistics and machine learning, and applies more
generally to tree-structured graphs (i.e. not just chains).5 In 5
Bishop (2006), §8.4.2
signal processing, where time series play a particularly central
role, the forward pass is known as filtering, while the backward
updates are known as smoothing. Figure 5.2 depicts the output
of these methods. The next section spells out their computations
in the case when the underlying state-space model is linear and
Gaussian.
0
diate estimation (filtering) posterior (5.3),
and the smoothed posterior (5.14). If the
predicted
−2 state-space model is linear and Gaussian,
observed
the estimation and smoothed posteriors
estimated
−4 are computable by the Kálmán filter and
smoothed
smoother, respectively. At each location,
the variance drops from prediction to es-
t1 t2 t3 t4 t5 t6 t7
timation belief, and from estimation to
t
smoothed belief.
p( x0 ) = N ( x0 ; m0 , P0 ), (5.8)
p ( x t +1 | x t ) = N ( x t +1 ; A t x t , Q t ),
p(yt | xt ) = N (yt ; Ht xt , Rt ). (5.9)
The latter two relations are also often written, and known as,
the dynamic model and an measurement model, respectively:
The update, Eq. (5.3) becomes Exercise 5.3 (easy). Using the basic prop-
erties of Gaussians from Eqs. (3.4), (3.8) &
p( xt | y0:t ) = N ( xt ; mt , Pt ), (3.10) and the prediction-update Eqs. (5.2)
& (5.3), show that Eqs. (5.10) to (5.13) hold.
zt := yt − Ht m−
t (innovation residual), (5.12)
St : = Ht Pt− Ht⊺ + Rt (innovation covariance),
Kt : = Pt− Ht⊺ St−1 (gain),
mt := m−t + Kt zt ,
Pt := ( I − Kt Ht ) Pt− . (5.13)
procedure Smoother(mt , Pt , A, m− − s s
t+1 , Pt+1 , mt+1 , Pt+1 )
1 Algorithm 5.2: Single step of the RTS
smoother. Notation as in Algorithm 5.1.
2 G= Pt A⊺ ( Pt−+1 )−1 gain The smoother, since it does not actually
3 mst = mt + G (mst+1 − m− t +1 ) posterior mean touch the observations yt , has complex-
ity O( L3 ).
4 Pts = Pt + G ( Pts+1 − Pt−+1 ) G⊺ posterior covariance
5 end procedure
5 Gauss–Markov Processes: Filtering and SDEs 47
These update rules are often simply called the “Kálmán smoo-
ther” or, more precisely, the (fixed interval) Rauch–Tung–Striebel
(RTS) smoother equations.7 The estimates computed by the Kál- 7
Rauch, Striebel, and Tung (1965)
mán filter and smoother are depicted in Figure 5.2.
For reference, Algorithms 5.1 and 5.2 summarise the above
results in pseudo-code, providing the individual steps for both
the filter and the smoother. The wrapper algorithm Predict (Al-
gorithm 5.3) solves the task of continuously predicting the sub-
sequent state of a time series. For a finite time series running
from t = 0 to t = T, the algorithm Infer (Algorithm 5.4) returns
posterior marginals for every time step t. Since these marginals
(that is, their means and variances) are exactly equal to those
of GP regression (§4.2.2), algorithm Infer (Algorithm 5.4) is
nothing but a linear-time implementation of GP regression with
Markov priors.8 8
Särkkä and Solin (2019), §12.4
The Kálmán filter (and smoother) are so efficient that they can
be sometimes applied even if the linearity or Gaussianity of
the dynamic or measurement model are violated.9 In numerics, 9
Särkkä (2013), §13.1
the case for such fast-and-Gaussian methods is even stronger
For a moment, let us put aside the observations y and only con-
sider the prior defined by the dynamic model from Eq. (5.11).
The predictive means of this sequence of Gaussian variables fol- 11
This uses the matrix exponential
low the discrete linear recurrence relation mt+1 = At mt . When ∞
Xi
e X ≡ exp( X ) := ∑ , (5.17)
solving numerical tasks, the time instances t will not usually be i
i!
immutable discrete locations on a regular grid, but values cho- where X i := |X ·{z
· · X} is the ith power of
sen by the numerical method itself, on a continuous spectrum. i times
C (defining C0 = I). The exponential
We thus require a framework creating continuous curves x (t) for exists for every complex-valued matrix.
t > t0 ∈ R that are consistent with such linear recurrence equa- Among its properties are:
tions. For deterministic quantities, linear differential equations are e0 = I (for the zero matrix);
that tool: Consider the linear (time-invariant) dynamical system if Xy = YX, then
for x (t) ∈ R N , e X e Y = e Y e X = e X +Y .
Thus, every exponential is invertible:
dx (t)
= Fx (t); and assume x (t0 ) = x0 . (5.16) −1
dt eX = e− X ;
=: k(t, t′ ).
p ( x t i +1 | x t i ) = N ( x t i +1 ; A t i x t i , Q t i ) , (5.20)
x (t)
0
with
Ati := exp( F (ti+1 − ti )) and −2
Z t −t (5.21)
i +1 i
Fτ ⊺ F⊺ τ
Q ti : = e LL e dτ. 0 1 2 3
0
t
θ 2 h2q+1
This structure16 ensures that the elements of x are derivatives [ Q̃]ij =
(2q + 3 − i − j)(q + 1 − i )!
of each other. So they can be interpreted as derivatives of a 1
× .
function f : R _ R: ( q + 1 − j ) ! ( i − 1) ! ( j − 1) !
h i This can help both with numerical stabil-
x (t) = f (t) f ′ (t) f ′′ (t) · · · f (q) (t) . (5.23) ity and efficient implementations.
5 Gauss–Markov Processes: Filtering and SDEs 51
0
observation noise. Top: Filtering. Predic-
tive means in solid black, two standard
−2 deviations in thin black. Three joint sam-
ples from the predictive distribution, con-
−4 structed in linear time during the pre-
diction run, are plotted in thin black,
also. Bottom: Posterior distribution after
t0 t1 t2 t3 t4 t5 t6 t7
smoothing (filtering distribution in grey
t
for comparison). The posterior samples
(which are indeed valid draws from the
6 joint posterior) are produced by taking
the samples from the predictive distri-
4 bution and scaling/shifting them (deter-
ministically) during the smoothing run;
2
they do not involve additional random
x (t)
0 numbers.
−2
−4
t0 t1 t2 t3 t4 t5 t6 t7
t
x 2θ
dx = − dt + √ dωt
λ λ
with x (t0 ) = x0 yields, with Eqs. (5.18) and (5.19),
t−t ′ t+t′ −2t0
− λ0 ′ 2 − |t−λt | −
E( x (t)) = x0 e and k ( x, x ) = θ e −e λ .
Assume the real numbers y := [yi ]i=1,...,N are drawn i.i.d. from
a Normal distribution with unknown mean α and variance β2 ,
N
p(y | α, β) = ∏ N (yi ; α, β2 ) = N (y; α1, β2 I ).
i =1
A prior η with this property is called conjugate to the likeli- In the context of this text, it is a neat
marginal observation that, while Legen-
hood p(y | α, β). For the present case of a Gaussian likelihood dre was interested in problems of chance,
with latent mean and variance, the standard choice of conjugate Euler’s original motivation for consider-
prior assigns a Gaussian distribution of mean µ0 and variance ing those integrals, in an exchange with
Goldbach, was one of interpolation: he
proportional to β2 (with scale λ0 ) to α, and a Gamma distribu- was trying to find a “simple” smooth
tion with parameters a0 , b0 to the inverse of β. This is known function connecting the factorial func-
tion on the reals. And indeed,
as the Gauss-Gamma, or Gauss-inverse-Gamma prior, and has the
Γ ( n ) = ( n − 1) ! ∀ n ∈ N\0 .
hyperparameters θ0 := µ0 λ0 a0 b0 :
Legendre is also to blame for this un-
sightly shift in the function’s argument,
η (α, β | µ0 , λ0 , a0 , b0 ) = p(α | β, µ0 , λ0 ) p( β | a0 , b0 ) since he constructed Eq. (6.1) by re-
arranging Euler’s more direct result
β2
= N α; µ0 , G( β−2 ; a0 , b0 ), Z 1
λ0 n! = (− log x )n dx.
0
b a z a −1
where G(z; a, b) := e−bz . A great exposition on this story and
Γ( a) the Gamma function can be found in
a Chauvenet-prize-decorated article by
Here, G(·; a, b) is the Gamma distribution with shape a > 0 Davis (1959). It is left as a research exer-
and rate b > 0. The normalisation constant of the Gamma cise to the reader to consider in which
sense Euler’s answer to this interpolation
distribution, and also the source of the distribution’s name, is problem is natural, in particular from a
the Gamma function Γ.3 To compute the posterior, we multiply probabilistic-numerics standpoint (that
prior and likelihood, and identify is, which prior assumptions give rise to
the Gamma function as an interpolant
of the factorials, whether there are other
p(α, β | y) ∝ p(y | α, β) p(α, β) (6.2) meaningful priors yielding other inter-
2 β2 −2 polations).
= N (y; 1α, β I )N (α; µ0 , /λ0 )G( β ; a0 , b0 ).
6 Hierarchical Inference in Gaussian Models 57
We first deal with the Gaussian part, using Eq. (3.5) (and some
simple vector arithmetic) to re-arrange this expression as
−2 2 1 · 1⊺ λ0 µ0 + ∑ i y i β2
G( β ; a0 , b0 )N y; 1µ0 , β I+ ·N α; , . (6.3)
λ0 λ0 + N λ0 + N
The second Gaussian expression is evidently a Gaussian over α,
and the first one does not depend on α. To deal with the first
part, we use the matrix inversion lemma, Eq. (15.9), to rewrite
1 · 1⊺ −1 1 · 1⊺
I+ = I− .
λ0 N + λ0
This allows writing the first line of Eq. (6.3) more explicitly
(leaving out normalisation constants independent of β) as
−2 1 · 1⊺ 2
G( β ; a0 , b0 ) · N y; µ0 , β I +
λ0
a0 −1 N/2 !!
1 b0 1 1 N
2(∑iN=1 yi − Nµ0 )2
∝
β2
exp − 2 ·
β β2
· exp − 2
2β ∑ i 0( y − µ ) −
λ0 + N
. (6.4)
i =1
λ0 N
(ᾱ − µ0 )2 .
λ0 + N
Intuitively, this term corrects for the fact that β̄2 is a biased
estimate of the variance – the sample mean is typically closer to
the samples than the actual mean is, and this bias depends on
how far the initial estimate µ0 is from the correct mean.
5
An alternative contender for the in-
vention of this distribution is the Ger-
In applications our main interest will usually not be in the man geodesist F. R. Helmert, but the
posterior distribution over the parameters (α, β), but predic- endearing story of the statistician Gos-
set, writing under the pseudonym “Stu-
tions of the subsequent sample y N +1 . Under the hierarchical dent” (1908) so as not to violate a non-
Gauss-Gamma model, this posterior distribution is the marginal disclosure agreement with the Guiness
distribution reached by integrating out the posterior over the Brewery, his employer, has stuck. In fact,
Helmert’s 1875 letter to the Zeitschrift für
hyperparameters, Mathematik und Physik is an entertaining
Z ∞ Z ∞ read, too. It contains a dressing-down of
R. A. Mees (Professor of Physics at Göt-
p ( y N +1 | y ) = p(y+ | y, α, β) p(α, β | y) dα dβ.
−∞ 0 tingen) who, in an article published in
the same newspaper some months prior
The integral over α is easy, as it is an instance of the basic (pp. 145 ff.), apparently misread a discus-
Gaussian properties above. The remaining integral over β is over sion, by Helmert, of Gauss’ work on es-
timation of errors. The letter opens with
the product of a Gaussian and several Gamma terms, which is the sentence “The point of the follow-
given by a famous result known as5 Student’s-t distribution: ing is to show that Mr. Mees errs, not
just in his evaluation of my work, but
throughout his entire essay.”
St(y N +1 ; µ N , a N/bN , 2a N )
Z ∞
:= N (y N +1 ; µ N , β2 )G( β−2 ; a N , b N ) dβ (6.9)
0
a − a N −1/2
Γ( a N + 1/2) b NN (y − µ N )2
= √ b N + N +1 .
Γ( a N ) 2π 2
Gauss-inverse-Wishart distribution
M
M ( M−1)/4 (1 − j )
Γ M ( x ) := π ∏Γ x+ 2 .
j =1
λ0 µ0 + N ᾱ
µN = ,
λ0 + N
λ N = λ0 + N,
νN = ν0 + N,
λ0 N
WN = W0 + N B̄ + (ᾱ − µ0 )(ᾱ − µ0 )⊺ ,
λ0 + N
1 7
When comparing Eq. (6.12) to Eq. (6.9),
ᾱ =
N ∑ yi , an unexpected factor of 2 shows up here
i and there. This comes from differing def-
1 initions of the “counters” a N and νN ,
B̄ =
N ∑(yi − ᾱ)(yi − ᾱ)⊺ . which in turn are due to the standard
i definition of the Gamma distribution.
N
p(y | θ ) = ∏ N (yi ; Hmi− , θ 2 H P̃i− H⊺ )
i =1
p(θ ) = G(θ −2 , α0 , β 0 ),
▶ 9.1 Motivation
0.6
But integration is also an operation with long history, and
f (x)
N 0
1
F̂ :=
N ∑ w ( x i ), (9.4) −2 0
x
2
i =1
using the function w( x ) := f ( x )/p( x ), which is well-defined by Figure 9.2: Monte Carlo integration. Sam-
ples are drawn from the Gaussian mea-
our above assumptions on p( x ). The logic of this procedure is sure p( x ) (unnormalised measure as
depicted in Figure 9.2. dashed line, samples as black dots), and
Note that, in contrast to F, the estimator F̂ is a random number. the ratio w( x ) = f ( x )/p( x ) evaluated
for each sample. The histogram plot-
By constructing F̂, we have turned the problem (9.1) – inferring ted vertically on the left (arbitrary scale)
an uncertain but unique deterministic number – into a stochas- shows
R the resulting distribution p(w) =
w( x ) dp( x ). Its expected
√ value, times
tic, statistical one. This introduction of external stochasticity the normalisation erf(3) π, is the true
introduces a different form of uncertainty, the aleatory kind, integral. Its standard deviation deter-
a lack of knowledge arising from randomness (more in §12.3). mines the scale for convergence of the
Monte Carlo estimate.
Because we have full control over the nature of the random num-
bers, it is possible to offer a quite precise statistical analysis of
F̂:
Lemma 9.2. If F is integrable, the estimator F̂ is unbiased. Its variance
is
1
var( F̂ ) = var p (w), (9.5)
N
72 II Integration
assuming var p (w) exists. Hence, the standard deviation (the square
root of the variance, which is a measure of expected error) drops as Proof of Lemma 9.2. F̂ is unbiased be-
cause its expected value is
O( N −1/2 ) – the convergence rate of Monte Carlo integration.
Z b
1
This is a strong statement given the simplicity of the algorithm ⟨ F̂ ⟩ =
N ∑ a
w( xi ) p( xi ) dxi = F,
i
it analyses: random numbers from almost any measure allow
given that the draws xi are i.i.d., and as-
estimating the integral over any integrable function. This inte- suming that the w(·) function is known
grator is “good” in the sense that it is unbiased, and its error (more on this later). As F̂ is a linear com-
bination of i.i.d. random variables, the
drops at a known rate.7 The algorithm is also relatively cheap: variance of F̂ is immediately a linear
it involves drawing N random numbers, evaluating f ( x ) once combination of the respective variances:
for each sample, and summing over the results. Given all these ∑i var p (wi ) var p (w)
var( F̂ ) = = .
strong properties, it is no surprise that Monte Carlo methods N2 N
have become a standard tool for integration. However, there is
a price for this simplicity and generality: as we will see in the 7
The multiplicative constant var p (w) can
next section, the O( N −1/2 ) convergence rate is far from the best even be estimated at runtime! Albeit
usually not without bias. See also Ex-
possible rate. In fact, we will find ourselves arguing that it is the ercise 9.3 for some caveats.
worst possible convergence rate amongst sensible integration
algorithms.
Exercise 9.3 (moderate, solution on
Monte Carlo is not limited to problems where samples can p. 361). One of the few assumptions in
be drawn exactly. Where exact sampling from a distribution Lemma 9.2 is the existence of var p (w). Try
is difficult, Monte Carlo is often practically realised through to find an example of a simple pair of inte-
grand f and measure p for which this as-
Markov-Chain Monte Carlo (mcmc). These iterative methods do sumption is violated.
not generally achieve the O( N −1/2 ) convergence rate, but they
can still be shown to be consistent, meaning that their estimate
of the integral asymptotically converges to its true value.
The model thus encodes assumptions, not just over the inte-
grand, but also over its relationship to the numbers being
computed to estimate it.
Actions
Actions
▶ 10.1 Models
posterior on F:
Z
1
p( F | Y ) = N F; m x + k̃ xX k̃−
XX (Y − m X ) dx,
X
ZZ
2 1
θ k̃ xx′ − k̃ xX k̃−
XX k̃ Xx ′ dxdx ′
.
X
k( x, x ′ ) = θ 2 N ( x; x ′ , λ2 ).
ν( x ) = N ( x; µ, σ2 ),
where
m = k⊺X k− 1
XX Y,
v = K − k⊺X k− 1
XX kX .
ν( x) = N ( x; µ, Σ),
k( x, x′ ) = θ 2 N ( x; x′ , Λ), (10.10)
X ν k Reference
[0, 1]d Unif(X ) Wendland TP Oates et al. (2019a)
[0, 1]d Unif(X ) Matérn Weighted TP Briol et al. (2019)
[0, 1]d Unif(X ) Gaussian Use of error function
Rd Mix. of Gaussians Gaussian Kennedy (1998)
Sd Unif(X ) Gegenbauer Briol et al. (2019)
Arbitrary Unif(X ) or mix. of Gauss. Trigonometric Integration by parts
Arbitrary Unif(X ) Splines Wahba (1990)
Arbitrary Known moments Polynomial TP Briol et al. (2015)
Arbitrary Known ∂ log ν( x) Gradient-based kernel Oates, Girolami, and Chopin
(2017); Oates et al. (2019a)
Table 10.1: A non-exhaustive list of distri-
bution ν and kernel k pairs that provide
(see Table 11.1), but choosing between them is a less intuitive a closed-formR expression for both the
kernel mean k (·, x ) dx and the initial
process, and adding to them is even harder, because there are
error K from Eq. (10.7). Here TP refers to
only analytical, rather than constructive, means to do so without the tensor product of one-dimensional
the interpretation of a prior. Interpretability is a key strength of kernels, and Sd indicates the d-sphere
{ x = ( x 1 , . . . , x d +1 ) ∈ Rd +1 : ∥ x ∥ 2 = 1 } .
pn. This table is adapted from Briol et al.
(2019).
X = arg min v( X̃ ).
X̃ ∈R N
Designing the optimal grid for such a rule, even for regular ker-
nels, can be challenging, because the corresponding multivariate
optimisation problem can, in general, have high computational
complexity.5 However, instead of finding an optimal grid, one can
also sample, at cost O( N 3 ), a draw from the N-determinantal point 6
The point of these papers does not
process (dpp) associated with k. Results by Bardenet and Hardy disagree with our general argument, in
(2019) suggest that doing so causes only limited decrease in §12.3, against the use of random sam-
pling. Rather, these results show that al-
performance over the optimal deterministic grid design (which lowing minor deviations from the opti-
amounts to the maximum a posteriori assignment under the mal design can drastically reduce com-
N-dpp). Belhadji, Bardenet, and Chainais (2019) offer further putational complexity at negligible de-
crease in performance. The necessary
support for the use of determinantal point process sampling for “samples” can even be drawn in a de-
integrands known to live in an rkhs.6 terministic way.
10 Bayesian Quadrature 81
the most desirable model (a log-gp) and the most desirable loss
function. However, bbq employs a first-order approximation to
the exponential function, along with the maintenance of a set of
candidate points xc at which to refine the approximation. Such
approximation proves both highly computationally demanding
and to express only weakly the prior knowledge of the large
dynamic range of the integrand.
The first practical adaptive Bayesian quadrature algorithm
was wsabi,8 which adopts another means of expressing the 8
Gunter et al. (2014)
non-negativity of a integrand: the square-root of the integrand
f ( x ) (minus a constant, α ∈ R) is modelled with a gp. Precisely,
1˜ 2
f (x) = α + f (x) ,
2
where, given data D ,
where m̃( x ) and Ṽ( x, x ′ ) are the usual gp posterior mean (4.6)
and covariance (4.7), respectively (and thus depend on D ). An
integrand modelled as the square of a gp will have a smaller
dynamic range than one modelled as an exponentiated gp. In
this respect, wsabi is a step backwards from bbq.
However, the bbq approximations are significantly more
costly, both in computation and quality, than those required for
wsabi. That is, wsabi considers both linearisation and moment-
matched approximation to implement the square-transformation:
both prove more tractable than the linearised-exponential for
bbq. Linearisation gives the following (approximate) posterior
for the integrand:
p( f | D) ≃ GP ( f ; mL , VL ),
1
mL ( x ) := α + m̃( x )2 ,
2
VL ( x, x ′ ) := m̃( x )Ṽ( x, x ′ )m̃( x ′ ). (10.12)
p( f | D) ≃ GP ( f ; mM , VM ),
1
mM ( x ) := α + m̃( x )2 + Ṽ( x, x ) ,
2
1
VM ( x, x ′ ) := Ṽ( x, x ′ )2 + m̃( x )Ṽ( x, x ′ )m̃( x ′ ). (10.13)
2
The expressions above demonstrate that both linearised and
moment-matched are readily implemented. In either case, the
posterior for the integrand is a gp, manipulable using the stan-
dard gp equations (e.g. Eq. (4.6) for the posterior mean and
10 Bayesian Quadrature 83
with
ai ( x ) = T q2 ( x )Ṽ( x, x ) bi ( x ),
where Ṽ( x, x ) is the usual gp posterior variance from Eq. (4.7),
q is a positive function (e.g. the integration measure), T is the
transformation (also known as warping), and bi ( x ) is an “adap-
tivity” function (in linearised wsabi, that is the square posterior
mean). Weak adaptivity, loosely speaking, requires that bi is
bounded away from zero and infinity. Linearised wsabi does
not technically fulfil this condition, but can be made to do so
with a minor correction. Moment-matched wsabi and mmlt
do satisfy the condition. Intuitively, weak adaptivity means that
the method is not “too adaptive” relative to non-adaptive bq,
and so can only be “stuck for a while, but not forever”. Kana-
gawa and Hennig (2019) also provide a worst-case bound on the
convergence rate that approaches that of non-adaptive Bayesian
quadrature. This result is weaker than that desired: empirically,
we usually see adaptive schemes offer improved convergence
over non-adaptive schemes.
Weak adaptivity plays a conceptually analogous role to the
notions of detailed balance and ergodicity used to show that
MCMC algorithms, likewise, can be “stuck for a while, but
not forever”. For MCMC the resulting insight is that, on some
unknown time-scale, the mixing time of the Markov Chain, the
algorithm converges like direct Monte Carlo. Similarly, weak
adaptivity shows that, up to some constants, adaptive BQ works
at least as well as its non-adaptive counterpart. Detailed bal-
ance and ergodicity alone don’t necessarily make a good mcmc
method (consistency is a very weak property, after all). However,
the statistical community has used the theoretical underpinning
provided by detailed balance and ergodicity as a licence to de-
velop a diverse zoo of mcmc methods that are chiefly evaluated
empirically. One may hope that the licence of weak adaptivity
might enable a similar flourishing of adaptive bq schemes.
11
Links to Classical Quadrature
−3 −2 −1 0 1 2 3
x
and
Z b Z min(b,x ) Z b
i
k( x i ) = k( x, xi ) dx = θ 2 x dx + xi dx − χ(b − a)
a a min(b,xi )
1
= θ2 min(b, xi )2 − a2 + xi b − min(b, xi ) − χ(b − a)
2
1
= θ 2 xi b − ( a2 + xi2 ) − χ(b − a) by the assumption a ≤ xi ≤ b above.
2
These results provide one possible implementation, but as such
they do not say much about the properties of this quadrature
rule. As it turns out, they are a disguise for a well-known idea.
A simple way to see this is to note that the posterior mean over
f is not just a weighted sum of the observations Y (Eq. (10.8)),
but also a weighted sum of kernel functions at the locations
{ x i }:
1
E p( f |Y ) f ( x ) = k xX k− Y = ∑ k( x, xi )αi . (11.3)
| XX
{z } i
1
There are some technicalities to con-
=:α sider if x1 does not coincide with a, be-
cause the choice of the starting time χ
The k( x, xi ) of Eq. (11.1) are piecewise linear functions, each affects the extrapolation behaviour on
with a sole non-differentiable locus x = xi . Thus the poste- the left. One resolution is to choose
the asymptotic setting χ _ −∞, which
rior mean of Eq. (11.3) is a sum of piecewise linear functions, gives rise to a constant extrapolation,
hence itself piecewise linear, with “kinks” – points of non- known as the natural spline (Minka, 2000)
differentiability – only at the input locations X; see Figure 11.1. (Wahba, 1990, pp. 13–14). That same solu-
tion is also found by the filter of §11.1.1,
As this is the posterior mean conditioned on Y, that piecewise which does not require a starting time.
linear function with N kinks has to pass through the N nodes
in Y. Assuming, for simplicity, x1 = a, x N = b, there is only
one such function1 on [ a, b]: the linear spline connecting the
evaluations: for a ≤ xi < x < xi+1 ≤ b, 2
The two integrals can be exchanged due
x − xi to Fubini’s theorem, which states that
E p ( f |Y ) f ( x ) = f ( xi ) + f ( x i +1 ) − f ( x i ) . this is possible whenever the (here: bi-
δi
variate, over x and f ) integrand is abso-
The expected value of the integral is the integral of the expected lutely integrable, which is true for inte-
grands fulfilling the assumptions above.
value,2 written
11 Links to Classical Quadrature 89
Z b
Z b
E p ( f |Y ) f ( x ) dx = E f ( x ) dx
a a
N −1 (11.4)
δi
= ∑ 2
f ( x i +1 ) + f ( x i ) .
i =1
If we allow for an observation at x1 = a, then the initial values Exercise 11.2 (easy). Convince yourself
of the SDE are irrelevant. Since we know Fa = 0 by definition that Eqs. (11.6) and (11.7) indeed arise as the
updates in Algorithm 5.1 from the choices or
(thus with vanishing uncertainty), the natural initialisation for
A, Q, H, R made above. Then show that the
the filter at x1 = a is resulting mean estimate m N at x = b indeed
amounts to the trapezoidal rule (e.g. by a
" # " #
telescoping sum). That is,
0 0 0
m1 = , P1 = . N −1
f ( a) 0 0 δi
E( F ) = ∑ ( f + f i ),
2 i +1
i =1
it out. The significance of this result is that Bayesian inference In practice, the algorithm could thus be im-
on an integral, using a nonparametric Gaussian process with plemented in this simpler (and parallelisable)
form. Note again, however, that this algo-
N evaluations of the integrand, can be performed in O( N ) rithm is not a good practical integration rou-
operations. tine, only a didactic exercise. See §11.4 and
in the literature cited above for more practical
Put another way, the simple software implementation of algorithms.
the trapezoidal rule is identical to that of a particular Bayesian
quadrature algorithm.
11 Links to Classical Quadrature 91
▷ 11.1.2 Uncertainty
∂ var( F ) θ2
Just like the mean estimate of the trapezoidal rule (11.4), the = [δ2j − δ2N −1 ].
∂δj 4
equidistant grid as an optimal design choice for the rule is
Setting this to zero (recall that all δi must
independent of the scale θ 2 of the prior. Hence, there is a family of be positive) gives δj = δN −1 , ∀ j ̸= N − 1.
models M(θ ), parametrised by θ ∈ R+ , so that every Gaussian Without the formal requirement of x1 =
a, x N = b, it actually turns out (Sacks
process prior in that family gives rise to the same design (the
& Ylvisaker, 1970 & 1985) that the best
same choice of evaluation nodes), and the same trapezoidal design is
estimation rule. The associated estimate for the square error, the 2i
xi = a + ( a − b ) .
posterior variance, of these models is given by Eq. (11.8). If we 2N + 1
choose the design with equidistant steps as derived above, that This leaves a little bit more room on the
left end of the domain than on the right,
expression is given by due to the time-directed nature of the
Wiener process prior. E.g., for N = 2,
N −1 3
θ2 b−a θ 2 ( b − a )3 the optimal nodes on [ a, b] = [0, 1] are at
var( F ) =
12 ∑ N−1
=
12( N − 1)2
, (11.9) [2/5, 4/5].
i =1
p
which means that the standard deviation std( F ) = var( F ),
an estimate for the absolute error, contracts at a rate O( N −1 ),
and thus more rapidly than the O( N −1/2 ) of the Monte Carlo
estimate.
Hence the choice of the kernel k, other than the scale θ, de-
termines the rate at which the error estimate var( F ) contracts,
while θ itself provides the constant scale of the error estimate.
This situation mirrors the separation, in classical numerical anal-
ysis between error analysis (rate) and error estimation (scale). The
algebraic form of the estimation rule is difficult to fundamen-
tally change at runtime without major computational overhead,
so its properties (e.g. rate) are studied by abstract analysis. The
scale, on the other hand, relates to an estimate of the concrete
error of the estimate, and should be estimated at runtime.
Sections 6 and 6.3 introduced the mechanism of conjugate
prior hierarchical inference on θ: using a Gamma distribu-
tion to define a prior on the inverse scale θ −2 , the joint pos-
terior over θ and f , F remains tractable, and can be used to
address the error estimation problem. Given the prior p(θ −2 ) =
G(θ −2 ; α0 , β 0 ), the posterior on θ −2 can be written using the
recursive terms in the Kalman filter as (reproduced for conve-
11 Links to Classical Quadrature 93
From Eq. (11.10) we see that, in contrast to Eq. (11.8), this esti-
mated variance now actually depends on the function-values
collected in Y. For the specific choice of the Wiener process
prior (11.2), the values collected in β N in Eq. (11.10) become9 9
If necessary, the second line of
Eq. (11.12) can be used to fix χ to its
1 N (yi − Hmi− )2 most likely value, given by
2 i∑
β N = β0 +
=1 H P̃i− H ⊺ χML :=
2 ! (11.12) 2 ! −1
N
1 f ( x1 )2 f ( x i ) − f ( x i −1 )
f ( x1 )2
1 N f ( x i ) − f ( x i −1 )
= β0 + +∑ . N−1 ∑ δi
2 x 1 + χ i =2 δi i =2
− x1 .
For reference, Algorithm 11.2 on p. 97 provides pseudo-code However, if x1 = a, and the first eval-
and highlights again that this Bayesian parameter adaptation uation y1 is made without observation
noise, the value of χ has no effect on the
can be performed in linear cost, by collecting a running sum of
estimates.
the local quadratic residuals ( f ( xi ) − Hmi− )2 of the filter.
)
10−5
trapezoidal rule overtakes the quality of
the MC estimate after eight evaluations,
and begins to approach its theoretical
10−7 Monte Carlo convergence rate for differentiable inte-
Wiener/Trapezoid grands, O( N −2 ) (each thin line corre-
Monte Carlo std-dev sponds to a different multiplicative con-
10−9 Student-t error estimate stant). The non-adaptive GP error esti-
mates of the form const./N (Eq. (11.9))
are under-confident. So is the adaptive
100 101 102 Student-t error estimate (dash-dotted, as
# samples in Eq. (11.11)), reflecting the overly con-
servative assumption of continuity but
non-differentiability in the Wiener pro-
cess prior. Nevertheless, the adaptive er-
ror estimate contracts faster than the non-
▷ 11.1.5 Convergence Rates adaptive rate of O( N −1 ).
−20
−40
50
f (x)
−50
−2 0 2
x
8
−
m = Am predictive mean
9 P̃− = A P̃A⊺ + Q̃ predictive covariance
10 z = f ( x ) − Hm− observation residual
11 s = H P̃− H ⊺ residual variance
12 K = 1/s P̃− H ⊺ gain
13
−
m ^ m + Kz update mean
14 P̃ ^ P̃− − KsK⊺ update covariance
15 β ^ β + z2/2s update hyperparameter
16 end for
17 E( F ) ^ m1 point estimate
18 var( F ) ^ β/(α0 + N/2 − 1) · P̃11 error estimate
N
19 r ^ β − β0 − 2 model fit diagnostic (see §11.3)
20 return E( F ), var( F ) return mean, variance of integral
21 end procedure
1
f (x)
−2 0 2
x
100
10−2
| F − F̂ |
10−4
10−6
10−8
10−10 0
10 101 102 100 101 102 100 101 102 100 101 102
300
200
E( p(ỹ))
p(y)
100
0
−100
log
−200
−300
100 101 102 100 101 102 100 101 102 100 101 102
# samples # samples # samples # samples
1 N (yi − Hmi− )2 1
log p(Y | M) = − ∑ − ⊺
− log | H P̃i− H ⊺ | + const.
2 i=1 H P̃i H 2
(11.14)
N 1
EỸ |M log p(Ỹ | M) = − − log | H P̃i− H ⊺ | + const.
2 2
The expected log-ratio between predicted and observed likeli-
100 II Integration
hood is
Z
p(Ỹ | M)
r (Y, M) := log p(Ỹ | M) dỸ
p(Y | M)
!
N ( y − Hm− )2
1 i
=− N−∑ i
(11.15)
2 i =1 H P̃i− H ⊺
N z2 N N
= ∑ 2si2 − 2
= β N − β0 − ,
2
i =1 i
100
10−2
| F − F̂ |
10−4
10−6
10−8
10−10 0
10 102 104 100 102 104 100 102 104 100 102 104
300
200
E( p(ỹ))
p(y)
100
0
−100
log
−200
−300
100 102 104 100 102 104 100 102 104 100 102 104
# samples # samples # samples # samples
z oidal
rule converges much faster. The curved
10−8 grey line is a suggestive exponential
function.
Gau
ss-L
10−11
egen
dre
10−14
1
Figure 11.9: Probabilistic interpretation
of Gauss–Legendre integration (that is,
for ν( x ) the Lebesgue measure). Priors
(left) and posteriors (right) consistent
with the Gauss–Legendre rules of degree
2q − 1 = 5 (top) and 15 (bottom), respec-
q=3
0
tively. The left plots show two marginal
standard deviations as a shaded area, the
first 2q − 1 Legendre polynomials span-
ning the kernel (which are exactly inte-
−1
grated by the associated quadrature rule)
in white, and two samples from the prior
in dashed black. The right panels show
1 the posterior after q evaluations at the
nodes of the qth polynomial, again with
two standard deviations as a shaded re-
gion, the posterior mean and the inte-
grand in thick black, and two samples
q=8
Under the posterior on f arising from the kernel k2N , the vari-
ance is zero at the N nodes X of the Nth polynomial, but is
generally non-zero at x ∈
/ X (see Figure 11.9). To explain this
11 Links to Classical Quadrature 105
ν(ψi ) = 0, ∀i > 0,
√
because ψi ( x ) = ±ψi ( x )ψ0 ( x )/ c0 . In the setting of Corol-
lary 11.6, those N evaluations exactly identify the value of
the first coefficient v0 , but not necessarily those of the other
coefficients. So there is flexibility left in the function values,
but only in ways that do not contribute to the integral. In the
posteriors shown in Figure 11.9, all sampled hypotheses, and
the posterior means, share the same integral.
N
E| X,Y ( f ( x )) = R⊺x m̄ + k xX k− 1 ⊺
XX Y = R x m̄ + ∑ I( x = xi ) yi ,
i =1
⊺
′
cov| X,Y f ( x ), f ( x ) = k xx′ − k xX k− 1
XX k Xx ′ + R x (c + 1⊺ k− 1 −1
XX 1) R x ′ ,
with
N 5
Rasmussen and Williams (2006), §2.7
R x : = 1 − 1⊺ k − 1
XX k Xx = 1 − ∑ I( x = xi ), and
i =1
N
θ −2
m̄ := (c + 1⊺ k− 1 −1 ⊺ −1
XX 1) 1 k XX Y = ∑ yi .
c + θ −2 N i =1
A defender of Monte Carlo might argue that its most truly desir-
able characteristic is the fact that its convergence (see Lemma 9.2)
does not depend on the dimension of the problem. Performing
well even in high dimension is a laudable goal. However, the
statement “if you want your convergence rate to be independent
of problem dimension, do your integration with Monte Carlo”
is much like the statement “If you want your nail-hammering
to be independent of wall hardness, do your hammering with
a banana.” We should be sceptical of claims that an approach
performs equally well regardless of problem difficulty. An ex-
planation could be that the measure of difficulty is incorrect:
perhaps dimensionality is not an accurate means of assessing
the challenge of an integral. However, we contend that another
possibility is more likely: rather than being equally good for any
number of dimensions, Monte Carlo is perhaps better thought
of as being equally bad.
Recall from §10.1.2 that the curse of dimensionality results
from the increased importance of the model relative to the evalu-
ations. Theorem 12.1 makes it clear that Monte Carlo’s property
of dimensionality-independence is achieved by assuming the
weakest possible model. With these minimalist modelling as-
sumptions, very little information is gleaned from any given
evaluation, requiring Monte Carlo to take a staggering number
of evaluations to give good estimates of an integral. As a con-
trast, Bayesian quadrature opens the door to stronger models
for integrands. The strength of a model – its inductive bias – can
indeed be a deficiency if it is ill-matched to a particular inte-
grand. However, if the model is well-chosen, it offers great gains
in performance. The challenge of high dimension is in finding
models suitable for the associated problems. Thus far, Proba-
bilistic Numerics has shone light on this problem of choosing
models, and has presented some tools to aid solving it. It is now
up to all of us to do the rest. Of course, we must acknowledge that
contemporary quadrature methods (both probabilistic and clas-
sical) do not work well in high-dimensional problems: indeed,
they perform far worse than Monte Carlo. However, arguments
like those in this chapter show that there is a lot of potential
for far better integration algorithms. Such methods can work
12 Probabilistic Numerical Lessons from Integration 111
For further evidence to this point, we note that even the most
general model underlying Monte Carlo integration can actually
converge faster if the nodes are not placed at random. Equa-
tion (12.1) is independent of the node placement X. So if it is
used for guidance of the grid design as in §11.1.3, then any
arbitrary node placement yields the same error estimate (as
long as no evaluation location is exactly repeated). Since the
covariance k assumes that function values are entirely unrelated
to each other, a function value at one location carries no infor-
mation about its neighbourhood, so there is no reason to keep
the function values separate from each other.
The tempting conclusion one may draw from Theorem 12.1
is that, because, under this rule, any design rule is equally
good, one should just use a random set of evaluation nodes. This
argument is correct if the true integrand is indeed a sample
from the extremely irregular prior of Eq. (12.2). But imagine for
a moment that, against our prior assumptions, the integrand f
happens to be continuous after all. Now consider the choice of
a regular grid,
b−a
X = [ a, a + h, a + 2h, . . . , b − h] with h := .
N
Then, the mean estimate from Eq. (12.1) is the Riemann sum
E| X,Y ( F ) = h ∑ f ( xi ).
i
For functions that are even Lipschitz continuous, this sum con-
verges to the true integral F at a linear rate,7 O( N −1 ). That is, 7
Davis and Rabinowitz (1984), §2.1.6
the poor performance of Monte Carlo is due not just to its weak
model, but its use of random numbers. This insight into the
advantage of regular over random node placement is at the
heart of quasi Monte Carlo methods.8 As we have seen above, 8
E.g. Lemieux (2009)
however, it is possible to attain significantly faster convergence
rates by combining non-random evaluation placements with
explicit assumptions about the integrand.
Exploration But what is the right loss function for the task
addressed by a prng? It is hard to defend a single, one-off,
choice being made by a prng: that is, to defend the expected loss
for such a choice being uniformly flat. A prng is perhaps more
productively considered as a heuristic for making a sequence of
decisions. The goal of this sequence (or design), X = { x1 , . . .}, is
to achieve exploration, which we will roughly define as providing
information about the salient characteristics of some function
f ( x ). As a motivating example, consider f ( x ) as the integrand
of a quadrature problem. A prng provides exploration but,
remarkably, requires neither knowledge or evaluations of f ,
nor more than minimal storage of previous choices x. These
self-imposed constraints are extreme. First, in many settings,
including in the quadrature case considered in this chapter,
we have strong priors for f . Second, many problems (again,
as in quadrature), render evaluations f ( x ) pertinent to future
choice of x: for instance, a range of x values for which f ( x )
is observed to be flat is unlikely to require dense sampling.
Third, as computational hardware has improved, memory has
become increasingly cheap. Is it still reasonable to labour under
computational constraints conceived in the 1940s?
The extremity of the prng approach is further revealed by
broader consideration of the problem it aims to solve. Explo-
ration is arguably necessary for intelligence. For instance, all
branches of human creative work involve some degree of explo-
ration. Human exploration, at its best, entails theorising, probing
and mapping. This fundamental part of our intelligence is ad-
dressed by a consequentially broad and deep toolkit. Random
and pseudo-random algorithms, in contrast, are painfully dumb,
and are so by design.
To better achieve exploration, the Probabilistic Numerics ap-
proach is to explicitly construct a model of what you aim to
explore – f ( x ). This model will serve as a guide to optimally ex-
plorative points, avoiding the potential redundancy of randomly
sampled points.
Figure 12.1, in contrast to Figure 9.3, is a cartoon indictment
of the over-simplicity of a randomised approach.
1. 6224441111111114444443333333
2. 169399375105820974944592307816
3. 712904263472610590208336044895
4. 100011111101111111100101000001
5. 01110000011100100110111101100011
Software
Further Reading
Quasi-Monte Carlo
Convergence Analysis
−1
( A B) = (VA VB )( D A DB )(VA VB ) , (15.6)
⊺ ⊺ ⊺
( A B) = A B , (15.7)
(SVD)
B = QΣU ⊺
with orthonormal5 matrices Q ∈ R N × N , U ∈ R M× M , whose 5
That is, Q⊺ Q = IN and U ⊺ U = I M .
columns are called the left- and right- singular vectors, respec-
tively, and a rectangular diagonal matrix6 Σ ∈ R N × M which 6
That is, Σij = 0 if i ̸= j.
contains non-negative real numbers called singular values of B on
the diagonal. Assume, w.l.o.g., that N ≥ M and the diagonal el-
ements of Σ are sorted in descending order, and Σrr with r ≤ M
is the last non-zero singular value. Then Q can be decomposed
into its first r columns, Q+ , and the (potentially empty) N − r
columns, Q− as Q = [ Q+ , Q− ], and similarly U = [U+ , U− ] for
the columns of U. The SVD is a powerful tool of matrix analysis:
r equals the rank of B: rk( B) = r;
x2
x is the unique vector that minimises the convex quadratic
2 p
function 2/λ
2 v2
1
f ( x ) = x⊺ Ax − x⊺ b, (16.2)
2 1
and is thus known as the least-squares problem. f ( x ) has gradient
2 3 4 5
(see Figure 16.1) x1
Figure 16.1: Sketch of a symmetric posi-
r ( x ) := ∇ f ( x ) = Ax − b, (16.3) tive definite linear problem.
A−1 =: H.
Ai Di = Zi and Hi Zi = Di .
The crucial question is, how should the solver choose the action
di+1 from the posterior?
r ( x ) = Ax − b.
For any estimate x̃, wherever it may come from, the update
x̃ ^ x̃ − Hr ( x̃ )
= x̃ − H ( A x̃ − b) = Hb = x
x2
for the solution x should be consistent with Hi . This suggests
2
the estimation update rule
xi+1 := xi − Hi r ( xi ), (17.1) 1
2 3 4 5
where the inference on H is so far left abstract. x1
Figure 17.1: A quadratic optimisation
Following our general recipe, the second part of the solver is problem: extremum as black centre, Hes-
sian with eigen-directions represented
the action rule, the choice for the next projection di+1 of A. by an ellipse with principal axes. The
There are two, related but not identical, objectives for this rule: restriction of the quadratic to a linear
sub-space is also a quadratic. The opti-
One the one hand, we would like to know the new residual
mum in that sub-space can be found in
r ( xi+1 ), if only to track process and check for convergence a single division, but it is not identical
(remember that the problem is solved iff r ( xi+1 ) = 0). On the to the projection of the global optimum
onto the sub-space.
other hand, we want to efficiently collect information about A
and H; i.e. explore aspects of A that will maximally improve
subsequent estimates x>i . To this end, consider the projection
and accompanying observation
di+1 := xi+1 − xi = − Hi r ( xi ),
(17.2)
zi+1 = Adi+1 .
mation from zi within the loop, as long as it is possible to do so Hence, for x0 = 0 (or Hi Ax0 =
x0 ), Eq. (17.1) is actually equal to xi =
at low computational cost. In particular, we can re-scale the step
Hi b. This is mostly a problem of presen-
as di+1 ^ αi+1 di+1 , using a scalar αi ∈ R. Doing so introduces tation: Algorithm 17.1 is a compromise
an ever so slight break in the consistency of the probabilistic allowing both a general probabilistic in-
terpretation while staying close to classic
belief: the estimate xi in line 8 of Algorithm 17.1 will neither formulations, which typically allow arbi-
be equal to Hi−1 b nor to Hi b. But this is primarily an issue of trary x0 .
algorithmic flow (the fact that xi and Hi are computed in differ-
ent lines of the algorithm), and the practical improvements are
too big to pass on. In any case, this adaptation is also present
in the classic algorithms, so we need to include it to find exact
equivalences.
Indeed, under the assumption of symmetric A, the optimal
scale αi+1 can be computed in linear time, using the observation
in line 7. We will consider symmetric A hereafter. Consider
the parametrised choice xi = xi−1 + αi di . The derivative of the
17 Evaluation Strategies 141
∂ f ( x i −1 + α i d i )
= αi d⊺i Adi + d⊺i ( Axi−1 − b)
∂αi
= αi d⊺i Adi + d⊺i ri−1 .
d⊺i ri−1
αi = − . (17.3)
d⊺i zi
∇ f ( xi ) = ri = Axi − b ⊥ Li . (18.1)
x2
Theorem 18.4 (proof on p. 184). If A is spd, Hi is symmetric
2
for all i ≥ 0, Assumption (18.3) holds, and Algorithm 17.2 does not
terminate before step k < N, then
1
ri ⊥ r j ∀ 0 ≤ i ̸= j ≤ k, 2 3 4 5
x1
and there exist γi ∈ R\0 for all i < k so that line 12 in Algorithm 17.2 Figure 18.1: Analogous plot to Fig-
can be written as ure 17.1. The gradients at points sam-
pled independent of the problem’s struc-
βi ture (“needles” of point and gradient as
di = − Hi−1 ri−1 = γi −ri−1 + d , (18.5)
γi − 1 i − 1 black line, drawn from spherical Gaus-
sian distribution around the extremum)
with are likely to be dominated by the eigen-
ri⊺−1 ri−1 vectors of the largest eigenvalues. Thus,
β i := . by following the gradient of the problem,
ri⊺−2 ri−2 one can efficiently compute a low-rank
approximation of A that captures most
Comparing Algorithm 17.2 to cg (Algorithm 16.1), we note that of the dominant structure. This intuition
they are identical up to re-scaling by γi : is at the heart of the Lanczos process that
provides the structure of conjugate gra-
dients.
dCG
i = γi dProbabilistic
i .
▶ 18.5 Preconditioning
(i.e. U ⊺ = U −1 ):
à x̃ = b̃ with (18.7)
−⊺ −1 −⊺
à := C AC , x̃ := Cx, b̃ = C b.
object.
This is not to say there is no use for uncertainty in linear
solvers. It just so happens that classic solvers address a corner
case, one less demanding of uncertainty. It is nevertheless useful
to understand the connection to probabilistic inference in this
domain, because uncertainty is more prominently important if:
µ
1/x
have historical5 and practical relevance.
Figure 19.1: Inverses of Gaussian vari-
1. One may treat the matrix A itself as the latent object, and ables are not themselves normal dis-
define a joint probability distribution6 p( A, Y, S). This model tributed. The plot shows the distribu-
tion of x −1 if p( x ) = N ( x; µ, 1), for five
class will be called inference on A. This approach has the ad- different values of µ. Since ( x + ϵ)−1 ≈
vantage that the computation of the matrix-matrix product x −1 − ϵx −2 , in the limit |µ| _ ∞, the dis-
tribution approaches
AS = Y is described explicitly. This would be relevant, for
example, if the main source of uncertainty is in this computa- p ( x −1 ) ≈ N ( x −1 ; µ −1 , µ −4 ).
tion itself – if we do not actually compute exact matrix-matrix However, for small values of µ, the distri-
bution becomes strongly bi-modal. It is
multiplications, but only approximations of it (a setting not therefore clear that we will have to resort
further discussed here). to approximations if we want to infer
both matrices and their inverse while
The downside of this formulation is that it does not explicitly using a Gaussian distribution to model
involve x. This is an issue because a tractable probability either variable. See also Figure 19.9 for
more discussion.
distribution on A may induce a complicated distribution on 6
We will generally assume that b itself
x. For intuition, Figure 19.1 shows distributions of the inverse is known with certainty, and thus not
of a scalar Gaussian variable of varying mean. For matrices, explicitly include it in the generative
model.
the situation is even more complicated, as the probability
measure might put non-vanishing density on matrices that
19 Probabilistic Linear Solvers: Algorithmic Scaffold 151
x = Hb
Z = AD ⇐⇒ D = HZ.
S]Y and A ] H.
p( A; A0 , Σ0 ) = N ( A; A0 , Σ0 )
p( A | Y, S) = N ( A; A M , Σ M ), (19.4)
#– #– −1 #– #–
A M : = A 0 + Σ0 ( I S ) ( I S⊺ ) Σ0 ( I S ) Y − ( I S⊺ ) A 0 , and (19.5)
| {z }
=:G M
⊺
−1
Σ M : = Σ0 − Σ0 ( I S ) ( I S ) Σ0 ( I S ) ( I S⊺ ) Σ0 . (19.6)
#–
A better prior would encode the fact that A is not just a long
vector, but contains the elements of a square matrix. The pro-
jections terms ( I S), with their Kronecker product structure,
already contain information about the generative process of the
observations. We thus consider a Kronecker product for the
prior covariance, too:7 7
Distributions of the form N ( X; X0 , V
W ) are sometimes called a matrix-variate
Σ0 = V0 W0 with spd V0 , W0 ∈ R N × N . (19.8) normal, due to a paper by Dawid (1981).
This convention will be avoided here,
since it can give the incorrect impression
What kind of prior assumptions are we making here? If both
that this is the only possibility to assign a
matrices in a Kronecker product are spd, so is their Kronecker Gaussian distribution over the elements
product (see Eq. (15.6)). Hence, Eq. (19.8) yields an spd overall of a matrix, when in fact Eq. (19.3) is the
most general such distribution.
covariance, and the prior assigns non-vanishing probability
density to every matrix A, including non-invertible, indefinite 8
One helpful intuition for this situation
ones, etc., despite the fact that such spd matrices V0 , W0 only is to convince oneself that the space of
offer Kronecker products spans a sub-space
of rank one within the space of N 2 ×
2 · 1/2 N ( N + 1) = N ( N + 1) N 2 real matrices, and that this sub-space
does contain a space of spd matrices.
degrees of freedom (as opposed to the 1/2 N 2 ( N 2 + 1) degrees
of freedom in a general spd Σ0 ).8
The prior assumptions encoded by a Kronecker product in the
covariance are subtle. A few intuitive observations follow. The
W
Kronecker covariance can be written as
c j ∼ N (0, W0 ).
154 III Linear Algebra
Figure 19.3 shows five samples each from two different Gaussian
distributions with Kronecker product covariance.
The main takeaway is that while the Kronecker covariance
does represent a helpful restriction, it is not one that limits
the space of matrices that we can infer. The prior measure
encompasses all matrices.
1 ⊺ −1 ⊺ −1
A−
M = ( A0 + ∆ AM ( S W0 S ) S W0 ) (19.12)
= A0−1 + (S − A0−1 Y )(S⊺ W0 A0−1 Y )−1 S⊺ W0 A0−1 .
Lemma 19.3. Assume A0 and W0 are spd, and the search directions Proof. If A0 is spd, its inverse exists.
S are chosen to be linearly independent. Then, for our assumption of Y = AS, and products of spd matrices
are spd. Thus, W0 A0−1 A is spd, hence
spd A, the inverse (19.12) exists. S⊺ W0 A0−1 AS invertible.
Aij − A ji = 0, ∀ 1 ≤ i, j ≤ N.
p( A) = N ( A; A0 , W0 W0 ). (19.14)
This step alone does not encode symmetry (samples from this
distribution are still asymmetric with probability one), but it
avoids technical complications in the following.
For a formal treatment, we introduce two projection operators
2
acting on the space R N of square N × N matrices:
2 2
Π : R N _ R N , with elements
2 2
Π : R N _ R N , with elements
12
Π := 1/2(δik δjℓ − δiℓ δjk ), Also known as skew-symmetric matri-
(ij),(kℓ)
ces.
It is easy to convince oneself that Π and Π are orthogonal which holds simply because
2
projection operators that jointly span R N , i.e. that13 X = 1/2( X + X ⊺ + X − X ⊺ ).
19 Probabilistic Linear Solvers: Algorithmic Scaffold 157
Π Π = Π , Π Π = Π , and
Π⊺ = Π , Π⊺ = Π , and
Π Π = Π Π = 0 N 2 , Π + Π = I N 2 . (19.15)
W W = Π (W W ) Π + Π (W W ) Π .
| {z } | {z }
=:W
W =:W
W
D ) −1 ̸ = C −1
(C D −1 .
We also have
#–
D ) X = 1/2(CXD⊺ + CX ⊺ D⊺ ), and
(C
(19.17)
15
If W ∈ R N × N is of full rank, the ma-
#–
D ) X = 1/2(CXD⊺ − CX ⊺ D⊺ ).
(C trix W W has rank 1/2 N ( N + 1), the
dimension of the space of all real sym-
Using this framework, the information about A’s symmetry can metric N × N matrices. That its inverse
on that space is given by W −1 W −1
be explicitly written as an observation with likelihood can be seen from Eq. (19.17). The inverse
on asymmetric matrices is not defined.
#– #– #–
p(| A) = δ(Π A − 0) = lim N ( 0 N 2 ; Π A, βIN 2 ). (19.18) 16
Alizadeh, Haeberley, and Overton
β_0 (1988)
p( A) = N ( A; A0 , W0 )
W0 ) (19.20)
δ(Y − AS)N ( A; A0 , W0
W0 )
p( A | Y, S, ) = R
δ(Y − AS)N ( A; A0 , W0
W0 ) dA
2 100
= N ( A; A M , Σ M ).
0 0
Some algebraic footwork17 is required to find the posterior
means and covariance, the analogues to Eqs. (19.10) and (19.11). −2 −100
They are Figure 19.4: Samples from a Gaussian
prior encoding symmetry. Results anal-
ogous to Figure 19.3. Five i.i.d. sam-
A M = A0 + (Y − A0 S)(S⊺ W0 S)−1 S⊺ W0 (19.21) ples from the distribution of Eq. (19.20)
⊺ −1 ⊺ for W0 = I (left column) and W0 =
+ W0 S(S W0 S) (Y − A 0 S )
diag[102 , 92 , 82 , . . . ] (right column), re-
⊺ −1 ⊺
− W0 S(S W0 S) S (Y − A0 S)(S⊺ W0 S)−1 S⊺ W0 , spectively. Note differing choice for W0
relative to Figure 19.3, since W0 here is
Σ M = WM
WM , with (as in Eq. (19.11)), (19.22) in both terms of the product.
⊺ −1 ⊺
WM := W0 − W0 S(S W0 S) S W0 .
Exercise 19.5 (moderate). Explicitly com-
pute the evidence term
These expressions, in particular the posterior mean A M , play Z
a central role not just in linear solvers, but also in nonlinear δ(Y − AS)N ( A; A0 , W0
W0 ) dA.
optimisation. Let us take a closer look. We first note that A M is
What is its form?
indeed symmetric if A0 is symmetric, because S⊺ Y = S⊺ AS is
symmetric. What is less obvious is that the expression added 17
For a derivation, see Hennig (2015).
to A M is of at most rank 2M. This can be seen by defining the Exercise 19.6 (hard). Derive the result in
helpful terms Eq. (19.21). In performing the derivation,
try to gain an intuition for why the pos-
terior mean (19.21) is not simply the sym-
U := W0 S(S⊺ W0 S)−1 ∈ RN× M and (19.23) metrised form Π A M of the posterior mean
⊺ N×M from Eq. (19.10) (consider Eq. (19.19) for a
V := ( I − 1/2US ) (Y − A0 S) ∈R . hint).
19 Probabilistic Linear Solvers: Algorithmic Scaffold 159
A12 = A21
0
A22
throughout this chapter that A is not just symmetric, but also
Figure 19.5: Outer boundaries of the pos-
positive definite, it would be desirable to also encode this infor- itive definite cone within the space of
mation in the prior. Unfortunately, the space of positive definite symmetric 2 × 2 matrices (the only case
2 that allows a plot). The thick line down
matrices is a cone, a nonlinear sub-space of R N , the space of the centre of the cone marks scalar matri-
all (vectorised) square N × N matrices, and also a nonlinear ces. The outer edge of the 2 × 2 positive
sub-space of R1/2 N ( N +1) , the space of symmetric such matrices definite cone is given
√ by matrices A with
A12 = A21 = ± A11 A22 .
(see Figure 19.5). Information about positive definiteness can
thus not be captured in a Gaussian likelihood term using only
linear terms of A.
It is, however, possible to scale the parameters of the Gaussian
prior post hoc to ensure that the posterior mean estimate always
lies within the positive definite cone. This is helpful in so far
as it means this posterior point estimate can be trusted to be
admissible, and this is how this correction is used in practice.
From a probabilistic perspective, however, this is not particularly
satisfying since it means the model cannot make use of the
known positive definiteness during inference.
The following is a minor generalisation of a derivation in a
seminal review by Dennis and Moré (1977, §7.2) reproduced in
some detail here because it provides valuable insights, also used
in Chapter IV. Consider the symmetry-encoding prior (19.20),
160 III Linear Algebra
A12 = A21
0
posterior belief
p( A | Yi ) = N ( A; Ai , Wi
Wi ),
0
conditioned on observations Yi , the i matrix-vector multiplica- −10
0 10 10
tions y j = As j for j = 1, . . . , i.
A11
Using p( A | Yi ), and given the next observation yi+1 = Asi+1 ,
A22
we can calculate a Gaussian posterior on A with the mean and Figure 19.6: A Gaussian prior measure
covariance of mean A0 = 3I and symmetric Kro-
necker covariance with W = 3I shown
(yi − Ai si )s⊺i Wi + Wi si (yi − Ai si )⊺ relative to the positive definite cone. The
A i +1 = A i +
s⊺i Wi s symmetric Kronecker product inherits
some of the cone’s structure in so far as
( y − A s )⊺ s the marginal variance of off-diagonal el-
− i ⊺ i i 2 i Wi si s⊺i Wi (19.25) ements under this prior is half that of di-
(si Wi si )
agonal elements. But the distribution still
= Ai + uv⊺ + vu⊺ , assigns non-vanishing measure to the in-
definite matrices outside of the cone.
for (see Eq. (19.23))
Wi si
u := ,
s⊺i Wi si
Wi si s⊺i (yi − Ai si )
v : = ( yi − Ai si ) − ,
2s⊺i Wi si
Wi si s⊺i Wi
Wi+1 = Wi − .
s⊺i Wi si
Each posterior mean update is thus a rank-2 update. This it- 18
Wilkinson (1965), pp. 95–98
erative form is more manageable from an analytic perspective
than the immediate form of Eq. (19.21). The idea is now to ask
for a value of W0 such that Ai can be shown by induction to be
positive definite, a notion that Dennis and Moré call hereditary
positive definiteness. For this we make use of a result from matrix
perturbation theory.18 Intuitively speaking, a rank-1 update can
at most shift the eigenvalues of the original matrix up or down
to the value of the nearest neighbouring eigenvalues.
Lemma 19.7. Let A ∈ R N × N be symmetric with eigenvalues
λ1 ≤ λ2 ≤ · · · ≤ λ N .
λ1 ≤ λ1∗ ≤ λ2 ≤ · · · ≤ λ N ≤ λ∗N .
19 Probabilistic Linear Solvers: Algorithmic Scaffold 161
Now note that the rank-2 update in Eq. (19.25) can be written
as the sum of two symmetric rank-1 matrices:
Ai+1 = Ai + 1/2 (u + v)(u + v)⊺ − (u − v)(u − v)⊺ . (19.26)
choice W0 = A).
A12 = A21
0
mean A0 , then inference on a positive definite A will always
produce positive definite posterior means (see Figure 19.8). If
we consider scalar covariances instead, then Theorem 19.10
0
provides the weaker statement that, in that setting, it is at least
−10
possible to “drag the posterior mean into the positive definite 0 10 10
A11
cone” by increasing prior covariance.
A22
Both statements, however, are dissatisfying from the proba- Figure 19.7: Gaussian inference under
bilistic perspective for two reasons. First, they are just statements the Gaussian prior of Figure 19.6 and
about the mean. The posterior distribution, being Gaussian, will the projection s = [1, 0]⊺ , on the spd
matrix A with A11 = A22 = 9, A12 =
always put non-zero measure on parts of the real vector-space 0.7 · 9 (black circle). The plots on the
outside of the positive definite cone. Second, these statements “left wall” of the plot show the projec-
tions of the prior and A into the obser-
are of a post-hoc nature, literally: the prior still puts mass out- vation space [ A11 , A12 ]⊺ . Although both
side of the cone (see Figure 19.6). The information that A is the prior mean and the true matrix are
positive definite, available a priori, can thus not be leveraged by symmetric positive definite, the poste-
rior mean (black square, connected to
the solver in its action policy. The real value of prior information A0 by a dashed line) lies outside of the
is that it can change the way the algorithm acts, not just the cone. Theorem 19.10 shows that one way
to fix this is to increase the prior mean.
final estimate. At the time of writing, there is no clear solution The graphical representation of this re-
to this problem. sult is that A1 always lies on the black
projection line connecting A and y (rec-
ommended instant exercise: why?). As
the prior mean increases, A1 eventually
▶ 19.7 Summary: Gaussian Linear Solvers moves along that line into the cone.
0
ferent candidates for a Gaussian framework of linear solvers.
Aiming to solve Ax = b for x, assuming that A is symmet-
ric positive definite, we adopt the algorithmic paradigm of an 0
iterative solver, as defined in Table 17.2, constructing projection- −10
0 10 10
observation pairs S = [s1 , . . . , s M ] ∈ R N × M and Y = AS = A11
A22
[ y1 , . . . , y M ] ∈ R N × M .
Figure 19.8: Analogous to Figure 19.7,
To endow that algorithm class with a probabilistic meaning,
but with the covariance choice W = A
we might directly model the matrix inverse H. Modelling H considered in Corollary 19.9. Under this
allows a joint Gaussian model over both H and the solution choice, the posterior mean always lies
“to the right” of the true A along the pro-
x. Alternatively, we might model the matrix A. Modelling A jection line, thus in the positive definite
allows direct treatment of Gaussian observation noise, and still cone.
19 Probabilistic Linear Solvers: Algorithmic Scaffold 163
p( H ) = N ( H0 , V0 W0 ) p( H ) = N ( H0 , W0
W0 )
p(Y, S | H ) = δ(S − HY ) p(Y, S | H ) = δ(S − HY )
Model for H
#– # – #– # –
= lim N S ; AY, γ2 ( IN IM ) = lim N S ; AY, γ2 ( IN
IM )
γ_0 γ_0
p( A) = N ( A0 , V0 W0 ) p( A) = N ( A0 , W0
W0 )
p(Y, S | A) = δ(Y − AS) p(Y, S | A) = δ(Y − AS)
#– # – #– # –
= lim N Y ; AS, γ2 ( IN IM ) = lim N Y ; AS, γ2 ( IN
IM )
Model for A
γ_0 γ_0
1
As we have already noted in §19.2, there are structural differ-
( x + µ ) −1
ences between a Gaussian model over A and one over its inverse 0
EN (µ,1) ( x −1 )
Definition 19.11 (Posterior correspondence). Consider two solvers r
of the form of Algorithm 17.2, one with a belief on A with prior mean π −( x−µ)2 /2 x−µ
= e erfi √ ,
2 2
A0 and a covariance parameter W0A (with associated posterior mean
where erfi(z) = −i erf(iz). However,
A M ) and one maintaining a distribution on H with prior mean H0 for µ ≫ 1, the distribution p( x −1 )
and covariance parameter W0H (with associated posterior mean H M ). is relatively well approximated by
We say their priors induce posterior correspondence if N ( x −1 ; µ−1 , µ−2 ) (dashed
p line µ−1 , dot-
ted lines at µ ± 2 µ−2 ). The plot also
1 shows 20 samples ( x + µ)−1 . For µ = 0,
A−
M = HM for 0 ≤ M ≤ N. (19.28) the mean does not exist.
19
These statements are all from Wenger
And we speak of weak posterior correspondence if we only have and Hennig (2020), where proofs can
also be found.
1
A−
M Y = H M Y. (19.29)
0 = ( AS − A0 S)[(S⊺ W0A A0−1 AS)−1 S⊺ W0A A0−1 − (S⊺ A⊺ W0H AS)−1 S⊺ A⊺ W0H ].
W0A S = Y, and
⊺
S (W0A A0−1 − AW0H ) = 0.
W0 = A.
doinverse3 of A M can be computed efficiently and has the right (The concept seems to have been in-
vented by Fredholm (1903) for operators,
conceptual properties for many applications. For a factorised and discussed for matrices by Moore
symmetric matrix like our A M = ỸỸ ⊺ , the pseudoinverse is (1920).) The pseudoinverse yields the
given by least-squares solution A+ b for our lin-
ear problem Ax = b in the sense that
A+ = Ỹ (Ỹ ⊺ Ỹ )−2 Ỹ ⊺ .
∥ Ax − b∥2 ≥ ∥ AA+ b − b∥2 ∀ x ∈ R N .
Since Ỹ ⊺ Ỹ
is tridiagonal symmetric positive definite, its inverse For the choice A0 = 0, A+ M can also be
can be computed in 8M operations (see note 2). seen as the natural limit of the estimator
1
A−M arising from A0 = αI for small α,
because, for general A,
Alas, we can of course not set W0 = A, since A is the very
A+ = lim ( A⊺ A + αI )−1 A⊺
matrix we are trying to infer. We could set α_0
1 −1
(Y ⊺ Ξ 0 Y ) − 1 = β(Y ⊺ W0 Y ) + Y ⊺ b̃b̃⊺ Y
2
−1 ⊺ −1 Y ⊺ b̃b̃⊺ Y (Y ⊺ W0 Y )−1
= 2β (Y W0 Y ) I− ,
β + b̃⊺ Y (Y ⊺ W0 Y )−1 Y b̃
⊺ ⊺ Y ⊺ b̃b̃⊺ Y (Y ⊺ W0 Y )−1
x M = x0 + ( βW0 Y + b̃b̃ Y ) β −1
(Y W0 Y ) −1
I− ( S⊺ b + Y ⊺ x0 ). (20.10)
β + b̃⊺ Y (Y ⊺ W0 Y )−1 Y b̃
Ω = ωI. (21.2)
= ω ( I − S ( S ⊺ S ) −1 S ⊺ ). span{r0 , r1 , . . . , rm−1 }
= span{r0 , y1 , . . . , ym−1 }
The scale ω can then be interpreted as scaling the remaining = span{s1 , . . . , sm }
uncertainty over the entire null-space of S, the space not yet = span{r0 , Ar0 , . . . , Am−1 r0 }.
explored by cg.3 How should ω be set? We already saw in For a proof, see e.g. Theorem 5.3 in No-
Eqs. (20.4)–(20.6) that the very (symmetric) Kronecker struc- cedal and Wright (1999).
ture in the covariance that engenders the desirable low-rank
structure of the posterior mean also restricts the calibration of
uncertainty and causes a trade-off: calibrated uncertainty on the 4
See Vijayakumar and Schaal (2000), or
§2.5 in Rasmussen and Williams (2006).
diagonal elements implies under-confidence off the diagonal,
The data can be found, at the time
and conversely, calibrated uncertainty off the diagonal means of writing, at www.gaussianprocess.org/
over-confidence on the diagonal. We can thus expect to have to gpml/data/. It contains a time series of
trajectories mapping 21-dimensional in-
strike some balance between the two. puts x ∈ R21 (positions, velocities, and
accelerations, respectively, of 7 joints of a
robot arm) to 7 output torques. The first
▶ 21.1 Rayleigh Regression of these torques is typically used as the
target y( x ) ∈ R for regression, as was
Figure 21.1 shows results from an empirical example, a run of done here, too. The entire training set
contains 44 484 input–output pairs. For
cg on a specific matrix. The SARCOS data set4 is a popular, the purposes of this experiment, to allow
simple test setup for kernel regression. It was used to construct some comparisons to analytical values,
a kernel ridge regression problem Ax = b with this was thinned by a factor of 1/3, to
N = 14 828 locations. The data was stan-
dardised to have vanishing mean and
A := k XX + σ2 I ∈ R14 828×14 828 and b := y, (21.3) unit covariance.
s⊺m Asm
a(m) := ,
s⊺m sm
5
Both bounds hold because A is as-
where sm is the mth direction of cg. These coefficients are read- sumed to be spd, thus all its eigenval-
ily available during the run of the solver, because the term ues are real and non-negative. The up-
per bound holds because the trace is the
s⊺m Asm (up to a linear-cost re-scaling) is computed in line 7 of sum of the eigenvalues. If k XX = UDU ⊺
Algorithm 17.2. From Eq. (21.3), there are straightforward upper is the eigenvalue decomposition of the
spd matrix k XX , then the lower bound
and lower bounds both for elements of A and for a(m). With holds because UDU ⊺ + σ2 I = U ( D +
the eigenvalues λ1 ≥ · · · ≥ λ N of A, we evidently have σ2 I )U ⊺ . For this specific matrix, we also
know from the functional form of k XX
λ1 ≥ a ( m ) ≥ λ N for all m (Eq. (4.4)) that [ A]ij ≤ 1 + σ2 δij , although
such a bound is not immediately avail-
able for H = A−1 .
and thus also5
21 Uncertainty Calibration 177
Since the a(m) are readily available during the solver run, it
is desirable to make additional use of them for uncertainty
quantification – to set ω in Eq. (21.2) based on the progression of
a(m). One possible use for the posterior mean A M is to construct
178 III Linear Algebra
0.4
0.2
0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
zi zi zi
M=1 M = 30 M = 250 A
AM
std M ( A)
0 0.5 1
std M ( A) | A − A M |
0 1 2 3
| A − AM |
Proof. By induction: For the base case1 i = 2, i.e. after the first 1
For this proof, it does not actually mat-
⊺ ter how the first direction d1 is chosen.
iteration of the loop, we have (recall that α1 = −d1 r0/d⊺1 Ad1 ).
The symmetry of the estimator Hi is used in the third to last
equality:
d⊺1 Ad2 = −d⊺1 A( H1 r1 ) = −d⊺1 A( H1 (y1 + r0 )) = −d⊺1 A(s1 + H1 r0 ) = −α1 d⊺1 Ad1 − d⊺1 AH1 r0
= d⊺1 r0 − α1−1 s⊺1 AH1 r0 = d⊺1 r0 − α1−1 y⊺1 H1 r0 = d⊺1 r0 − α1−1 s⊺1 r0 = d⊺1 r0 − d⊺1 r0 = 0.
For the inductive step, assume {d0 , . . . , di−1 } are pairwise A-
conjugate. For any k < i, using this assumption twice yields
r0⊺ r0
α0 = .
r0⊺ Ar0
22 Proofs 185
r0⊺ Ar0
δ1 = − δ0 , and
r0⊺ AAr0
r0⊺ Ar0
d2 = δ0 r0 − ⊺ Ar0
r0 AAr0
!
(r0⊺ Ar0 )2
= δ0 r0 − α0 Ar0 ⊺
(r0 r0 )(r0⊺ AAr0 )
⊺
(r0⊺ Ar0 )2 (r0 r0 )(r0⊺ AAr0 )
= δ0 ⊺ r0 − α0 Ar0 +r0 −1
(r0 r0 )(r0⊺ AAr0 ) | {z } (r0⊺ Ar0 )2
| {z } =r1
=:−γ0
1 (r0⊺ r0 )(r0⊺ AAr0 )
= γ0 −r1 + − 1 d1 .
γ1 (r0⊺ Ar0 )2
We see γ0 ̸= 0, because A is spd and r0 ̸= 0 by assumption
(otherwise the algorithm would be converged!). To close this
part, we observe that
di = ∑ νj s j + νi ri−1 . (22.3)
j <i
186 III Linear Algebra
If ℓ < i − 1, then the second term in this sum cancels by the first
induction assumption:
ℓ<i −1
y⊺ℓ ri−1 = (rℓ − rℓ−1 )⊺ ri−1 = 0.
d⊺i ri−1
ri = Asi + ri−1 = αi Adi + ri−1 = − Adi + ri−1 . (22.6)
d⊺i Adi
22 Proofs 187
βj ⊺ βj ⊺
ri⊺−1 ri = −ri⊺−1 ri−1 + d j−1 r j−1 + ri⊺−1 ri−1 = d r .
γ j −1 γ j −1 j −1 j −1
d⊺j−1 r j−2
r j−1 = α j−1 Ad j−1 + r j−2 = − .
d j−1 Ad j−1
−1
= 1/α (Y⊺ Y ) − (Y⊺ S)(S⊺ S)−1 (Y⊺ S) − Y⊺ S .
In the analogue to Eq. (6.2), we take care of the Gaussian part by re-arranging
#– #– #– 1 #–⊺
p(α, β | Y, S) ∝ G( β−2 ; a0 , b0 )N Y ; µ0 S , β2 ( I S)⊺ ( I
I )( I S) + S S (22.8)
λ0
−1 #–
#–⊺
× N α; Ψ λ0 µ0 + S ( I S)⊺ ( I I ) −1 ( I S ) Y , β2 Ψ
#– #–−1
with Ψ := λ0 + S ⊺ (( I S)⊺ ( I
I )( I S))−1 S . (22.9)
G : = ( I S )⊺ ( I
I )( I S) ∈ R N M× N M .
GX = Z 5
The trace in Eq. (22.10) has a particu-
larly simple form if the directions S are
⇒ X = 2Z (S⊺ S)−1 − S(S⊺ S)−1 ( Z⊺ S)(S⊺ S)−1 . A-conjugate (as is the case if they are
produced by cg). In that case, Y ⊺ S is a
In this sense, we have diagonal matrix, and the expression be-
comes
S⊺ G −1 S = tr (S⊺ S)(S⊺ S)−1 = M,
tr(Y ⊺ S(S⊺ S)−1 )
which simplifies Ψ from Eq. (22.9) to Ψ = (λ0 + M)−1 . Anal- M
= ∑(s⊺m Asm )[(S⊺ S)−1 ]mm .
ogous to the scalar base case, we introduce a “sample mean” m
defined by5
α̃ := M−1 tr(Y ⊺ S(S⊺ S)−1 ). (22.10)
With this the second line of Eq. (22.8), the posterior on α, be-
comes6 6
For an intuition of this expression, note
that if S = I:,1:M (the first M columns
2 λ0 µ0 + Mα̃ β2 of the identity), then the posterior mean
p(α | β , Y, S) = N α; , . on α is essentially (up to regularisation)
λ0 + M λ0 + M
computing a running average of A’s first
To approach the posterior on the variance β2 , we continue to fol- M diagonal elements.
low the guidance from Chapter I and apply the matrix inversion
lemma, which yields
−1 #– #–
S )( S ⊺ G −1 )
= G −1 − ( G λ0 + M
z }| {
# – #– 1 #–⊺ −1 # –
(Y − µ 0 S ) ⊺ G+ S S (Y − µ 0 S )
λ0
# –
= (Y − µ0 S)⊺ 2Y (S⊺ S)−1 − S(S⊺ S)−1 (Y⊺ S)(S⊺ S)−1 ) − µ0 S(S⊺ S)−1
( Mα̃ − Mµ0 )2
−
λ0 + M
M2 (α̃ − µ0 )2
= tr(2Y⊺ Y (S⊺ S)−1 − Y⊺ S(S⊺ S)−1 Y⊺ S(S⊺ S)−1 ) − 2µ0 Mα̃ + µ20 M − .
λ0 + M
As mentioned in Note 3 above, some care must be taken when
considering the number of pseudo-observations. The matrix
determinant lemma (15.11) provides
2 #– 1 #–⊺ 2 1 #–⊺ −1 #–
det β G + S S = det( β G ) +S G S
λ0 λ0
2( N M −1/2( M2 − M)) M
=β det( G ) 1 + .
λ0
22 Proofs 191
λ0 µ0 + M α̃
µN = ,
λ0 + M
λ N = λ0 + M,
a N = a0 + 1/2( N M − 1/2( M2 − M)),
b N = b0 + /2 tr(2Y ⊺ Y (S⊺ S)−1 − Y ⊺ S(S⊺ S)−1 Y ⊺ S(S⊺ S)−1 )
1
M2 (α̃ − µ0 )2
− 2µ0 Mα̃ + µ20 M − . (22.11)
λ0 + M
sights, in particular:
Software
∂ f (x)
∇ f : RN _ RN , [∇ f ( x )]i = , and
∂xi
∂2 f ( x )
B : RN _ RN×N , [ B( x )]ij = .
∂xi ∂x j
(That is, we will use the notation arg min even for local minima,
for simplicity). Introductions to classic nonlinear optimisation
methods can be found in a number of great textbooks. Nocedal
and Wright (1999) provide an accessible, practical introduc-
tion with an emphasis on unconstrained, not-necessarily-convex
problems. Boyd and Vandenberghe (2004) offer a more theoreti-
cally minded introduction concentrating on convex problems.
Both books also discuss constrained problems, and continuous
optimisation problems that do not have a continuous gradient
everywhere (so-called non-smooth problems). These two areas
are at the centre of the book by Bertsekas (1999). Other popular
types of optimisation include discrete, and mixed-integer “pro-
grams”.3 Genetic Algorithms and Stochastic Optimisation are 3
For historical reasons, the optimisation
also large communities, interested in optimising highly noisy or and operations research communities
use the terms “program” and “problem”,
fundamentally rough functions (see e.g. the book by Goldberg as well as “programming” and “optimi-
(1989)). Such noise (i.e. uncertainty/imprecision) on function sation” synonymously. A mixed integer
program is a problem involving both con-
values will play a central role in this chapter – in fact, one could tinuous (real-valued) and discrete param-
make the case that Probabilistic Numerics can bridge some con- eters.
ceptual gaps between numerical optimisation and stochastic
optimisation. However, we will make the assumption that there
is at least a smooth function “underneath” the noise. Stochastic
and evolutionary methods are also connected to the contents of
Chapter V.
6
Cauchy may be a contender, in 1847.
Gradient descent, both with an efficient choice of step size and a See Lemaréchal (2012).
fixed step size, offers important reference points. The following
two results7 show that gradient descent with exact line searches 7
Thms. 3.3 and 3.4 in Nocedal and
has a linear convergence rate. Wright (1999). The proof for the first one
is in Luenberger (1984).
Theorem 25.1 (Convergence of noise-free, exact line search,
steepest descent on quadratic functions). Consider the strongly
convex quadratic function (already studied in Chapter III)
1 ⊺
f (x) = x Bx − b⊺ x, B ∈ RN×N , b ∈ RN
2
with symmetric positive definite Hessian B. This function has a global
minimum at x∗ = B−1 b. Let 0 < λ1 ≤ λ2 ≤ · · · ≤ λ N be the
202 IV Local Optimisation
satisfies
2
λ N − λ1
∥ x i +1 − x∗ ∥2B ≤ ∥ xk − x∗ ∥2B . (25.1)
λ N + λ1 Exercise 25.2 (easy, instructive, solution
on p. 363). Theorem 25.1 characterises the
Two special cases to consider: in an isometric problem (all eigen- convergence of gradient descent if optimal
values equal to each other), the iteration converges in one step. step sizes can be found. Many practitioners
actually just set the step size to a fixed con-
On the other hand, if the condition number κ ( B) = λ N/λ1 is stant, like αi = 0.1. This exercise may help
very large, then this bound is essentially vacuous, because the gain an intuition for why this is problematic,
and hides some underlying assumptions.
constant on the right-hand side of Eq. (25.1) is almost unit. Two wheeled robots are standing on top of
Although these are just bounds, they reflect the practical be- a steep hill. Their task is to drive down
haviour of gradient descent well. The following theorem shows the hill by performing “gradient descent”.
At every time step i, standing at location
that these properties translate relatively directly to the nonlinear xi , they evaluate their potential energy den-
case. sity f ( xi ) = E( xi )/m = g · h( xi ), and
its gradient ∇ f ( x ), then move a step of
Theorem 25.3 (Convergence of steepest descent on general size α = 0.1 to the new location xi+1 =
xi − α∇ f ( xi ). Here, xi ∈ R2 is the robots’
functions). Consider a general, twice continuously differentiable 2D GPS co-ordinate, g is the free-fall acceler-
f : R N _ R, and assume that the iterates generated by the steepest- ation, and h( xi ) is the height of the ground
at xi . (m is the robot’s mass, we use energy
descent method with exact steps converge to a point x∗ at which B( x∗ ) density rather than energy, so that this mass
is symmetric positive definite with eigenvalues 0 < λ1 ≤ · · · ≤ λ N . cancels out of all calculations).
Let c be any number with The first robot uses SI units. For it, the
starting point is h( x0 ) = 456m above sea
level g = 9.81m/s2 , and the initial gradient,
λ N − λ1 at x0 , is ∇ f ( x0 ) = 5J/kg·m. The other robot
c∈ ,1 .
λ N + λ1 uses Imperial units. Hence, h( x0 ) = 1496ft
above sea, g = 32.19ft/s2 , and the initial gra-
Then for all i sufficiently large, it holds that dient is ∇ f ( x0 ) = 1.03 · 10−5 Cal/oz·ft. How
far will either robot move in its first step, and
(assuming h( x ) is locally well-described by
f ( x i +1 ) − f ( x ∗ ) ≤ c 2 f ( x i ) − f ( x ∗ ) .
a linear function), what is the new energy at
x1 ?
26
Step-Size Selection – a Case Study
p(z | M) p(c | z, M)
p(z | c, M) =
p(c | M)
N (z; µM , ΣM ) · δ(c − PM z)
= ⊺ , (26.1)
N (c; PM µM , PM ΣM PM )
1 M p(Y | f ) = N (Y; f X , σ2 IN ),
L̃( x ) = r ( x ) +
M ∑ ℓ(ξ Jm , x ) ≈ L(Ξ, x ). (26.3)
with σ ∈ R+ and f X :=
m =1
[ f ( x1 ), . . . , f ( x N )]. Assume a general
A typical approach is to draw J ⊂ [1, K ] at random in an Gaussian process prior p( f ) = GP ( f ; µ, k)
on the function f . Show that the posterior
i.i.d. fashion, and to re-draw the batch every single time the mean E p( f |Y,X ) ( f X ) for the function values
optimiser asks for a function or gradient value. In that case, at X is the solution to an optimisation
problem involving a loss function with the
the smaller sum is an unbiased estimator for the larger one
form given in Eq. (26.2), and can thus be
and, by the central limit theorem, L̃ is approximately Gaussian computed with the methods discussed in this
distributed around L: chapter (to get the notation of that equation,
replace Y ] Ξ, yi ] ξ i , and f X ] x). What
effect does the choice of prior p( f ) have on
p L̃( x ) | L(Ξ, x ) ≈ N L̃( x ); L(Ξ, x ), σ2 , (26.4)
this structure? Which other priors would
retain the structure of Eq. (26.2)? What
with variance σ2 ∝ 1/M. From the point of view of the optimiser, about the likelihood p(Y | f )? Which
evaluations at different x are then disturbed by independent structure does it need to have to keep the
connection to Eq. (26.2)?
Gaussian noise, because the batches are re-drawn every time the
optimiser requests a value of L or its gradient. Batching thus
effectively provides a knob, which the user or the algorithm may 3
Eq. (26.4) really is an entirely quanti-
twist to trade off computational precision against computational tative, identified object that can be ex-
plicitly used in a probabilistic numeri-
cost.3 Because data set sizes K tend to be large and low-level cal method. Assume that K ≫ M, and,
cache sizes are limited, the balance in this decision will often be for simplicity, that J is drawn uniformly
from [1, K ]. Further assume that the
dominated by cost considerations. In deep-learning problems, data ξ i are drawn i.i.d. from some mea-
even signal-to-noise ratios well below one are quite common. sure p – e.g. from p ∝ exp(ℓ ξ | x̄ ) ,
recalling ℓ is the loss of a single da-
tum and where x̄ is the “correct” value
The Gaussian noise introduced by batching explicitly introduces of x. Then σ2 = var(ℓ( x ))/M, where
2
a likelihood term into the computation, and thus naturally sug- var(ℓ( x )) = E p ℓ2 (ξ, x ) − E p ℓ(ξ, x ) .
gests a probabilistic treatment. Knowing that classic numerical Even if var(ℓ( x )) cannot be analytically
computed because p is unknown, it
methods are associated with Dirac likelihoods, it is not sur- can be estimated empirically during the
prising that classic methods for optimisation, in particular the batching process at low cost overhead,
using the statistic
efficient ones, tend to struggle with the noisy setting.
∑ ℓ2 ( ξ J j , x ).
j
▶ 26.2 Classic Line Searches For a while, it was difficult to access
these quantities in standard deep learn-
In the remainder of this section, we will study how significant ing libraries, but following work like that
of Dangel, Kunstner, and Hennig (2020),
computational noise can invalidate a classic numerical routine,
even the established libraries are begin-
and how this problem can be addressed from the probabilistic ning to make full-batch quantities avail-
perspective. The object of interest will be the line search, the able.
206 IV Local Optimisation
∂ f ( xi + αdi )
f ′ (α) := = d⊺i ∇ f ( xi + αdi ) ∈ R.
∂α
Note that both f (α) and f ′ (α) are scalars. The entirety of this
section will be concerned with the problem of finding good
values for α. This all happens entirely within one “inner loop”,
with almost no propagation of state from one line search to
another. So we drop the subscript i. This is a crucial point that
is often missed at first. It means that line searches operate in
a rather simple environment, and their computational cost is
rather small. For intuition about the following results, it may
be helpful to keep in mind that a typical line search performs
between one and, rarely, 10 evaluations of f (α) and f ′ (α), re-
spectively.4 4
We here assume that it is possible to si-
multaneously evaluate both the function
f and its gradient f ′ . This is usually the
Because line searches are relatively simple yet important algo- case in high-dimensional, truly “numer-
rithms, we can study them in detail. The following pages start by ical” optimisation tasks. The theory of
automatic differentiation (e.g. Griewank
constructing a non-probabilistic line search that is largely based (2000)), guarantees that gradient evalua-
on versions found in practical software libraries and textbooks. tions can always be computed with cost
comparable to that of a function evalu-
They are followed by their probabilistic extension. ation. But there are some situations in
which one of the two may be difficult to
access, for example because it has differ-
▶ 26.3 The Wolfe Termination Conditions ent numerical stability. The probabilistic
line search described below easily gen-
Building a line search requires addressing two problems: Where eralises to settings in which only one of
the two, or in fact any set of linear pro-
to evaluate the objective (the search), and when to stop it (the jections of the objective function, can be
termination). We will start with the latter. Intuitively, it is not nec- computed.
essary for these inner-loop methods to actually find a true local
26 Step-Size Selection – a Case Study 207
0
conditions. (For this plot, the parameters
c2 f 0 (0) were set to c1 = 0.4, c2 = 0.5 to get an in-
f 0 (0) structive plot. These are not particularly
0 0.2 α1 0.8 1 α∗ α2 1.6 α3 2 smart choices for practical problems; see
α the end of §26.2.)
208 IV Local Optimisation
The derivative fˆ′ is a quadratic function, which has a unique Figure 26.2: Cubic spline interpolation
minimum at q for searching along the line. Each (top
−b + b2 − 3a f 0′ and bottom) pair of frames shows the
α2 = . same plot as in Figure 26.1, showing
3a progress of the search and interpolation
steps. Left: The first evaluation only al-
lows a linear extrapolation, which re-
f (0) quires an initial ad hoc extrapolation
step. Centre: The first extrapolation (to
the second evaluation) step happened to
f (α)
0
step will be at the local minimum of the
c2 f 0 (0) interpolant. It so happens that this point
f 0 (0) (empty square) will provide an evalua-
0 α̂ α∗ 0 α∗ α̂ 0 α̂ tion pair that satisfies the Wolfe condi-
α α α tions.
210 IV Local Optimisation
It will turn out that all three of these issues can be addressed
jointly, by casting spline interpolation as the noise-free limit of
Gaussian process regression.
212 IV Local Optimisation
Recall from §5.4, specifically Eq. (5.27), that the total solution of
the stochastic differential equation
" # " #" # " #
f (α) 0 1 f (α) 0
d ′ = dα + dωt , (26.11)
f (α) 0 0 f ′ (α) q
0
(dashed) and the marginal density (shad-
c2 f 0 (0) ing). The marginal distribution on f ′ is
f 0 (0) a standard Wiener process (note rough,
0 α̂ α∗ 0 α∗ α̂ 0 α̂ Brownian motion, samples).
α α α
Then we can use the standard results from §4.2, and in particular
§4.4, to compute a Gaussian process posterior measure with
posterior mean function
" # " # " # " # ! −1
να µα k k∂αA k AA k∂AA
′ := ′ + ∂ αA +Λ (Y − µYA ), (26.14)
να µα k αA ∂k ∂
αA
∂k
AA
∂k ∂
AA
0
tinuous piecewise quadratic).
c2 f 0 (0) The lower half of the figure shows the
f 0 (0) belief over the two Wolfe conditions im-
posed by this Gaussian process posterior
2 on ( f , f ′ ). From top to bottom: Beliefs
0 over variables aα (encoding the Armijo
at
−2
1
weak or (approximately) strong condi-
0 tions to hold. While this plot shows con-
ρt
on ( f , f ′ ) is
" # " # " #!
f ν κ κ∂
p( f , f ′ | Y ) = GP ; ′ , ∂
f′ ν κ ∂κ ∂
p( aα ≥ 0 ∧ 0 ≤ bα ≤ b̄) 21
Alternatively, one could also use the
Z ∞
b
Z b̄−mbb
√ α " # " # " #! 95%-confidenceq lower and upper bounds
Cα a 0 1 ρα f ′ (0) ≲ ν0′ + 2 ∂κ00
∂ and f ′ (0) ≳ ν′ −
= ma b N ; , dadb. (26.17) q 0
− √ αaa − mα √ b 0 ρα 1 2 κ00 to build more lenient or restric-
∂ ∂
Cα
Cαbb
tive decision rules, respectively.
Algorithm 26.1 provides pseudo-code for the thus completed
probabilistic extension of classic line searches.
218 IV Local Optimisation
0.8
test error
10−1
0.7
0.6
10−2 −4
10−4 10−3 10−2 10−1 100 101 10 10−3 10−2 10−1 100 101
initial learning rate initial learning rate
0.8
0.8 0.6
test error
0.4
0.6 0.2
0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
epoch epoch
M
1
Si = ∑ ℓ2m (αi ), and
M m =1
(26.18)
M
1
Si′ = ∑ (d⊺i ∇ℓm (αi ))2 ,
M m =1
For the moment, we consider the simple (and flawed, yet popu-
lar) case of stochastic gradient descent, i.e. the optimiser given
by the update rule
xi+1 = xi − αi ∇L̃( xi ) =: xi − αi g M ( xi ).
Let us assume we have found a good value for αi , e.g. by using
the line search algorithm described above. Here we simplified
the notation by introducing the shorthand g M ( xi ), which explic-
itly exposes the batch size M in the noisy gradient ∇L̃ M ( xi ).
By Eq. (26.3), the variance of the gradient elements scales in-
versely with M. If we can assume that the entire data set is
very large (K ≫ M) and that the batch elements are drawn
independently of each other, then the elements of g M ( xi ) are
distributed according to the likelihood
1
g M ( xi ) ∼ N ∇L(Ξ, xi ), Σ( xi ) , (27.1)
M
where Σ( x ) is the covariance between gradient elements,
K
1
Σ( x ) :=
K ∑ (∇ℓ(ξ i , x) − ∇L(Ξ, x))(∇ℓ(ξ i , x) − ∇L(Ξ, x))⊺ .
i =1
For simplicity, we will assume that the optimiser has free con-
trol over the batch size M.3 Deciding for a concrete value of 3
In practice, aspects like the cache size
M, the optimisation algorithm now faces a trade-off: A large of the processing unit usually mean that
batch sizes have to be chosen as an inte-
batch size provides a more informative (precise) estimate of the ger multiple of the maximal number of
true gradient, but increases computation cost. From Eq. (27.1), data points that can be simultaneously
cached.
the standard deviation of g M only drops with M−1/2 , but the
computation cost of course rises linearly with M. So we may
conjecture that there is an optimal choice for M. It would be
nice to know this optimal value; but since this is a very low-
level consideration about a hyperparameter of an inner-loop
algorithm, an exact answer is not as important as a cheap one.
Thus, we will now make a series of convenient assumptions to
arrive at a heuristic:
First, assume that the true gradient ∇L is Lipschitz continu-
ous (a realistic assumption for machine learning models) with
Lipschitz-constant L. That is,
∥∇L( x ) − ∇L( xi )∥ ≤ L∥ x − xi ∥ ∀ x ∈ R N .
Lα2i
L( xi ) − L( xi+1 ) ≥ G := αi ∇L( xi )⊺ gi − ∥ gi ∥ 2 .
2
Here is where the probabilistic description becomes helpful:
From Eq. (27.1), we know that E( gi ) = ∇L( xi ), and
tr Σ
E(∥ gi ∥2 ) = ∥∇L( xi )∥2 + ,
M
so we can compute an expected gain from the next step of gradi-
ent descent, as
!
Lα2i Lα2
E( G ) = α i − ∥∇L( xi )∥2 − i tr(Σ).
2 2M
Since L is usually not known a priori and the norm of the true
gradient is not accessible, we finally make a number of strongly
simplifying assumptions to arrive at a concrete heuristic that
does not add computational overhead. First, assume that ∇L
is not just Lipschitz continuous but also differentiable, and
the Hessian of L is scalar: B( xi ) ≈ hIN . This means5 that the 5
A detailed derivation is in the original
Lipschitz constant is L = h and the gradient norm can be paper by Balles, Romero, and Hennig
(2017).
approximated linearly as ∥∇L( xi )∥2 ≈ 2h(L( xi ) − L∗ ), where
L∗ = minαi L( xi + αi gi ) is the loss from the optimal stochastic
gradient descent step size. This simplifies Eq. (27.3) to
α tr Σ
M∗ = .
2 − hα L( xi ) − L∗
tr Σ
M∗ ≈ α .
L( xi ) − L∗
100
train loss
100.5
10−1
0
10
10−2
10−1 100
1 0.85 0.5
0.98
0.8
0.9
test accuracy
0.96 0.4
0.75
0.94 M = 32
0.85
M = 128
M = 512 0.7
0.92 0.3
adaptive
50 100
200 100
M
0 0 0 0
0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8
data read data read data read data read
·105 ·107 ·107 ·106
Figure 27.1: Optimisation progress of
stochastic gradient descent when batch
sizes are controlled by the adaptive rule
in the noisy case, we never know with certainty that this is the of Eq. (27.4). Results reproduced from
case, and the optimiser may well not actually converge to a true Balles, Romero, and Hennig (2017). Each
column of plots shows results on a spe-
root of the gradient. In the empirical risk minimisation setting, cific empirical risk minimisation prob-
this holds when significant computational noise arises from lem, an image recognition task on four
batching, but it is even a problem when the risk is computed different standard benchmark data sets
(MNIST: see Figure 26.6. Street View
over the entire data set. The real target of the optimisation House Numbers: Netzer et al., 2011. CI-
problem is the population risk, which can fundamentally not be FAR10 and 100: Krizhevsky and Hin-
ton, 2009). Top to bottom: Training error
accessed because we only have access to a finite data set. This (a measure of the optimisers’ raw effi-
is an example where computational uncertainty and empirical ciency); test accuracy (a measure of gen-
uncertainty overlap. eralisation); and the batch size M chosen
by the optimiser over the course of the
optimisation. Results are plotted against
the number of read data-points (not the
The stopping problem has practical relevance. For contempo- number of optimiser steps) for a fairer
rary machine learning models, the number N of parameters comparison. Especially for the moder-
ately larger problems (CIFAR10 and
to be fitted often exceeds the number K of datapoints. In such CIFAR100), the adaptive schedule im-
situations, if the optimiser is not stopped, it might over-fit to proves on fixed batch sizes, even though
the locally chosen batch sizes generally
a minimum of the empirical risk that reflects features of the
lie within the range of the constant-M
data set that are only due to sampling noise, and not present comparisons.
in the population. The standard approach to this problem is
to separate the data into a training set and a separate validation
226 IV Local Optimisation
set. The optimiser only gets access to the empirical risk on the
training set (possibly sub-sampled into batches). A separate
monitor observes the evolution of the validation risk, and stops
the optimiser when the validation risk starts rising. Apart from
technicalities (reliably detecting a rise in the validation risk is
itself a noisy estimation problem), the principal downside of
this approach is that it “wastes” a significant part of the data
on the validation set. Collecting data often has a high financial
and time cost, and the data in the validation set cannot be used
by the optimiser to find a more general minimum. Even if we
ignore the issue of overfitting, we still need some way to decide
when to stop the optimiser. Many practitioners just run the
method “until the learning-curve is flat”, which is wasteful.
This section describes a simple statistical test as an alterna-
tive, which is particularly suitable for small data sets, where
constructing a (sufficiently large) validation is not feasible. It is
based on work by Mahsereci et al. (2017), and makes explicit
use of the observation likelihood (Eq. (27.1)). Let ppop. be the
distribution from which data points are sampled in the wild,
and f be the population risk
Z
f (x) = r(x) + ℓ(ξ, x ) dppop. (ξ ).
x i +1 = x i + α i d i , αi ∈ R, di ∈ R N ,
Section 28.3 will discuss a second type of rules that also consider
interactions between gradient elements. This classification is pri-
marily a computational consideration, not an analytic one. Some
of the methods in this first class converge asymptotically faster
than gradient descent; virtually all the methods in the second
class converge slower than Newton’s method. But element-wise
rules scale more readily to very high-dimensional problems.
This is one of the reasons why they are currently the popular
choice in machine learning, where the number N of parameters
to be optimised is frequently in the range beyond 106 .
Nesterov’s accelerated method3 is a variant of momentum that m ẍ (t) = −κ ẋ (t) − ∇ f ( x (t)). (28.1)
what the real gradient may be. The optimiser can just check.
now a fundamental choice to be made. The conceptually cleaner The discussion in §28.3 will show that
path is to define a consistent Gaussian process model over the including these constraints in a compu-
tationally efficient way is not straightfor-
input domain R N . However, this would impose the usual cubic ward. Further discussion of these issues
cost in i, the number of the optimiser’s iterations, a problem that can be found in Hennig (2013).
would require further approximations to fix. We also know that
the optimiser will only ever ask for gradients along its trajectory,
which forms a univariate curve, albeit no linear sub-space of
R N . For this reason, we make another leap of faith and treat the
individual gradient observations f n′ ( xi ) as separate univariate
time series f n′ (ti ) for ti ∈ R. Inference can then be addressed
with the Kalman filter (§5). This raises the question how the
multivariate input xi should be transformed into a scalar t: is the
difference from one optimisation step to another unit (t ^ t + 1)
or does it have a length? For the purposes of this section, we will
use the latter option, and set ti+1 = ti + τ with τ := ∥ xi+1 − xi ∥.
With this, we can consider two basic choices for the SDE defin-
ing the Kalman filter: The Wiener process and the Ornstein–
Uhlenbeck process,
Here we have allowed for separate drift (γn ) and diffusion (θn )
scales for each element of the gradient. They translate into the
Kalman filter parameters (see Algorithm 5.1 and Eq. (5.21))
x i = x i −1 + α i m i −1 .
iterate xi ,
1
f ( xi + d) ≈ f ( xi ) + d⊺ ∇ f ( xi ) + d⊺ B( xi )d.
2
(Recall that B is the notation for the Hessian of f .) If the Hessian
is symmetric positive definite (if f is locally convex), then this
quadratic approximation has a unique minimum, which defines
the next Newton iterate,
xi+1 = xi − B−1 ( xi )∇ f ( xi ).
scale problems is not its stability, but the need to compute the
Hessian B( xi ) and invert it (or rather, to solve the linear prob-
lem B( xi )z = ∇ f ( xi )). Quasi-Newton methods15 are one way to 15
A great contemporaneous review with
address the computational cost of Newton’s method by con- extensive analysis and discussion can be
found in Dennis and Moré (1977).
structing an approximation B̂( xi ) to B( xi ). They are based on
the observation that each pair of subsequent gradient obser-
vations [∇ f ( xi ), ∇ f ( xi−1 )] collected by the optimiser provides
information about the Hessian function, because the Hessian is
the rate of change of the gradient, or more precisely:
yi = B̂si (28.10)
si = Ĥyi
Name ci Reference
Symmetric Rank-1 (SR1) ci = yi − Bi−1 si Davidon (1959)
Powell Symmetric Broyden ci = si Powell (1970)
Greenstadt’s method ci = Bi−1 si Greenstadt (1970)
DFP ci = yi Davidon (1959); Fletcher & Powell (1970)
r ⊺
y s Broyden (1969); Fletcher & Powell (1970);
BFGS ci = yi + s⊺ Bi i s Bi−1 si
i i −1 i Goldfarb (1970); Shanno (1970).
Table 28.1: The most popular members
of the Dennis family, Eq. (28.11), defined
This situation sounds familiar, and indeed it is closely related by their choice of ci (middle column); see
Martinez R. (1988) for more details. Note
to the setup discussed at length in Chapter III on linear alge-
that the names DFP and BFGS consist
bra. Here as in the earlier chapter, an algorithm collects linear of the first letters of the names of their
projections of some matrix, and has to estimate that matrix inventors (right column).
This is also the reason why there is not just one ‘best’ quasi-
Newton method. In contrast to the linear setting, where the
method of conjugate gradients is a contestant for the gold stan-
dard for spd problems, there are entire families of quasi-Newton
methods. A widely studied one is the Dennis family16 of update 16
Dennis (1971)
rules of the form
(yi − Bi si )c⊺i + ci (yi − Bi si )⊺ ci s⊺i (yi − Bi si )c⊺i
Bi+1 = Bi + − ,
c⊺i si (c⊺i si )2
(28.11)
where ci ∈ R N is a parameter that determines the concrete
member of the family. Quasi-Newton methods were the subject
of intense study from the late 1950s to the late 1970s. The most
widely used members of the Dennis family are presented in
Table 28.1. Among these, the BFGS method is arguably the most
popular in practice, but this should not tempt the reader to
ignore the other ones. This is particularly true for problems
with noisy gradients.
p( B) = N ( B; Bi , Wi
Wi )
So far so good, but now, in the final step to complete the con-
nection so that the next iteration still behaves like the Den-
nis method, we have to implicitly force the filter to add, in
the prediction step, just the right terms to P1 so that P2− =
A1 P1 A⊺1 + Q1 = W2−
W2− with an spd matrix W2 that yields
the required match W2 s2 = c2 to the Dennis family. It turns out
that such a step does not always exist, because the necessary
update W2 − W1 is not always symmetric positive definite. So,
beyond the special case of the linear problems discussed at
length in Chapter III, we cannot hope to find a general and
one-to-one interpretation of existing quasi-Newton methods as
Kalman filtering models.
−2
−3
−4 −2 0 2 4
x
existing evaluations are few, like stars dotted in the void. The
challenge of exploration is hence in sifting through the mass of
uncertainty to find the evaluation that best promises potential
reward. This is a challenge common to many aspects of intelli-
gence: think of artistic creativity, or venture capital, or simply
finding your lost keys. When humans explore, we draw upon
some of our most profoundly intelligent faculties: we theorise,
probe and map. As such, exploration for global optimisation
motivates sophisticated algorithms.
Relative to local optimisation, global optimisation typically:
▶ 31.1 Prior
These are not the only plausible candidate losses; we will meet
alternatives below. Crucial to distinguishing these losses is a
careful treatment of the end-point of the optimisation. The loss
function must make precise what is to happen to the set of
obtained objective evaluations once the procedure ends, and
how valuable this outcome truly is. One crucial question is that
of when our algorithm must terminate. Termination might be 7
We will regard this final point as ad-
upon the exhaustion of an a priori fixed budget of evaluations, ditional to our permitted budget of N
evaluations.
or, alternatively, when a particular criterion of performance or
convergence is reached. The former assumption of a fixed bud-
get of N evaluations is the default within Bayesian optimisation, 8
In the absence of noise, limiting to
and will be taken henceforth. the set of evaluation locations enforces
the constraint that the returned func-
We present in Figure 31.2 the decision problem for Bayesian tion value (the putative minimum) is
optimisation. We seek to illustrate the iterative nature of op- known with complete confidence. This
is not unreasonable; however, in some
timisation and its final termination. In particular, the termi- settings, the user may be satisfied with
nating condition for optimisation will often require us to se- a more diffuse probability distribution
lect a single point7 in the domain to be returned: we will de- over the returned value: such consider-
ations, of course, motivate the broader
note this point as x N . At the termination of the algorithm, probabilistic-numerics vision. It is worth
we will define the full set of evaluation pairs gathered as noting that the limitation to the set of
evaluation locations does not permit re-
D N := ( xi , yi ) | i = 0, . . . , N − 1 . Here the ith evaluation turning unevaluated points, even if their
is yi = f ( xi ). We will assume, for now, that evaluations are values are known exactly. As an example
exact, hence noiseless. The returned point will often be limited where this is important, consider know-
ing that a univariate objective is linear:
to the set of evaluation locations, x N ∈ D N , but this need not then, any pair of evaluations would spec-
necessarily be so.8 ify exactly the minimum, on one of the
two edges of a bounded interval. In such
The importance of the loss function can be brought out a case, would we really want to require
through consideration of the consequences of the terminal de- that this minimum could not be returned
cision of the returned point, x N . With our notation, the loss until it had been evaluated?
254 V Global Optimisation
function
α( xn | Dn ) = E λ( xn , yn , Dn )
Z
= λ( xn , yn , Dn ) p(yn | Dn ) dyn .
η := min f ( x i ),
i ∈{0, ..., n−1}
260 V Global Optimisation
after the (n + 1)th step. That is, kg considers the posterior mean
after the upcoming next step.
This modification offers much potential value. Valuing im-
provement in the posterior mean rather than in the evaluations
directly eliminates the need to expend a sample simply to re-
turn a low function value that may already be well-resolved
by the model. For instance, if the objective is known to be a
quadratic, the minimum will be known exactly after (any) three
evaluations, even if it has not been explicitly evaluated. In this
setting, evaluating at the minimum, as would be required by ei,
is unnecessary.
kg does introduce some risk relative to ei, however. Note
that the final value f ( x N ) may not be particularly well-resolved
after the (n + 1)th step: the posterior variance for f ( x N ) may
be high. kg, in ignoring this uncertainty, may choose x N such
that the final value f ( x N ) is unreliable. That is, the final value
returned, f ( x N ), may be very different from what the optimiser
expects, m N +1 ( x N ).
The kg loss is hence the final value revealed at the minimiser
of the posterior mean after the next evaluation. Let’s define that
minimiser as
x̌n+1 := arg min mn+1 ( x′ ),
x′
where the posterior mean function,
m n +1 ( x ′ ) : = E f ( x ′ ) | D n +1 ,
takes the convenient form of Eq. (4.6) for a gp. The kg loss can
now be written as
d f ( x̌n+1 ) d f ( xn )
Z
= min mn+1 ( x′ ) p f ( xn ) | Dn d f ( xn ).
x′
ηn : = min f ( x i ).
i ∈{0, ..., n−1}
−3
−4 −2 0 2 4
x
−6
xn
x
6
That said, an interesting re-
Let us now return to the two alternatives to the value loss (vl) interpretation of the ucb criterion
developed in detail above: the location-information loss (lil) and is provided by Jones (2001)).
α iago ( xn ) = α es ( xn )
= E λ lil (Dn+1 )
Z
= H( x∗ | Dn+1 ) p(yn | xn , Dn ) dyn
= : Ey n H ( x ∗ | y n , x n , D n ) .
Eyn H( x∗ | yn , xn , Dn ) is a conditional entropy, the expected
entropy in x∗ after an observation yn whose value is currently
unknown.
Predictive entropy search (pes)9 is an alternative acquisition 9
Hernández-Lobato et al. (2015)
function derived from the lil. It first notes that
arg min Eyn H( x∗ | yn , xn , Dn )
xn
= arg max H( x∗ | Dn ) − Eyn H( x∗ | yn , xn , Dn ) ,
xn
α opes = α mes
= −H ( y n | x n , D n ) + Ey ∗ H ( y n | y ∗ , x n , D n ) .
cost, e.g. the error achieved per second.8 The arguments of the 8
Snoek, Larochelle, and Adams (2012)
objective, such as hyperparameters, might include regularisa-
tion penalties (parameters of the prior), architecture choices,
and the parameters of internal numerics procedures (such as
learning rates). Conveniently, there are often not more than 10
or 20 such hyperparameters that are known to be important: the
dimensionality of such problems is compatible with Bayesian
optimisation. In real-world cases, these hyperparameters have
been historically selected manually by practitioners: it is not
difficult to make the case for the automated alternative pro-
vided by Bayesian optimisation. As such, Bayesian optimisation
is a core tool in the quest for automated machine learning.9 9
See, e.g., www.ml4aad.org/automl and
As one example, Bayesian optimisation was used to tune the autodl.chalearn.org.
▷ 34.3.1 Software
0.6
0.4
0.2 x0 x0
0
0 0.5 1 0 0.5 1
time t time t
This removal of the flow map Φ from Eq. (37.5) to Eq. (37.6)
means that the solver, after every completed step, falsely as-
sumes that its current estimate x̂ (t) is the true x (t) – a property
of classical numerics, referred to as uncertainty-unawareness.4 To 4
For more details on uncertainty-
satisfy this overly-optimistic internal assumption of the solver, (un)awareness in numerics see §1 in Ker-
sting (2020).
one would have to replace x̂ (t) by the exact x (t) in the en-
tire data set, Eq. (37.6). Iterated local Hermite interpolation on
this more informative (but, to the solver, inaccessible) data set
indeed yields a more accurate regression of x – which is numer-
ically demonstrated in Figure 37.2 for the popular fourth-order
Runge–Kutta method (RK4).
As a remedy, we can build more “uncertainty-aware” numer-
ical solvers by modelling the ignored uncertainty with proba-
bility distributions, that is by adding appropriate noise to the
Hermite extrapolation performed by classical ODE solvers (as
in the generic GP regression from §4.2.2).
But our probabilistic, regression-based view of numerics will
lead us further than that, beyond the conventional categories
of single-step and multi-step methods. To see how, let us first
recall that classical solvers iterate local Hermite interpolations
on t _ t + h using the data set from Eq. (37.6) for each respective
292 VI Solving Ordinary Differential Equations
4th order Hermite polation with exact data (i.e. x (t) instead
4 2 of x̂ (t) in Eq. (37.6)) on the linear and the
Van-der-Pol ODE. The linear system is
2 1 given by x ′ (t) = x (t), x (0) = 1.0 and the
s∈[0,t]
0.5
Top row: Classical solvers construct an
extrapolation x̂ (ti ) (solid black circle)
which, in this example, is also used
as the probing point x̃ (ti ) to construct
0 an observation yi = f ( x̃ (ti ), ti ). Bottom
row: Probabilistic solvers do the same,
but return a probability measure p( x (t))
1 (grey delineated shading, with lightly
coloured samples), rather than the point
estimate x̂ (t) (although for example the
mean of p could be used as such an
x (t)
0.5
estimate). For a well-calibrated classic
solver, the estimate x̂ should lie close
to the true solution. The same applies
0 for the mean (or mode) estimate of a
Gaussian probabilistic solver; but addi-
tionally, the width (standard deviation)
of the posterior measure should also be
ident via an intuitive (but less general) ssm in §38.3.4. Bayesian meaningfully related to the true error.
regression of x (t) in such a ssm is then performed by methods In the case of the (nonparametric) per-
known as ODE filters and smoothers. The difference between a turbative solvers the resulting samples
should accurately capture the entire dis-
classical and a probabilistic solver is visualised in Figure 37.3. tribution of numerically possible trajec-
tories; e.g. by covering both sides of a
bifurcation (Figure 38.2).
John Skilling (1991) was the first to recognise that ODEs can,
and perhaps should, be treated as a Bayesian (GP) process re-
gression problem. But two decades passed by before, in parallel
development, Hennig and Hauberg (2014) and Chkrebtii et al.
(2016) set out to elaborate on his vision. While both papers used
GP regression as a foundation, the data generation differed.
Hennig and Hauberg (2014) generated data by evaluating f at
the posterior predictive mean, and Chkrebtii et al. (2016) by eval-
uating f at samples from the posterior predictive distribution,
i.e. at Gaussian perturbations of the posterior predictive mean.
This difference stemmed from separate motivations: Hennig
and Hauberg had the initial aim to deterministically reproduce
classical ODE solvers in a Bayesian model, as had been previ-
ously achieved in e.g. Bayesian quadrature. Chkrebtii et al., on 5
This is not the only categorisation of
the other hand, intended to sample from the distribution of so- probabilistic ODE solvers. Another pos-
lution trajectories that are numerically possible given a Bayesian sible distinction would be nonparamet-
ric vs Gaussian or deterministic vs ran-
model and a discretisation. Thus, these two papers founded two domised which would both group the
distinct lines of works which we call ODE filters and smoothers particle ODE filter/smoother with the
and perturbative solvers;5 see §38 and §40 respectively. perturbative solvers.
Note that the perturbative solvers have
The former approach, after an early success of reproducing been called “sampling-based” solvers in
Runge–Kutta methods (Schober, Duvenaud, and Hennig, 2014), several past publications.
294 VI Solving Ordinary Differential Equations
x (t) = H0 #–
x ( t ), and x ′ (t) = H #–
x ( t ). (38.3)
Hence, a prior p( #–
x ) on #–x immediately implies a joint prior
′
p( x, x ) whose marginal p( x ) is the prior on the ODE solution
x (t) ∼ X (t), this prior distribution p( #–
x. Due to #– x ) is nothing
but the law of X (t) which (as in Definition 5.4) we define by a
linear time-invariant SDE, written
p( #–
x (t)) = N ( #–
x (t); A(t)m0 , A(t) P0 A(t)⊺ + Q(t)), (38.6)
g : R D _ Rd , ξ 7→ Hξ − f ( H0 ξ ), (38.7)
38 ODE Filters and Smoothers 297
p( x0 ) = N ( x0 ; m0 , P0 ), (38.10)
p( xn+1 | xn ) = N ( xn+1 ; A(hn+1 ) xn , Q(hn+1 )), (38.11)
5
p(zn | xn ) = δ(zn − g( xn )), (38.12) It might, however, be advantageous
to add a positive variance R > 0 to
with data zn = 0. (38.13) Eq. (38.12), e.g. to facilitate particle fil-
tering or to account for an inexact vector
It resembles the linear-Gaussian ssm from Eqs. (5.8)–(5.9) – with field f . This leads to a likelihood of the
form
the only difference that now Hn xn = g( xn ) and Rn = 0 in
Eq. (38.12), which makes it nonlinear.5 As this is a complete and p ( z n | x n ) = N ( z n ; g ( x n ), R ).
Note, however, that this ssm is only known since its intro-
duction by Tronarp et al. (2019); all preceding publications
employed a less-general linear-Gaussian ssm which we will also
define below in Eqs. (38.31)–(38.33). The new nonlinear ssm
(38.10)–(38.13), instead, leaves the task of finding approxima-
tions to the inference algorithm and, in this way, engenders both
Gaussian (§38.3) and non-Gaussian inference methods (§38.4).
This difference between the SSMs in the literature is at the
heart of a common source of confusion over the new ssm (38.10)–
(38.13): To some readers, it might appear that the constant data
zn = 0 contains no information whatsoever. But this is mistaken
because the use of information does not only depend on the data
but also on the likelihood. While in the regression-formulation
of classical solvers (§37) the data was an evaluation of f , this
dependence on f is now hidden in the likelihood via the def-
inition of g, Eq. (38.7). Since g( xn ) is by construction equal to
0 for the true xn = #–
x (tn ), the observation of the constant data
zn = 0 amounts by the form of the likelihood, Eq. (38.12), to
!
“conditioning on the ODE” by imposing that x ′ (tn ) = f ( x (tn )) –
which is similar to Eq. (37.7). In §38.3.4, we will explain how
the alternative linear-Gaussian ssm echoes the logic of classical
solvers.
The last row can still be set flexibly by choosing the scale σ > 0
of the Wiener process and the non-negative drift coefficients
( a0 , . . . , aq ) ≥ 0 – which parametrise the Matérn covariance
family with ν = q + 1/2, as we saw in §5.5.
Although Matérn priors are popular for GP regression, they
have (in their general form) not yet been explored for ODE
filtering. Only the special case of ( a0 , . . . , aq−1 ) = 0 has been
studied, where the only free parameter is aq ≥ 0. In this case,
X (t) is the q-times integrated Ornstein–Uhlenbeck process with
(mean-reverting) drift coefficient aq . While this prior can be
advantageous for some exponentially decreasing curves (such as
radioactive decay),10 it is, to date, not known if these advantages 10
Magnani et al. (2017)
extend to more ODEs.
Meanwhile, the q-times IWP, which sets ( a0 , . . . , aq ) = 0, has
become the standard prior for ODEs because the q-times IWP
extrapolates (as we saw in §5.4) by use of polynomial splines of
degree q. And this polynomial extrapolation also takes place for
the derivatives: under the q-times IWP prior, the ith mean of the
dynamic model (38.11) is, by Eq. (5.24), for all i = 1, . . . , q + 1
300 VI Solving Ordinary Differential Equations
given by
q +1
hkn−
+1
i
[ A ( h n +1 ) x n ] i = ∑ (k − i )!
[ xn ]k , (38.14)
k =i
11
This insight will be the basis of the
i.e. by a (q + 1 − i )th-order Taylor-polynomial extrapolation. convergence-rates analysis of §39.1.1.
In particular, the solution state (i = 1) is predicted forward
by a qth-order Taylor expansion, which is by Taylor’s theorem
(absent additional information about x) the best local model.11
Note that this is in keeping with classical solvers which – in light
of Eq. (37.3) – extrapolate forward along Taylor polynomials of
the flow of the ODE.12 12
Accordingly, all known equivalences
Therefore, it is only natural that the IWP is the standard prior with classical models hold for the IWP
prior; see Schober, Särkkä, and Hennig
for ODEs, and that any deviation from it requires specific prior (2019).
knowledge on the solution x : [0, T ] _ Rd . Hence, the utility of
adapting prior-selection strategies from GP regression depends
on how much knowledge on x can be extracted from f . With
this in mind – to draw from the full inventory of GP priors –
one can even go beyond the Matérn class and use state-space
approximations of non-Markov covariance functions.13 The fre- 13
For a comprehensive overview of such
quent case of periodic ODEs (oscillators) can, following earlier approximations, see §12.3 in Särkkä and
Solin (2019).
work on GP regression,14 be modelled by such a state-space 14
Solin and Särkkä (2014)
approximation of the periodic covariance functions.15 Remark- 15
Kersting and Mahsereci (2020)
ably, this model extrapolates with a Fourier (instead of a Taylor)
expansion – which is indeed how a periodic signal x is usually
approximated. Unfortunately, Fourier series are (unlike Taylor
series) global models, and therefore the utility of this periodic
model is (so far) limited to fast-and-rough extrapolations with
large step sizes after an initial learning period. It remains to
be seen whether this (or another) radical deviation from the
Taylor-expansion logic of classical numerics can give solvers
that can compete with probabilistic solvers that use the IWP
prior.
▷ 38.2.1 Initialisation
x (i ) (0) = f ⟨i ⟩ ( x0 ), (38.15)
38 ODE Filters and Smoothers 301
with the recursively defined f ⟨i⟩ from Eq. (37.4). Hence, the
ideal initialisation is
h i⊺
m 0 = x 0 , f ( x 0 ) , f ⟨2⟩ ( x 0 ) , . . . , f ⟨ q ⟩ ( x 0 ) ∈ Rq +1 ,
P0 = 0 ∈ R(q+1)×(q+1) .
1 procedure ODE filter( f , x (0), p( xn+1 | xn )) Algorithm 38.1: Bayesian ODE filtering
2 initialise p( x0 ) with available information about x (0) iteratively computes a sequence of pre-
3 for n = 0 : 1 : N − 1 do dictive and filtering distributions. Recall
from the graphical model of filtering
4 optional: adapt dynamic model p( xn+1 | xn )
(Figure 5.1) (with z instead of y) that the
5 optional: choose step size hn > 0 sequential form of this inference proce-
6 predict p( xn+1 | z1:n ), from p( xn | z1:n ) by (38.11) dure (i.e. the for-loop) is legitimate. The
7 observe the ODE: zn+1 = 0 according to (38.13) form of the computations in lines 6–8
8 update p( xn+1 | z1:n+1 ), from p( xn+1 | z1:n ) by (38.12) depend on the choice of filter. The ini-
tialisation (line 2) is explained in §38.2.1.
9 end for
The optional (but recommended) lines 4
10 return { p( xn | z1:n ); n = 0, . . . , n} and 5 are detailed in §38.5.
11 end procedure
1 procedure ODE Smoother( f , x (0), p( xn+1 | xn )) Algorithm 38.2: Bayesian ODE smooth-
2 { p( xn | z1:n ), p( xn | z1:n−1 )}n=0,...,N = ing extends Alg. 38.1 by iteratively up-
3 ODE filter ( f , x (0), p( xn+1 | xn )) dating its output, the filtering distribu-
tions p( xn | z1:n ), to the full posterior
4 for n = N − 1 : −1 : 0 do
p( xn | z1:N ). Note that, in line 3, the filter
5 compute p( xn | z1:N ), from p( xn+1 | z1:N ) by (38.26) additionally returns the posterior predic-
6 end for tive distributions p( xn | z1:n−1 ) which it,
7 end procedure for all n, computes as an intermediate
step anyway; see line 6 of Alg. 38.1.
38 ODE Filters and Smoothers 303
prediction step update step Figure 38.1: Depiction of the first step
1 1 of the EKF0 with 2-times IWP prior ini-
true solution tialised at x0 and the implied deriva-
x (t) (2-times IWP)
0.8 0.8
samples tives as in (38.15). In the prediction step
0.6 mean 0.6 (left column), the predictive distribution
uncertainty p( x1 ) is computed by extrapolating for-
0.4 0.4
ward in time along the dynamic model.
0.2 x0 0.2 x0 The samples can be thought of as dif-
ferent possibilities for the trajectory of
0 0
x (t), x ′ (t) and x ′′ (t) (from the top to the
4 1.5 bottom row). Then, in the update step
(right column), the predictive distribu-
tion is conditioned on x ′ (t1 ) = f (m1− )
x ′ (t) (IWP)
p( xn | z1:n ) = N ( xn ; mn , Pn ), (38.16)
p ( x n +1 ) = N ( x n +1 ; m − −
n+1 , Pn+1 ), with (38.17)
⊺
m−
n +1 = A ( h n +1 ) m n , Pn−+1 = A(hn+1 ) Pn A(hn+1 ) + Q(hn+1 ).
ẑn+1 := f ( H0 m− −
n+1 ) − Hmn+1 , (innovation residual) (38.18)
S n +1 : = H̃Pn−+1 H̃ ⊺ + Rn+1 , (innovation cov.) (38.19)
K n +1 : = Pn−+1 H̃ ⊺ Sn−+1 1 , (gain) (38.20)
m n +1 : = m− n+1 + Kn+1 ẑn+1 , (38.21)
Pn+1 := ( ID − Kn+1 H̃ ) Pn−+1 . (38.22)
Z
p( xn+1 | xn ) p( xn+1 | z1:N )
p( xn | z1:N ) = p( xn | z1:n ) dxn+1 , (38.26)
p( xn+1 | z1:n )
which (in our ssm) does not differ between the EKF0 and EKF1
because their dynamic model p( xn+1 | xn ) is the same.
With this in mind, we define the (zeroth and first-order)
extended Kalman ODE smoothers, EKS0 and EKS1,29 as the in- 29
In some recent publications, the EKS0
stances of Algorithm 5 that employ the EKF0 or EKF1 in line 3 and EKS1 are referred to as EK0 and
EK1 – because smoothing has become
and then compute line 5 by Eqs. (38.23)–(38.25). The resulting the default (see §38.6).
smoothing-posterior distributions p( xn | z1:N ) = N ( xn ; msn , Pns )
can be extended beyond the time grid {tn }nN=0 by interpolation
along the dynamic model, Eq. (38.11), and therefore contain
the same information as the full GP posterior of Eqs. (4.7) and
(4.6).30 30
This was also discussed above for
generic Gaussian smoothers in §5.2.
Remark (Relation to Bayesian quadrature). Before introducing
more ODE filters, let us briefly clarify the relation to Bayesian quadra-
ture (BQ) – namely that the EKF0/EKS0 is a generalisation of
BQ in the following sense: if the ODE is really just an integral
(i.e. x ′ (t) = g(t)), then its solution is given by
Z t
x ( t ) = x0 + g(s) ds. (38.27)
0
x ∗ (t) := H0 #–
x ∗ ( t ), (38.30)
p( x0 ) = N ( x0 ; m0 , P0 ), (38.31)
p( xn+1 | xn ) = N ( xn+1 ; A(hn+1 ) xn , Q(hn+1 )),
p(yn | xn ) = N (yn ; Hm−
n , R n ), (38.32)
with data yn = f ( H0 m−
n ), (38.33)
A(hn ) Pn−1 A(hn )⊺ + Q(hn ). Recall that, by Eq. (38.3), the predic-
tive normal distributions over x (tn ) and ẋ (tn ) are now given
by
− ⊺
p( x (tn ) | y1:n−1 ) = N ( x (tn ); H0 m−
n , H0 Pn H0 ), and
− ⊺
p( ẋ (tn ) | y1:n−1 ) = N ( ẋ (tn ); Hm−
n , HPn H )). (38.34)
bifurcating ODE flow particle-filtering representation Figure 38.2: Bifurcation detection by use
2 of particle ODE filtering. We consider the
1 Bernoulli ODE
1
x ′ (t) = rx (t)(1 − | x (t)|),
x (t)
and EKF1.
Another approach to local error estimation and step-size
adaptation in SSMs for probabilistic differential equation was
derived from Bayesian statistical design60 for the perturbative 60
Chkrebtii and Campbell (2019)
method by Chkrebtii et al. (2016).
are much faster and more stable than the non-Gaussian ones
(particle filtering). Among the Gaussian ones, the first-order
versions (EKF1 and EKS1) make use of the Jacobian of f (avail-
able by automatic differentiation).62 This tends to produce a 62
Griewank and Walther (2008), §13
more precise mean with better-calibrated uncertainty. Moreover,
smoothing returns (unlike filtering) the full GP posterior distri-
bution which exploits the whole data set z1:N along the entire
time axis [0, T ] – while maintaining the O( N ) complexity of fil-
tering, both in the number of steps and of function evaluations.
Therefore the EKS1 is, altogether, our default recommendation.
But a longer answer would also involve other methods. As
a first alternative to the EKS1, both the EKF1 and the EKS0
recommend themselves. The EKF1 omits the smoothing pass
(38.23)–(38.25) backwards through time. It is therefore a bit
cheaper, i.e. its cost O( N ) has a smaller constant. This can, e.g.,
be advantageous when only the distribution at the final time T
(where the filtering and smoothing distributions coincide) is of
interest.
The EKS0, on the other hand, does not require the Jacobian.
Compared with the EKS1, this again reduces the constant in the
O( N ) cost. The Jacobian is beneficial to solve stiff ODEs and to
calibrate the posterior uncertainty accurately. But when rough
uncertainty estimates suffice, the EKS0 is an attractive cheaper
alternative for non-stiff ODEs.
Lastly, the EKF0 combines both of the modifications of the
EKF1 and EKS0, with respect to the EKS1. It is thus appropriate
for the intersection of cases where the EKF1 and EKS0 are
suitable.
The other above-mentioned ODE filters and smoothers are
more expensive and trickier to implement efficiently. Hence, we
recommend to only consider them in very specific cases. For
instance, if the MAP estimate is desired, the IEKS is best suited
to compute it. The particle ODE filter should only be used when
capturing non-Gaussian structures is crucial. It is thus not really
an alternative to Gaussian ODE filters and smoothers, but rather
to the perturbative solvers of §40.
Efficient implementations of our recommended choice (the
EKS1), and its next best alternatives (EKF1, EKS0, EKF0) are
readily available in the ProbNum package.
∥m( T ) − x ( T )∥ ≤ C ( T )hq ,
where m( T ) := H0 m N denotes by Eq. (38.3) the posterior mean
estimate of x ( T ) computed by the EKF0. The same bound holds for
the EKS0.
Z [ X ](ti ) = 0.
Theorem 39.6. Under Assumption 39.5 and for any prior X (t) of 13
That is, for any process with a.s. q-
smoothness q,13 there exists a constant C ( T ) > 0 such that times differentiable sample paths. In par-
ticular, this includes the Matérn family
Z t with ν = q + 1/2 (§38.2) and its special
sup Z [ x ∗ (s)] ds ≤ C ( T )hq , (39.2) cases: the q-times integrated Wiener pro-
t∈[0,T ] 0 cess and Ornstein–Uhlenbeck process.
See §2.1 in Tronarp, Särkkä, and Hen-
where x ∗ (t) = H0 #–
x ∗ (t) is the MAP estimate of x (t) given a discreti- nig (2021) for an alternative definition of
such priors by use of Green’s functions.
sation 0 = t0 < t1 < · · · < t N = T.14
14
Recall Eq. (38.30).
Proof. The proof idea is to first analyse (with the help of tools
from nonlinear analysis) which regularities the information
operator Z inherits from f under Assumption 39.5, and then to
apply results from scattered-data interpolation in the Sobolev
space associated with the prior X (t). Details in Tronarp, Särkkä,
and Hennig (2021), Theorem 3.
where x ∗ (t) = H0 #–
x ∗ (t) is the MAP estimate of x (t) given a dis-
cretisation 0 = t0 < t1 < · · · < t N = T.14 (NB: In particular,
this uniform bound also holds for the discrete MAP estimate x ∗ (t0:N )
which the IEKS aims to estimate; see §38.3.3.)
▷ 39.2.1 A-Stability
for some real22 matrix Λ whose eigenvalues lie in the unit circle 22
In the classical literature Λ ∈ Rd×d is a
around zero, i.e. for which limt _ ∞ x (t) = 0. An ODE solver complex matrix, but ODE filters are only
designed for real-valued ODEs. Hence,
is said to be A-stable, if and only if its numerical estimate x̂ (t) we here use the real-valued analogue
also converges to zero (for a fixed step size h > 0) as t _ ∞.23 (39.3) instead; cf. Eq. (31) in Tronarp et al.
(2019).
Accordingly, a Gaussian ODE filter is A-stable if and only if its 23
Dahlquist (1963)
mean estimate H0 mn goes (for a fixed step size h > 0) to zero,
as n _ ∞.
The following recursion holds by Eqs. (38.17)–(38.21) for the
predictive mean m− n of both the EKF0 and EKF1 (but with
different Kn ):
m− −
n +1 = [ A ( h ) − A ( h ) K n B ] m n , (39.4)
p( x0 ) = N ( x0 ; T −1 m0 , T −1 P0 T −⊺ ), (39.7) 33
Note that, for notational simplicity, we
here assumed a constant step size h. See
p( xn+1 | xn ) = N ( xn+1 ; T −1 A(h) Txn , T −1 Q(h) T −⊺ ), Krämer and Hennig (2020) for a general-
isation to variable step sizes {hn }nN=0 (in
instead of Eqs. (38.10) and (38.11).33 In other words, we re- zero-based indexing!).
placed the original predictive matrices ( A, Q) from Eq. (38.13)
with the new ones ( Ā := T −1 A(h) T, Q̄ := T −1 Q(h) T −⊺ ) to
39 Theory of ODE Filters and Smoothers 325
obtain Eq. (39.7). As desired, these new matrices are now scale-
invariant:
q+1−i σ2
[ Ā]ij = I( j ≥ i ) , [ Q̄]ij = , (39.8)
q+1−j 2q + 3 − i − j
Pn−+1 = R⊺ R. (39.11)
Fortunately, this R can be obtained without assembling Pn−+1
Exercise 39.10. Prove the claim in the text,
from its square-root factors as in Eq. (39.10), since it (as Ex- i.e. show that the upper-triangular matrix in
ercise 39.10 reveals) is equal to the upper-triangular factor of the QR decomposition of [ An L P , LQ ]⊺ is the
the QR decomposition of [ An L P , LQ ]⊺ . Hence, we may replace transpose of the lower-triangular Cholesky
factor of Pn−+1 . (For a solution, see §3.3 in
the original prediction step (39.10) by the lower-dimensional Krämer and Hennig (2020).)
matrix multiplication (39.11) in which the Cholesky factor R
is efficiently obtained by a QR decomposition of [ An L P , LQ ]⊺
(i.e. without ever computing Pn−+1 ).
Thereby, the filter can summarise the predictive distribu-
tion of Eq. (38.17) by (m− − −
n+1 , R ) instead of ( mn+1 , Pn+1 ). In the
subsequent update step, the innovation-covariance matrix S of
Eq. (38.19) can again be captured by its Cholesky factor which
is (via analogous reasoning) again available (without assem-
bling S) in the form of the upper-triangular QR-factor of ( HR)⊺ .
The conditioning on zn+1 from Eqs. (38.20)–(38.22) can then be
executed solely by use of this Cholesky factor of S.38 Finally, 38
For details, see Appendix A.3 in
Krämer and Hennig (2020).
the resulting filtering distribution N (mn+1 , Pn+1 ) is obtained
directly from the previously computed Cholesky factors. Again,
the computation of Pn+1 is replaced by computing its Cholesky
factor instead.
Altogether, this square-root implementation represents all re-
quired covariance matrices (including the hidden, intermediate
ones) by their Cholesky matrix square-root. Since the Cholesky
factors of the matrices Pn−+1 , S and Pn+1 can be obtained by
QR decompositions of already-available matrix square-roots,
they never have to be assembled – which further reduces the
computational cost.
We refer the reader to Appendix A39 in Krämer and Hennig 39
In particular, the square-root versions
(2020) for a complete description of this square-root implemen- of the EKS0 and EKS1 are detailed in
their Appendix A.4.
tation, and to the book by Grewal and Andrews (2001) for more
implementation ideas which might help address the remain-
ing practical challenges.40 All of the above-described tricks are 40
These are mainly the efficient integra-
included in the ProbNum Python-package.41 tion of high-dimensional and very stiff
ODEs; see §5 in Krämer and Hennig
(2020). In this regard, Krämer et al. (2021)
▶ 39.3 Connection with Classical Solvers recently published a further advance
demonstrating that ODE filters can ef-
ficiently solve ODEs in very high dimen-
The probabilistic reproduction of classical numerical methods sions.
41
has, especially in its early days, been a central strategy of pn Code at probnum.org. See the corre-
sponding publication by Wenger et al.
to invent practical probabilistic solvers. For ODE filtering, this (2021).
changed when Tronarp et al. (2019) introduced the rigorous
ssm of Eqs. (38.10)–(38.13) because, from then on, new research
could also draw from the accumulated wisdom of signal pro-
cessing (instead of numerical analysis). Since then, most ODE
39 Theory of ODE Filters and Smoothers 327
filters and smoothers were designed directly from the first prin-
ciples of Bayesian estimation in SSMs, without attempting to
imitate classical numerical solvers. While some loose connec-
tions have been observed,42 it has not been studied in detail how 42
For instance, it has been repeatedly
the whole range of ODE filters and smoothers relates to classical pointed out that both the EKF1 and the
classical Rosenbrock methods make use of
methods. Nonetheless, earlier research43 has established one the Jacobian matrix of f .
important connection – in the form of an equivalence between 43
Schober, Särkkä, and Hennig (2019)
the EKF0 with IWP prior44 (more precisely, its filtering mean) 44
It is unsurprising that the equivalences
and Nordsieck methods,45 which we will discuss in §39.3.2. But, are only known for the IWP prior as it is
the only one with Taylor predictions; see
first, we will present another, more elementary, special case.46 Eq. (38.14).
45
Nordsieck (1962)
46
▷ 39.3.1 Equivalence with the Explicit Trapezoidal Rule Note that, even earlier, the pioneer-
ing work by Schober, Duvenaud, and
In the case of the 1-times IWP prior and R = 0, the Kalman Hennig (2014) showed an equivalence
between a single Runge–Kutta step and
gains {Kn }nN=1 are the same for all n. In other words, it is always GP regression with an IWP prior. How-
in its steady state K∞ = limn _ ∞ Kn . Therefore, the recursion ever, as it relies on imitating the sub-step
structure of Runge–Kutta methods, this
for the Kalman-filtering means {mn }nN=0 is independent of n equivalence cannot be naturally repro-
which leads to the following equivalence. duced with ODE filters. Therefore, we
do not further discuss this result here.
Proposition 39.11 (Schober, Särkkä and Hennig, 2018). The
EKF0 with 1-times IWP prior and R = 0 is equivalent to the explicit
trapezoidal rule (aka Heun’s method). More precisely, its filtering
mean estimates x̂n := H0 mn of x (tn ) follow the explicit trapezoidal
rule, written
h
x̂n+1 = x̂n + ( f ( x̂n ) + f ( x̃n+1 )) , (39.12)
2
with x̃n+1 := x̂n + h f ( x̂n ).
do not only model the ODE solution x (t), but a larger state
vector #–
x (t), which contains at least x (t) and x ′ (t). For the most
important priors, #– x (t) is simply the concatenation of the first q
derivatives, written
h i
#–
x ( t ) = x ( t ), x ′ ( t ), . . . , x ( q ) ( t ) . (39.13)
#b
x–Nord (t) executes the prediction and update step in a single
step:50 50
Cf. Eq. (28) in Schober, Särkkä, and
Hennig (2019).
#b
x–(t + h) = [ I − l H̄ ] Ā #b
x–(t) + hl f ( H0 Ā #b
x–(t)), (39.17)
recursions are the same in the steady state of the filter, i.e. after
Kn+1 has reached its limit K∞ := limn _ ∞ Kn .52 Note, however, 52
For the details, see §3.1 in Schober,
that K∞ depends on the ssm on which the EKF0 performs Särkkä, and Hennig (2019).
∥m( T ) − x ( T )∥ ≤ C ( T )h3 .
√
Proof. First, derive the steady-state Kalman gain: K ∞ = [ 3+12 3 ,
√
1, 3−2 3 ]⊺ . Then, insert K ∞ as the Nordsieck weight-vector l into
Theorem 4.2 from Skeel (1979), which yields global convergence
rates of order h3 . For the details, see Theorem 1 in Schober,
Särkkä, and Hennig (2019).
Remark. For the q-times IWP prior with q = 2, this theorem gives
hq+1 instead of the hq convergence rates suggested by Theorem 39.3
and Corollary 39.7, but only in the steady state. These rates indeed
hold in practice.53 53
Schober, Särkkä, and Hennig (2019),
In the same way, one could interpret any instance of the EKF0 (in Figure 4
IWP prior models x (t) and its first q derivatives. At any given discrete
time point tn (n = 0, . . . , N), the filtering mean estimate for x (tn ),
computed by the EKF0, will of course depend on all previous function
evaluations {yi = f ( H0 mi− )}in=1 . But, in the steady state, the mean
estimates for the q modelled derivatives [ x ′ (tn ) . . . , x (q) (tn ))] will
depend only on a finite number j ∈ N of these function evaluations,
namely on {yi = f ( H0 mi− )}in=n− j+1 . What is j for a given q ∈ N?
40
Perturbative Solvers
So far in this chapter on ODEs, all methods (with the sole ex-
ception of the particle ODE filter) were probabilistic, but not
stochastic. By this, we mean that they use probability distribu-
tions to approximate x (t), but are not randomised (i.e. they
return the same output when run multiple times). This design
choice stems from a conviction held by some in the pn com-
munity1 that it is never optimal to inject stochasticity into any 1
See the corresponding discussion in
deterministic approximation problem – except in an adversarial §12.3.
But even in most non-chaotic ODEs, the long-term effect of the z′ = x1 x2 − βx3 .
Z hn
ε n ( hn ) ∼ ξ n ( hn ) := χn (s) ds, (40.2)
0
the sense that both the expected global error of the former and
the fixed global error of the latter are in O(hq ). However, if
the added noise is larger than that (i.e. p < q), then the ex-
pected global error is only in O(h p ), i.e. larger than without
randomisation.
This is intuitive. Loosely speaking, it means that one can at
most perturb the local error (in O(hq+1 )) by a slightly larger16 16
Due to the independence of random
additive noise (in O(hq+1/2 )) without reducing the global con- variables ξ n (hn ), n = 1, . . . , N, an order
of hq+1/2 in the local noise is already suf-
vergence rate. Accordingly, Conrad et al. (2017) recommend ficiently small to have an expected global
choosing p := q, i.e. to add the maximum admissible stochastic- error of hq ; see Remark 8 in Abdulle and
Garegnani (2020).
ity that still preserves the accuracy of the underlying determin-
istic method Ψh .
Just like Theorem 40.5, Theorem 40.6 shows that the local error
rate of q + 1 and the local standard deviation of p + 1/2 combine
to a convergence rate of min( p, q). For the same reasons as
above, it is again recommended to choose p = q. Note that, like
in a weaker earlier version of Theorem 40.5 by Conrad et al., the
maximum is outside of the expectation here and f is assumed
to be globally Lipschitz. For Theorem 40.5 these restrictions
were later lifted by Lie, Stuart, and Sullivan (2019); maybe this
is also possible for Theorem 40.6. Since the desired properties
of geometric integrators hold for all h > 0, they a.s. carry over
to a sample of { X̂n } from Eq. (40.5).19 19
Abdulle and Garegnani (2020), Thm. 4
Notably, both of these methods can be thought of as frequen-
tist as they sample i.i.d. approximations of x (t). The particle
ODE filter (from §38.4), on the other hand, is Bayesian as it 20
Teymur, Zygalakis, and Calderhead
computes a dependent set of samples that approximate the true (2016)
posterior distribution. While there are first experimental com-
parisons (Tronarp et al., 2019, §5.4), more research is needed 21
Teymur et al. (2018)
to understand the differences between all of these nonpara-
metric solvers. There are further important methods – such as
the aforementioned one by Chkrebtii et al. (2016) as well as
stochastic versions of linear multistep methods20 and of im-
plicit solvers.21 Finally, note that Abdulle and Garegnani (2021)
recently published an extension of their ODE solver (40.5) to
PDEs by randomising the meshes in finite-element methods.
336 VI Solving Ordinary Differential Equations
1 1 1
0 0 0
x2
x2
x2
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x1 x1 x1
x1 + µ x − µ′
x1′′ = x1 + 2x2′ − µ′ −µ 1 , (40.6)
D1 D2
x2 x
x2′′ = x2 − 2x1′ − µ′ −µ 2 ,
D1 D2
D1 = (( x1 + µ)2 + x22 )3/2 ,
D2 = (( x1 − µ′ )2 + x22 )3/2 ,
µ = 0.012277471 and µ′ = 1 − µ.
It is known that there are initial values that give closed, periodic
orbits. One example is
p( x T | x N ) = δ( x T − H0 x N ),
z ( t i ) : = x ( t i ) + ξ i ∈ Rd , ξ i ∼ N (0, σ2 Id ),
x2′ = −θ3 x2 + θ4 x1 x2 ,
In the case of ODE filters and smoothers, the (to date) only publi-
cation is Kersting et al. (2020). Like the perturbative approaches,
it managed to reduce the overconfidence in the likelihood by in-
serting an EKF0 in lieu of a classical ODE solver, as Figure 41.2
demonstrates. But, on top of that, it exploited the resulting,
more structured Gaussian form of the likelihood to estimate its
gradients and Hessian matrices in the following way.
First, let us assume w.l.o.g. that h > 0 is fixed, and recall,
from Eqs. (38.1) and (38.2), that the functions x and x ′ are a
priori jointly modelled by a Gauss–Markov process, written
" #! " # " # " #!
x x x0 k k∂
p = GP ; , ∂ ,
x′ x′ f ( x0 ) k ∂k ∂
[Y ]ij = f j (m−
θ (ih )) − f j ( x0 ), (41.11)
p(yobs
n | H
obs
xn ) = N (yobs
n ;H
obs
xn , Robs ), (41.15)
−1
Solution to Exercise 4.7. Define the short-hand wi ( x ) = [KXX k Xx ]i
for the regression weights. To show the theorem, we insert the
definition of m x from Eq. (4.6) and use the reproducing property
of k to write all instances of f ( x ), f ( xi ) as an inner product:
!2
N
2
s( x ) : = sup (m x − f x ) = sup ∑ f ( x i ) wi ( x ) − f x
f ∈H,∥ f ∥≤1 f ∈H,∥ f ∥≤1 i =1
* +2
= sup ∑ wi (x)k(·, xi ) − k(·, x), f (·) .
f ∈H,∥ f ∥≤1 i H
∑i wi ( x )k(·, xi ) − k (·, x )
f¯x (·) := .
∥ ∑i wi ( x )k(·, xi ) − k(·, x )∥
we can rewrite:
* +2
s( x ) = ∑ wi (x)k(·, xi ) − k(·, x), f¯x (·)
i H
2
= ∑ wi (x)k(·, xi ) − k(·, x)
i H
= ∑ wi ( x )w j ( x )k( xi , x j ) − 2 ∑ wi ( x )k( x, xi ) + k( x, x )
ij i
−1
= k xx − k xX K k Xx ,
Thus, the eigenvalue of A = exp( Fh) is e−ξh , and one can find
the three forms:
A0 = exp(−ξh),
" #! " #
0 1 ξh + 1 h
A1 = exp h = e−ξh ,
−ξ 2 −2ξ −ξ 2 h (1 − ξh)
0 1 0
A2 = exp h 0 0 1
−ξ 3 −3ξ 2 −3ξ
1/2( ξ 2 h2 + 2ξh + 2) h(ξh + 1) 1/2h2
= e−ξh −1/2ξ 3 h2 −(ξ 2 h2 − ξh − 1) −1/2h(ξh − 2) .
1/2ξ 3 h ( ξh − 2) ξ 2 h(ξh − 3) 1/2( ξ 2 h2 − 4ξh + 2)
Now write down the same equation for t + 1, and replace all
occurrences of Pt+1 using (42.1). Then use the matrix inversion
lemma, Eq. (15.9) to simplify the expression.
| A B| = |VA VB || D A DB ||VA VB | = | D A DB |.
N
(( I S⊺ )Σ)nm,kℓ = ∑ δni S jm δik δjℓ Wij = δnk Wkℓ Sℓm ,
ij
Selected Exercises from Chapter IV “Local Optimisation” 363
f ( x1 ) ≈ f ( x1 ) − ∥ x1 − x0 ∥∇ f ( x0 )
J J J
= 4 473 − 0.5 m × 5 = 4 448 .
kg kg · m kg
The second robot makes the literally microscopic step 0.1 ×
∇ f ( x0 )[ft] = 1.03 × 10−6 [ft] ≈ 0.3 µm, to where the potential
energy density is
Cal Cal
f ( x1 ) = 3.031 × 10−2 − 1.03 × 10−6 ft × 1.03 × 10−5
oz oz · ft
Cal J
= 3.031 × 10−2 = 4 473
oz kg
1 N
(Yi − f ( xi ))2 .
2 i∑
= r(x) +
=1
Selected Exercises from Chapter IV “Local Optimisation” 365
k t,ta k t,tb , is always at most a polynomial containing terms |t a − min(t a , tb ) = 1/2(|t a + tb | − |t a − tb |).
tb |ℓ with 0 ≤ ℓ ≤ 6. Thus, it is certainly possible to construct
priors which, given Y with likelihood (26.12), assign a different
absolute uncertainty to each input location t but still revert to
the cubic spline mean in the limit Λ _ 0. But their qualitative
behaviour is equivalent in the sense that the marginal standard
deviation (the “sausage of uncertainty” around the posterior
mean) is locally cubic in t. This is again an instance of the deeper
insight that, since a Gaussian process posterior mean and (co-
)variance both involve the same kernel, a classic numerical
estimate that is a particular least-squares MAP estimator is
consistent with only a restricted set of probabilistic posterior
error estimates in the sense of posterior standard-deviations.
366 VIII Solutions to Exercises
Z η
α ei ( xn ) = f ( x n ) N f ( x n ); m ( x n ), V( x n ) d f ( x n )
−∞
Z η
−η N f ( x n ); m ( x n ), V( x n ) d f ( x n )
−∞
Z η 2
1 1 f ( xn ) − m( xn )
= f ( xn ) p exp − d f ( xn )
−∞ 2πV( xn ) 2 V( x n )
− η Φ η; m( xn ), V( xn )
Z η −m( xn )
z + m( xn ) 1 z2
= p exp − dz
−∞ 2πV( xn ) 2 V( x n )
− η Φ η; m( xn ), V( xn )
r Z η −m( xn )
V( x n ) −z 1 z2
=− exp − dz
2π−∞ V( x n ) 2 V( x n )
+ m( xn ) Φ −m( xn ); 0, V( xn )
− η Φ η; m( xn ), V( xn )
r η −m( xn )
V( x n ) 1 z2
=− exp −
2π 2 V( x n ) − ∞
+ m( xn ) − η Φ η; m( xn ), V( xn )
= −V( xn )N η; m( xn ), V( xn )
+ m( xn ) − η Φ η; m( xn ), V( xn ) . □
" #!
y1
p ε1 − ε2 = N ε 1 − ε 2 ; σ2 (Σ + σ2 I )−1 y1 − y2 − (m1 − m2 ) , 2σ2 − 2σ4 (C2 + σ2 )−1 .
y2
Z Z 2 12
λ ucb ( xn , Dn ) + yn p(yn | Dn−1 ) dyn = − β n yn − m( xn ) p(yn | Dn−1 ) dyn
1
and, multiplying both sides by β n V( xn ) 2 ,
Z
1
β n V( x n ) 2 λ ucb ( xn , Dn ) + yn p(yn | Dn−1 ) dyn
Z 2
= − β2n yn − m( xn ) p(yn | Dn−1 ) dyn .
GP Gaussian process. 37, 66, 77, 79, 81, 82, 84, 85, 95, 104, 107,
108, 152, 251, 252, 255, 259, 262, 263, 269, 270, 274, 277, 366,
368
RKHS reproducing kernel Hilbert space. 36–38, 52, 76, 80, 108,
121, 306, 318, 359
SPD symmetric positive definite. 23, 34, 51, 128, 132–134, 145,
150, 151, 153, 155, 161, 162, 170, 171, 173, 175, 176, 180, 181,
185, 187, 362
SSM state space model. 282, 283, 292, 293, 295, 297–303, 305,
307, 309, 310, 314, 322–324, 326, 327, 329, 332, 340, 344, 346
A-stability, 322, 323 code, see software emukit package, see software
acquisition function, 254 companion matrix, 53 entropy, 23
Adam, 230 conditional distribution, 21 epistemic uncertainty, 11, 70
affine transformation, 23 conditional entropy, 272 equidistant grid, 92
agent, 3, 4, 11 conjugate gradients, 134, 137, 144, ergodicity, 85
aleatory uncertainty, 11, 71 145, 165, 166 error analysis vs error estimation,
analytic, 1 probabilistic, 165 92
Arenstorf orbit, 336 conjugate prior, 55, 56 error function, 215
Armijo condition, see line search continuous time, 48, 296 Euler’s method, 289
Arnoldi process, 135 continuous-time Riccati equation, Eulerian integrals, 56
atomic operation, 70 see Riccati equation evidence, 99
Automated Machine Learning, 276 convergence rate, 201 expected improvement, 259
average-case, 182 convex function, 199 expensive evaluations, 256
average-case analysis, 7 covariance, 23 exploration-exploitation trade-off,
covariance function, see kernel 247
backpack package, see software219 cubic splines, see splines exponential kernel, see Ornstein–
Bayes’ theorem, 21 curse of dimensionality, 79 Uhlenbeck process
Bayesian, 8 exponentiated quadratic kernel, see
Bayesian ODE filters and Dahlquist test equation, 323 Gaussian kernel
smoothers, see ODE filters DARE, see Riccati equation
and smoothers data, 21 filter, 44
Bayesian Optimisation, 251 decision theory, 4 Kalman, 46
Bayesian quadrature, 72 Dennis family, 236 ODE, see ODE filters and
belief propagation, 43 detailed balance, 85 smoothers
BFGS, 236 determinant lemma, 130 optimal, 46
bias, 11 determinantal point process, 80 particle, 309
bifurcation, 311 DFP, 237 forward-backward algorithm, see
boundary value problem, 286, 339 Dirac delta, 27 sum-product algorithm
Brownian motion, 50 discrete time, 42, 297 Frobenius norm, see norm
discrete-time algebraic Ricatti equa- function-space view, 28
calibration, 12 tion, see Riccati equation
CARE, see Riccati equation dynamic model, 45, 298 Galerkin condition, 143
Cauchy-Schwarz inequality, 37 gamma distribution, 56
Chain graph, 42 early stopping, 224 gamma function, 56
chaos, 332 EKF0, EKF1, see ODE filters and Gauss–Markov process, 41
Chapman–Kolmogorov equation, smoothers Gauss-inverse-gamma, 56
43, 45 EKS0, EKS1, see ODE filters and Gauss-inverse-Wishart, 59
Chebyshev polynomials, 103 smoothers Gaussian
Cholesky decomposition, 36, 138 empirical risk minimisation, 37, 204 elimination, 133
398 Index