Hennig P. Probabilistic Numerics. Computation As Machine Learning 2022
Hennig P. Probabilistic Numerics. Computation As Machine Learning 2022
PROBABILISTIC
NUMER CS
Computation as
Machine Learning
Probabilistic numerical computation formalises the connection between machine learning and
applied mathematics. Numerical algorithms approximate intractable quantities from computable
ones. They estimate integrals from evaluations of the integrand, or the path of a dynamical system
described by differential equations from evaluations of the vector field. In other words, they
infer a latent quantity from data. This book shows that it is thus formally possible to think of
computational routines as learning machines, and to use the notion of Bayesian inference to build
more flexible, efficient, or customised algorithms for computation.
The text caters for Masters’ and PhD students, as well as postgraduate researchers in artificial
intelligence, computer science, statistics, and applied mathematics. Extensive background material
is provided along with a wealth of figures, worked examples, and exercises (with solutions) to
develop intuition.
Philipp Hennig holds the Chair for the Methods of Machine Learning at the University of
Tiibingen, and an adjunct position at the Max Planck Institute for Intelligent Systems. He has
dedicated most of his career to the development of Probabilistic Numerical Methods. Hennig’s
research has been supported by Emmy Noether, Max Planck and ERC fellowships. He is a co
Director of the Research Program for the Theory, Algorithms and Computations of Learning
Machines at the European Laboratory for Learning and Intelligent Systems (ELLIS).
Michael A. Osborne is Professor of Machine Learning at the University of Oxford, and a co
Founder of Mind Foundry Ltd. His research has attracted £10.6M of research funding and has
been cited over 15,000 times. He is very, very Bayesian.
Hans P. Kersting is a postdoctoral researcher at INRIA and Ecole Normale Superieure in Paris,
working in machine learning with expertise in Bayesian inference, dynamical systems, and
optimisation.
‘This impressive text rethinks numerical problems through the lens of probabilistic inference and
decision making. This fresh perspective opens up a new chapter in this field, and suggests new and
highly efficient methods. A landmark achievement!’
- Zoubin Ghahramani, University of Cambridge
‘In this stunning and comprehensive new book, early developments from Kac and Larkin have been
comprehensively built upon, formalised, and extended by including modern-day machine learn
ing, numerical analysis, and the formal Bayesian statistical methodology. Probabilistic numerical
methodology is of enormous importance for this age of data-centric science and Hennig, Osborne,
and Kersting are to be congratulated in providing us with this definitive volume.’
- Mark Girolami, University of Cambridge and The Alan Turing Institute
‘This book presents an in-depth overview of both the past and present of the newly emerging
area of probabilistic numerics, where recent advances in probabilistic machine learning are used to
develop principled improvements which are both faster and more accurate than classical numerical
analysis algorithms. A must-read for every algorithm developer and practitioner in optimization!’
- Ralf Herbrich, Hasso Plattner Institute
‘Probabilistic numerics spans from the intellectual fireworks of the dawn of a new field to its
practical algorithmic consequences. It is precise but accessible and rich in wide-ranging, principled
examples. This convergence of ideas from diverse fields in lucid style is the very fabric of good
science.’
- Carl Edward Rasmussen, University of Cambridge
‘An important read for anyone who has thought about uncertainty in numerical methods; an
essential read for anyone who hasn’t’
- John Cunningham, Columbia University
‘This is a rare example of a textbook that essentially founds a new field, re-casting numerics on
stronger, more general foundations. A tour de force.’
- David Duvenaud, University of Toronto
‘The authors succeed in demonstrating the potential of probabilistic numerics to transform the way
we think about computation itself.’
- Thore Graepel, Senior Vice President, Altos Labs
PHILIPP HENNIG
Eberhard-Karls-Universitat Tubingen, Germany
MICHAEL A. OSBORNE
University of Oxford
HANS P. KERSTING
Ecole Normale Superieure, Paris
PROBABILISTIC NUMERICS
ИC ambridge
UNIVERSITY PRESS
Cambridge
UNIVERSITY PRESS
www.cambridge.org
Information on this title:www.cambridge.org/9781107163447
DOI: 10.1017/9781316681411
© Philipp Hennig, Michael A. Osborne and Hans P. Kersting 2022
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2022
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-107-16344-7 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To our families.
Measurement owes its existence to Earth
Estimation of quantity to Measurement
Calculation to Estimation of quantity
Balancing of chances to Calculation
and Victory to Balancing of chances.
Acknowledgements page ix
Symbols and Notation xi
Introduction 1
I Mathematical Background 17
1 Key Points 19
2 Probabilistic Inference 21
3 Gaussian Algebra 23
4 Regression 27
5 Gauss-Markov Processes: Filtering and SDEs 41
6 Hierarchical Inference in Gaussian Models 55
7 Summary of Part I 61
II Integration 63
8 Key Points 65
9 Introduction 69
10 Bayesian Quadrature 75
11 Links to Classical Quadrature 87
12 Probabilistic Numerical Lessons from Integration 107
13 Summary of Part II and Further Reading 119
22 Proofs 183
23 Summary of Part III 193
References 369
Index 395
Acknowledgements
Philipp Hennig
Michael A. Osborne
I would like to thank Isis Hjorth, for being the most valuable
source of support I have in life, and our amazing children
Osmund and Halfdan - I wonder what you will think of this
book in a few years?
Hans P. Kersting
Bold symbols (x) are used for vectors, but only where the fact that a variable is a vector is relevant.
Square brackets indicate elements of a matrix or vector: if x = [x1, ..., xN] is a row vector, then
[x]i = xi denotes its entries; if A G Rnxm is a matrix, then [A]ij = Aj denotes its entries. Round
brackets (•) are used in most other cases (as in the notations listed below).
Notation Meaning
Notation Meaning
will consume. There are almost always choices for the character
of an iteration, such as where to evaluate an integrand or an
objective function to be optimised. Not all iterations are equal,
and it takes an intelligent agent to optimise the cost-benefit
trade-off.
On a related note, a well-designed probabilistic numerical
agent gives a reliable estimate of its own uncertainty over their
result. This helps to reduce bias in subsequent computations. For
instance, in ODE inverse problems, we will see how simulating
the forward map with a probabilistic solver accounts for the
tendency of numerical ODE solvers to systemically over- or
underestimate solution curves. While this does not necessarily
give a more precise ODE estimate (in the inner loop), it helps
the inverse-problem solver to explore the parameter space more
efficiently (in the outer loop). As these examples highlight, pn
hence promises to make more effective use of computation.
to wait over a decade. By then, the plot had thickened and au
thors in many communities became interested in Bayesian ideas
for numerical analysis. Among them Kadane and Wasilkowski
(1985), Diaconis (1988), and O’Hagan (1992). Skilling (1991) even
ventured boldly toward solving differential equations, display
ing the physicist’s willingness to cast aside technicalities in the
name of progress. Exciting as these insights must have been
for their authors, they seem to have missed fertile ground. The
development also continued within mathematics, for example in
the advancement of information-based complexity5 and average 5Traub, Wasilkowski, and Wozniakowski
(1983); Packel and Traub (1987); Novak
case analysis.6 But the wider academic community, in particular
(2006).
users in computer science, seem to have missed much of it. But 6 Ritter (2000)
the advancements in computer science did pave the way for
the second of the central insights of pn: that numerics requires
thinking about agents.
This Book
research questions.
p (x | y) = ( 1
P y x) P x) ( .
^posterior J p(y | x) p(x)
У - -
d x'
evidence
Y n 1
N(x;ц, ) = (2 ^D121Y11/2 exp (- (x - ц)TY 1(x - ц.
(3.1)
Here, a parameter vector ц G RD specifies the mean of the 1 The matrix inverse Y-1 is known as the
rich analytic theory, and that computers are good at the basic More on this in a report by MacKay
linear operations - addition and multiplication. (2006).
2 The entropy,
In fact, the connection between linear functions and Gaussian Hp (x) := - p(x)logp(x)dx, (3.2)
distributions runs deeper: Gaussians are a family of probability of the Gaussian is given by
distributions that are preserved under all linear operations. The
following properties will be used extensively: HN(x;ц,Y) (x)
N1
If a variable x G RD is normal distributed, then every affine
= у (1 + log(2 n)) + ^log | |. Y (3.3)
ure 3.1):
I
The product of two Gaussian probability density functions
is another Gaussian probability distribution, scaled by a con-
Figure 3.1: The product of two Gaussian
stant.3 The value of that constant is itself given by the value densities is another Gaussian density, up
of a Gaussian density function to normalisation.
N(x; a, A)N (x; b,B)=N(x;c, C)N(a; b, A + B), 3This statement is about the product of
two probability density functions. In con
where C := (A-1 + B-1)-1, (3.5)
trast, the product of two Gaussian ran
and c := C(A-1a + B-1b). dom variables is not a Gaussian random
variable.
then both the posterior and the marginal distribution for y (the
evidence) are Gaussian (Figure 3.2):
linear regression model.3 To perform inference on such a function 3This need not mean that f is also a lin
ear function in x!
within the Gaussian framework, we assign a Gaussian density
over the possible values for the weight vector w, written 4This includes the special case Л 0,
usually written suggestively with the
p (w) = N (w; ц, E). Dirac delta as
£ = £ - £ФX(ФХ£ФX + Л) -1ФХ£,
ц = ц + £Фx(ФХ£Фх + Л)-1 (y - ФХц).
Ф,(x) = |x - i\;
Ф, (x) = e-x-i;
x x x
30 I Mathematical Background
& Generalised linear regression allows inference on real-valued Exercise 4.1 (easy). Consider the likelihood
of Eq. (4.2) with the parametric form for
functions over arbitrary input domains. f of Eq. (4.1). Show that the maximum
likelihood estimator for w is given by the
By varying the feature set Ф, broad classes of hypotheses ordinary least-squares estimate
can be created. In particular, Gaussian regression models can WML = (Ф X ФХ ) "1 ФX У.
model nonlinear, discontinuous, even unbounded functions.
To do so, use the explicit form of the Gaus
There are literally no limitations on the choice of ф : X R. sian pdf to write out log p(y | X, w), take
the gradient with respect to the elements [w]i
& Neither the posterior mean nor the covariance over function of the vector w and set it to zero. If you find it
difficult to do this in vector notation, it may
values (Eq. (4.3)) contain “lonely” feature vectors, but only
be helpful to write out ФXw = £i wi [Фx] i:
inner products of the form kab := Ф,УЕФb and ma := Фaц. where [Фx] i: is the ith column of Фx. Cal
For reasons to become clear in the following section, these culate the derivative of log p(y | X,W) With
respect to wi, which is scalar.
quantities are known as the covariance function k : X x X R
and mean function m : X R, respectively.
with scale Л € R+ over the domain [cmin, cmax] C R. That is, set Proof sketch, omitting technicalities:
10
i—1
_ 2 92 Q 2 (cmax cmin ) , 2Q2 92 ( c max — c min ) — (—2
-------------------=--------------------- e 2Л2
= I /~Л1'
F (ci -1/2( a+b))2
and set ц — 0. It is then possible to convince oneself10 that the ■ £e Л2Г2 .
i—1
limit of F те and cmin — те, cmax те yields
In the limit of large F, the number of fea
tures in a region of width 3c converges
(
k (a, b) — Q2 exp ( — a - b) ) • (4.4) to F'3c/(cmax-cmin), and the sum becomes
the Gaussian integral
/292 -(a-b)2
This is called a nonparametric formulation of regression, since kab — ,—, e 2Л2
Плк
the parameters w of the model (regardless of whether their ccmax - (c-1/2(a+b))2
e Л2/2 dc.
number is finite or infinite) are not explicitly represented in the
computation. The function k constructed in Eq. (4.4), if used in For cmax, cmjn ±те, that integral con
verges to ^/ПЛ A/2.
a Gaussian regression framework, assigns a covariance of full
rank to arbitrarily large data sets. Such functions are known as
(positive definite) kernels.
is positive (semi-) definite.11 Such functions are also known as Mercer 11 It follows from the definition of pos
itive definite matrices (§15.2) that all
kernels, positive definite functions, and, in the context of Gaussian
kernels obey k(a, a) > 0 and k(a, b) —
processes, as covariance functions. k(b, a), ya, b € X.
k(a,b) = e-|a-b|;
x x x
36 I Mathematical Background
8 end procedure
mean prediction) at a new location x* is linear in N, computing 1. Consider the real number a e R. The
function f = a • f is also distributed
the marginal variance at such a point is quadratic in N . according to a Gaussian process. What
are its mean and covariance functions?
2. The sum f + g is also distributed accord
► 4.3 Relationship to Least-Squares and Statistical Learning ing to a Gaussian process. What is the
mean function and the kernel off + g?
In the case of Gaussian models, the probabilistic formalism 3. The sum f - g is also distributed accord
is closely connected to other frameworks for inference, learn ing to a Gaussian process. What is the
mean function and the kernel off - g?
ing and approximation. This connection is helpful to connect
Probabilistic Numerics with classic numerical point estimates.
A theorem due to Moore and Aronszajn14 shows that each 14 This theorem was first published by
where || f \\-H, is the rkhs norm of f 17 The corresponding state 17Note that, in this text, the meaning of
the symbol k is overloaded. On the one
ment for the posterior standard deviation
hand, it signifies the covariance function
of a Gaussian process (gp) and, on the
a(x) := ^V(x, x) = ^kxx - kxx (kxx + Л)-1 kxx (4.9) other hand, the kernel of an rkhs. This
double use of k is common practice due
is given by the following theorem about the worst-case approxi to the similarity of both cases.
defined by Eq. (4.6) is tightly bounded above by the associated GP Кf,g)\2 < (.f,f} • (.s,g),
posterior marginal variance for Л 0: for f,g eH.
(a-b)2
kSE(a, b) := exp I 2.2)'
[ - (d - b)Ф(d - b) + (d - a)Ф(d - a)
and
= (]A2 - x .4 kSE(x,x),
f), f (x)) = (
cov ( d - .2 ) kSE (x, x).
5
Gauss-Markov Processes
Filtering and Stochastic Differential Equations
fj=tP(У0:t-1 I X0:t-1)p(x0) dx0 (По<j<t P(Xj | Xj-1) ^7) P(xt | xt-1) (Пi>t P(Xi | Xi-1) dxi)
f P(У0:t-1 I X0:t-1)p(x0) dx0 (По<j<tP(xj | xj-1) dx^ p(xt | xt-1) dxt (ni>tp(xi I xi-1) dxi)
Predict: use the posterior from the previous step to compute2 2 These steps are a special case of a
general algorithm called belief ProPaga
tion or the sum-Product algorithm (Pearl,
P(xt | У0:t-1) = j P(xt | xt-1)P(xt-1 | У0:t-1) dxt-1. (5.2) 1988; Lauritzen and Spiegelhalter, 1988).
It formalises the computational cost
If t = 0, start the induction with the prior P(x0). of inference in joint probability distri
butions given their factorisation into
Update: include the local observation into the posterior by local terms. There are visual formal
languages capturing such factorisation
Bayes’ theorem:
properties, known as graPhical models
(e.g. Figure 5.1).
p (xt | У )= P(yt । xt)P(xt । У0:t-1) (53)
0: t fly (yt | xt) P ( xt | У0: t-1) d xt
Expression (5.2) is known as the Chapman-Kolmogorov equation.
It can be found as Eqs. (5) and (5*) in a seminal paper by
Kolmogorov (1936).
Then, we just insert Eq. (5.5) into Eq. (5.6),4 which yields 4 Kitagawa (1987)
p(x0)=N(x0;m0,P0), (5.8)
p(xt+1 | xt) = N (xt+1; Atxt, Qt),
p(yt | xt) = N (yt; Htxt, Rt). (5.9)
The latter two relations are also often written, and known as,
the dynamic model and an measurement model, respectively:
In signal processing and control theory, the matrices At, Ht, Qt, Rt
are known as the transition and measurement matrices, process
noise and observation noise covariances, respectively. We will
sometimes use these intuitions evoking a dynamical system, but
it will not always fit directly to the numerical domain. The en
tire setup is known as a linear Gaussian system in control theory.
However, to avoid confusion with the regression models defined
in §4.1 and §4.2 (which are also Gaussian and linear), we will
use another popular convention and call this a linear Gaussian
state-space model to stress that the inference takes place in terms
of the time-varying states xt .Ifthe parameters are independent
of time, At = A, Qt = Q, Ht = H and Rt = R, for all t, they
define a linear time-invariant (LTI) system.
10 end procedure
The update, Eq. (5.3) becomes Exercise 5.3 (easy). Using the basic prop
erties of Gaussians from Eqs. (3.4), (3.8) &
(3.i0) and the prediction-update Eqs. (5.2)
p(xt | y0:t) = N (xt; mt, Pt),
& (5.3), show that Eqs. (5.i0) to (5.i3) hold.
1 procedure Smoother(mt, Pt, A, mt-+1, Pt-+1, mts+1, Pts+1) Algorithm 5.2: Single step of the RTS
smoother. Notation as in Algorithm 5.i.
2 G = Pt A T (P-+1)—1 // gain
The smoother, since it does not actually
3 mss = mt + G (mS+1 — m-+1) Ц posterior mean touch the observations yt, has complex
ity O(L3).
4 Pf = Pt + G (Pf+1 - Pf+1) GT H posterior covariance
5 end procedure
5 Gauss-Markov Processes: Filtering and SDEs 47
1 procedure Predict(m0, P0, [At, Qt, Ht, yt, Rt]t=1,...,M) Algorithm 5.3: Algorithmic wrapper
around Algorithm 5.1 to perform on
2 for t = 1,... do
line prediction at a sequence of points
3 I (mt, Pt) = Filter(mt-1, Pt—1, At, Qt, Ht, Rt, yt) ti. Predict has constant memory re
4 end for quirements over time, and its compu
tation cost is constant per time step.
5 end procedure Thus inference from N observations re
quires N steps of Algorithm 5.1, at cost
O(N(L3 + V3)). See the Bayesian ODE
of the prediction and updated beliefs, as follows: filter (Algorithm 38.1); one of its special
cases, the EKF0, is analogous to this Al
gorithm 5.3 (as discussed in §38.3.4).
Let Gt := Pt AJ (Pt—)—1, (smoother gain)
then mts = mt + Gt(mts+1 - mt-+1)
and Pt = Pt + Gt(P— — P—)GtT. (5.15)
These update rules are often simply called the “Kalman smoo
ther” or, more precisely, the (fixed interval) Rauch-Tung-Striebel
(RTS) tmoother equations.7 The estimates computed by the Kal 7 Rauch, Striebel, and Tung (1965)
man filter and smoother are depicted in Figure 5.2.
For reference, Algorithms 5.1 and 5.2 summarise the above
results in pseudo-code, providing the individual steps for both
the filter and the smoother. The wrapper algorithm Predict (Al
gorithm 5.3) solves the task of continuously predicting the sub
sequent state of a time series. For a finite time series running
from t = 0 to t = T, the algorithm Infer (Algorithm 5.4) returns
posterior marginals for every time step t. Since these marginals
(that is, their means and variances) are exactly equal to those
of GP regression (§4.2.2), algorithm Infer (Algorithm 5.4) is
nothing but a linear-time implementation of GP regression with
Markov priors.8 8 Sarkka and Solin (2019), §12.4
The Kalman filter (and smoother) are so efficient that they can
be sometimes applied even if the linearity or Gaussianity of
the dynamic or measurement model are violated.9 In numerics, 9 Sarkka (2013), §13.1
the case for such fast-and-Gaussian methods is even stronger
1 procedure Infer(m0, P0, [At, Qt, Ht, yt, Rt]t=1,...,T) Algorithm 5.4: Algorithmic wrapper
around Algorithms 5.1 and 5.2 to per
2 for t = 1 : 1 : T do
form inference for a time-series of finite
3 ((mt,Pt), (m—,P— )) = length. While the runtime of this routine
4 Filter(mt—1,Pt—1, At—1, Qt—1, Ht, Rt,yt) is also linear in T, in contrast to Algo
rithm 5.3, it has linearly growing mem
5 end for ory cost, to store all the parameters of
6 for t = T — 1 : — 1 : 0 do the posterior and intermediate distribu
tions. See the Bayesian ODE smoother (Al
7 (mt, Pt) = gorithm 38.2); one of its special cases, the
8 +
Smoother( mt, Pt, At, m— 1, Pt—1, mst—1, Pt—1) EKS0, is analogous to this Algorithm 5.4
9 end for (as discussed in §38.3.4).
10 end procedure
48 I Mathematical Background
For a moment, let us put aside the observations y and only con
sider the prior defined by the dynamic model from Eq. (5.11).
The predictive means of this sequence of Gaussian variables fol 11 This uses the matrix exponential
low the discrete linear recurrence relation mt+1 = Atmt. When
solving numerical tasks, the time instances t will not usually be
eX = exp(X) := £ X-, (5.17)
i
immutable discrete locations on a regular grid, but values cho where Xi := X • • • X is the ith power of
sen by the numerical method itself, on a continuous spectrum. C (defining Ci0tim=es I). The exponential
We thus require a framework creating continuous curves x(t) for
exists for every complex-valued matrix.
t > 10 G R that are consistent with such linear recurrence equa Among its properties are:
tions. For deterministic quantities, linear differential equations are ■e e0 = I (for the zero matrix);
that tool: Consider the linear (time-invariant) dynamical system ■& if Xy = YX, then
for x(t) G RN, eXeY = eYeX = eX+Y .
Thus, the linear ordinary differential equation (5.16), together ■& if D = diagi (di) then eD = diagi (edi)
with a set of discrete time locations [t0, t1, ..., tN = T] gives rise (using the scalar exponential). In par
ticular, if X = VDV-1 is the eigen-
to the linear recurrence relation
decomposition of X, then
=: k(t, t).
with
Ati := exp(F(ti+1 - ti)) and
Qti := [ti+1 -ti eFTLLтeFTT dT. (5.21)
i0
t
-2
-4
—
t0 t1 t2 t3 t4 t5 t6 t7
t
h2q+3-i-j
A detailed derivation of A and Q is provided in the literature.18 18 Kersting, Sullivan, and Hennig (2020),
1
cov(f (a), f (b)) = в2 (3min3(a, b) + \a - bl min(a, b))
(5.27)
(see also Figure 4.3). We recall that the posterior mean function
of a Gaussian process is a weighted sum of kernel evaluations.
Hence, under the choice (5.27) and noise-free observations of
f, the posterior mean on f would be the piecewise cubic inter-
polant of the data (which is unique within the convex hull of the
52 I Mathematical Background
x 26
dx — — л d t + dcct
k (t, t) — 62 e \^.
The naming of the family is due to
21
0,
dz(t) z(t) dt + dwt
- z2 -2z e
0 1 0 0
dz(t) 0 0 1 z(t) dt + 0 dwt. (5.30)
-z3 -3z2 -3z e
found in the book by Fafibender (2007). More generally, Riccati with symmetric matrices B, C. There is a
equations have been the subject of deep analysis. A classic book corresponding stability analysis for the
continuous-time algebraic Eq. (5.32) in
was written by Reid (1972). Bini, Iannazzo, and Meini (2011) volving a Hamiltonian matrix.
provide a more recent review.
6
Hierarchical Inference in Gaussian Models
N
p(y | a,в) = ПN(yi; a,в2) = N(У; a 1,в21)•
i=1
A prior n with this property is called conjugate to the likeli In the context of this text, it is a neat
marginal observation that, while Legen
hood p(y | a, в). For the present case of a Gaussian likelihood dre was interested in problems of chance,
with latent mean and variance, the standard choice of conjugate Euler’s original motivation for consider
ing those integrals, in an exchange with
prior assigns a Gaussian distribution of mean ц0 and variance
Goldbach, was one of interpolation: he
proportional to в2 (with scale Л0) to a, and a Gamma distribu was trying to find a “simple” smooth
tion with parameters a0, b0 to the inverse of в. This is known function connecting the factorial func
tion on the reals. And indeed,
as the Gauss-Gamma, or Gauss-inverse-Gamma prior, and has the
Г(n) = (n - 1)! n e N\0.
hyperparameters 90 := ц0 Л0 a0 b0 :
Legendre is also to blame for this un
sightly shift in the function’s argument,
П (a, в | Ц0, Л0, a0, b0) = p (a | в, Ц0, Л0) p (в | a0, b0) since he constructed Eq. (6.1) by re
в2 arranging Euler’s more direct result
=N a; ц 0/— G (в-2; a 0, b 0),
Л0 n! = (- log x)n dx.
b a za 1—
where G(z; a, b) := A great exposition on this story and
the Gamma function can be found in
a Chauvenet-prize-decorated article by
Here, G (•; a, b) is the Gamma distribution with shape a > 0 Davis (1959). It is left as a research exer
and rate b > 0. The normalisation constant of the Gamma cise to the reader to consider in which
sense Euler’s answer to this interpolation
distribution, and also the source of the distribution’s name, is problem is natural, in particular from a
the Gamma function Г.3 To compute the posterior, we multiply probabilistic-numerics standpoint (that
prior and likelihood, and identify is, which prior assumptions give rise to
the Gamma function as an interpolant
of the factorials, whether there are other
p(a,в | У) “ p(У | a,в)p(a,в) (6.2) meaningful priors yielding other inter
polations).
= N (y; 1 a, в 21) N (a; ц 0, P2/л0) G (в-2; a 0, b0).
6 Hierarchical Inference in Gaussian Models 57
We first deal with the Gaussian part, using Eq. (3.5) (and some
simple vector arithmetic) to re-arrange this expression as
and the first one does not depend on a .To deal with the first
part, we use the matrix inversion lemma, Eq. (15.9), to rewrite
(I + 1i1l A —1 = I — AilL
\ A0 J N + A0.
This allows writing the first line of Eq. (6.3) more explicitly
(leaving out normalisation constants independent of в) as
yy; у0,в2(I + Ao
G(в—2;a0,b0) N
1N
sample mean, (6.5)
f
a = N i=1 y
1N
в2 = N f
( yi — a )2 sample variance, (6.6)
f
of i.i.d. draws yi, one observes individually
scaled samples ai = yisi with known scales
N
(fN=1 yi — N У 0 )2 si. By re-tracing the derivation, convince
(yi — у0)2 — yourself that this situation can be addressed
i=1 A0 + N
in the same fashion, but with new sufficient
= Nв 2+ л A 0Nv(a — У 0)2. statistics
A0 + N 1
The terms in Eq. (6.4) thus have the form of a Gamma distribu
f *si
i=1
tion over в-2. All other terms suppressed by the ■ sign make up в2
;
1
N
the normalisation constant of the new Gauss-Gamma-posterior
This will be helpful for hierarchical inference
p(a, в | у, У0, A0, a0, b0) = у (a, в I yN, An, aN, bN), (6.7) in §III.
Aо N
(a - y0)2.
A0 + N
Intuitively, this term corrects for the fact that в>2 is a biased
estimate of the variance - the sample mean is typically closer to
the samples than the actual mean is, and this bias depends on
how far the initial estimate y0 is from the correct mean.
5 An alternative contender for the in
Gauss-inverse-Wishart distribution
_ Л0ц0 + Na
Hn = Л0 + N '
ЛN = Л0 + N,
vN = v0 + N,
p (yN+1 | Y, ц0, Л0, v0, W0) = у p (yN+1 | a, B) p (a, B | Y, ц0, Л0, v0, W0) da dB (6.12)
r. ( ЛN +1 ,, \
= StM yN++1; HN, ЛN (vn — M + 1) Wn, vN — M + 1J
Г( vn+M/2) — (vn+M)/2
(6.13)
In particular, for SDEs like those of the preceding §5.4 and §5.5,
which can be written, with an explicit role for 9, as
N
p (у 19) = П
i=1
N (yi; Hm-, 92 HP-HT)
p (9 ) = G (9—2, a 0, в 0),
z _2 , x ( 2 N 1 N (yi — Hm— )2 A
P(9 2 1 У1:n) = G 9 2;a0 + у,в0 + 2
2 2 iL
=1
HP — HT
HPi H
=: G(9—2,aN,вn). (6.14)
In Algorithm 5.1, this can be realised10 by using Q instead of 10Pseudo-code can be found in the fol
lowing chapter, as Algorithm 11.2.
Q, and adding the lines a a + 1/2 and в в + z/2s after line
8 (those two variables having been initialised to a a0, в в0).
The corresponding Student-t marginal on the state can be com
puted locally, if necessary, in each step.
7
Summary of Part I
► 9.1 Motivation
(Figure 9.1). The handful of symbols on paper that make up 3 www.gnu.org/software/libc. There is a
fuzzy boundary between what consti
Eq. (9.1) fully specify a unique deterministic function. For arbi
tutes an “atomic” operation and a nu
trary double precision values of x G R, a laptop computer can merical algorithm. We will be pragmatic
evaluate f (x) to machine precision in a few nanoseconds, using and define the libc as the set of atomic
operations. A more abstract notion, in
only multiplication and addition, and the “atomic” functions line with the purposes of this text, albeit
exp and sin, which are part of elementary programming lan also less concrete, is as follows: consider
an algorithm a, a map a (0) = & taking
guage definitions, like the gnu C library.3 Repeated evaluations
inputs 0 G П from some space П that de
at the same value x will always return the same result. There is fine a computational task with a true (po
nothing imprecise about Eq. (9.1) and thus, one may argue that tentially unknown) solution d , and re
turning an estimate d. If there are values
there is also nothing uncertain about the function f . However, of 0 that the algorithm accepts without
the definite real number throwing an error, such that the resulting
& deviates from the true w by more than
machine precision, then we might call
F := f(x) dx GR (9.2) a a numerical method, otherwise a low-
-3
level, atomic routine. Put succinctly, a
numerical method produces an estimate
cannot be computed straightforwardly or elementarily. Since
that may be off, while an atomic rou
f is clearly integrable, there is one and only one correct value tine just returns correct numbers. Hence,
of F. But this real number cannot be found in standard tables numerical methods can benefit from a
non-trivial notion of uncertainty, while
of analytic integrals.4 And there is no atomic operation in low- in atomic routines the associated uncer
level libraries providing its value to machine precision. Despite tainty is always nil. This is not meant to
the formal clarity of Eq. (9.1), we are evidently uncertain about be a perfectly rigorous definition of the
term, and it is imperfect (for example,
the value of F, because we cannot provide a correct value for numerical methods also often have pre
it without further work. This is epistemic uncertainty, the kind cision parameters and budgets to tune
and spend, an aspect ignored by this def
arising from a lack of knowledge. inition), but a precise definition is not
But it is easy to constrain the value of F to a finite domain: needed in practice anyway.
Because f (x) is strictly positive, F > 0. A second glance at
4 Gradshteyn and Ryzhik (2007)
Eq. (9.1) lets us notice that f is bounded above by the Gaussian
function:
Exercise 9.1 (inspirational, see discus
f (x) < g(x) := exp(-x2) Vx G R. sion in §9.3). Even without appealing to
the Gaussian integral, we could also bound f
Thus, from above with the unit function u(x)=1
on the integration domain, and would arrive
0 < F< g (x) d x = /П.
—TO at the looser bounds 0 < F < 6. On the
other hand, if we allow the use of the func
Hence it is possible to define a proper prior measure over the tion erf (which is also in glibc), we could
value of F, for example the uniform measure p(F) = U(0,) (F)• refine the upper bound to the definite integral
over g, arriving at 0 < F < \/П erf (3). In
Thus, if we can collect “data” Y that is related to F through some which sense is this more “precise” prior more
correctly defined and sufficiently regular likelihood function “correct”? Is there a most correct, or opti
p(Y | F), then the resulting posterior p( F | Y ) will converge mal prior? There is no immediate answer to
these questions, but it is helpful to ponder it
towards the true value F, at an asymptotically optimal rate for while reading the remainder of this chapter.
this likelihood.5 The question is thus, which prior, and which
likelihood, should we choose? 5 Le Cam (1973)
For the example integrand f with the integration limits (a, b)=
(-3, 3) from Eq. (9.1), the uniform measure p(x) = U-3,3(x)
fulfils these properties. Another possibility is the Gaussian mea
sure restricted to [-3,3], that is,
1N
:
F = N E
w(xi),
i=1
(9.4)
—2
using the function w(x) := f (x)/p(x), which is well-defined by Figure 9.2: Monte Carlo integration. Sam
ples are drawn from the Gaussian mea
our above assumptions on p ( x). The logic of this procedure is
sure p(x) (unnormalised measure as
depicted in Figure 9.2. dashed line, samples as black dots), and
Note that, in contrast to F, the estimator F is a random number. the ratio w(x)=f (x)/p(x) evaluated
for each sample. The histogram plot
By constructing F, we have turned the problem (9.1) - inferring ted vertically on the left (arbitrary scale)
an uncertain but unique deterministic number - into a stochas shows the resulting distribution p(w) =
w(x) dp(x). Its expected value, times
tic, statistical one. This introduction of external stochasticity
the normalisation erf(3^/n, is the true
introduces a different form of uncertainty, the aleatory kind, integral. Its standard deviation deter
a lack of knowledge arising from randomness (more in §12.3). mines the scale for convergence of the
Monte Carlo estimate.
Because we have full control over the nature of the random num-
bers, it is possible to offer a quite precise statistical analysis of
F:
assuming varp (w) exists. Hence, the standard deviation (the square
root of the variance, which is a measure of expected error) drops as Proof of Lemma 9.2. F is unbiased be
cause its expected value is
O (N- 1/2) - the convergence rate of Monte Carlo integration.
This is a strong statement given the simplicity of the algorithm (F> = N E la w (xi) p(xi)dxi = F,
it analyses: random numbers from almost any measure allow
given that the draws xi are i.i.d., and as
estimating the integral over any integrable function. This inte suming that the w (■) function is known
grator is “good” in the sense that it is unbiased, and its error (more on this later). As F is a linear com
bination of i.i.d. random variables, the
drops at a known rate.7 The algorithm is also relatively cheap: variance of F is immediately a linear
it involves drawing N random numbers, evaluating f ( x) once combination of the respective variances:
for each sample, and summing over the results. Given all these Ei varp (wi) varp (w)
var( F) = N2 N •
strong properties, it is no surprise that Monte Carlo methods
have become a standard tool for integration. However, there is □
a price for this simplicity and generality: as we will see in the 7The multiplicative constant varp (w) can
next section, the O(N-1/2) convergence rate is far from the best even be estimated at runtime! Albeit
usually not without bias. See also Ex
possible rate. In fact, we will find ourselves arguing that it is the ercise 9.3 for some caveats.
worst possible convergence rate amongst sensible integration
algorithms.
Exercise 9.3 (moderate, solution on
Monte Carlo is not limited to problems where samples can p. 361). One of the few assumptions in
be drawn exactly. Where exact sampling from a distribution Lemma 9.2 is the existence of varp(w). Try
to find an example of a simple pair of inte
is difficult, Monte Carlo is often practically realised through
grand f and measure p for which this as
Markov-Chain Monte Carlo (mcmc). These iterative methods do sumption is violated.
not generally achieve the O(N-1/2) convergence rate, but they
can still be shown to be consistent, meaning that their estimate
of the integral asymptotically converges to its true value.
The model thus encodes assumptions, not just over the inte
grand, but also over its relationship to the numbers being
computed to estimate it.
F= f(x (x) dv (x), (10.1) 1The example of Eq. (9.2) can either be
X seen as integrating the function f (x)=
e- sin2 (3x) against the Gaussian measure
where v (x) is a measure.1 The domain of integration, X, is, in v(x) = e-x2, or as integrating the f from
practical applications, often a bounded interval, such as [a, b] C Eq. (9.1) against the Lebesgue measure.
R. The integration problem (10.1) is written as univariate, which
will be the setting motivating the majority of this chapter. We
will, however, also explain in §10.1.2 how Bayesian quadrature
can be generalised to multivariate problems.
As we already saw in Chapter I, Gaussian process models for
are a fundamental tool for efficient formulations of inference.
And indeed they allow for an analytic formulation of integration
as probabilistic inference. Because Gaussian measures are closed
under linear maps (see Eq. (3.4)), they are, in particular, also
closed under integration. Following the exposition of §4.4, a
generic Gaussian process prior ( p f )=GP( f ; m,k) over the
integrand amounts to a joint Gaussian measure over both the
function values collected in Y := [f( x1), ...,f( xN)] and the
integral F:
A .
p (F, Y )=! p (F, YI f) p (f) df=/5 (F - L f (x)d v (x)) П Vi - f (xi)) p(f) df
i=1
= mo + k Xkxx(Y — mx),
► 10.1 Models
posterior on F:
. Т Т —1 /л z
mx + kxX kXX (Y mX) dx,
21
kxx' kxXkXXkXx'dxdx
в I I I.
where
m = kXk-XY,
v = K - kXkx X k X.
K = 02 (det 2n(2 E + Л) 2 G R.
X v k Reference
X = argmin v( X).
X e R''
Designing the optimal grid for such a rule, even for regular ker
nels, can be challenging, because the corresponding multivariate
optimisation problem can, in general, have high computational
complexity.5 However, instead of finding an optimal grid, one can
also sample, at cost O(N3), a draw from the N-determinantal point
6The point of these papers does not
process (dpp) associated with k. Results by Bardenet and Hardy disagree with our general argument, in
(2019) suggest that doing so causes only limited decrease in §12.3, against the use of random sam
pling. Rather, these results show that al
performance over the optimal deterministic grid design (which
lowing minor deviations from the opti
amounts to the maximum a posteriori assignment under the mal design can drastically reduce com
N-dpp). Belhadji, Bardenet, and Chainais (2019) offer further putational complexity at negligible de
crease in performance. The necessary
support for the use of determinantal point process sampling for “samples” can even be drawn in a de
integrands known to live in an rkhs.6 terministic way.
10 Bayesian Quadrature 81
the most desirable model (a log-gp) and the most desirable loss
function. However, bbq employs a first-order approximation to
the exponential function, along with the maintenance of a set of
candidate points xc at which to refine the approximation. Such
approximation proves both highly computationally demanding
and to express only weakly the prior knowledge of the large
dynamic range of the integrand.
The first practical adaptive Bayesian quadrature algorithm
was wsabi,8 which adopts another means of expressing the 8 Gunter et al. (2014)
non-negativity of a integrand: the square-root of the integrand
f (x) (minus a constant, a G R) is modelled with a gp. Precisely,
1
f (x) = a + 2 f (x)2 ,
where, given data D,
where m(x) and V(x, x') are the usual gp posterior mean (4.6)
and covariance (4.7), respectively (and thus depend on D). An
integrand modelled as the square of a gp will have a smaller
dynamic range than one modelled as an exponentiated gp. In
this respect, wsabi is a step backwards from bbq.
However, the bbq approximations are significantly more
costly, both in computation and quality, than those required for
wsabi. That is, wsabi considers both linearisation and moment-
matched approximation to implement the square-transformation:
both prove more tractable than the linearised-exponential for
bbq. Linearisation gives the following (approximate) posterior
for the integrand:
mL(x) := a + 1 m(x)2,
VL(x,x') := m(x)VV(x,x')m(x'). (10.12)
ample.
F=J f (x) d x,
(ais).11 These are remarkable results, because the algorithm 11 Neal (2001)
for wsabi requires the maintenance of a full gp, with its costly
O (N3 ) computational scaling in N, the number of evaluations.
These algorithms also require the computationally expensive
management of the gp’s hyperparameters. wsabi also selects
evaluations at each iteration by solving a global optimisation
problem, making calls to optimisation algorithms. In short, ws-
abi requires substantial computational overhead. In comparison,
Monte Carlo makes use of a (pseudo-) random number generator
(prng) to select evaluations, at negligible overhead cost, and
a negligibly costly simple average as a model. Nonetheless,
the substantial overhead incurred by wsabi, included in those
measurements of wall-clock time, does not prevent wsabi from
converging more quickly than Monte Carlo alternatives. Again,
the overhead is perhaps better framed as an investment, whose
returns more than compensate for the initial outlay. We will
return to these considerations in our generic arguments against
random numbers in §12.3.
Buttressing these promising empirical results, Kanagawa and
with
ai(x) = T(q2(x)VV(x, x)) bi(x),
-3 -2 -1
and
k( xi )= k k (x, xi) d x = 6 2( [ x dx + f xi d x — x (b - a)
a a min(b,xi)
1 /1 z О О\ /
= 6 у 2 (min(b,xi)2 - a2) + xi[b - min(b,xi)) - X(b - a)
1
= 62 I xib — 2 (a2 + x2) - X(b - a) by the assumption a < xi < b above.
= t2
N 1 2(f(x+1)
i=1
+ f(x)). ' 4>
This is the trapezoidal rule.3 It is arguably the most basic quadra 3 Davis and Rabinowitz (1984), §2.1.4
ture rule, second only to Riemann sums. Hence we have the
following result.
Theorem 11.1. The trapezoidal rule is the posterior mean estimate4 4 Because the mean of a Gaussian dis
for the integral F = jb f (x) dx under any centred Wiener process tribution coincides with the location of
maximum density, the trapezoidal rule is
prior p(f) = GP(f;0,k) with k(x,x') = в2(min(x,x') - X for also the maximum a posteriori estimate
arbitrary в E R+ and x < a E R• associated with this setup.
1 Si Si3/3 Si2/2
Ai = Qi = e2
0 1 Si2/2 Si
5 Since the goal is to infer the definite in
Using H = 01and setting R = 0 to encode the likelihood tegral Fb at the right end of the domain,
there is no need to also run the smoother
p(yi | f) = S(yi = f (xi)), we can thus write the steps of the (Algorithm 5.2). It could be used, how
Kalman filter (Algorithm 5.1) explicitly,5 and find that they ever, to also construct estimates for the
anti-derivative Fx (Eq. (11.5)) at arbitrary
simplify considerably. The mean and covariance updates in locations a < x < b.
lines6 7 and 8 of Algorithm 5.1 are simply 6 Note that the symbol z has a different
If we allow for an observation at x1 = a, then the initial values Exercise 11.2 (easy). Convince yourself
of the SDE are irrelevant. Since we know Fa = 0 by definition that Eqs. (11.6) and (11.7) indeed arise as the
updates in Algorithm 5.1 from the choices or
(thus with vanishing uncertainty), the natural initialisation for
A, Q, H, R made above. Then show that the
the filter at x1 = a is resulting mean estimate mN at x = b indeed
amounts to the trapezoidal rule (e.g. by a
telescoping sum). That is,
0 0 0
m1 = P1 =
f(a) 0 0 E( F) = iE
=11 2 (fi+1 + fi),
The filter thus takes the straightforward form of Algorithm 11.1. e2 N—1
12 i=1
This algorithm is so simple that it barely makes sense to spell var(F) = 12 E
S3.
it out. The significance of this result is that Bayesian inference In practice, the algorithm could thus be im
on an integral, using a nonparametric Gaussian process with plemented in this simpler (and parallelisable)
form. Note again, however, that this algo
N evaluations of the integrand, can be performed in O(N) rithm is not a good practical integration rou
operations. tine, only a didactic exercise. See §11.4 and
in the literature cited abovefor more practical
Put another way, the simple software implementation of algorithms.
the trapezoidal rule is identical to that of a particular Bayesian
quadrature algorithm.
11 Links to Classical Quadrature 91
2 N-1 (11.8)
= 1-2 E S3.
i=1
> 11.1.4 Adaptive Error Estimation The gradient w.r.t. 3j, j < N - 1 is then
2
dvar(F)
Just like the mean estimate of the trapezoidal rule (11.4), the = у [ 3j- 3 N-1 ].
d3j
equidistant grid as an optimal design choice for the rule is
Setting this to zero (recall that all 3i must
independent of the scale Q2 of the prior. Hence, there is a family of be positive) gives 3j = 3N-1, Vj = N - 1.
models M (Q), parametrised by Q 6 R+, so that every Gaussian Without the formal requirement of x1 =
a, xN = b, it actually turns out (Sacks
process prior in that family gives rise to the same design (the
& Ylvisaker, 1970 & 1985) that the best
same choice of evaluation nodes), and the same trapezoidal design is
estimation rule. The associated estimate for the square error, the 2i
posterior variance, of these models is given by Eq. (11.8). If we * = a + (a - b) 2NTT •
choose the design with equidistant steps as derived above, that This leaves a little bit more room on the
left end of the domain than on the right,
expression is given by
due to the time-directed nature of the
Wiener process prior. E.g., for N = 2,
, . Q2 1 (b-a V Q2(b - a)3 the optimal nodes on [a,b] = [0, 1] are at
(11.9) [2/5, 4/5].
var( ) = 12 Л VN-lJ = 12( N - 1)2,
I
_ CNIr.., aN
St I F; PF\Y , ,, T , aN I , 2
\ в N °F 7
From Eq. (11.10) we see that, in contrast to Eq. (11.8), this esti
mated variance now actually depends on the function-values
collected in Y. For the specific choice of the Wiener process
prior (11.2), the values collected in вN in Eq. (11.10) become9 9If necessary, the second line of
Eq. (11.12) can be used tofixX to its
most likely value, given by
1N (yi H Hmi )2
вN = в0 + ^ E HP i— H5 т XML :=
(11.12)
1 f(x 1 )2 N f(xi) -f(xi—1))2 1 N (f(xi) — f(xi—1))2
= в0 + ^ + E Si
f (x1)2 1 E
N- i=2 Si
X1 + X i=2
x1.
For reference, Algorithm 11.2 on p. 97 provides pseudo-code However, if x1 = a, and the first eval
and highlights again that this Bayesian parameter adaptation uation y1 is made without observation
noise, the value of X has no effect on the
can be performed in linear cost, by collecting a running sum of
estimates.
the local quadratic residuals (f(xi) - Hmi-)2 of the filter.
be integrable. But of course it is a correct assumption, because 11 Davis and Rabinowitz (1984), p. 53
the integrand of (9.1) is indeed continuous (even smooth).
12 The intuition for the corresponding
After about N = 64 evaluations, the trapezoidal rule settles proof is that, if the integrand is contin
into a relatively homogeneous convergence at a rate of approxi uously differentiable, then, by the mid
point rule, the infimum and supremum
mately O(N-2). This behaviour is predicted by classical anal-
of f1 give an upper and lower bound on
yses11 of this rule for differentiable integrands like this one.12 the deviation of the true integral in a seg
The probabilistically constructed non-adaptive error estimate ment [xi, xi+1] from the integral over the
linear posterior mean in that segment,
of the trapezoidal rule (Eq. (11.8)) predicts a more conservative, and that deviation drops quadratically
slower, convergence rate O(N-1). We know from Eq. (11.9) that with the width of the segments.
this is a direct consequence of the Wiener process prior assump
Davis and Rabinowitz (1984), p. 52.
13
tion: draws from Wiener processes are very rough functions Note, however, that draws from the
(almost surely continuous but not differentiable). While there Wiener process are actually almost surely
not Lipschitz. The nearest class of con
is no direct classic analysis for this hypothesis class of Wiener
tinuity that can be shown to contain
samples itself, there is classical analysis of the Trapezoid rule them is that of Holder-1/2 continuous
for Lipschitz continuous functions that agrees with the poste functions; a very rough class for which
the cited theorem can only guarantee
rior error estimate and predicts a linear error decay.13 Even O(N-1/2) convergence.
the adaptive error estimate arising from Eq. (11.11), although
converging faster than the 1/N rate of the gp posterior standard 14 Not every classical quadrature rule can
Fa(x)
f(x)
z (x) = f' (x)
()
f(q) x
8 m- = Am / predictive mean
9 P - = AP A т + Q / predictive covariance
10 z = f (x) — Hm- / observation residual
11 s = HP - H т / residual variance
12 K = 1/sP - H T // gain
13 m m- + Kz / update mean
14 P P- - KsKт / update covariance
15 в в + z2/2s / update hyperparameter
16 end for
17 E(F) m1 / point estimate
18 var(F) в/(a0 + N/2- - 1) • P11 / error estimate
19 r в - в0 - N / model fit diagnostic (see §11.3)
20 return E(F), var(F) / return mean, variance of integral
21 end procedure
0 0 01xq
P= 0 0 0 1 xq (11.13)
0 qx 1 0qx1 aIq
with a “very large” value of a.
Figure 11.5 shows empirical convergence of these rules when
integrating our running example of Eq. (9.1), for the choices
q = 0 (the trapezoidal rule) through q = 3.
mizu (2020)
quadrature rule for one specific problem involves finding a
good combination of prior mean m and covariance k so that the
three integrals of Eqs. (10.5), (10.6), and (10.7) are analytically
available. Recall that we introduced a diverse range of tractable
possibilities for Bayesian quadrature in §10.1. Exploring this
space to find the best model is daunting.
Even if we only consider Bayesian quadrature algorithms
based on linear state-space models, there is a large space of po
tential sdes to consider. In this setting, can the choice of model
be automated? This search for the “right” prior model is itself
another, potentially challenging, inference problem. In §11.1.4,
11 Links to Classical Quadrature 99
1N (yi - Hmi-)2 1
log p (Y IM ) = - 2 £ HPi- H T
— ^ log IHP— HT| + const.
(11.14)
hood is
E Z^_ N N
eN - в0 H у,
-E12 s2 2
d
v (f)=Lf (x) v (x).
They will be assumed to hold without
further comment.
N
Q(f) := E wi •f (xi),
i=1
That rule is identified by the weights
19
where X = [x1, ...,xN] are the nodes or knots (also sometimes w satisfying the N monomial constraints
called sigma-points) of the rule, and w = [w1, ..., wN] are the Q(xi) = v(i) for i = 0, ..., N - 1, which
amount to the Vandermonde system
weights.
0
The most popular design criterion for quadrature rules is x10 N w1
to require the rule to be exact for all polynomials up to some .
degree. x1N-1 N-1
N wN
v(x0)
Definition 11.3. Let pN(x) = EiN=0 aixi be a polynomial of degree
.
N. A quadrature rule Q is of degree M if it is exact with respect to
v(xN-1)
v for all polynomials pN of degree N < M, i.e. if
& Under the posterior on f arising from the kernel k2N, the vari
ance is zero at the N nodes X of the Nth polynomial, but is
generally non-zero at x E/ X (see Figure 11.9). To explain this
11 Links to Classical Quadrature 105
v (ty) = 0, Vi > 0,
associated gp match the integrand as closely as possible.1 Of 1 Analysis through the lens of the rkhs
adds further analytical tools. For practi
course, building a strong model entails its own design and
cal purposes, however, samples from the
running costs (e.g. in updating the gp with new data). For associated Gaussian process are arguably
a cheap integrand, like the running example in this chapter, more explicit, and more interpretable.
„ ............. . *
E\х,y (f (x)) = R X m + kxxkX xy = RI m + E I(x = xi) Vi,
i=1
with
N
5 Rasmussen and Williams (2006), §2.7
Rx := 1 - 1TkX 1Xkxx = 1 - E I(x = xi), and
i=1
q-2 N
m := (c + 1TkxX 1) 11TkxXY = c + Q- 2n E yi.
The corresponding Gaussian posterior over the integral F =
ab f(x) dx has mean and variance
bb , X
E\X,Y(F) = Ja E\X,Иf(x)) dx =(b - a)m1, and
var\x,Y(F) = aa cov\x,y f (x), f (x)) d x dx= j+- aN
A defender of Monte Carlo might argue that its most truly desir
able characteristic is the fact that its convergence (see Lemma 9.2)
does not depend on the dimension of the problem. Performing
well even in high dimension is a laudable goal. However, the
statement “if you want your convergence rate to be independent
of problem dimension, do your integration with Monte Carlo”
is much like the statement “If you want your nail-hammering
to be independent of wall hardness, do your hammering with
a banana.” We should be sceptical of claims that an approach
performs equally well regardless of problem difficulty. An ex
planation could be that the measure of difficulty is incorrect:
perhaps dimensionality is not an accurate means of assessing
the challenge of an integral. However, we contend that another
possibility is more likely: rather than being equally good for any
number of dimensions, Monte Carlo is perhaps better thought
of as being equally bad.
Recall from §10.1.2 that the curse of dimensionality results
from the increased importance of the model relative to the evalu
ations. Theorem 12.1 makes it clear that Monte Carlo’s property
of dimensionality-independence is achieved by assuming the
weakest possible model. With these minimalist modelling as
sumptions, very little information is gleaned from any given
evaluation, requiring Monte Carlo to take a staggering number
of evaluations to give good estimates of an integral. As a con
trast, Bayesian quadrature opens the door to stronger models
for integrands. The strength of a model - its inductive bias - can
indeed be a deficiency if it is ill-matched to a particular inte
grand. However, if the model is well-chosen, it offers great gains
in performance. The challenge of high dimension is in finding
models suitable for the associated problems. Thus far, Proba
bilistic Numerics has shone light on this problem of choosing
models, and has presented some tools to aid solving it. It is now
up to all of us to do the rest. Of course, we must acknowledge that
contemporary quadrature methods (both probabilistic and clas
sical) do not work well in high-dimensional problems: indeed,
they perform far worse than Monte Carlo. However, arguments
like those in this chapter show that there is a lot of potential
for far better integration algorithms. Such methods can work
12 Probabilistic Numerical Lessons from Integration 111
For further evidence to this point, we note that even the most
general model underlying Monte Carlo integration can actually
converge faster if the nodes are not placed at random. Equa
tion (12.1) is independent of the node placement X. Soifit is
used for guidance of the grid design as in §11.1.3, then any
arbitrary node placement yields the same error estimate (as
long as no evaluation location is exactly repeated). Since the
covariance k assumes that function values are entirely unrelated
to each other, a function value at one location carries no infor
mation about its neighbourhood, so there is no reason to keep
the function values separate from each other.
The tempting conclusion one may draw from Theorem 12.1
is that, because, under this rule, any design rule is equally
good, one should just use a random set of evaluation nodes. This
argument is correct if the true integrand is indeed a sample
from the extremely irregular prior of Eq. (12.2). But imagine for
a moment that, against our prior assumptions, the integrand f
happens to be continuous after all. Now consider the choice of
a regular grid,
Then, the mean estimate from Eq. (12.1) is the Riemann sum
E|X,Y(F) = h Е f (xi).
For functions that are even Lipschitz continuous, this sum con
verges to the true integral F at a linear rate,7 O(N-1). That is, 7 Davis and Rabinowitz (1984), §2.1.6
the poor performance of Monte Carlo is due not just to its weak
model, but its use of random numbers. This insight into the
advantage of regular over random node placement is at the
heart of quasi Monte Carlo methods.8 As we have seen above, 8 E.g. Lemieux (2009)
however, it is possible to attain significantly faster convergence
rates by combining non-random evaluation placements with
explicit assumptions about the integrand.
Exploration But what is the right loss function for the task
addressed by a prng? It is hard to defend a single, one-off,
choice being made by a prng: that is, to defend the expected loss
for such a choice being uniformly flat. A prng is perhaps more
productively considered as a heuristic for making a sequence of
decisions. The goal of this sequence (or design), X = {x1, ...}, is
to achieve exploration, which we will roughly define as providing
information about the salient characteristics of some function
f (x). As a motivating example, consider f (x) as the integrand
of a quadrature problem. A prng provides exploration but,
remarkably, requires neither knowledge or evaluations of f,
nor more than minimal storage of previous choices x. These
self-imposed constraints are extreme. First, in many settings,
including in the quadrature case considered in this chapter,
we have strong priors for f . Second, many problems (again,
as in quadrature), render evaluations f (x) pertinent to future
choice of x: for instance, a range of x values for which f (x)
is observed to be flat is unlikely to require dense sampling.
Third, as computational hardware has improved, memory has
become increasingly cheap. Is it still reasonable to labour under
computational constraints conceived in the 1940s?
The extremity of the prng approach is further revealed by
broader consideration of the problem it aims to solve. Explo
ration is arguably necessary for intelligence. For instance, all
branches of human creative work involve some degree of explo
ration. Human exploration, at its best, entails theorising, probing
and mapping. This fundamental part of our intelligence is ad
dressed by a consequentially broad and deep toolkit. Random
and pseudo-random algorithms, in contrast, are painfully dumb,
and are so by design.
To better achieve exploration, the Probabilistic Numerics ap
proach is to explicitly construct a model of what you aim to
explore - f (x). This model will serve as a guide to optimally ex
plorative points, avoiding the potential redundancy of randomly
sampled points.
Figure 12.1, in contrast to Figure 9.3, is a cartoon indictment
of the over-simplicity of a randomised approach.
Loveland (1966).
do not actually need to understand precisely what is and is
not predictable. The point of the Monte Carlo argument is that,
if we have access to a stream of unpredictable numbers and
use itto build an integration method, then no one can design
an integrand that will foil our algorithm, simply because that
adversary cannot predict where our method will evaluate the
integrand.
The possible existence of adversarial problems motivates the
construction of unbiased Monte Carlo estimators. Unfortunately,
‘bias’ is an overloaded and contested term. In the context of
Monte Carlo, ‘unbiased’ simply means that the expected value
of a random number is equal to the quantity to be estimated
(Lemma 9.2). But this purely technical property draws some
emotional power from the (completely unrelated!) association,
in common language, of ‘bias’ with unfairness.11 The technical, 11 Pasquale (2015); O’Neil (2016).
statistical, definition of bias, that used in defining unbiased esti
mators, is one term within a particular decomposition of error
in predicting a data set. As argued by Jaynes,12 this term has no 12 Jaynes and Bretthorst (2003), §17.2
fundamental significance to inference. Our goal should simply
be to reduce error in the whole. (‘Inductive bias’,13 meaning 13 Mitchell (1980)
the set of assumptions represented within a machine learning
algorithm, represents yet another distinct use of the term. In
this sense, Probabilistic Numerics unashamedly encourages bias,
through the incorporation of useful priors.)
It is important to keep in mind that the users of numerical
methods are not adversaries of the methods’ designers. In fact,
the relationship is exactly the opposite of adversarial: users often
change their problems to be better-suited to numerical methods
(as, in deep learning, network architectures are chosen to suit
optimisers). As a result, the majority of integrands, and optimi
sation objective functions, are quite regular functions. Moreover,
this regularity is well-characterised and knowable by the nu
merical algorithm. There may be good use cases for random
numbers in areas like cryptography, where a lack of informa
tion, unpredictability, is very much the point. But numerical
116 II Integration
1. 6224441111111114444443333333
2. 169399375105820974944592307816
3. 712904263472610590208336044895
4. 100011111101111111100101000001
5. 01110000011100100110111101100011
3. This sequence was generated by the von Neumann method,14 14 Von Neumann (1951)
a pseudo-random number generator, using the seed 908344. It
is the kind of sequence used in real Monte Carlo algorithms,
and - now that you know the seed - entirely deterministic.
It would have been ok to use this sequence for Monte Carlo
estimation up until three sentences ago, when we ripped
down the veil of randomness.
Software
At the time of writing, the open-source emukit library1 provides 1 Paleyes et al. (2019), available at
a package for Bayesian quadrature with a number of different emukit.github.io
13 Summary of Part II and Further Reading 121
Further Reading
Quasi-Monte Carlo
Convergence Analysis
This matrix maps from RMA •MB, the space of vectorised real
MA x MB matrices, to RNAxNB, the space of vectorised NA x NB
matrices. For C G RMA'MB, and C = C, we have
There are many norms3 on the space of real matrices A G 3 A matrix norm ||A|| G R has the prop
erty that
RN x M, some with certain analytical advantages over others.
IIAII > 0,
The Frobenius norm is defined by
■o \\A\\ = 0iff A = 0,
NM ■& 11 aA || = | a 11| A11 for all a G R,
IIA||F := tr(ATA) = £ £ Aj = A-A = || A||2, * \\a + B\\<wAWW.
i=1 j=1
(SVD)
B = Q S UT
with orthonormal5 matrices Q G RNxN, U G RMM, whose 5 That is, QT Q = IN and UT U = IM.
columns are called the left- and right- singular vectors, respec
tively, and a rectangular diagonal matrix6 S G RNxM which 6 That is, Sij = 0ifi = j.
contains non-negative real numbers called singular values of B on
the diagonal. Assume, w.l.o.g., that N > M and the diagonal el
ements of S are sorted in descending order, and Srr with r < M
is the last non-zero singular value. Then Q can be decomposed
into its first r columns, Q+, and the (potentially empty) N - r
columns, Q- as Q = [Q+, Q-], and similarly U = [U+, U-] for
the columns of U. The SVD is a powerful tool of matrix analysis:
with
M :=(S - RP-1Q)-1, i
P := P-1 + P-1QMRP-1,
i
Q := -P-1 QM, i
R := -MRP-1.
130 III Linear Algebra
The related result for block matrices is9 9 Lutkepohl (1996), §4.2.2, Eq. (6)
A B
A non-singular ^ det det(A) det(D - CA-1B),
C D
r
A B
D non-singular ^ det det(D) det(A - BD-1C).
C D
__________ 16
Introduction
A—1 =: H.
Ai Di = Zi and HiZi = Di .
The crucial question is, how should the solver choose the action
di+1 from the posterior?
r(x) = Ax - b.
x x - Hr(x)
= xc - H ( Axc - b)=Hb = x
mation from zi within the loop, as long as it is possible to do so Hence, for x0 = 0 (or HiAx0 =
x0), Eq. (17.1) is actually equal to xi =
at low computational cost. In particular, we can re-scale the step
Hib. This is mostly a problem of presen
as di+i ai+idi+i, using a scalar ai G R. Doing so introduces tation: Algorithm 17.1 is a compromise
an ever so slight break in the consistency of the probabilistic allowing both a general probabilistic in
terpretation while staying close to classic
belief: the estimate Xi in line 8 of Algorithm 17.1 will neither formulations, which typically allow arbi
be equal to Hi-ib nor to Hib. But this is primarily an issue of trary x0.
algorithmic flow (the fact that Xi and Hi are computed in differ
ent lines of the algorithm), and the practical improvements are
too big to pass on. In any case, this adaptation is also present
in the classic algorithms, so we need to include itto find exact
equivalences.
Indeed, under the assumption of symmetric A, the optimal
scale ai+i can be computed in linear time, using the observation
in line 7.We will consider symmetric A hereafter. Consider
the parametrised choice Xi = Xi-i + aidi. The derivative of the
17 Evaluation Strategies 141
df (xi—1 + aidi)
aid]Adi + d] (Axi—1 - b)
dai
= aid ]Adi + diT ri—1.
diT ri—1
ai = (17.3)
dizi '
Theorem 18.2 (proof onp. 183). IfA is symmetric, and the inference
rule in line 12 of Algorithm 17.2 produces a symmetric estimator Hi,
then Algorithm 17.2 is a conjugate directions method.
ri ± rj V 0 < i = j < k,
and there exist Yi € R\0 for all i < k so that line 12 in Algorithm 17.2 Figure 18.1: Analogous plot to Fig
can be written as ure 17.1. The gradients at points sam
pled independent of the problem’s struc
di = -Hi-1 ri-1 = Yi - -ri-1 + ^~-i- 1 ) , ture (“needles” of point and gradient as
(18.5) black line, drawn from spherical Gaus
Yi-1 sian distribution around the extremum)
with are likely to be dominated by the eigen
vectors of the largest eigenvalues. Thus,
ei := . by following the gradient of the problem,
rl- 2 ri-2 one can efficiently compute a low-rank
approximation of A that captures most
Comparing Algorithm 17.2 to cg (Algorithm 16.1), we note that of the dominant structure. This intuition
they are identical up to re-scaling by Yi : is at the heart of the Lanczos process that
provides the structure of conjugate gra
dCG = Y dProbabilistic dients.
► 18.5 Preconditioning
(i.e. Uт = U- 1):
AX = b with (18.7)
A := C-TAC- 1, X := Cx, b = C-Tb.
5 for i = 1,..., N do
6 di = -gi-1 + pi1 pi-1 / compute direction
7 zi Adi / compute
8 ai = -d ri-1/di zi / optimal step-size
9 si aidi / re-scale step
10 yi ai zi / re-scale observation
11 xi = xi-1 + si / update estimate for x
12 ri = ri-1 + yi / new gradient at xi
13 gi = K-1 ri / corrected gradient
14 pi = rl gi/r]_ 1 ri-1 / compute conjugate correction
15 end for
16 end procedure
object.
This is not to say there is no use for uncertainty in linear
solvers. It just so happens that classic solvers address a corner
case, one less demanding of uncertainty. It is nevertheless useful
to understand the connection to probabilistic inference in this
domain, because uncertainty is more prominently important if:
& the question at hand involves the matrix inverse itself more
than the estimate of the solution. For example if we want to
compute Laplace approximations to large-scale models like
deep neural networks,8 which involve the inverse Hessian of 8 MacKay (1992); Daxberger et al. (2021).
the regularised empirical loss. Such matrices are routinely
much too large to be directly inverted. But they can also have
a limited number of prominent eigenvalues. If we use an
iterative solver in such situations, we will be forced to stop
it very early compared to the size of the matrix, and then
look for a good uncertainty estimate over the “unexplored”
remainder; or
X G RNxK by using the vectorisation operation (§15.1) and has distribution W(X; V, v).
using the multivariate Gaussian distribution over vectors to
define3 3 We will omit the arrow over matrices
(X (X; X0, £0) := (2n)NK/21£011/2 , (19.3) used in this sense, so there is no risk of
ambiguity.
tion itself - if we do not actually compute exact matrix-matrix However, for small values of ц, the distri
bution becomes strongly bi-modal. It is
multiplications, but only approximations of it (a setting not therefore clear that we will have to resort
further discussed here). to approximations if we want to infer
both matrices and their inverse while
The downside of this formulation is that it does not explicitly using a Gaussian distribution to model
involve x. This is an issue because a tractable probability either variable. See also Figure 19.9 for
more discussion.
distribution on A may induce a complicated distribution on 6 We will generally assume that b itself
x. For intuition, Figure 19.1 shows distributions of the inverse is known with certainty, and thus not
of a scalar Gaussian variable of varying mean. For matrices, explicitly include it in the generative
model.
the situation is even more complicated, as the probability
measure might put non-vanishing density on matrices that
19 Probabilistic Linear Solvers: Algorithmic Scaffold 151
x = Hb
Z = AD .-, D = HZ.
S Y and A H.
A better prior would encode the fact that A is not just a long
vector, but contains the elements of a square matrix. The pro
jections terms (I 0 S), with their Kronecker product structure,
already contain information about the generative process of the
observations. We thus consider a Kronecker product for the
prior covariance, too:7 7Distributions of the form N(X; X0, V ®
W) are sometimes called a matrix-variate
So = Vb ® Wo with spd Vo, Wo € RNxN. (19.8) normal, due to a paper by Dawid (1981).
This convention will be avoided here,
since it can give the incorrect impression
What kind of prior assumptions are we making here? If both
that this is the only possibility to assign a
matrices in a Kronecker product are spd, so is their Kronecker Gaussian distribution over the elements
product (see Eq. (15.6)). Hence, Eq. (19.8) yields an spd overall of a matrix, when in fact Eq. (19.3) is the
most general such distribution.
covariance, and the prior assigns non-vanishing probability
density to every matrix A, including non-invertible, indefinite
8 One helpful intuition for this situation
ones, etc., despite the fact that such spd matrices V0, W0 only is to convince oneself that the space of
offer Kronecker products spans a sub-space
of rank one within the space of N2 x
2 ■ (1/2N(N + 1)) = N(N + 1) N2 real matrices, and that this sub-space
does contain a space of spd matrices.
degrees of freedom (as opposed to the 1/2N2 (N2 + 1) degrees
of freedom in a general spd S0).8
The prior assumptions encoded by a Kronecker product in the
covariance are subtle. A few intuitive observations follow. The
Kronecker covariance can be written as
Cj ~ N(0, Wo).
154 III Linear Algebra
3. Set Aij = biCj + Ao,ij i.e. Al = B © C + A0, if B and C denote 9NB: the product of two Gaussian prob
the matrices resulting from arranging the vectors bi and cj ability distributions is another Gaussian
distribution (times a Gaussian normali
into square matrices in the obvious way. sation constant). But the product of two
Gaussian random variables is not itself a
The matrix A arising from this process is not Gaussian dis- Gaussian random variable!
tributed.9 But it indeed10 satisfies Eq. (19.9): the covariance is 10 This is because
Kronecker-structured.
E (AlijAlы - E(Alij)E(Alы))
Another helpful observation is that the marginal variance of
= E(aibjakbt)) = [V0]ik ■ [W0]je.
individual matrix elements under the choice E0 = V0 © W0 is
determined solely by the diagonals of the two matrices in the
Kronecker product:
Lemma 19.3. Assume Ao and Wo are spd, and the search directions Proof. If Ao is spd, its inverse exists.
S are chosen to be linearly independent. Then, for our assumption of Y = AS, and products of spd matrices
are spd. Thus, Wo Ao-1 A is spd, hence
spd A, the inverse (19.12) exists. S1Wo A -1 AS invertible. □
This step alone does not encode symmetry (samples from this
distribution are still asymmetric with probability one), but it
avoids technical complications in the following.
For a formal treatment, we introduce two projection operators
acting on the space RN of square N x N matrices:
[Пе (W 0 W)ПФ](ij),(и)
1
= 4 (W W - W W - WikWje + W W) = 0.
W 0 W = Пе(W0W)Пе + Пф (W3W)Пф .
' =W W ' ' =:,«■ W '
These two products are known14 as the symmetric Kronecker 14 van Loan (2000)
product W 0 W and the skew-symmetric Kronecker product W ® W,
with elements
Exercise 19.4 (easy). Show that for matri
ces X drawn from the Wishart distribution
[C 0 D] (ij),(u) = 1/2(CD + CltDjk), and (19.16) W(X; V, v) (defined in Eq. (19.1)) the ele
ments of X have the covariance
[C ® D](ij),(k£) = 1/2(000 - CtfDjk).
cov(Xij, Xu) = 2v(V ф V).
They inherit some, but not all, of the great properties of the Hint: Use the generative definition of
Kronecker product. In particular, when applied to symmetric Eq. (19.2). The fourth moment of a central
normal distribution is given by Isserlis’ the
matrices,15
orem (Isserlis, 1918):
(W 0 W) -1 = W-1 0 W- 1.
p (a )= N (a;0, L) ^
However,16 for general C and D, E( aiaj ak ae) = L ij Lkt + L ik Ljt + L ie Ljk.
(C 0 D)-1 = C-1 0 D- 1.
We also have
— —
(C 0 D)X = 1/2(CXDT + CXTDT), and
(19.17) 15If W e RNxN is of full rank, the ma
, _ к --- , _ _ _ _к
(C ® D)X = 1/2(CXDT - CXTDT). trix W ф W has rank 1/2N(N + 1), the
dimension of the space of all real sym
Using this framework, the information about A’s symmetry can metric N x N matrices. That its inverse
on that space is given by W-1 ф W-1
be explicitly written as an observation with likelihood can be seen from Eq. (19.17). The inverse
on asymmetric matrices is not defined.
# . .к _# —- _. . # # #— _ —- _ _ 4
p (e I A ) = 5 (Пф A - 0) = lim N (0 n2; П® A, в IN2). (19.18) 16Alizadeh, Haeberley, and Overton
в 0 (1988)
;
p(A 1 ©) = N(A A0 - L0Пф(ПФ^Цф) 1(-ПФA0), (i i )
Lo - L0ПТ (ПфLoпт )-1 ПфL0). 9
158 III Linear Algebra
WM := Wо - WоS(STW6S)-1STWe-
Exercise 19.5 (moderate). Explicitly com
pute the evidence term
These expressions, in particular the posterior mean Ам, play
a central role not just in linear solvers, but also in nonlinear 13 (Y - AS)N(А; А о, Wо ® Wb) dА.
0 IM Uт
A m = A о + UV т + VU т = A 0 + [u
IM 0 Vт
i i -1 г
U т A 0 1U U т A - 1V + I Uт
A M1 = A0-1 - A0-1 U V A001 . (19.24)
V т A -1U + I V т A 0 1V Vт
w:= -W^
u : s] Wisi'
Now note that the rank-2 update in Eq. (19.25) can be written
as the sum of two symmetric rank-1 matrices:
(y]A- 1 Wisi)2 - ii
ii (yiTA- 1 yi)(i s]WiiA- 1 Wisi) + i(sWWiiA- 1 Wisi)(i yjsi)
det Ai+1 = det Ai
(siT Wisi )2
Since Wi is positive semi-definite, by Lemma 19.8, Ai+1 is thus
symmetric positive definite if and only if
choice W0 = A). □
[y1 Ум ] G RNx M.
Figure 19.8: Analogous to Figure 19.7,
To endow that algorithm class with a probabilistic meaning,
but with the covariance choice W = A
we might directly model the matrix inverse H. Modelling H considered in Corollary 19.9. Under this
allows a joint Gaussian model over both H and the solution choice, the posterior mean always lies
“to the right” of the true A along the pro
x. Alternatively, we might model the matrix A. Modelling A jection line, thus in the positive definite
allows direct treatment of Gaussian observation noise, and still cone.
19 Probabilistic Linear Solvers: Algorithmic Scaffold 163
+ UAT - USTA UT
VY(A) = V0 ® W0(I - SUT) VIY(A) = W0(I - SUT) 0 W3(I - SUт)
1
As we have already noted in §19.2, there are structural differ
ences between a Gaussian model over A and one over its inverse 0
of the form of Algorithm 17.2, one with a belief on A with prior mean (x-ц)2/2 erfi Х-Ц
erfi V2 '
A0 and a covariance parameter W0A (with associated posterior mean
where erfi(z) = -ierf(iz). However,
AM) and one maintaining a distribution on H with prior mean H0
for ц 1, the distribution p(x- 1)
and covariance parameter W0H (with associated posterior mean HM). is relatively well approximated by
We say their priors induce posterior correspondence if N(x- 1; ц- 1, ц-2) (dashed line ц- 1, dot
ted lines at ц ± 2yJц-2). The plot also
shows 20 samples (x + ц)-1. For ц = 0,
AM = HM for 0 — M — N. (19.28) the mean does not exist.
19 These statements are all from Wenger
And we speak of weak posterior correspondence if we only have and Hennig (2020), where proofs can
also be found.
A-M1Y = HMY. (19.29)
W0A S = Y, and
Sт ( waa- 1 - AWH) = 0.
A = (C ® C)TA.
In this chapter, we have so far established that there are self 1 Expositions on the convergence of
cg and related methods can for example
contained probabilistic inference algorithms on matrix elements
be found in §5.1 (p. 112 onwards) of No-
whose behaviour is consistent with that of existing linear solvers. cedal and Wright (1999) and in §11.4.4. of
These corroborate the view that computation is inference. Golub and Van Loan (1996). A frequently
used result is that cg after k + 1 steps,
On a more practical level, a corollary of these results is that assuming exact computations, finds the
we can use existing - efficient, stable - implementations of linear solution Xk+i = Pk(A)r0, where P£ is
the (matrix) polynomial of degree k that
solvers, cg in particular, and treat them as a source of data,
solves the following optimisation prob
producing informative action-output pairs (si, yi = Asi)i=1,...,M. lem over all such polynomials:
Combined with the kind of structured Gaussian priors stud Pk = argmin\\x0 + Pk(A)r0 - x\\a,
ied above, these data then give rise to posteriors on A and H, Pk
(20.1)
and these posteriors have convenient properties: their posterior where x is the true solution of Ax =
mean is a sum of A0 and a term of low rank, thus easy to handle b, and ||v||A := vTAv. This result can
be used to phrase the convergence in
both analytically and computationally. And the good conver
terms of the eigenvalue spectrum of
gence properties1 of cg translate into desirable convergence A (e.g. p. 116 in Nocedal and Wright
properties of the Gaussian posterior. (1999)). In particular, if A has eigenval
ues Ai < ■ ■ ■ < AN, then the error of
cg after M + 1 steps is roughly given by
Thus, consider the data set S, Y G RN' M collected by running
||xm+1 - x\\a ~
cg on A, b.Ifwe adopt this practically minded approach, the
(An-M - A1)\\x0 - x*\\a. (2O.2)
desiderata for the Gaussian prior change. It no longer matters
Simply put, if A has K N large
whether the prior is particularly consistent with the actions of
eigenvalues and N - K small ones, then
the algorithm that collected the data. Instead, two new consid cg finds a good estimate in only K steps.
Assume we want to perform inference on A with a symmetry 2A scalar A 0 = a 01 is also relatively easy
to handle. Because cg produces orthog
encoding prior p(A) = N(A; A0, W0 ® W0). Both the computa onal gradients and yi = Asi = ri - ri-1,
tional and the calibration viewpoints suggest choosing the matrix YT Y is symmetric tridiagonal
positive definite:
W0 = A.
([A]ii - Ep(A)([A]ii)) =1
(20.5)
Ep(a)(([A]ii - Ep(A)([A]ii))2) .
20 Computational Constraints 171
(20.6)
because we assumed A to be spd, and thus
trix inversion lemma to compute an inverse. However, the pseu- (A+A )T = A+A.
doinverse3 of A M can be computed efficiently and has the right (The concept seems to have been in
vented by Fredholm (1903) for operators,
conceptual properties for many applications. For a factorised
and discussed for matrices by Moore
symmetric matrix like our AM = YYT, the pseudoinverse is (1920).) The pseudoinverse yields the
given by least-squares solution A+b for our lin
ear problem A x = b in the sense that
A + = Y (Yт Y) - 2 Yt.
||Ax - b||2 > \\AA+b - b||2 Vx e RN.
Since YTY is tridiagonal symmetric positive definite, its inverse For the choice A0 = 0, A+M can also be
can be computed in 8M operations (see note 2). seen as the natural limit of the estimator
A-M arising from A0 = al for small a,
because, for general A,
Alas, we can of course not set W0 = A, since A is the very
A + = lim (A т A + a I)- 1 A т
matrix we are trying to infer. We could set a 0
^а: = W0 - W0 S (S T W0 S) - 1 S T Wo
= YYт - YYт = 0.
YT x = ST b. (20.7)
1 , ~~ s
=: 2 (ew0 + bbT) with b := W0b,в := bTW0b.
1
(YTE0Y) -1 = 2 (в(YTW0Y) + YTbbTY)
Y т b b т Y (Y т W0 Y) -1 \
= 2 в-1 (Y T Wo Y) -1 I ,
в + b т Y (Y т Wo Y) -1 Yb J
/ VTMTV/VTW Y/-— 1 \
xm = x0 + (.WоY + bbTY)в-1 (YTW0Y)-1 (I — в + ~TY(YTWoY)—1 Yb) (STb + YTx0). (20.10)
tive definite, fast-converging, computationally efficient) mean N (A; a I, в21 ® I) is another prior con
sistent with conjugate gradient. Since it
estimate AM .But we could have done all of this without a gives rise to a posterior mean that offers
probabilistic formulation. The final goal of this chapter is to less of these good properties (it is more
expensive, and not necessarily positive
construct a tractable covariance that is both probabilistically definite), it is less interesting. But its sim
consistent with AM (i.e. both mean and covariance arise from pler structure allows a different form of
uncertainty calibration, via a conjugate
the same generative model) and well calibrated, so that it can
prior. This will not be further explored
serve as a notion of uncertainty. here, but interested readers can find a
Our approach to achieve this will be to set W0 to a matrix derivation in the appendix to this chap
ter, in §22.5.
that acts like A on the span of S, and estimate its effect on the
complement of this space using regularity assumptions about
A. That is, W0 could be chosen as the general form2 2See Hennig (2015). Additional discus
sion and further experiments can be
found in Wenger and Hennig (2020),
Wo = YYT + (I - S(STS)-1ST)n(I - S(STS)-1ST), (21.1) which, among other things, investigates
the notion of Rayleigh regression.
П = w I. (21.2)
Wо = YYT + w (I - S (S TS)-1ST) and 3Recall, e.g. from Theorem 18.4, that for
cg, the space spanned by the directions
Wm = Wo - Wo S (ST Wo S) S T Wo S and that of the A-projections Y are
= Wo - YYT closely related. More precisely,
s mAs m
a(m) :=
smsm
5 Both bounds hold because A is as
where sm is the mth direction of cg. These coefficients are read sumed to be spd, thus all its eigenval
ily available during the run of the solver, because the term ues are real and non-negative. The up
per bound holds because the trace is the
sm Asm (up to a linear-cost re-scaling) is computed in line 7 of sum of the eigenvalues. If kxx = UDUT
Algorithm 17.2. From Eq. (21.3), there are straightforward upper is the eigenvalue decomposition of the
spd matrix kXX, then the lower bound
and lower bounds both for elements of A and for a(m). With holds because UDUT + a21 = U(D +
the eigenvalues A1 > • • • > AN of A, we evidently have a21) UT. For this specific matrix, we also
know from the functional form of kXX
A 1 > a (m) > AN for all m (Eq. (4.4)) that [A]ij < 1 + a28ij, although
such a bound is not immediately avail
able for H = A-1.
and thus also5
21 Uncertainty Calibration 177
Since the a(m) are readily available during the solver run, it
is desirable to make additional use of them for uncertainty
quantification - to set w in Eq. (21.2) based on the progression of
a(m). One possible use for the posterior mean AM is to construct
178 III Linear Algebra
1
p(Av) = N(Av; AMv, 2 (WMvTWMv + (WMv)(vTWMA).
\______________ _______________ /
=:Lv
Proof. By induction: For the base case1 i = 2, i.e. after the first 1 For this proof, it does not actually mat
iteration of the loop, we have (recall that a 1 = -d1 r0/d}Ad1). terhow the first direction d1 is chosen.
The symmetry of the estimator Hi is used in the third to last
equality:
For the inductive step, assume {d0, ..., di-1} are pairwise A-
conjugate. For any k < i, using this assumption twice yields
= -dk A Hi I £ yj +r0 ]
<<i J
= -dk A ( £ sj + Hir0 )
\j<i /
= -akdjAdk - djA(Hir0)
= dkrk- 1 - dkr0
r о r0
a0 =
rо Ar0.
22 Proofs 185
r rоAr0 x j
51 =— A5°,and
, _ , ( rJ Ar0
d2 = 50 V 0 — AA r 0;
_ ( ( ( (rJ Ar0 )2
50 I 00 a0Ar0(0,00)(0,AAr0)
di ^, vj sj + viri~ 1. (22.3)
j<i
186 III Linear Algebra
If £ < i — 1, then the second term in this sum cancels by the first
induction assumption:
в в1 . \ 1 > Pi . \ , ■
dj = Yj \ —j 1 +---- — dj—1 I ^ rj 1 =------- dj------------- — dj—1 I for all j < i. (22.5)
\ Yj—1‘ ) Yj\‘ Yj—1 )
dT ri— 1
ri = Asi + ri—1 = aiAdi + ri—1 = — £ Ad Adi + ri—1. (22.6)
22 Proofs 187
rT r. — i - 1d — — di^ Ad + rT r- = — dTr. . + rT r 1
ri—1 ri ЛТДЛ I di di-1 Adi + ri— 1 ri— 1 — d dir_ 1 + r_ 1r_ 1,
Yjui tMAi \ li—1 / li
and use Eq. (22.4) to get
rT r - rT г I вj dт Г +rT r - вj dT r
ri—1 ri— —ri—1 ri—1 + — dj—1 rj—1+ ri—1 ri—1— y—y dj—1 rj-1.
d—1 rj-2
r_1 — aj—1 Adj—1 + rj—2 — — dj—1 Adj—1.
Proof of Theorem 19.10. We first note that for scalar W0, we have
Wi — в(I — S(STS) —1ST), so в cancels out of the right-hand side
of Eq. (19.27), and Wisi amounts to computing the projection
of si onto the complement of the span of S. Now we make the
inductive assumption that Ai is positive definite. If this holds,
then the second term being subtracted on the right-hand side
of Eq. (19.27) is strictly positive (the numerator is the square
of a real number, the denominator is nonnegative because Ai
is positive definite). So the right-hand side of the inequality is
smaller than yjA—1 yi. Furthermore, the left-hand side, yTsi —
siAsi is positive, since A is assumed to be spd. We will show
that the upper bound y]A—1 yi can be brought arbitrarily close
to zero for large values of a, so that the inequality eventually
has to hold.
188 III Linear Algebra
where
matrix M is theinterm
inverse 19the
Eq. (in .24) lower
(and depends x M
right M on a!).block of the
Its form is
_ /VU
=( —
Y (A0 1 I+ — WS(S WA—0 WSt)- 1SqTW
A01WQI<;TW — 0 bv_
WA VTCl
)Y - Y S) —1
-1
= (1/a ((YTY) - (YTS)(STs) —1 (YтS)j — Yт.
p (Y, S | a, в) = j p (Y | A, S) p (A | a, в) d A
In the analogue to Eq. (6.2), we take care of the Gaussian part by re-arranging
defined by5
a := M- 1 tr(YTS(STS)- 1). (22.10)
With this the second line of Eq. (22.8), the posterior on a, be-
comes6 6For an intuition of this expression, note
that if S = I:,1:M (the first M columns
V(a I в2 Y S) = N (a; ' Ma в2 of the identity), then the posterior mean
p(a 1 в , Y, S) N , л о + m , ло + M . on a is essentially (up to regularisation)
computing a running average of A’s first
To approach the posterior on the variance в2, we continue to fol M diagonal elements.
low the guidance from Chapter I and apply the matrix inversion
lemma, which yields
(Ma - Мцо)2
Ло + M
= tr(2YтY(STS)- 1 - YTS(STS)- 1 YTS(STS)- 1) - 2ц0Ma + ц2M - M (*+^M0)
7 ( -— 1 —- = det(вGG)(^1 + STG-1S )
det в G G + S — ST
Ло U0 7
= в2(NM-1/2(M2-M)) det(G) f 1 + M^ .
Ло
22 Proofs 191
Л0 ц 0 + Ma
HN = Л0 + M '
Л N = Л0 + M,
aN = a0 + 1/2(NM - 1/2(M2 - M)),
bN = b0 + 1/^ tr(2YTY(STS) ^1 - YTS(STS) -1YTS(STS)-1)
- 2Ц0Ma + Ц22M - M a Ц0
Л0 + M
( - ) . (22.11)
i=1
tr(2YтY(STS)-1 - YTS(STS)-1YTS(STS)-1)
= tr(2URTRE-2UT - U(RTE-1)(RTE- 1)UT)
= tr(2RTRE-2 - (RтE-1)(R^E- 1))
M
= £ 2RjiRjiai-2 - RijRjiai-1aj-1
ij=1
M
= £ (VтAV)2.
ij=1
192 III Linear Algebra
1 M
e2:= M E /
((V’AV)2, - ( VTAV) ii ( VTAV ),
ij=1
M
1 Л0 M
bN = bо + 2 (Me + (a - но)2
Ло + M
sights, in particular:
Software
The ProbNum library1 provides reference implementations of 1Code at probnum.org. See the corre
sponding publication by Wenger et al.
probabilistic linear solvers with posterior uncertainty quantifi
(2021).
cation.
Chapter IV
Local Optimisation
24
Key Points
The preceding chapters on integration and linear algebra dealt 1Integration is a linear operation in so
far as, for two functions f, g and real
with linear problems.1 In this chapter, we move to more general,
numbers a, в G R,
nonlinear tasks. This theme will then continue in the chapters
on differential equations. У af (x) + вд (X) dX
Nonlinear optimisation problems are another class of numer
=a f (x) dx + в g(x) dx.
ical tasks that have been studied to extreme depths. They have
ubiquitous applications of massive economic relevance. As in A Gaussian (process) prior on an inte
grand f is thus associated with a Gaus
previous chapters, the scope of this text is too limited to give a sian marginal on the integral over f. The
comprehensive overview, nor even to address just a significant linear algebra problems studied in Chap
ter III actually do not have this prop
part of the myriad different types of optimisation problems
erty, because matrix inversion is non
(some listed below). We will focus on the following basic set-up: linear ((A + B)-1 = A-1 + B-1). The
consider the real-valued function f (x) G R with multivariate principal linear property used in Chap
ter III is that matrix-vector multiplica
inputs x G RN. Typical values for N can vary from a handful tions As = y provide a linear projection
(e.g. in control engineering) to billions (e.g. in contemporary ma of the latent matrix A.
(That is, we will use the notation arg min even for local minima,
for simplicity). Introductions to classic nonlinear optimisation
methods can be found in a number of great textbooks. Nocedal
and Wright (1999) provide an accessible, practical introduc
tion with an emphasis on unconstrained, not-necessarily-convex
problems. Boyd and Vandenberghe (2004) offer a more theoreti
cally minded introduction concentrating on convex problems.
Both books also discuss constrained problems, and continuous
optimisation problems that do not have a continuous gradient
everywhere (so-called non-smooth problems). These two areas
are at the centre of the book by Bertsekas (1999). Other popular
types of optimisation include discrete, and mixed-integer “pro-
grams”.3 Genetic Algorithms and Stochastic Optimisation are 3 For historical reasons, the optimisation
and operations research communities
also large communities, interested in optimising highly noisy or
use the terms “program” and “problem”,
fundamentally rough functions (see e.g. the book by Goldberg as well as “programming” and “optimi
(1989)). Such noise (i.e. uncertainty/imprecision) on function sation” synonymously. A mixed integer
program is a problem involving both con
values will play a central role in this chapter - in fact, one could tinuous (real-valued) and discrete param
make the case that Probabilistic Numerics can bridge some con eters.
ceptual gaps between numerical optimisation and stochastic
optimisation. However, we will make the assumption that there
is at least a smooth function “underneath” the noise. Stochastic
and evolutionary methods are also connected to the contents of
Chapter V.
produce a sequence of points4 xi, i = 0, ..., M, that should 4 We will make the philosophical leap
to call these iterates “estimates”. That
ideally converge towards x* in a robust and fast way. Here
is how they are interpreted by practi
“robust” may mean that the sequence will converge from more tioners (who can never run an optimiser
or less any choice of the starting point x0. “Fast” means that to perfect convergence), and the classic
convergence analysis also supports this
either the sequence of residuals of the estimates {||xi — x* ||}i>0 interpretation.
or the sequence of function values {f (xi) — f (x*) }i>0 converge
to0at some high rate.5 5Several different types of rates are used
in optimisation when talking about how
For many classic local optimisers, each iteration from xi to
fast a sequence {ri}iGN converges to
xi+1 consists of the same two principal steps as for the linear zero. Stephen Wright once summarised
solvers defined in Algorithm 17.2: them thus:
A sublinear rate, in the optimisa
Decide on a search direction di G RN, meaning that the next tion context, means that ri 0, but
iterate will have the form xi+1 = xi + aidi with ai G R. This ri+1/ri 1. An example is the de
crease ri < C/i for some constant C.
step usually involves at least one call to the “black boxes”
■e A linear or geometric (sometimes
f and/or Vf. Several options for the choice of di will be also exponential) rate means that
discussed in §28; but the naive choice is to set di = —Vf (xi). ri+1 < dri for a constant 0 < d < 1.
Thus ri < Cdi for some constant C.
This is known as gradient or steepest descent and is such an
■e- A super-linear rate means ri+1/ ri 0.
elementary idea that there may be no meaningful citation
In other words, a geometric decrease
for its invention.6 Exercise 25.2 shows that gradient descent with decreasing constant d 0. We
actually has certain pathological properties. Nevertheless, it will see that this rate can only be
achieved with comparably elaborate
remains a popular algorithm. algorithms.
■e A quadratic rate means that ri+1 <
Fix the step-size ai. This may be done by a closed-form guess
dri2 for some constant d (which may
about the optimal step-size. But if this step is performed by even be larger than 1 and still al
evaluating f and/or Vf for different values of a in search of a low convergence if r0 is sufficiently
small!). This is the rate of Newton’s
“good” location close to the optimum along this direction, it is method. Achieving it generally re
called a line search. If the line search finds the global optimum quires access to the Hessian function
B.
(along this univariate direction) a* = arg mina f(xi + adi),
In a quadratic decrease, the number of
it will be called perfect. This is an analytical device - for leading digits in xi that match those of x*
nonlinear problems, perfect line searches do not exist in doubles at each iteration. The consensus
practice, although a good line search on a convex problem is that, in high-dimensional problems,
this should be fast enough for anybody.
can come close.
satisfies
Note that both f (a) and f1 (a) are scalars. The entirety of this
section will be concerned with the problem of finding good
values for a. This all happens entirely within one “inner loop”,
with almost no propagation of state from one line search to
another. Sowe drop the subscript i. This is a crucial point that
is often missed at first. It means that line searches operate in
a rather simple environment, and their computational cost is
rather small. For intuition about the following results, it may
be helpful to keep in mind that a typical line search performs
between one and, rarely, 10 evaluations of f (a) and f1 (a), re-
spectively.4 4 We here assume that it is possible to si
multaneously evaluate both the function
f and its gradient f'. This is usually the
Because line searches are relatively simple yet important algo case in high-dimensional, truly “numer
rithms, we can study them in detail. The following pages start by ical” optimisation tasks. The theory of
automatic differentiation (e.g. Griewank
constructing a non-probabilistic line search that is largely based (2000)), guarantees that gradient evalua
on versions found in practical software libraries and textbooks. tions can always be computed with cost
comparable to that of a function evalu
They are followed by their probabilistic extension.
ation. But there are some situations in
which one of the two may be difficult to
access, for example because it has differ
► 26.3 The Wolfe Termination Conditions ent numerical stability. The probabilistic
line search described below easily gen
Building a line search requires addressing two problems: Where eralises to settings in which only one of
the two, or in fact any set of linear pro
to evaluate the objective (the search), and when to stop it (the jections of the objective function, can be
termination). We will start with the latter. Intuitively, it is not nec computed.
essary for these inner-loop methods to actually find a true local
26 Step-Size Selection - a Case Study 207
f (a) < f (0) + c 1 af'(0), (26.7) Figure 26.1: The Wolfe conditions for
I f'(a) |< c21f'(0) I. (26.8) line search termination. Optimisation
utility f (a) and f1 (a) as black curves
(Eq. (26.7) is identical to Eq. (26.5), it is just reprinted for easy ref (top and bottom figure, respectively).
True optimal step a* is marked bya
erence.) For every continuously differentiable f that is bounded
black circle. The sufficient-decrease con
dition excludes the grey region in the
top figure, constraining the acceptable
space to [0, a2]. Adding the weak cur
vature condition additionally excludes
the lower grey region in the bottom
plot (thus restricting the acceptable space
now to [a1, a2]. The strong extension also
excludes the top region (restricting to
[a1, min{a2, a3}]). All points in between
are considered acceptable by the Wolfe
conditions. In this example, the true ex
tremum lies within that acceptable re
gion, but this is not guaranteed by the
conditions. (For this plot, the parameters
were set to c1 = 0.4, c2 = 0.5 to get an in
structive plot. These are not particularly
smart choices for practical problems; see
the end of §26.2.)
208 IV Local Optimisation
below, there exist step lengths that satisfy the conditions.7 7 Nocedal and Wright (1999), Lemma 3.1
As can be deduced from Figure 26.1, the Wolfe conditions do
not guarantee that the true optimum lies within their bracket of
acceptable step sizes (not even if the objective is convex). Nor do
they strictly guarantee that the optimal step size is particularly
close to the chosen one. However, they obviously prevent certain
kinds of undesirable behaviour in the optimiser, like increasing
function values from one step to another, or step size choices
that are drastically too small. When used in combination with
algorithms choosing the search directions di in the outer loop,
the Wolfe conditions can also provide some guarantees for
di+1. For example, for the BFGS method discussed in §28, the
curvature condition (26.6) guarantees that di+1 is a descent
direction (see more in that section). Good choices for the two
parameters c1, c2 depend a little bit on the application and
the outer-loop optimisation method. But practical line searches
often use lenient choices8 with a small c1, e.g. c1 = 10-4 and 8 Nocedal and Wright (1999), §3.1
a large c2, e.g. c2 = 0.9. Under these settings, the conditions
essentially just require a decrease in both function value and
absolute gradient, no matter how small.
a 1 2a 1 3
— a2 f1 — a 1 f0 — f0 (26.10)
b 3
a4(2 - ) -a2 a1 f1— f0
The derivative f' is a quadratic function, which has a unique Figure 26.2: Cubic spline interpolation
minimum at for searching along the line. Each (top
and bottom) pair of frames shows the
same plot as in Figure 26.1, showing
progress of the search and interpolation
steps. Left: The first evaluation only al
lows a linear extrapolation, which re
quires an initial ad hoc extrapolation
step. Centre: The first extrapolation (to
the second evaluation) step happened to
/ be too small, so another extrapolation
step is required. Since the “natural” cu
B^
bic spline extrapolation is linear, the next
step is again based on an ad hoc exten
sion of the initial step. Right: The third
evaluation finally brackets a local mini
mum. An interpolation step follows, us
ing a cubic spline interpolant. The next
step will be at the local minimum of the
interpolant. It so happens that this point
(empty square) will provide an evalua
tion pair that satisfies the Wolfe condi
tions.
210 IV Local Optimisation
It will turn out that all three of these issues can be addressed
jointly, by casting spline interpolation as the noise-free limit of
Gaussian process regression.
212 IV Local Optimisation
Recall from §5.4, specifically Eq. (5.27), that the total solution of
the stochastic differential equation
f (a) 0 1 f(a)
d da + dwt, (26.11)
f' (a) 0 0 f' (a) q
Then we can use the standard results from §4.2, and in particular
§4.4, to compute a Gaussian process posterior measure with
posterior mean function
vx цх ka A k»A kAA
AA
+ Л^ (Y — ЦА ),
kdaA
dk kdAA
dk (26.14)
v» H» dkxA dkAA
AA
kernels kaA and kdxA are (at most) cubic polynomials, so is the
posterior mean. Analogously, the posterior mean inherits the
points of non-differentiable first derivative from these kernels,
at the K points a = a, i = 1,..., K.
In the limit of Л 0, i.e. if we assume the evaluations are
available without noise, then the cubic splines of (26.9) with 16A remark on implementation: Since Л
is block diagonal we can compute the
parameters given by Eq. (26.10) are the only feasible piecewise
posterior mean functions v, v' by filter
cubic estimate for t G [a0, aK], since then each spline is restricted ing (§5). Then each of the K filter and
by 4 conditions on either end of [ai-1, ai], i = 1, ..., K. In this smoother steps involves matrix-matrix
multiplications of 2 x 2 matrices, as does
sense, the integrated Wiener process prior for f is an extension Equation (26.10), and the computational
of cubic spline interpolation to non-zero values of the noise co cost of this regression procedure is only
a constant multiple of that of the classic
variance Л G RKxK. This construction allows explicit modelling
interpolation routine. However, since a
of the observation noise (Figure 26.5). The additional cost of this typical line search performs only a very
probabilistic line search is often negligible, even without using limited number (< 10) of function and
gradient evaluations, the concrete imple
Kalman filtering and smoothing to speed up the computation mentation practically does not matter,
of the GP posterior from Eqs. (26.14) and (26.15).16 and the direct form of Eq. (26.15) can be
used instead. More generally, the cost
overhead of this line search is often neg
ligible when compared to the demands
> 26.5.2 Selecting Evaluation Nodes of computing even a single gradient if
N » K.
In the noise-free setting, our line search chose the evaluation
node ai either at an extrapolation point, following some rule for Exercise 26.4 (advanced, discussion on
how to grow extrapolation steps, or, once an outer bound for the p. 365). Are there other possible choices to
extend the cubic spline model to Gaussian
extrapolation has been established, as the minimum of the spline
process regression? More precisely, is there
interpolant. Because it was possible to perform deterministic a Gaussian process prior GP(f, ц, k) with
bisections and the interpolant is a cubic polynomial, there was mean function ц and covariance function k,
such that, given observations under the likeli
always exactly one17 such evaluation candidate. But in the noisy hood (26.12), the posterior mean, for Л 0,
setting, no part of the search domain can ever be “bisected-away” converges to the cubic spline of Eq. (26.9)
with parameters (26.10) in the inner inter
with absolute certainty. Hence, the probabilistic line search will
vals a0 < a < aK ?
have to consider a set of candidates for the next evaluation ai,
and use some decision rule to settle on one of them. We will now 17If the spline interpolant has no internal
minimum, then the “right” end of the bi
first design a finite set of candidate points т := [ Ti, ..., Tl ] G R+; section either has negative gradient (thus
then address the question of how to choose among them. we would extrapolate), or the right-most
gradient is zero, and thus accepted by
First, because the presence of noise makes it impossible to
the Wolfe conditions.
rule out the possibility that a local optimum lies to the right
of amax := max{ai}i=0,...,K, our list of candidates will always
include T1 = amax + ri, where ri is some extrapolation step.
Just as in the noise-free case, the extrapolation strategy could
be chosen more or less aggressive, depending on the problem
setting. For high-dimensional optimisation problems in machine
learning, a constant (ri = 1) or linearly growing (ri = i) policy
may be better than the very aggressive exponential growth
(ri = 2i).
To cover the domain between the previous evaluations, we
will add all the local minima of the posterior mean function v (a)
26 Step-Size Selection - a Case Study 215
(
uei т) = Ep(f (т)\Y) (min{0, n — f (т) })
f(0)
aa 1 c1 a -10 f' (0) 0
ba 0 -c2 01 f(a ) 0
f' (a )
Thanks to the closure of the Gaussian family under linear trans
formations (Eq. (3.4)), the Gaussian process measure on (f, f')
directly implies a bivariate Gaussian measure on (aa, ba) for all
a (see Figure 26.5). If the posterior Gaussian process measure
26 Step-Size Selection - a Case Study 217
on (f, f') is
v( f f1 I Y) = GP ( f • V K K
p(f,f 1 Y ) = G4 f/ ; v , дк дкд
aa maa a
Caa a
Cab
p(aa,ba) = N b Caba Cabb
ba mba
bb 2 дд дд дд
Ca — c2 к00 - 2c2 к0a + Kaa•
And the probability for the weak Wolfe conditions to hold is 20This is the “bivariate error function”,
the generalisation of the univariate Gaus
given by the standardised bivariate normal probability20
sian cumulative density
x
p(aa > 0 Л ba > 0) I N(x;0,1) dxc — 2 (1+ erf(v'a))'
-—TO
1 Pa Just like the univariate case, there is no
dadb,
pa 1 “analytic” solution, only an “atomic” one
(in standard libraries, the error function
is implemented as a special case of the
incomplete Gamma function, computed
with the correlation coefficient -1 < pa :— Cab/ yjCaaCab < 1.
either via a series expansion or a con
The strong condition is slightly more tricky. It amounts to tinued fraction, depending on the input.
the still linear restriction of ba on either side, to For more, see §6.2 of Numerical Recipes
(Press et al., 1992). A highly accurate ap
proximation of comparable cost to the er
0 < ba < -2c2f'(0). ror function was provided by Alan Genz
(2004), whose personal website at Wash
But of course we do not have access to the exact value of f1 (0), ington State University is a treasure trove
of exquisite atomic algorithms for Gaus
just a Gaussian estimate for it. A computationally easy, albeit sian integrals.
ad hoc, solution is to use the expectation21 f'(0) « v0 to set an
upper limit b :— -2c2v0, and use it to compute an approximate
probability for the strong Wolfe conditions, as
p (aa > 0 Л 0 < ba < b) 21 Alternatively, one could also use the
95%-confidence lower and upper bounds
1 Pa f'(0) < v0 + 2/4 and f'(0) > v0-
dadb. (26.17)
Pa 1 дк00 to build more lenient or restric
tive decision rules, respectively.
Algorithm 26.1 provides pseudo-code for the thus completed
probabilistic extension of classic line searches.
218 IV Local Optimisation
1 procedure PROBLlNESEARCHSKETCH(f, y0, y0, Vf0, Vff) Algorithm 26.1: Pseudo-code for a prob
abilistic line search. Adapted from Mah-
2 T, Y, Y! 0, y0, y0 £ R Ц initialise storage sereci and Hennig (2017). The code as
3 t 1 Ц initial candidate equals previous step-length sumes access to a function handle f that
returns scaled pairs of values and gradi
4 while length(T) < 10 and no Wolfe-point found do
ents of the form
5 [ y, y ] f (t) / evaluate objective
f(t)= Гf(t) - f (0) f1 (t) 1
6 T, Y, Y' T U t, Y U y, Y'U y*1 / update storage 1 (t) I. f'(0) , f'(0)J ,
7 GP GPinference(t, y, y') where t = a / a—1 is the input variable
8 PWolfe probWolfe(T, GP) Ц Wolfe prob. at T scaled by the step length of the previ
ous line search. This re-scaling allows
9 if any PWolfe > cW then Ц done?
the use of standardised scales for the
10 I return t* = arg max PWolfe / success! GP prior, thus avoids hyperparameters.
11 else / keep searching! Note that all operations take place on
the scalar projected gradients £ R, not
12 Tcand COMPUTECANDIDATES(GP) / candidates on full gradients £ RN . The code sets
13 EI expectedImprovement(Tcand, GP) a fixed budget of 10 evaluations, after
which the search aborts. In this case, a
14 PW probWolfe(Tcand, GP)
fall-back solution can be returned. For
15 t arg max(PW Q EI) Ц find best candidate example, the point t* that minimises the
16 end if GP posterior mean. In practice, this fall
back is very rarely required.
17 end while
18 return error / fall-back / no acceptable point found
19 end procedure
1M
E
Si = M m=1 £m (ai), and
(26.18)
1M
Si = M E
(d I ^^m ( ai ))2,
V2 = Si - y2 V2 = Si - (y'i)2
Vfi M-1, Vfi M-1
from the probabilistic standpoint. These are just case-studies,2 2 The two sections are short summaries
of the following two papers, respectively:
and should not be taken as the sole solutions to these issues.
Balles, Romero, and Hennig (2017) as
Quite in the contrary, both aspects have been studied by many well as Mahsereci et al. (2017). Addi
authors in the machine learning community. The point of this tional information can be found in M.
Mahsereci’s PhD thesis (2018).
section is to highlight two aspects. First, on a conceptual level,
the probabilistic viewpoint offers a unifying framework in which
to reason about algorithmic design. But second, practical consid
erations may also require us to once again ease off the Bayesian
orthodoxy. Especially when it comes to nuisance parameters
deep inside a low-level algorithm, not every quantity has to
have a full-fledged calibrated posterior. The probabilistic frame
work may then still help in the derivations, but point estimates
derived from the probabilistic formulation may just do the trick.
For the moment, we consider the simple (and flawed, yet popu
lar) case of stochastic gradient descent, i.e. the optimiser given
by the update rule
For simplicity, we will assume that the optimiser has free con
trol over the batch size M.3 Deciding for a concrete value of 3In practice, aspects like the cache size
of the processing unit usually mean that
M, the optimisation algorithm now faces a trade-off: A large
batch sizes have to be chosen as an inte
batch size provides a more informative (precise) estimate of the ger multiple of the maximal number of
true gradient, but increases computation cost. From Eq. (27.1), data points that can be simultaneously
cached.
the standard deviation of gM only drops with M-1/2, but the
computation cost of course rises linearly with M. Sowe may
conjecture that there is an optimal choice for M.It would be
nice to know this optimal value; but since this is a very low-
level consideration about a hyperparameter of an inner-loop
algorithm, an exact answer is not as important as a cheap one.
Thus, we will now make a series of convenient assumptions to
arrive at a heuristic:
First, assume that the true gradient VL is Lipschitz continu
ous (a realistic assumption for machine learning models) with
Lipschitz-constant L. That is,
gain computation
Sinceper cost
cost E(G)/M is linear
, which inM,awe
is then consider
rational the expected
function inM,
and we can find the M that maximises this expression. It is
2 La tr У
M* (27.3)
2-lO ||vl(xi)||2.
224 IV Local Optimisation
Since L is usually not known a priori and the norm of the true
gradient is not accessible, we finally make a number of strongly
simplifying assumptions to arrive at a concrete heuristic that
does not add computational overhead. First, assume that VL
is not just Lipschitz continuous but also differentiable, and
the Hessian of L is scalar: B (xi) « hIn . This means5 that the 5 A detailed derivation is in the original
M _ a trS
M* = 2 — ha L(xi) — L*'
tr S
M*
a L(xi) — L*'
set. The optimiser only gets access to the empirical risk on the
training set (possibly sub-sampled into batches). A separate
monitor observes the evolution of the validation risk, and stops
the optimiser when the validation risk starts rising. Apart from
technicalities (reliably detecting a rise in the validation risk is
itself a noisy estimation problem), the principal downside of
this approach is that it “wastes” a significant part of the data
on the validation set. Collecting data often has a high financial
and time cost, and the data in the validation set cannot be used
by the optimiser to find a more general minimum. Even if we
ignore the issue of overfitting, we still need some way to decide
when to stop the optimiser. Many practitioners just run the
method “until the learning-curve is flat”, which is wasteful.
This section describes a simple statistical test as an alterna
tive, which is particularly suitable for small data sets, where
constructing a (sufficiently large) validation is not feasible. It is
based on work by Mahsereci et al. (2017), and makes explicit
use of the observation likelihood (Eq. (27.1)). Let ppop. be the
distribution from which data points are sampled in the wild,
and f be the population risk
p(VL(x) | 0) = N(VL(x);0,^x)
K
1 kN / [ vl (X)] i \
1 - N 1Д S > > 0.
Section 28.3 will discuss a second type of rules that also consider
interactions between gradient elements. This classification is pri
marily a computational consideration, not an analytic one. Some
of the methods in this first class converge asymptotically faster
than gradient descent; virtually all the methods in the second
class converge slower than Newton’s method. But element-wise
rules scale more readily to very high-dimensional problems.
This is one of the reasons why they are currently the popular
choice in machine learning, where the number N of parameters
to be optimised is frequently in the range beyond 106.
Nesterov's accelerated method3 is a variant of momentum that mx( t) = —кx (t) — Vf (x (t)). (28.1)
what the real gradient may be. The optimiser can just check.
With this, we can consider two basic choices for the SDE defin
ing the Kalman filter: The Wiener process and the Ornstein-
Uhlenbeck process,
Here we have allowed for separate drift (Yn) and diffusion (6n)
scales for each element of the gradient. They translate into the
Kalman filter parameters (see Algorithm 5.1 and Eq. (5.21))
xi xi- 1 + aimi- 1.
Newton’s method13 is about the fastest optimisation method for 13Also known as Newton-Raphson op
timisation. The idea itself is ancient, a
multivariate problems anyone could wish for. A straightforward
primitive form may already have been
way to derive this iterative update rule is to consider the second- known to the Babylonians (Fowler and
order Taylor expansion of the objective f around the current Robson, 1998, p. 376).
iterate xi,
1
f (Xi + d) « f (Xi) + dт Vf (Xi) + 2 d T B (Xi) d.
(Recall that B is the notation for the Hessian of f.) If the Hessian
is symmetric positive definite (if f is locally convex), then this
quadratic approximation has a unique minimum, which defines
the next Newton iterate,
yi = B Si (28.10)
?
Si = H yi
Name ci Reference
Symmetric Rank-1 (SR1) ci = yi - Bi-1 si Davidon (1959)
Powell Symmetric Broyden ci = si Powell (1970)
Greenstadt’s method ci = Bi-1 si Greenstadt (1970)
DFP ci = yi Davidon (1959); Fletcher & Powell (1970)
Broyden (1969); Fletcher & Powell (1970);
BFGS ci = yi +1 ^B^ Bi-1 si
si i-1 si Goldfarb (1970); Shanno (1970).
Table 28.1: The most popular members
of the Dennis family, Eq. (28.11), defined
This situation sounds familiar, and indeed it is closely related by their choice of ci (middle column); see
Martinez R. (1988) for more details. Note
to the setup discussed at length in Chapter III on linear alge
that the names DFP and BFGS consist
bra. Here as in the earlier chapter, an algorithm collects linear of the first letters of the names of their
projections of some matrix, and has to estimate that matrix inventors (right column).
This is also the reason why there is not just one ‘best’ quasi
Newton method. In contrast to the linear setting, where the
method of conjugate gradients is a contestant for the gold stan
dard for spd problems, there are entire families of quasi-Newton
methods. A widely studied one is the Dennis family16 of update 16 Dennis (1971)
rules of the form
(yi - BiSi)ci + ci(yi - BiSi)t cisi(yi - BiSi)ci
Bi+1 = + --------------- --- ---------------------------- ------ ----- ------
Bi
ci Si ( ci Si )2
(28.11)
where ci G RN is a parameter that determines the concrete
member of the family. Quasi-Newton methods were the subject
of intense study from the late 1950s to the late 1970s. The most
widely used members of the Dennis family are presented in
Table 28.1. Among these, the BFGS method is arguably the most
popular in practice, but this should not tempt the reader to
ignore the other ones. This is particularly true for problems
with noisy gradients.
matrix estimate Bi+1 that satisfies the secant equation (28.10).17 17 The DFP and BFGS methods can also
So far so good, but now, in the final step to complete the con
nection so that the next iteration still behaves like the Den
nis method, we have to implicitly force the filter to add, in
the prediction step, just the right terms to P1 so that P2- =
A 1 P1 A} + Q1 = W- 0 W- with an spd matrix W2 that yields
the required match W2s2 = c2 to the Dennis family. It turns out
that such a step does not always exist, because the necessary
update W2 - W1 is not always symmetric positive definite. So,
beyond the special case of the linear problems discussed at
length in Chapter III, we cannot hope to find a general and
one-to-one interpretation of existing quasi-Newton methods as
Kalman filtering models.
& The surrogate is the optimiser’s model for the objective func
tion, reflecting the designer ’s prior assumptions. Little is
more important in practice than designing a surrogate that
is well-informed about the features of the objective (partic
ularly where the objective’s input space possesses unusual
features, such as graphs or strings). This surrogate must be
a probabilistic model, such as a Gaussian process, as the
the probabilistic treatment of uncertainty is vital to global
optimisation. First, evaluations of the objective are often in
exact — for instance, they may be corrupted by noise - and
246 V Global Optimisation
Chapter IV presented methods designed to find the extremum1 1Without loss of generality, our discus
sion in this chapter will focus on minimi
of an objective function that is assumed to be convex. These
sation. Maximisation can be achieved by
methods are often used on objectives that are, instead, multi minimising the negative of the objective.
modal: in such a case, they will converge only to the mode of
the objective nearest (in some sense) their starting position. We
will refer to such methods as providing local optimisation.
Global optimisation, as its name suggests, instead tackles the
problem of finding the global minimiser x* of an, e.g., multi
modal objective f (x) G R, where f (x*) := minx f (x) is the
minimum of all local modes of the function2 . This is a much 2 Note that f may have no minimum at
all - minx f (x) is, however, defined if f
more challenging problem, demanding the balancing of the
is continuous and the domain is compact
exploration-exploitation trade-off.3 (Garnett, 2022).
Exploitation corresponds to making an evaluation with a high 3This term is more common in the multi
armed bandit and reinforcement learn
probability of improvement. Typically, an exploitative move
ing literatures (Sutton and Barto, 1998).
is an evaluation whose result is known with high confidence,
usually near a known low function value, and that is expected
to yield an improvement, perhaps an improvement over that
known low function value, even if only an incremental one.
Exploitation hence typically hones in on a local mode, as would
be performed by local optimisation.
Exploration instead corresponds to evaluating the objective
in a region of high uncertainty. Such an evaluation is high
risk, and may be expected to supply no improvement over
existing evaluations at all. Nonetheless, exploration is warranted
by improbable, high-payoff, possibilities: such as finding an
altogether new local mode.
4 For instance, if the search domain is
Exploration is hard. To be clear, for most problems, it is [0, 1]D, evaluating just the corners of the
not difficult to find evaluations that are high-uncertainty. The search box will require 2D evaluations.
This number for a realistic problem, say
search domain is normally enormous4 and begins as uniformly
with D = 20, could easily exceed the
high-uncertainty. The low-uncertainty regions produced by our permitted budget of evaluations.
248 V Global Optimisation
existing evaluations are few, like stars dotted in the void. The
challenge of exploration is hence in sifting through the mass of
uncertainty to find the evaluation that best promises potential
reward. This is a challenge common to many aspects of intelli
gence: think of artistic creativity, or venture capital, or simply
finding your lost keys. When humans explore, we draw upon
some of our most profoundly intelligent faculties: we theorise,
probe and map. As such, exploration for global optimisation
motivates sophisticated algorithms.
Relative to local optimisation, global optimisation typically:
vironmental monitoring,7 and software engineering,8 amongst 7 Marchant and Ramos (2012)
many more. The real world is very often more complex than 8 Hoos (2012)
► 31.1 Prior
These are not the only plausible candidate losses; we will meet
alternatives below. Crucial to distinguishing these losses is a
careful treatment of the end-point of the optimisation. The loss
function must make precise what is to happen to the set of
obtained objective evaluations once the procedure ends, and
how valuable this outcome truly is. One crucial question is that
of when our algorithm must terminate. Termination might be 7We will regard this final point as ad
ditional to our permitted budget of N
upon the exhaustion of an a priori fixed budget of evaluations,
evaluations.
or, alternatively, when a particular criterion of performance or
convergence is reached. The former assumption of a fixed bud
get of N evaluations is the default within Bayesian optimisation, 8 In the absence of noise, limiting to
the set of evaluation locations enforces
and will be taken henceforth.
the constraint that the returned func
We present in Figure 31.2 the decision problem for Bayesian tion value (the putative minimum) is
optimisation. We seek to illustrate the iterative nature of op known with complete confidence. This
is not unreasonable; however, in some
timisation and its final termination. In particular, the termi settings, the user may be satisfied with
nating condition for optimisation will often require us to se a more diffuse probability distribution
over the returned value: such consider
lect a single point7 in the domain to be returned: we will de
ations, of course, motivate the broader
note this point as xN . At the termination of the algorithm, probabilistic-numerics vision. It is worth
we will define the full set of evaluation pairs gathered as noting that the limitation to the set of
evaluation locations does not permit re
DN := (xi, yi) | i = 0, ..., N - 1 . Here the ith evaluation turning unevaluated points, even if their
is yi = f (xi). We will assume, for now, that evaluations are values are known exactly. As an example
where this is important, consider know
exact, hence noiseless. The returned point will often be limited
ing that a univariate objective is linear:
to the set of evaluation locations, Xn G Dn, but this need not then, any pair of evaluations would spec
necessarily be so.8 ify exactly the minimum, on one of the
two edges of a bounded interval. In such
The importance of the loss function can be brought out a case, would we really want to require
through consideration of the consequences of the terminal de that this minimum could not be returned
until it had been evaluated?
cision of the returned point, xN . With our notation, the loss
254 V Global Optimisation
uation.10 It is also used to describe other functions used for 10More precisely, in the literature, the
term acquisition function is more com
selecting evaluations: we will develop some of these subtleties
monly used to describe a function that
below. An acquisition function is also sometimes distinguished is to be maximised to determine the
from a recommendation strategy used to select the final putative next evaluation location. Nonetheless, to
maintain consistency with our expected-
location for the minimum, xN . In a decision theoretic, prob loss framework, we will use the term ac
abilistic numeric, framework, the acquisition function should quisition function to describe a function
that is to be minimised rather than max
be derived from a loss function defined on the results of the
imised. In some cases, this will result in
final selection: we have no need for a separate recommendation us describing acquisition functions as the
strategy. negation of their more common forms.
function
a ( xn I Dn ) = E( A ( xn , yn , Dn ))
= I A ( xn , yn , Dn ) p ( yn I Dn ) d yn .
П := -cr,min , f (xi),
iE{0,..., n — 1}
260 V Global Optimisation
EI
For convenience, our notation no longer reflects the real de
EI )
pendence of Л on Dn +1. To be explicit, we will pick the next
EI E( EI)
evaluation location as the minimiser of E (Л (xn), which is
identical to the minimiser of a (xn) := Л (xn) — ц,
ПП
aei(xn) =1 (f (xn) — П)p(f (xn) \Dn) df (xn). (32.1)
Knowledge gradient (kg)4 is an acquisition function that allows 4 Frazier, Powell, and Dayanik (2009)
the relaxation of one of the assumptions of ei. In particular,
while kg is myopic, it does not restrict the returned point xN
to be amongst the set of evaluated locations. Instead, after the
nth step of the optimisation procedure (that is, after the nth
evaluation), xN is chosen as the minimiser of the posterior
+
mean mn 1 for the objective f (x), which is conditioned on
Dn+1 := xi, f(xi) \ i = 0, ..., n ,the set of evaluation pairs
262 V Global Optimisation
after the (n + 1)th step. That is, kg considers the posterior mean
after the upcoming next step.
This modification offers much potential value. Valuing im
provement in the posterior mean rather than in the evaluations
directly eliminates the need to expend a sample simply to re
turn a low function value that may already be well-resolved
by the model. For instance, if the objective is known to be a
quadratic, the minimum will be known exactly after (any) three
evaluations, even if it has not been explicitly evaluated. In this
setting, evaluating at the minimum, as would be required by ei,
is unnecessary.
kg does introduce some risk relative to ei, however. Note
that the final value f (xN) may not be particularly well-resolved
after the (n + 1)th step: the posterior variance for f (xN) may
be high. kg, in ignoring this uncertainty, may choose xN such
that the final value f (xN) is unreliable. That is, the final value
returned, f (xN), may be very different from what the optimiser
expects, mN+1 (xN).
The kg loss is hence the final value revealed at the minimiser
of the posterior mean after the next evaluation. Let’s define that
minimiser as
xn+i := arg min mn+i(x),
x
where the posterior mean function,
:
mn+1 (x) = E(f (x) I Dn+1),
takes the convenient form of Eq. (4.6) for a gp. The kg loss can
now be written as
:
A kg (Dn+1) = f (xn+1). (32.3)
The expected loss, the acquisition function, is hence5 5 Note that the KG acquisition function
df (x:n+1) df(xn)
E(Л Vl
goal is to compute the expected loss of evaluating next at x0 ,
). If N evaluations remain, we must marginalise N + 1
values, y0, ..., yN, (recall that we’re assuming that the final
returned value is “free”, additional to our budget) and N loca
32 Value Loss 265
tions, x1, ..., xN. The latter random variables emerge from a
decision process: xi will be the optimiser of the ith acquisition
function - we assume that all future decisions will be made
optimally. That is,
lem have managed to consider no more than around 20 fu 12Streltsov and Vakili (1999); Osborne,
Garnett, and Roberts (2009); Marchant,
ture steps. To give a flavour of how such approaches proceed, Ramos, and Sanner (2014); Gonzalez, Os
Gonzalez, Osborne, and Lawrence (2016) and Jiang et al. (2020) borne, and Lawrence (2016); Jiang et al.
(2020).
propose schemes in which the strong knowledge of the sequen
tial selection of observations, as in Eq. (32.4), is set aside in
favour of a model which assumes that all locations are chosen
at once, as in a batch (batch Bayesian optimisation will be de
scribed in §34.1). This approximate model is depicted in Figure
32.3. This coupling of locations and removal of nesting provides
a substantially simpler numerics problem, one solvable using
batch Bayesian optimisation techniques for the optimisation of
locations. Gonzalez, Osborne, and Lawrence (2016) additionally
use expectation propagation13 for the marginalisation of their 13 Cunningham, Hennig, and Lacoste-
Julien (2011)
values.
Пп := min f (xi).
iE{ 0,..., n —1}
PI
by any acquisition function. Its utility here is promoted by its
fixed range, an, (xn) G [0,1] C R, with 1 deemed completely
exploitative and 0 completely explorative, easing (visual) com
parisons across different iterations of optimisation. Doing so
is helpful in interrogating a completed Bayesian optimisation
run, whatever the acquisition function. Quick inspection might
reveal, for instance, that exploitation was never performed, or if
the objective was inadequately explored.
Notably, this acquisition function does not distinguish be
tween improvements of different magnitudes: any improvement
(above the threshold), however small, is equally valued. This
caps the potential rewards from gambling on exploration.
We could view the nth step loss function (33.1) as emerging
from an approximation to a single loss function applicable
across all steps,
a iago ( xn ) — a es ( xn )
= E( Л lil (Dn+1))
a opes — a mes
= -H (yn | xn, Dn) + Ey* H (yn | y*, xn, Dn) .
& in optimising machine learning model architectures, many 1Ginsbourger, Le Riche, and Carraro
(2008, 2010); Chevalier and Ginsbourger
such architectures might be simultaneously evaluated to ex (2013); Marmin, Chevalier, and Gins-
ploit parallel computing resources; and bourger (2015); Marmin, Chevalier, and
Ginsbourger (2016); Wang et al. (2016);
& in searching for optimal policy parameters, one (or many) Rontsis, Osborne, and Goulart (2020).
agent-based simulations of an economic system may be able
to be run simultaneous with a real-world trial.
Batch optimisation requires proposing a set of evaluation lo 2Desautels, Krause, and Burdick (2012);
Daxberger and Low (2017).
cations, xB, before knowing the values f (xB) at (any) of the
locations. Here the Probabilistic Numerics framing of Bayesian
optimisation offers an explicit joint probability distribution over
the values f (xB), acknowledging the probabilistic relationships
amongst the batch. These relationships are crucial to select 3 Wu and Frazier (2016); Wu et al. (2017).
ing a good batch, where desiderata include the exclusion of
redundant measurements (that are likely to return the same
information) and the balancing of exploration against exploita
tion. 4 Shah and Ghahramani (2015)
The technical challenges of batch Bayesian optimisation are
bespoke to the different priors and acquisition functions. Batch
approaches exist for ei1 (sometimes called multi-point ei), ucb2, 5 Azimi, Fern, and Fern (2010); Azimi,
cost, e.g. the error achieved per second.8 The arguments of the 8 Snoek, Larochelle, and Adams (2012)
objective, such as hyperparameters, might include regularisa-
tion penalties (parameters of the prior), architecture choices,
and the parameters of internal numerics procedures (such as
learning rates). Conveniently, there are often not more than 10
or 20 such hyperparameters that are known to be important: the
dimensionality of such problems is compatible with Bayesian
optimisation. In real-world cases, these hyperparameters have
been historically selected manually by practitioners: it is not
difficult to make the case for the automated alternative pro
vided by Bayesian optimisation. As such, Bayesian optimisation
is a core tool in the quest for automated machine learning.9 9See, e.g., www.ml4aad.org/automl and
autodl.chalearn.org.
As one example, Bayesian optimisation was used to tune the
hyperparameters of AlphaGo for its high-profile match against
Lee Sodol10. 10 Chen et al. (2018)
Perhaps most interestingly, from a Probabilistic Numerics
perspective, there are many characteristics of the objective that
can be used to inform the choice of prior and loss function. First,
note that the relevance of some hyperparameters to an objective
function for hyperparameter tuning is often conditional on the
values of other hyperparameters. This includes examples in
which the objective has a variable number of hyperparameters:
we may wish to search over neural network architectures with a
variable number of layers. In that case, whether the number of
hidden units in the third layer will influence the objective will
be conditional on the value of the hyperparameter that specifies
the number of layers. The covariance function of a gp surrogate
can be chosen to capture this structure, leading to improved
Bayesian optimisation performance.11 11 Swersky et al. (2013)
Another common feature of hyperparameter tuning problems
is that the objective can return partial information even before its
computation is complete. Concretely, such computation is used
in training a machine learning model, typically through the
use of a local optimiser (see Chapter IV). Even before that local
optimisation has converged, its early stages can be predictive of
the ultimate value of the objective. This is observable through
the familiar decaying-exponential shape of so-called training
curves or learning curves, giving training loss as a function of
the number of iterations of local optimisation. If the early
parts of a training curve do not promise a competitive value
after more computation is spent (a value strongly correlated
with or equal to the associated objective function value), it
might make sense to abort the computation prematurely. This
intuition was incorporated into a Bayesian optimisation model
278 VI Global Optimisation
& The main analytical desiderata for ODE solvers are high
polynomial convergence rates and numerical stability - as
well as (additionally for probabilistic methods) a calibrated
posterior uncertainty. For extended Kalman ODE filters with
an IWP prior (and, locally, with other priors), global conver
gence rates on par with standard methods hold, namely of
the same order as the number of derivatives contained in
35 Key Points 283
vector field (aka dynamics) f : V Rd on some non-empty open set 2 Most classical methods can only solve
V C Rd. first-order ODEs, which is no restriction
to their applicability: any ODE of nth
order,
By this definition, we restrict our attention to first-order au
tonomous ODEs - but this comes without loss of generality, as x(n)(t) = f (x(n—1)( t),..., x (t), x (t)} ,
higher-order ODEs can be reformulated as first-order2 and as we can be transformed into an ODE of first-
only exclude the (mathematically analogous) non-autonomous order by defining the new object
case f (x(t), t) to declutter our notation.3 x(t) := [x(t) x'(t) ■■■ x(n)(t),]T
x(t)= f (x(t)), for all t e [0, T], [x'(t)....... x(n- 1)(t), f (x(1:n- 1)(t))]T .
with initial value x(0) = x0 e Rd. This, however, hides the derivative-
relation of the components of x(t). In
Probabilistic Numerics, this structure
If there is an additional final condition x(T) = xT e Rd, then
can be explicitly modelled in a state
Eq. (36.2) turns into a boundary value problem (BVP).4 space model; see Exercise 38.2 and Bosch,
Tronarp, and Hennig (2022) who consid
Note that, under this definition, IVPs can be ill-posed; in par ered second-order ODEs directly.
ticular, some ODEs do not have a well-defined solution, either 3Note, however, that we ignore, as most
when Eq. (36.2) is satisfied for multiple5 choices of x or when a textbooks, the more general case of im
plicit ODEs
local solution x on [0, t], 0 < t < T, cannot be extended6 to the
entire interval [0, T]. BVPs, of course, only admit a solution if xT 0 = F x(n)(t),..., x(t)
equals the final value x (T ) of a solution of the underlying IVP. because only little is known about them;
Consequently, the solutions of many IVPs (and even more BVPs) see Eich-Soellner and Fuhrer (1998).
are not well-defined. Since there is nothing to approximate in
4Strictly speaking, this is only a special
these cases, the whole concept of a numerical error loses its case of a BVP; the boundary condition
meaning. Fortunately, the next two theorems will enable us to can more generally be g(y(a), y(b)) = 0
for an arbitrary function g.
exclude such cases by requiring the following assumptions.
5Example: Consider the ODE x1 (t) =
2sign|x(t)|) which admits a unique
Assumption 36.3. The set V C Rd is open and x0 e V. The vector solution for all x0 = 0. However, if x0 =
field f : V Rd is locally Lipschitz continuous, i.e. for each open 0, then the curves
subset U C V there exists a constant LU > 0 such that 0, for t e [0, t0]
x(t) =
±(t — t0)2, for t e (t0, T]
II f (x) — f (y) II < Lu llx — y\\, for all x, y e U.
are solutions for all t0 e [0, T].
6Example: Consider the ODE x (t) =
While more general preconditions for uniqueness and existence
x(t)2. Its solution
exist in the literature,7 the following version of the Picard-
Lindelof theorem8 is adapted to Definition 36.1 and will suffice
for our purposes. only exists on [0, 1/x0] and cannot be
extended beyond its singularity at t =
Theorem 36.4 (Picard-Lindelof theorem). Under Assumption 36.3, 1/x0.
let us choose a S > 0 such that BS(x0) := {x e Rd : ||x — x01| <
°)
S} с V and set M := supxeBj(x || f (x) ||. Then there exists a unique
7 For the most general existence and
where x(t) = Фt(a) is the solution of the IVP (36.2) with initial
value x(0) = a.
Given any numerical estimate x( t), standard ODE solvers
now extrapolate from t to t + h by approximating this flow
Фh(x(t)) as precisely and cheaply as possible. In fact, Euler’s
method can simply be interpreted as using a first-order Tay
lor expansion, x(t + h) = x(t) + hyt, to approximate Фh(x(t)).
Hence, it is only natural that Euler’s method produces (by Tay
lor’s theorem) a local error of O(h2), and (after N = T/h G
O(1/h) steps) a global error of O(h).
290 VI Solving Ordinary Differential Equations
0.6
///////////
0.4 ///////////.-/■/ /,
0.2
"I I г
0 0.5 1 0 0.5 1
time t time t
(37.5)
This removal of the flow map Ф from Eq. (37.5) to Eq. (37.6)
means that the solver, after every completed step, falsely as
sumes that its current estimate x(t) is the true x(t) - a property
of classical numerics, referred to as uncertainty-unawareness.4 To 4For more details on uncertainty-
(un)awareness in numerics see §1 in Ker-
satisfy this overly-optimistic internal assumption of the solver,
sting (2020).
one would have to replace x( t) by the exact x (t) in the en
tire data set, Eq. (37.6). Iterated local Hermite interpolation on
this more informative (but, to the solver, inaccessible) data set
indeed yields a more accurate regression of x - which is numer
ically demonstrated in Figure 37.2 for the popular fourth-order
Runge-Kutta method (RK4).
As a remedy, we can build more “uncertainty-aware” numer
ical solvers by modelling the ignored uncertainty with proba
bility distributions, that is by adding appropriate noise to the
Hermite extrapolation performed by classical ODE solvers (as
in the generic GP regression from §4.2.2).
But our probabilistic, regression-based view of numerics will
lead us further than that, beyond the conventional categories
of single-step and multi-step methods. To see how, let us first
recall that classical solvers iterate local Hermite interpolations
on t t + h using the data set from Eq. (37.6) for each respective
292 VI Solving Ordinary Differential Equations
John Skilling (1991) was the first to recognise that ODEs can,
and perhaps should, be treated as a Bayesian (GP) process re
gression problem. But two decades passed by before, in parallel
development, Hennig and Hauberg (2014) and Chkrebtii et al.
(2016) set out to elaborate on his vision. While both papers used
GP regression as a foundation, the data generation differed.
Hennig and Hauberg (2014) generated data by evaluating f at
the posterior predictive mean, and Chkrebtii et al. (2016) by eval
uating f at samples from the posterior predictive distribution,
i.e. at Gaussian perturbations of the posterior predictive mean.
This difference stemmed from separate motivations: Hennig
and Hauberg had the initial aim to deterministically reproduce
classical ODE solvers in a Bayesian model, as had been previ
ously achieved in e.g. Bayesian quadrature. Chkrebtii et al., on
5 This is not the only categorisation of
the other hand, intended to sample from the distribution of so probabilistic ODE solvers. Another pos
lution trajectories that are numerically possible given a Bayesian sible distinction would be nonparamet
ric vs Gaussian or deterministic vs ran
model and a discretisation. Thus, these two papers founded two
domised which would both group the
distinct lines of works which we call ODE filters and smoothers particle ODE filter/smoother with the
and perturbative solvers;5 see §38 and §40 respectively. perturbative solvers.
Note that the perturbative solvers have
The former approach, after an early success of reproducing been called “sampling-based” solvers in
Runge-Kutta methods (Schober, Duvenaud, and Hennig, 2014), several past publications.
294 VI Solving Ordinary Differential Equations
we can define the state misalignment2 z(t) := g(#-(t)) which is 2For more intuition on this state mis
alignment, see §4 in Kersting, Sullivan,
equal to 0 for all t G [0, T], since —(t) solves the ODE (36.1)
and Hennig (2020).
due to (38.3). Again, we model this feature by a (this time, 3 The pair of system X(t) and obser
p(x0)=N(x0;m0,P0), (38.10)
p(xn+1 I xn)=N(xn+1;A(hn+1)xn,Q(hn+1)), (38.11)
It might, however, be advantageous
p(zn I xn) = $(zn - g(xn)), (38.12) 5
Note, however, that this ssm is only known since its intro
duction by Tronarp et al. (2019); all preceding publications
employed a less-general linear-Gaussian ssm which we will also
define below in Eqs. (з8.з1)-(з8.зз). The new nonlinear ssm
(з8.1о)-(з8.1з), instead, leaves the task of finding approxima
tions to the inference algorithm and, in this way, engenders both
Gaussian (§з8.з) and non-Gaussian inference methods (§з8.4).
This difference between the SSMs in the literature is at the
heart of a common source of confusion over the new ssm (з8.1о)-
(з8.1з): To some readers, it might appear that the constant data
zn = 0 contains no information whatsoever. But this is mistaken
because the use of information does not only depend on the data
but also on the likelihood. While in the regression-formulation
of classical solvers (§з7) the data was an evaluation of f, this
dependence on f is now hidden in the likelihood via the def
inition of g, Eq. (з8.7). Since g(xn ) is by construction equal to
0 for the true xn = — (tn), the observation of the constant data
zn = 0 amounts by the form of the likelihood, Eq. (з8.12), to
“conditioning on the ODE” by imposing that x'(tn) = f (x(tn)) -
which is similar to Eq. (з7.7). In §з8.з.4, we will explain how
the alternative linear-Gaussian ssm echoes the logic of classical
solvers.
The last row can still be set flexibly by choosing the scale u > 0
of the Wiener process and the non-negative drift coefficients
(a0,...,aq) > 0 - which parametrise the Matern covariance
family with v = q + 1/2, as we saw in §5.5.
Although Matern priors are popular for GP regression, they
have (in their general form) not yet been explored for ODE
filtering. Only the special case of (a0, ...,aq-1)=0 has been
studied, where the only free parameter is aq > 0. In this case,
X (t) is the q-times integrated Ornstein-Uhlenbeck process with
(mean-reverting) drift coefficient aq . While this prior can be
advantageous for some exponentially decreasing curves (such as
radioactive decay),10 it is, to date, not known if these advantages 10 Magnani et al. (2017)
extend to more ODEs.
Meanwhile, the q-times IWP, which sets (a0, ..., aq )=0, has
become the standard prior for ODEs because the q-times IWP
extrapolates (as we saw in §5.4) by use of polynomial splines of
degree q. And this polynomial extrapolation also takes place for
the derivatives: under the q-times IWP prior, the ith mean of the
dynamic model (38.11) is, by Eq. (5.24), for all i = 1, ...,q + 1
300 VI Solving Ordinary Differential Equations
given by
q+1 hkn+1
[ A (hn+1) xn ] i = L (k - i)
k=i
[xn]k, (38.14)
m0 = [x0,f(x0),f<2(x0),...,f(x0)]T e Rq+1,
P0 = 0 e R(q+1)x (q+1).
1 procedure ODE filter(f, x(0), p(xn+1 | xn)) Algorithm 38.1: Bayesian ODE filtering
2 initialise p(X0) / with available information about x(0) iteratively computes a sequence of pre
3 for n = 0 : 1 : N — 1 do dictive and filtering distributions. Recall
from the graphical model of filtering
4 optional: adapt dynamic model p (xn+i | xn)
(Figure 5.1) (with z instead of y) that the
5 optional: choose step size hn > 0 sequential form of this inference proce
6 predict p(Xn+1 | z 1:n), from p (Xn | z 1:n) // by (38.11) dure (i.e. the for-loop) is legitimate. The
7 observe the ODE: zn + 1 = 0 / according to (38.13) form of the computations in lines 6-8
8 update p(Xn+1 | z 1:n+1), from p(Xn+1 | z 1:n) // by (38.12) depend on the choice of filter. The ini
tialisation (line 2) is explained in §38.2.1.
9 end for
The optional (but recommended) lines 4
10 return {p (xn | z 1: n); n = 0,..., n} and 5 are detailed in §38.5.
11 end procedure
1 procedure ODE Smoother(f, X(0), p(Xn+1 | Xn)) Algorithm 38.2: Bayesian ODE smooth
2 {p(Xn | z 1:n), p(Xn | z 1:n — 1) }n=0,...,N = ing extends Alg. 38.1 by iteratively up
3 ODE FILTEr(f, X(0), p(Xn +1 | Xn)) dating its output, the filtering distribu
tions p(Xn | z1:n), to the full posterior
4 for n = N — 1 : — 1 : 0 do
p(Xn | z1:N). Note that, in line 3, the filter
5 I compute p(Xn | z 1:N), from p(Xn +1 | z 1:N) // by (38.26) additionally returns the posterior predic
6 end for tive distributions p(Xn | z1:n—1 ) which it,
7 end procedure for all n, computes as an intermediate
step anyway; see line 6 of Alg. 38.1.
38 ODE Filters and Smoothers 303
prediction step update step Figure 38.1: Depiction of the first step
1 of the EKF0 with 2-times IWP prior ini
......... true solution tialised at xо and the implied deriva
0.8
------- samples tives as in (38.15). In the prediction step
0.6 ------- mean ‘ (left column), the predictive distribution
------ uncertainty/ p(x 1) is computed by extrapolating for
0.4
ward in time along the dynamic model.
0.2 10 The samples can be thought of as dif
ferent possibilities for the trajectory of
0
x(t), x(t) and x"(t) (from the top to the
bottom row). Then, in the update step
(right column), the predictive distribu
tion is conditioned on x' (11) = f(m-)
(recall that the intuitive ssm can be used
in the Gaussian case). This yields the fil
tering distribution p(x1 | z1 ) whose sam
ples are now restricted to those with a
first derivative of f (m1- ) at t1. The uncer
tainty (dashed line) is drawn at two stan
dard deviations in both directions of the
mean, thus capturing a 95% credible in
terval. Note the reduction of uncertainty
after conditioning on z1.
The same procedure is then repeated
for 11 12, that is all possible trajectories
(samples) are predicted forward in time,
and then restricted to x' (12) = f (m-)
in the subsequent update step. (Second
step not depicted.)
use,23 we use Taylor approximations around the predictive 23 Sarkka (2013), §5.1
mean mn-+1 which makes the update tractable. We will restrict
our attention to the most important cases of a zeroth and first-
order Taylor approximation. The resulting methods are known
as the (zeroth and first-order) extended Kalman ODE filter - or
EKF0 and EKF1 abbreviated.24 24 Tronarp et al. (2019), §2.4
In terms of specific computations, the EKF0 and EKF1 thus
perform the following approximate update step using the data
zn+1 = 0, where the difference between the EKF0 and EKF1 lies
in the choice of Й:
+
Zn+1 := f (H0m— 1) — Hm- 1, + (innovation residual) (38.18)
+
Sn+1 := HiP- 1 HiT + Rn+1, (innovation cov.) (38.19)
Kn+1 := P—+1 HiT S—+1, (gain) (38.20)
mn + 1 mn + 1 + Kn + 1 z n + 1, (з8.21)
Pn+1 := (Id k Kn+1 H)P—+1. (38.22)
:
Gn = PnA(hn+1)T(P— 1)—1, + (gain) (38.23)
msn := mn + Gn msn+1 — mn
—+1 , (38.24)
Psn := Pn + Gn (Pn+1 — P-+i) Gn. (38.25)
These backward-recursion equations are, simply, an exact Gaus We already provided this equation in
28
Eq. (5.7).
sian execution of the well-known smoothing equation:28
38 ODE Filters and Smoothers 305
(
p(Xn | Z1:N) = p(Xn | Z1:n) I' [p xn+1
p(xn+1
( z1:n) 1 z
। xn)p xn+1 1:N)1 dxn+1, (38.26)
which (in our ssm) does not differ between the EKF0 and EKF1
+
because their dynamic model p(xn 1 | xn) is the same.
With this in mind, we define the (zeroth and first-order)
extended Kalman ODE smoothers, EKS0 and EKS1,29 as the in 29In some recent publications, the EKS0
and EKS1 are referred to as EK0 and
stances of Algorithm 5 that employ the EKF0 or EKF1 in line 3
EK1 - because smoothing has become
and then compute line 5 by Eqs. (з8.2з)-(з8.25). The resulting the default (see §з8.6).
smoothing-posterior distributions p(xn | Z1:N) = N (xn; msn, Pns)
can be extended beyond the time grid {tn}n =
N 0 by interpolation
along the dynamic model, Eq. (з8.11), and therefore contain
the same information as the full GP posterior of Eqs. (4.7) and
(4.6).з0 з0 This was also discussed above for
generic Gaussian smoothers in §5.2.
Remark (Relation to Bayesian quadrature). Before introducing
more ODE filters, let us briefly clarify the relation to Bayesian quadra
ture (BQ) - namely that the EKF0/EKS0 is a generalisation of
BQ in the following sense: if the ODE is really just an integral
(i.e. x'(t) = g(t)), then its solution is given by
#—( 10:N) .
0
X * ((t0:N) = argmin ||#-( ) - m0II2
P +
n=1
£
II#—(tn) - A(hn)#—(tn- 1)IIQ(hn)
’
, (38.29)
#—( 10:N)
subject to z1:N = 0. Here, for a fixed positive definite matrix p,
||x||P := xTP- 1 x is the Mahalanobis norm associated with P. By
use of Eq. (38.1), we can extract a global MAP estimate
p(x0)=N(x0;m0,P0), (38.31)
p(xn+1 | xn) = N(xn+1;A(hn+1)xn,Q(hn+1)),
p(yn | xn)=N(yn;Hmn-,Rn), (38.32)
with data yn = f(H0mn-), (38.33)
bifurcating ODE flow particle-filtering representation Figure 38.2: Bifurcation detection by use
of particle ODE filtering. We consider the
Bernoulli ODE
aa2 =
= —
N ZT S-1Zzn
/ , znSn (38.41)
n=1
38 ODE Filters and Smoothers 313
are much faster and more stable than the non-Gaussian ones
(particle filtering). Among the Gaussian ones, the first-order
versions (EKF1 and EKS1) make use of the Jacobian of f (avail
able by automatic differentiation).62 This tends to produce a 62 Griewank and Walther (2008), §13
more precise mean with better-calibrated uncertainty. Moreover,
smoothing returns (unlike filtering) the full GP posterior distri
bution which exploits the whole data set z1:N along the entire
time axis [0, T] - while maintaining the O(N) complexity of fil
tering, both in the number of steps and of function evaluations.
Therefore the EKS1 is, altogether, our default recommendation.
But a longer answer would also involve other methods. As
a first alternative to the EKS1, both the EKF1 and the EKS0
recommend themselves. The EKF1 omits the smoothing pass
(38.23)-(38.25) backwards through time. It is therefore a bit
cheaper, i.e. its cost O(N) has a smaller constant. This can, e.g.,
be advantageous when only the distribution at the final time T
(where the filtering and smoothing distributions coincide) is of
interest.
The EKS0, on the other hand, does not require the Jacobian.
Compared with the EKS1, this again reduces the constant in the
O(N) cost. The Jacobian is beneficial to solve stiff ODEs and to
calibrate the posterior uncertainty accurately. But when rough
uncertainty estimates suffice, the EKS0 is an attractive cheaper
alternative for non-stiff ODEs.
Lastly, the EKF0 combines both of the modifications of the
EKF1 and EKS0, with respect to the EKS1. It is thus appropriate
for the intersection of cases where the EKF1 and EKS0 are
suitable.
The other above-mentioned ODE filters and smoothers are
more expensive and trickier to implement efficiently. Hence, we
recommend to only consider them in very specific cases. For
instance, if the MAP estimate is desired, the IEKS is best suited
to compute it. The particle ODE filter should only be used when
capturing non-Gaussian structures is crucial. It is thus not really
an alternative to Gaussian ODE filters and smoothers, but rather
to the perturbative solvers of §40.
Efficient implementations of our recommended choice (the
EKS1), and its next best alternatives (EKF1, EKS0, EKF0) are
readily available in the ProbNum package.
As in classical numerics, the implied6 global convergence rates 6 Strictly speaking, one can obtain even
faster convergence rates because filters
are O(hq), which are satisfied under some additional restric
exchange information between adjacent
tions. steps via the derivatives - like multi-step
methods do. §39.3 presents two settings
Theorem 39.3. Under the same assumptions as in Theorem 39.2, (for q = 1 and q = 2), where the EKF0
let us add the restrictions that q = 1, that the prior is a q-times indeed has global convergence rates of
order hq+1.
integrated Wiener process and that R G O (hq) is constant for all
times tn, n = 1, ..., N. Then, there exists a constant C(T) > 0,
depending on final time T > 0, such that for all sufficiently small
h>0
for both the EKF0 and EKS0.8 8Recall, from Eq. (38.22), the notation PN
for the posterior variance.
nig (2020).
evidence for choices up to q = 11 was added.10 Hence, it is
10 See §4 in Kramer and Hennig (2020).
widely believed that Theorem 39.3 can be extended to a general
q G N.
320 VI Solving Ordinary Differential Equations
Z[X](ti) = 0.
Theorem 39.6. Under Assumption 39.5 and for any prior X(t) of 1з That is, for any process with a.s. q-
times differentiable sample paths. In par
smoothness q,13 there exists a constant C (T) > 0 such that
ticular, this includes the Matern family
with v = q + 1/2 (§38.2) and its special
sup t Z[x*(s)] ds < C (T)hq, (з9.2) cases: the q-times integrated Wiener pro
te[0,T] 0 cess and Ornstein-Uhlenbeck process.
See §2.1 in Tronarp, Sarkka, and Hen
where x* (t) = H0 x * (t) is the MAP estimate of x (t) given a discreti nig (2021) for an alternative definition of
such priors by use of Green’s functions.
sation 0 = 10 < 11 < ■ ■ ■ < tN = T.14
14 Recall Eq. (38.30).
Proof. The proof idea is to first analyse (with the help of tools
from nonlinear analysis) which regularities the information
operator Z inherits from f under Assumption з9.5, and then to
apply results from scattered-data interpolation in the Sobolev
space associated with the prior X(t). Details in Tronarp, Sarkka,
and Hennig (2021), Theorem 3. □
where x* (t) = HоX* (t) is the MAP estimate of x(t) given a dis
cretisation 0 = 10 < 11 < ■ ■ ■ < tN = T.14 (NB: In particular,
this uniform bound also holds for the discrete MAP estimate x* (t0:N)
which the IEKS aims to estimate; see §38.3.3.)
for some real22 matrix Л whose eigenvalues lie in the unit circle 22In the classical literature Л e Rd d is a
complex matrix, but ODE filters are only
around zero, i.e. for which limt те x(t) = 0. An ODE solver
designed for real-valued ODEs. Hence,
is said to be A-stable, if and only if its numerical estimate x(t) we here use the real-valued analogue
also converges to zero (for a fixed step size h > 0) as t те.гз (з9.з) instead; cf. Eq. (з1) in Tronarp et al.
(2о19).
Accordingly, a Gaussian ODE filter is A-stable if and only if its
2з Dahlquist (196з)
mean estimate H0mn goes (for a fixed step size h > 0) to zero,
as n те.
The following recursion holds by Eqs. (з8.17)-(з8.21) for the
predictive mean mn- of both the EKF0 and EKF1 (but with
different Kn ):
Proof. Theorem 2 in Tronarp et al. (2019) shows (using filter 25 Anderson and Moore (1979), p. 77
ing theory)25 that indeed, for the EKF1, Kте exists and [A (h) —
A(h)KтеB] has eigenvalues in the unit circle. As explained above,
the A-stability of the EKF1 follows. For the corresponding
smoother (EKS1), the claim follows from the fact that, at the final
time (and thus also for t те), the smoothing mean coincides
with the filtering mean. □
The A-stability of the EKS1 was recently demonstrated on 26Bosch, Hennig, and Tronarp (2021),
§6.1
a very stiff version of the Van-der-Pol ODE.26 There are other
324 VI Solving Ordinary Differential Equations
(hq hq-1
T := y/hdiag I — ,-,-----туг,..., h, 1 (39.5)
q! (q-1)!
instead of the original SDE (38.4).32 In the discrete-time ssm, Cf. the alternative Nordsieck transfor
32
p(x0)= N(x0; T 1 m0, T 1 PoT t), (39.7) 33Note that, for notational simplicity, we
here assumed a constant step size h. See
p(xn+1 I Xn) = N(Xn+1; T-1A(h)Txn, T-1Q(h)T-T), Kramer and Hennig (2020) for a general
N=0 (in
isation to variable step sizes {hn}n
instead of Eqs. (38.10) and (38.11).33 In other words, we re zero-based indexing!).
obtain Eq. (39.7). As desired, these new matrices are now scale
invariant:
q+1-i
[A]ij = I(j > i) (39.8)
q+1-j , [ Q ]ij 2 q + 3 - i - j'
LP A т
Pn-+1 = AnLP LQ (39.10)
LQT ,
P-+1 = RT R. (39.li)
Fortunately, this R can be obtained without assembling Pn-+1
Exercise 39.10. Prove the claim in the text,
from its square-root factors as in Eq. (39.l0), since it (as Ex i.e. show that the upper-triangular matrix in
ercise 39.l0 reveals) is equal to the upper-triangular factor of the QR decomposition of [An Lp, Lq]t is the
transpose of the lower-triangular Cholesky
the QR decomposition of [AnLP, LQ]T. Hence, we may replace
factor of Pn-+1 . (For a solution, see §3.3 in
the original prediction step (39.l0) by the lower-dimensional Kramer and Hennig (2020).)
matrix multiplication (39.ll) in which the Cholesky factor R
is efficiently obtained by a QR decomposition of [AnLP, LQ]T
(i.e. without ever computing Pn -+1).
Thereby, the filter can summarise the predictive distribu
tion of Eq. (38.l7) by (mn-+1, R) instead of (mn
-+1,Pn-+1). In the
subsequent update step, the innovation-covariance matrix S of
Eq. (38.l9) can again be captured by its Cholesky factor which
is (via analogous reasoning) again available (without assem
bling S) in the form of the upper-triangular QR-factor of (HR)T.
The conditioning on zn+1 from Eqs. (з8.20)-(з8.22) can then be
executed solely by use of this Cholesky factor of S.38 Finally, For details, see Appendix A.3 in
38
filters and smoothers were designed directly from the first prin
ciples of Bayesian estimation in SSMs, without attempting to
imitate classical numerical solvers. While some loose connec
tions have been observed,42 it has not been studied in detail how 42For instance, it has been repeatedly
pointed out that both the EKF1 and the
the whole range of ODE filters and smoothers relates to classical
classical Rosenbrock methods make use of
methods. Nonetheless, earlier research43 has established one the Jacobian matrix of f.
important connection - in the form of an equivalence between 43 Schober, Sarkka, and Hennig (2019)
the EKF0 with IWP prior44 (more precisely, its filtering mean) 44It is unsurprising that the equivalences
are only known for the IWP prior as it is
and Nordsieck methods,45 which we will discuss in §39.3.2. But,
the only one with Taylor predictions; see
first, we will present another, more elementary, special case.46 Eq. (38.14).
45 Nordsieck (1962)
> 39.3.1 Equivalence with the Explicit Trapezoidal Rule 46Note that, even earlier, the pioneer
ing work by Schober, Duvenaud, and
Hennig (2014) showed an equivalence
In the case of the 1-times IWP prior and R = 0, the Kalman
between a single Runge-Kutta step and
gains {Kn}n N=1 are the same for all n. In other words, it is always GP regression with an IWP prior. How
in its steady state KTO = limn TO Kn. Therefore, the recursion ever, as it relies on imitating the sub-step
structure of Runge-Kutta methods, this
for the Kalman-filtering means {mn}n N=0 is independent of n equivalence cannot be naturally repro
which leads to the following equivalence. duced with ODE filters. Therefore, we
do not further discuss this result here.
Proposition 39.11 (Schober, Sarkka and Hennig, 2018). The
EKF0 with 1-times IWP prior and R = 0 is equivalent to the explicit
trapezoidal rule (aka Heun’s method). More precisely, its filtering
mean estimates xn := H0mn of x(tn) follow the explicit trapezoidal
rule, written
h
xn+1 = xn + 2 (f ( xn) + f (xn+1)), (39.12)
т -Л- Л 1! 2! (q -1)! q! А / X
TNord : diag ^1, h , h 2, ..., hq- 1 , hq j . (39.14)
, (j - 1\
[''■]tj = I(j > i) i 1 .
i-1
Hennig (2019).
x (t + h) = [I - lHi] A x(t) + hlf (H0A x (t)), (39.17)
recursions are the same in the steady state of the filter, i.e. after
Kn+i has reached its limit Kto := limn to Kn.52 Note, however, 52For the details, see §3.1 in Schober,
Sarkka, and Hennig (2019).
that Kto depends on the ssm on which the EKF0 performs
inference. While more equivalences with different Nordsieck
methods should hold, the following result is so far the only one
known.
Remark. For the q-times IWP prior with q = 2, this theorem gives
hq+1 instead of the hq convergence rates suggested by Theorem 39.3
and Corollary 39.7, but only in the steady state. These rates indeed
hold in practice.53 53Schober, Sarkka, and Hennig (2019),
Figure 4
In the same way, one could interpret any instance of the EKF0 (in
its steady state) as a Nordsieck method. But there is no guarantee that
the corresponding Nordsieck method is practically useful for at least
two reasons: First, the numerical stability could be insufficient if more
than two derivatives are included (see §39.2.2); second, the order of
such a method will depend on the entries of Kto according to Theorem
4.2. from Skeel (1979).
IWP prior models x(t) and its first q derivatives. At any given discrete
time point tn (n = 0, ..., N), the filtering mean estimate for x(tn),
computed by the EKF0, will of course depend on all previous function
evaluations {yi = f (H0mi-)}in=1. But, in the steady state, the mean
estimates for the q modelled derivatives [ x1 (tn)..., x (q)(tn))] will
depend only on a finite number j e N of these function evaluations,
namely on {yi = f (H0m-) }n= n-j+1. What is j for a given q e N?
__________ 40
Perturbative Solvers
So far in this chapter on ODEs, all methods (with the sole ex
ception of the particle ODE filter) were probabilistic, but not
stochastic. By this, we mean that they use probability distribu
tions to approximate x(t), but are not randomised (i.e. they
return the same output when run multiple times). This design
choice stems from a conviction held by some in the pn com-
munity1 that it is never optimal to inject stochasticity into any 1See the corresponding discussion in
§12.3.
deterministic approximation problem - except in an adversarial
setting where it can be provably optimal. This view is, however,
not shared by others who make the following arguments in
favour of randomisation in ODE solvers.
known as chaos. Clearly, such chaotic long-term behaviour can x 1 = a(x2 - x 1),
not be captured by Gaussians (or any other parametric family).6 x2 = x 1 (P - x3) - x2, (40.1)
But even in most non-chaotic ODEs, the long-term effect of the z = x 1 x2 - ex3.
г hn
:
£n (hn) ~ tn (hn) = Jo Xn (s) ds, (40.2)
Eq. (37.2).
iff is globally Lipschitz.)
the sense that both the expected global error of the former and
the fixed global error of the latter are in O(hq). However, if
the added noise is larger than that (i.e. p < q), then the ex
pected global error is only in O(hp), i.e. larger than without
randomisation.
This is intuitive. Loosely speaking, it means that one can at
most perturb the local error (in O(hq+1)) by a slightly larger16 16 Due to the independence of random
Assumption 40.6. The random variables {Hn}n N=1 satisfy the fol
lowing properties, for all n e {1,..., N }.-
(iii) there exists p > 1/2 and C > 0 independent ofn such that
Just like Theorem 40.5, Theorem 40.6 shows that the local error
rate of q + 1 and the local standard deviation of p + 1/2 combine
to a convergence rate of min( p, q). For the same reasons as
above, it is again recommended to choose p = q. Note that, like
in a weaker earlier version of Theorem 40.5 by Conrad et al., the
maximum is outside of the expectation here and f is assumed
to be globally Lipschitz. For Theorem 40.5 these restrictions
were later lifted by Lie, Stuart, and Sullivan (2019); maybe this
is also possible for Theorem 40.6. Since the desired properties
of geometric integrators hold for all h > 0, they a.s. carry over
to a sample of {Xn} from Eq. (40.5).19 19 Abdulle and Garegnani (2020), Thm. 4
Notably, both of these methods can be thought of as frequen-
tist as they sample i.i.d. approximations of x(t). The particle
ODE filter (from §38.4), on the other hand, is Bayesian as it
20Teymur, Zygalakis, and Calderhead
computes a dependent set of samples that approximate the true (2016)
posterior distribution. While there are first experimental com
parisons (Tronarp et al., 2019, §5.4), more research is needed
21 Teymur et al. (2018)
to understand the differences between all of these nonpara
metric solvers. There are further important methods - such as
the aforementioned one by Chkrebtii et al. (2016) as well as
stochastic versions of linear multistep methods20 and of im
plicit solvers.21 Finally, note that Abdulle and Garegnani (2021)
recently published an extension of their ODE solver (40.5) to
PDEs by randomising the meshes in finite-element methods.
336 VI Solving Ordinary Differential Equations
two astronomical objects.23 This concrete example setup is taken 23 Hairer, N0rsett, and Wanner (1993),
from Hairer, Norsett, and Wanner (1993, p. 129). Figure 40.1 p.129
shows two bodies (representing earth and moon) of mass 1 — ц
and ц, respectively, rotating in a plane around their common
centre of mass, with a third body of negligible mass orbiting
them in the same plane. The equations of motion are given by
the ODE system
It is known that there are initial values that give closed, periodic
orbits. One example is
In the same way as for IVPs (§38.2), we can model the BVP Definition 36.1, there is a more general
boundary condition g(x(0), x(T)) = 0.
solution x(t) and its derivative x'(t) by a GP prior p(—(10:N)),
Accordingly, the referenced probabilistic
such that x(t) = Hо#-(10:N) and x'(t) = Hx-(10:N) are contained BVP solvers are not limited to our re
in #-(t). The BVP posterior distribution is now written strictive definition which only serves to
declutter the notation.
out the linear-time formulation) been applied to numerically tain equal information. Hence, filtering
is less natural as it moves only in one
approximate shortest paths and distances on Riemannian man
direction through time.
ifolds. This approach has concrete applications in fields such 8 See§з8.з.2 and §з8.з.з for these two
as Riemannian statistics and brain imaging where quantifying approaches, respectively.
the numerical uncertainty plays an important role for the final
objective. This was first recognised by Hennig and Hauberg
(2014) and further exploited by Schober et al. (2014). The cur
rent state of the art was provided by Arvanitidis et al. (2019).
However, note that, in principle, all above (newer) probabilistic
BVP solvers are applicable to this task as well.
:
z = [z 1 (11),...,z 1(tM),...,Zd(11),...,Zd(tM)]T e RdM.
342 VI Solving Ordinary Differential Equations
In the case of ODE filters and smoothers, the (to date) only publi
cation is Kersting et al. (2020). Like the perturbative approaches,
it managed to reduce the overconfidence in the likelihood by in
serting an EKF0 in lieu of a classical ODE solver, as Figure 41.2
demonstrates. But, on top of that, it exploited the resulting,
more structured Gaussian form of the likelihood to estimate its
gradients and Hessian matrices in the following way.
First, let us assume w.l.o.g. that h > 0 is fixed, and recall,
from Eqs. (38.1) and (38.2), that the functions X and x' are a
priori jointly modelled by a Gauss-Markov process, written
X ; X0 k kd
= GP
p
x' ; f (X0) dk dkd
where the posterior mean is given by20 20We here, for its simplicity, use the no
tation of the linear-Gaussian ssm (38.31)
(38.33) to work with the EKF0 and EKS0.
me (~) x01 M + k3 TMTN [3KTNTN] [y 1:N - f(x0, в)1N] (41.8) This is admissible because Kalman filter
ing (and smoothing) on this ssm is equal
with TN = {jh}N=i, y1: N = [f (m- (h), в),..., f (m- (Nh), в )]т, to the EKF0 and EKS0 on the general ssm
(38.10)-(38.13) which yields Eq. (41.8);
and 1M = [1,..., 1]T e RM. While the dependence of m- on see §38.3.4. For notational simplicity we
в is at least as involved as the dependence f (x, в ) on в, the set R = 0, but a non-zero R > 0 can
simply be added to 3kTNTN if needed.
following simplifying assumption fortunately holds for many
ODEs.21 21 In fact, most ODEs collected in Hull
et al. (1972, Appendix I), a standard set
Assumption 41.1. The dependence of f on в is linear. More precisely, of ODE benchmarking problems, satisfy
Assumption 41.1 either immediately or
we assume that f (x, в) = LП=1 Gifi (x), for some continuously dif after re-parametrisation. In particular,
ferentiable fi : Rd Rd,for all i = 1,..., n. (Note that no linearity Assumption 41.1 does not restrict the
assumption is placed on the fi.) numerical difficulty of the ODE since
f inherits all inconveniences from the
{fi }in=1 for which nothing (but the usual
Under this assumption, Eq. (41.8) becomes22
continuous differentiability) is assumed.
22 Note that Eq. (41.9) reveals how eas
x0 ily x0, if unknown, can be treated as an
тв = 1 m J = x01 m + №, (41.9)
в additional parameter.
:
V2L(z) = JT P + a2M J (41.14)
for the gradient and Hessian of the corresponding log-likelihood
L(z) := log p(z | 0).25 Both of these estimators can then be 25 Note that the size of these gradient
and Hessians scale, as desired, inversely
inserted into any existing gradient-based sampling or optimisa
with the combined uncertainty of the
tion method to infer 00. Classically, such gradient and Hessian numerical error P and statistical noise
estimators are not accessible without an additional sensitivity a2IM. This means that they inherited the
uncertainty-awareness of the probabilis
analysis (Rackauckas et al., 2018). tic likelihood (41.12).
To summarise, we have seen how the EKF0 gives rise to
an uncertainty-aware26 likelihood (41.12) with freely available 26 See Figure 41.2.
estimators for its gradient (41.13) and Hessian (41.14). Our expo
sition followed Kersting et al. (2020) who, however, executed this
strategy for the EKF0 (and not the EKS0). The filtering case is
more complicated (because the equivalence with GP regression
only holds locally), but essentially analogous. For the filter
ing case, the experiments in Kersting et al. (2020) demonstrate
that, indeed, the resulting gradient-based inference schemes
are significantly more efficient (i.e. they need fewer forward
ODE solutions) than their gradient-free counterparts - both
for MCMC sampling and for optimisation. Recently, Tronarp,
Bosch, and Hennig (2022) introduced another method to solve
ODE inverse problems by insertion of ODE filtering/smoothing
into the likelihood.
( Hobsx
p(ynobs | | H ) (
N obs Hobsx Robs
xn)=N (yn ; H xn, R ), ) (( 1 ))
411.15
/N
s(x) : = sup (mx - fx)2 = sup £ f (xi)Wi(x) - fx
f eh, IIf ll< 1 feH, || f ||< 1 \i=1
/ \ 2
= sup
feH,IIf ||< 1 \ i
£w W (x)k(■, x ) - k(■, x), f (■))
i i
/H
we can rewrite:
I \ 2
s(x) = \ EWi(x)k(•, Xi) - k(•, x), fx(•) /
2
E Wi (x) k(•, xi) - k (•, x)
i H
= E Wi(x)Wj(x)k(xi, xj) - 2 E Wi(x)k(x, xi) + k(x, x)
= kxx - kxXK-1kXx,
A 0 = exp(-h),
5h +1 h
A1 = exp h
-I2 -2 h (1 - th)
0 1 0
A2 = exp h 0 0 1
-3 - 3 £2 - 3t
1/2( g2 h2+ 2 gh + 2) h (gh + 1) 1/2 h2
= e-h -1/2g3 h '-2 - (^2h2 - ^h - 1) -1/2h(^h - 2)
1/2£3h(£h - 2) ^2h(^h - 3) 1/2(^2h2 - 4^h + 2)
Now write down the same equation for t + 1, and replace all
occurrences of Pt+1 using (42.1). Then use the matrix inversion
lemma, Eq. (15.9) to simplify the expression.
1 x2
f(x)=N(x;0,s2)= e exp v 2s2J , and
V 2 ns t2
p (x) = N (x ;0, a2)
AA,a AB,b ( VA ® .
VB )ia,jb
IA ® B1 = V ® VB ||Da ® Db 11 VA ® VB | = lDA ® DB |.
Note again that (Da ® Db ) is a diagonal matrix with diagonal
entries (DA ® DB) ij,ij = AA,i AB,j. So its determinant is
N_
((I 0 S )E)nm,k£ E> ^niSjm^ik$j£Wij $nkWk£S£m,
ij
Selected Exercises from Chapter IV “Local Optimisation” 363
(The evidence is not a function of fX, thus does not affect the
location of the maximum and can be ignored.) The location
of maxima is not affected by any monotonic re-scaling. Sowe
can take the logarithm (a monotonic transformation) of the
posterior. The logarithm of a Gaussian (again ignore all constant
scaling factors) is a quadratic form. Because we are interested
in minimisation instead of maximisation, we also multiply by
(-1), to get
1N
= r(x) + 2 E
(Yi - f (xi))2.
Selected Exercises from Chapter IV “Local Optimisation” 365
A well-known theorem byde Finetti7 states that this is the case 7 Diaconis and Freedman (1980)
if the likelihood is exchangeable (invariant under permutations
of the data). However, the opposite direction does not always
work: not all regularisers r (w) and not all loss functions f (y^; w)
can be interpreted as the negative logarithm of a prior and
likelihood, respectively, since their exponential may not have
finite integral, and thus can not be normalised to become a
probability measure. For example, the hinge loss used in support
vector machines is not the logarithm of a likelihood.8 8For further discussion, see pp. 144-145
in Rasmussen and Williams (2006).
- n фn;m(Xn), V(Xn))
_ гn-m(Xn) z + m(Xn) 1 z2 d
= У-- x 2nV(Xn) exp - 2 V(Xn) dz
m1 c1 0И
P m2 .0 c2я.
- -
=: C
/ rs 7 \ ^ s1 0 a0- 0 a2 0
s2 s2 0 0 a0- 0 a2
P =N
У1 У1 m1 a0- 0 C1 + aa- 0
\ Ly2J ) \Ly2. m2 0 a0- 0 C2 + a2_| )
Selected Exercises from Chapter V “Global Optimisation” 367
£1 y1
p
£2 y2
г1 y1 - m1
; а2 (C + а21) 1 , а21 - а4 (C + а21) 11
£2 y2 - m2
у1
p £1 - £2 = N £11 - £2; а2 (E + а21) 1 (у 1 - у2 - (m 1 - m2)), 2а2 - 2а4(C2 + а2) 1).
у2
A(Aucb (xn , Dn )+ yn)p (yn lDn-1 ) d yn — fin^f (yn m (xn )) p (yn lDn-1) d yn
A-stability, 322, 323 code, see software emukit package, see software
acquisition function, 254 companion matrix, 53 entropy, 23
Adam, 230 conditional distribution, 21 epistemic uncertainty, 11, 70
affine transformation, 23 conditional entropy, 272 equidistant grid, 92
agent, 3, 4, 11 conjugate gradients, 134, 137, 144, ergodicity, 85
aleatory uncertainty, 11, 71 145, 165, 166 error analysis vs error estimation,
analytic, 1 probabilistic, 165 92
Arenstorf orbit, 336 conjugate prior, 55, 56 error function, 215
Armijo condition, see line search continuous time, 48, 296 Euler’s method, 289
Arnoldi process, 135 continuous-time Riccati equation, Eulerian integrals, 56
atomic operation, 70 see Riccati equation evidence, 99
Automated Machine Learning, 276 convergence rate, 201 expected improvement, 259
average-case, 182 convex function, 199 expensive evaluations, 256
average-case analysis, 7 covariance, 23 exploration-exploitation trade-off,
covariance function, see kernel 247
backpack package, see software219 cubic splines, see splines exponential kernel, see Ornstein-
Bayes’ theorem, 21 curse of dimensionality, 79 Uhlenbeck process
Bayesian, 8 exponentiated quadratic kernel, see
Bayesian ODE filters and Dahlquist test equation, 323 Gaussian kernel
smoothers, see ODE filters DARE, see Riccati equation
and smoothers data, 21 filter, 44
Bayesian Optimisation, 251 decision theory, 4 Kalman, 46
Bayesian quadrature, 72 Dennis family, 236 ODE, see ODE filters and
belief propagation, 43 detailed balance, 85 smoothers
BFGS, 236 determinant lemma, 130 optimal, 46
bias, 11 determinantal point process, 80 particle, 309
bifurcation, 311 DFP, 237 forward-backward algorithm, see
boundary value problem, 286, 339 Dirac delta, 27 sum-product algorithm
Brownian motion, 50 discrete time, 42, 297 Frobenius norm, see norm
discrete-time algebraic Ricatti equa function-space view, 28
calibration, 12 tion, see Riccati equation
CARE, see Riccati equation dynamic model, 45, 298 Galerkin condition, 143
Cauchy-Schwarz inequality, 37 gamma distribution, 56
Chain graph, 42 early stopping, 224 gamma function, 56
chaos, 332 EKF0, EKF1, see ODE filters and Gauss-Markov process, 41
Chapman-Kolmogorov equation, smoothers Gauss-inverse-gamma, 56
43, 45 EKS0, EKS1, see ODE filters and Gauss-inverse-Wishart, 59
Chebyshev polynomials, 103 smoothers Gaussian
Cholesky decomposition, 36, 138 empirical risk minimisation, 37, 204 elimination, 133
396 Index