0% found this document useful (0 votes)
208 views

Hennig P. Probabilistic Numerics. Computation As Machine Learning 2022

Uploaded by

kaiser key
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views

Hennig P. Probabilistic Numerics. Computation As Machine Learning 2022

Uploaded by

kaiser key
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 411

Copyrighted Material

PROBABILISTIC

NUMER CS

Computation as
Machine Learning

PHILIPP HENNIG, MICHAEL A. OSBORNE


AND HANS P.KERSTING
Copyrighted Material
Probabilistic Numerics

Probabilistic numerical computation formalises the connection between machine learning and
applied mathematics. Numerical algorithms approximate intractable quantities from computable
ones. They estimate integrals from evaluations of the integrand, or the path of a dynamical system
described by differential equations from evaluations of the vector field. In other words, they
infer a latent quantity from data. This book shows that it is thus formally possible to think of
computational routines as learning machines, and to use the notion of Bayesian inference to build
more flexible, efficient, or customised algorithms for computation.
The text caters for Masters’ and PhD students, as well as postgraduate researchers in artificial
intelligence, computer science, statistics, and applied mathematics. Extensive background material
is provided along with a wealth of figures, worked examples, and exercises (with solutions) to
develop intuition.

Philipp Hennig holds the Chair for the Methods of Machine Learning at the University of
Tiibingen, and an adjunct position at the Max Planck Institute for Intelligent Systems. He has
dedicated most of his career to the development of Probabilistic Numerical Methods. Hennig’s
research has been supported by Emmy Noether, Max Planck and ERC fellowships. He is a co­
Director of the Research Program for the Theory, Algorithms and Computations of Learning
Machines at the European Laboratory for Learning and Intelligent Systems (ELLIS).

Michael A. Osborne is Professor of Machine Learning at the University of Oxford, and a co­
Founder of Mind Foundry Ltd. His research has attracted £10.6M of research funding and has
been cited over 15,000 times. He is very, very Bayesian.
Hans P. Kersting is a postdoctoral researcher at INRIA and Ecole Normale Superieure in Paris,
working in machine learning with expertise in Bayesian inference, dynamical systems, and
optimisation.
‘This impressive text rethinks numerical problems through the lens of probabilistic inference and
decision making. This fresh perspective opens up a new chapter in this field, and suggests new and
highly efficient methods. A landmark achievement!’
- Zoubin Ghahramani, University of Cambridge

‘In this stunning and comprehensive new book, early developments from Kac and Larkin have been
comprehensively built upon, formalised, and extended by including modern-day machine learn­
ing, numerical analysis, and the formal Bayesian statistical methodology. Probabilistic numerical
methodology is of enormous importance for this age of data-centric science and Hennig, Osborne,
and Kersting are to be congratulated in providing us with this definitive volume.’
- Mark Girolami, University of Cambridge and The Alan Turing Institute

‘This book presents an in-depth overview of both the past and present of the newly emerging
area of probabilistic numerics, where recent advances in probabilistic machine learning are used to
develop principled improvements which are both faster and more accurate than classical numerical
analysis algorithms. A must-read for every algorithm developer and practitioner in optimization!’
- Ralf Herbrich, Hasso Plattner Institute

‘Probabilistic numerics spans from the intellectual fireworks of the dawn of a new field to its
practical algorithmic consequences. It is precise but accessible and rich in wide-ranging, principled
examples. This convergence of ideas from diverse fields in lucid style is the very fabric of good
science.’
- Carl Edward Rasmussen, University of Cambridge

‘An important read for anyone who has thought about uncertainty in numerical methods; an
essential read for anyone who hasn’t’
- John Cunningham, Columbia University

‘This is a rare example of a textbook that essentially founds a new field, re-casting numerics on
stronger, more general foundations. A tour de force.’
- David Duvenaud, University of Toronto

‘The authors succeed in demonstrating the potential of probabilistic numerics to transform the way
we think about computation itself.’
- Thore Graepel, Senior Vice President, Altos Labs
PHILIPP HENNIG
Eberhard-Karls-Universitat Tubingen, Germany

MICHAEL A. OSBORNE
University of Oxford

HANS P. KERSTING
Ecole Normale Superieure, Paris

PROBABILISTIC NUMERICS

COMPUTATION AS MACHINE LEARNING

ИC ambridge
UNIVERSITY PRESS
Cambridge
UNIVERSITY PRESS

University Printing House, Cambridge CB2 8BS, United Kingdom


One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314-321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi - 110025, India
103 Penang Road, #05-06/07, Visioncrest Commercial, Singapore 238467

Cambridge University Press is part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title:www.cambridge.org/9781107163447
DOI: 10.1017/9781316681411
© Philipp Hennig, Michael A. Osborne and Hans P. Kersting 2022
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2022
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-107-16344-7 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To our families.
Measurement owes its existence to Earth
Estimation of quantity to Measurement
Calculation to Estimation of quantity
Balancing of chances to Calculation
and Victory to Balancing of chances.

Sun Tzu - The Art of War


§4.18: Tactical Dispositions
Translation by Lionel Giles, 1910
Contents

Acknowledgements page ix
Symbols and Notation xi
Introduction 1

I Mathematical Background 17
1 Key Points 19
2 Probabilistic Inference 21
3 Gaussian Algebra 23
4 Regression 27
5 Gauss-Markov Processes: Filtering and SDEs 41
6 Hierarchical Inference in Gaussian Models 55
7 Summary of Part I 61

II Integration 63
8 Key Points 65
9 Introduction 69
10 Bayesian Quadrature 75
11 Links to Classical Quadrature 87
12 Probabilistic Numerical Lessons from Integration 107
13 Summary of Part II and Further Reading 119

III Linear Algebra 123


14 Key Points 125
15 Required Background 127
16 Introduction 131
17 Evaluation Strategies 137
18 A Review of Some Classic Solvers 143
19 Probabilistic Linear Solvers: Algorithmic Scaffold 149
20 Computational Constraints 169
21 Uncertainty Calibration 175
viii Contents

22 Proofs 183
23 Summary of Part III 193

IV Local Optimisation 195


24 Key Points 197
25 Problem Setting 199
26 Step-Size Selection - a Case Study 203
27 Controlling Optimisation by Probabilistic Estimates 221
28 First- and Second-Order Methods 229

V Global Optimisation 243


29 Key Points 245
30 Introduction 247
31 Bayesian Optimisation 251
32 Value Loss 259
33 Other Acquisition Functions 267
34 Further Topics 275

VI Solving Ordinary Differential Equations 279


35 Key Points 281
36 Introduction 285
37 Classical ODE Solvers as Regression Methods 289
38 ODE Filters and Smoothers 295
39 Theory of ODE Filters and Smoothers 317
40 Perturbative Solvers 331
41 Further Topics 339

VII The Frontier 349


42 So What? 351

VIII Solutions to Exercises 357

References 369
Index 395
Acknowledgements

Many people helped in the preparation of this book. We, the


authors, extend our gratitude to the following people, without
whom this book would have been impossible.
We are particularly grateful to Mark Girolami for his involve­
ment during the early stages of this book as a project. Though
he could not join as an author in the end, he provided a lot of
support and motivation to make this book a reality.
We would like to deeply thank the many people who offered
detailed and thoughtful comments on drafts of the book: On-
drej Bajgar, Nathanael Bosch, Jon Cockayne, Michael Cohen,
Paul Duckworth, Nina Effenberger, Carl Henrik Ek, Giacomo
Garegnani, Roman Garnett, Alexandra Gessner, Saad Hamid,
Marius Hobbhahn, Toni Karvonen, Nicholas Kramer, Emilia
Magnani, Chris Oates, Jonathan Schmidt, Sebastian Schulze,
Thomas Schon, Arno Solin, Tim J. Sullivan, Simo Sarkka, Filip
Tronarp, Ed Wagstaff, Xingchen Wan, and Richard Wilkinson.

Philipp Hennig

I would like to thank my research group, not just for thorough


proof-reading, but for an intense research effort that contributed
substantially to the results presented in this book. And, above
all, for wagering some of their prime years, and their career, on
me, and on the idea of probabilistic numerics:
Edgar Klenske, Maren Mahsereci, Michael Schober, Simon
Bartels, Lukas Balles, Alexandra Gessner, Filip de Roos, Frank
Schneider, Emilia Magnani, Niklas Wahl and Hans-Peter Wieser,
Felix Dangel, Frederik Kunstner, Jonathan Wenger, Agustinus
Kristiadi, Nicholas Kramer, Nathanael Bosch, Lukas Tatzel,
Thomas Glafile, Julia Grosse, Katharina Ott, Marius Hobbhahn,
Motonobu Kanagawa, Filip Tronarp, Robin Schmidt, Jonathan
Schmidt, Marvin Pfortner, Nina Effenberger, and Franziska
Weiler.
I am particularly grateful to those among this group who
x Acknowledgements

have contributed significantly to the development of the probnum


library, which would not exist at all, not even in a minimal state,
without their commitment.
Last but not least, I am grateful to my wife Maike Kaufman.
And to my daughters Friederike und Iris. They both arrived
while we worked on this book and drastically slowed down
progress on it in the most wonderful way possible.

Michael A. Osborne

I would like to thank Isis Hjorth, for being the most valuable
source of support I have in life, and our amazing children
Osmund and Halfdan - I wonder what you will think of this
book in a few years?

Hans P. Kersting

I would like to thank my postdoc adviser, Francis Bach, for


giving me the freedom to allocate sufficient time to this book.
I am grateful to Dana Babin, my family, and my friends for
their continuous love and support.
Symbols and Notation xi

Symbols and Notation

Bold symbols (x) are used for vectors, but only where the fact that a variable is a vector is relevant.
Square brackets indicate elements of a matrix or vector: if x = [x1, ..., xN] is a row vector, then
[x]i = xi denotes its entries; if A G Rnxm is a matrix, then [A]ij = Aj denotes its entries. Round
brackets (•) are used in most other cases (as in the notations listed below).

Notation Meaning

a «c a is proportional to c: there is a constant k such that a = k • c.


A Л B, A V B The logical conjunctions “and” and “or ”; i.e. A Л B is true iff
both A and B are true, A V B is true iff —A Л —B is false.
A ®B The Kronecker product of matrices A, B. See Eq. (15.2).
A ©B The symmetric Kronecker product. See Eq. (19.16).
A&B The element-wise product (aka Hadamard product) of two
matrices A and B of the same shape, i.e. [A © B]ij = [A]ij • [B]ij.
A ЦA A is the vector arising from stacking the elements of a matrix A
row after row, and its inverse (A = A). See Eq. (15.1).
covp(x, y) The covariance of x and y under p. That is,
covp(x, y) := Ep(x • y) - Ep(x)Ep(y).
Cq(V, Rd) The set of q-times continuously differentiable functions from
V to Rd, for some q, d G N.
5 (x - У) The Dirac delta, heuristically characterised by the property
f f (x) 5(x — y) dx = f (y) for functions f : R R.
5ij The Kronecker symbol: 5ij = 1ifi = j, otherwise 5ij = 0.
det( A) The determinant of a square matrix A.
diag(x) The diagonal matrix with entries [diag(x)]ij = 5ij[x]i.
dwt The notation for an an It6 integral in a stochastic differential
equation. See Definition 5.4.
erf (x) The error function erf(x) := -2n fxf exp(—t2) dt.
Ep(f) The expectation of f under p. That is, Ep(f) := f(x) dp(x).
E|Y(f) The expectation of f under p (f | Y).
Г( z) The Gamma function Г(z) := /0“ xz—1 exp( — x) dx. See
Eq. (6.1).
The Gamma distribution with shape a > 0 and rate b > 0, with
G (•; a, b)
probability density function G(z; a, b) := r^)
b e—bz.
GP (f; ц, k) The Gaussian process measure on f with mean function ц and
covariance function (kernel) k. See §4.2
Hp(x) The (differential) entropy of the distribution p (x).
That is, Hp (x) := - p(x) log p(x) dx. See Eq. (3.2).
H(x | y) The (differential) entropy of the cond. distribution p(x | y).
That is, H(x | y) := Hp(qy)(x).
I(x;y) The mutual information between random variables X and Y.
That is, I(x;y) := H(x) -H(x | y) = H(y) -H(y | x).
xii Symbols and Notation

Notation Meaning

I, IN The identity matrix (of dimensionality N): [I]ij = Sij.


I(■ G A) The indicator function ofaset A.
Kv The modified Bessel function for some parameter v G C.
That is, Kv(x) := J™ exp(—x ■ cosh(t))cosh(vt) dt.
L The loss function of an optimization problem (§26.1), or the
log-likelihood of an inverse problem (§41.2).
M The model M capturing the probabilistic relationship between
the latent object and computable quantities. See §9.3.
N,C,R,R+ The natural numbers (excluding zero), the complex numbers,
the real numbers, and the positive real numbers, respectively.
N(x; ц, S) = p(x) The vector x has the Gaussian probability density function
with mean vector ц and covariance matrix S. See Eq. (3.1).
N(ц,S) - X The random variable X is distributed according to a Gaussian
distribution with mean ц and covariance S.
O (■) Landau big-Oh: for functions f, g defined on N, the notation
f (n) = O(g(n)) means that f (n)/g(n) is bounded for n те.
p(y | x) The conditional the probability density function for variable Y
having value y conditioned on variable X having value x.
rk(A) The rank of a matrix A.
span{x1, ...,xn} The linear span of {x1, ...,xn}.
St( ■; ц, Лi, Л i) The Student’s-t probability density function with parameters
ц G R and Л 1, Л2 > 0, see Eq. (6.9).
tr(A) The trace of matrix A, That is, tr(A) = £i [A]ii.
Aт The transpose of matrix A: [AT]ij = [A]ji.
Ua,b The uniform distribution with probability density function
p(u) := I(u G (a, b)), for a < b.
Vp (x) The variance of x under p. That is, Vp (x) := covp (x, x).
V|Y(f) The variance of f underp(f | Y). That is, H(x | y) :=
— log p(x | y) dp(x | y).
W(V,v) The Wishart distribution with probability density function
W(x; V,v) a |x|(v—N—1)/2e—1/2tr(V—1x). See Eq. (19.1).
x±y x is orthogonal to y, i.e. {x, y) = 0.
x := a The object x is defined to be equal to a.
x=a The object x is equal to a by virtue of its definition.
x a The object x is assigned the value of a (used in pseudo-code).
X-p The random variable X is distributed according to p.
1, 1d A column vector of d ones, 1 d := [1,..., 1]T G Rd.
w f ( x, t) The gradient of f w.r.t. x. (We omit subscript x if redundant.)
Introduction

The Uncertain Nature of Computation

Computation is the resource that has most transformed our


age. Its application has established and extracted value from
globe-spanning networks, mobile communication, and big data.
As its importance to humankind continues to rise, computation
has come at a substantial cost. Consider machine learning, the
academic discipline that has underpinned many recent tech­
nological advances. The monetary costs of modern massively
parallel graphics processing units (GPUs), that have proved so
valuable to much of machine learning, are prohibitive even to
many researchers and prevent the use of advanced machine
learning in embedded and other resource-limited systems. Even
where there aren’t hard resource constraints, there will always
be an economic incentive to reduce computation use. Trou-
blingly, there is evidence1 that current machine learning models 1R. Schwartz et al. “Green AI” (2019);
D. Patterson. “The Carbon Footprint of
are also the cause of avoidable carbon emissions. Computations
Machine Learning Training Will Plateau,
have to become not just faster, but more efficient. Then Shrink” (2022).
Much of the computation consumption, particularly within
machine learning, is due to problems like solving linear equa­
tions, evaluating integrals, or finding the minimum of nonlinear
functions, all of which will be addressed in different chapters
in this text. These so-called numerical problems are studied by
the mathematical subfield of numerics (short for: numerical analy­
sis). What unites these problems is that their solutions, which
are numbers, have no analytic form. There is no known way to
assign an exact numerical value to them purely by structured,
rule-based thought. Methods to compute such numbers, by a
computer or on paper, are of an approximate nature. Their re­
sults are not exactly right, and we do not know precisely how
far off they are from the true solution; otherwise we could just
add that known difference (or error) and be done.

That we are thus uncertain about the answer to a numerical


2 Introduction

problem is one of the two central insights of probabilistic numerics


(pn). Accordingly, any numerical solver will be uncertain about
the accuracy of its output - and, in some cases, also about
its intermediate steps. These uncertainties are not ignored by
classical numerics, but typically reduced to scalar bounds. As
a remedy, pn brings statistical tools for the quantification of
uncertainty to the domain of numerics.
What, precisely, are the benefits of quantifying this numerical
uncertainty with probability measures?
For a start, a full probability distribution is a richer output
than a sole approximation (point estimate). This is particularly
useful in sequences of computations where a numerical solu­
tion is an input to the next computational task - as is the case
in many applications, ranging from the elementary (matrix in­
version for least squares) to the complex (solving differential
equations for real-word engineering systems). In these settings,
a probabilistic output distribution provides the means to propa­
gate uncertainty to subsequent steps.
But pn’s uncertainty-aware approach does not stop with
quantifying the uncertainty over the final output. Rather, it offers
an uncertainty-aware alternative to the design of numerical
methods in two ways:
Oftentimes, numerical uncertainty already appears within nu­
merical algorithms because its intermediate values are subject to
approximations themselves. The resulting intra-algorithmic accu­
mulation of such uncertainties calls for an appropriate amount
of caution. It appears, e.g., whenever expensive function evalua­
tions are replaced by cheaper simulations (such as when cheaper
surrogate functions are used in global optimisation); or when
imprecise steps are iterated (such as when ODE solvers concate­
nate their extrapolation steps). Probabilistic numerical methods
account for these uncertainties using probability measures. This
enables such methods to make smarter uncertainty-aware deci­
sions, which becomes most salient through their formulation as
probabilistic agents (as detailed below).
Moreover, probability distributions allow to more precisely
encode the expected structure of the numerical problem into
the solver: Numerical tasks can be solved by any number of
algorithms, and it is difficult to choose among them. Not all
algorithms for, say, integration, work on all integration prob­
lems. Some require the integrand to be highly regular, others
only that it be continuous, or even just integrable at all. If their
requirements are met, they produce a sequence of numbers
that converge to the intractable solution. But they do not all
The Uncertain Nature of Computation 3

converge at the same speed. In particular, algorithms that work


on a restricted set of problems, when applied to a problem
from that set, often converge faster than methods designed to
work on a larger, less-restrictive domain. Academic research in
numerics has traditionally concentrated on generic algorithms
that work on large spaces of problems. Such methods are now
widely available as toolboxes. These collections are valuable
to practitioners because they save design time and leverage
expert knowledge. But generic methods are necessarily inef­
ficient. Generic methods are, in essence, overly cautious. The
more that is known about a computation to be performed be­
fore one delves into it, the easier it will be to make progress.
Some knowledge is, however, not completely certain but only
expected with high probability, and thus cannot be encoded by
a function space alone. A probabilistic numerical method, on
the other hand, can exploit such less-than-certain expectations
by distributing its prior probability mass to expected subsets of
the function space, and away from less likely scenarios.
The mathematical toolkit of pn allows probabilistic algo­
rithms that leverage these benefits of uncertainty quantification.
Such algorithms have proven able to achieve dramatic reduc­
tions in computation.

Figure 1: A computational agent inter­


acts with the world just as its numeri­
cal algorithms interact with the agent.
That is, the agent receives data from the
world and selects actions to perform in
the world. The numerical algorithm re­
WORLD NUMERICS ceives evaluations from the agent and se­
lects computations (actions) for the agent
to perform. For example, the agent might
feed evaluations of an objective to an
optimiser, which selects the next evalu­
ations for the agent to make. pn recog­
nises that a numerical algorithm is just
as much an agent as one directly inter­
acting with the world.

The second of the central insights of pn is that a numerical


algorithm can be treated as an agent. For our purposes, an
agent is an entity able to take actions so as to achieve its goals.
These agents receive data from the environment, use the data
to make predictions, and then use the predictions to decide
how to interact with the environment. Machine learning often
aims to build such agents, most explicitly within its subfield
of reinforcement learning. As another example, consider an
4 Introduction

image classifier using active learning. This classifier receives


labelled images, uses those to predict the labels of unlabelled
images, and then uses those predictions to decide which new
data should be acquired.
It is possible to treat a numerical algorithm as just such an
agent, as laid out diagrammatically in Figure 1. Traditionally,
a numerical method takes in data, in the form of evaluations
(e.g. of an integrand), and returns predictions, or estimates
(e.g. of the integral). A numerical method must also provide
a rule, perhaps an adaptive one, determining which compu­
tations are actually performed (e.g. which nodes to evaluate):
these are decisions. There is thus a feedback-loop: an agent
that decides itself which data to collect may be inefficient by
collecting redundant data, or unreliable if it neglects to probe
crucially informative areas of the data-domain. Explicitly, pn
treats numerical algorithms just as machine learning often treats
its algorithms: as agents.
More precisely, pn is concerned with probabilistic agents. As
above, pn uses probability distributions to quantify uncertainty.
Quantified uncertainty is crucial to all agents, and numerical
agents are no exception. In particular, this understanding of
numerical solvers brings probabilistic decision theory2 to numerics, 2 M. J. Kochenderfer. Decision Making Un­

der Uncertainty: Theory and Application.


yielding a range of advantages.
2015.
For one thing, a numerical agent must decide when to stop.
Unlike in the computation of analytic expressions, there is not
always an obvious end to a numerical procedure: the error in
the current estimate is unknown, so it can be difficult to deter­
mine when to stop. Generic and encapsulated methods take a
cautious approach, usually aiming to satisfy high demands on
precision. Doing so requires many iterations, each further reduc­
ing uncertainty and improving precision, but each coming at a
cost. That is, this cautious approach consumes much computa­
tion. pn provides a new solution to this problem: if uncertainty
is well-quantified, we may be satisfied with (quantified) vague,
and thus cheap answers. pn hence provides a principled means
of stopping early, enabling fewer iterations, enabling a reduction
in computation.
Uncertainty also guides exploration. An intelligent agent,
making a decision, must weigh the uncertain consequences of
its actions, occasionally gambling by taking an uncertain ac­
tion, in order to explore and learn. Performing such exploration
is core to effective numerics. Predictive uncertainty unlocks a
fundamental means of quantifying the value of a numerical iter­
ation, and weighing it against the real cost of the computation it
The Deep Roots of Probabilistic Numerics 5

will consume. There are almost always choices for the character
of an iteration, such as where to evaluate an integrand or an
objective function to be optimised. Not all iterations are equal,
and it takes an intelligent agent to optimise the cost-benefit
trade-off.
On a related note, a well-designed probabilistic numerical
agent gives a reliable estimate of its own uncertainty over their
result. This helps to reduce bias in subsequent computations. For
instance, in ODE inverse problems, we will see how simulating
the forward map with a probabilistic solver accounts for the
tendency of numerical ODE solvers to systemically over- or
underestimate solution curves. While this does not necessarily
give a more precise ODE estimate (in the inner loop), it helps
the inverse-problem solver to explore the parameter space more
efficiently (in the outer loop). As these examples highlight, pn
hence promises to make more effective use of computation.

The Deep Roots of Probabilistic Numerics

Probabilistic Numerics has a long history. Quite early in the his­


tory of numerical computation, people noted that its demands
closely matched what was provided by the professional pro­
cess of guesswork known as statistical inference. It seemed to
those people that probability, central to the process of inference,
might be a natural language in which to describe computation
as the gathering of information. In the first chapter to his semi­
nal nineteenth-century Calcul des Probabilities,3 Henri Poincare 3H. Poincare. Calcul des Probabilites. 1896.
§I.7, pp. 30-31. Emphasis in the original.
mused about assigning probabilities to not-yet-computed num­
bers:
[Roughly:] The need for probability only
Une question de probabilites ne se pose que par suite de notre arises out of uncertainty: It has no place if
we are certain that we know all aspects of
ignorance: il n’y aurait place que pour la certitude si nous a problem. But our lack of knowledge also
connaissions toutes les donnees du probleme. D'autre part, notre must not be complete, otherwise we would
have nothing to evaluate. There is thus a
ignorance ne doit pas etre complete, sans quoi nous ne pourrions spectrum of degrees of uncertainty.
rien evaluer. Une classification s’opererait donc suivant le plus While the probability for the sixth decimal
digit of a number in a table of logarithms to
ou moins de profondeur de notre ignorance.
equal 6 is 1/10 a priori, in reality, all aspects
Ainsi la probabilite pour que la sixieme decimale d’un nombre of the corresponding problem are well deter­
dans une table de logarithmes soit egale a 6 est a priori de 1/10; mined, and, if we wanted to make the effort,
we could find out its exact value. The same
en realite, toutes les donnees du probleme sont bien determinees, holds for interpolation, for the integration
et, si nous voulions nous en donner la peine, nous connaitrions methods of Cotes or Gauss, etc. (Emphasis
in the op. cit.)
exactement cette probabilite. De meme, dans les interpolations,
dans le calcul des integrales definies par la methode de Cotes ou
celle de Gauss, etc.
6 Introduction

Although Poincare found it natural to assign probabilities to


the value of determined but unknown numbers, it seems the
idea of uncertainty about a fully, formally determined quantity
did not sit well with a majority of mathematicians. Rather than
assigning degrees of certainty about properties of a problem, it
seemed more acceptable to formally state assumptions required
to be strictly true at the onset of a theorem. When considering a
particular numerical estimation rule, one can then analyse the
convergence of the estimate in an asymptotic fashion, thereby
formally proving that (and in which sense) the rule is admissible.
This approach leaves the estimation of the rule’s error after a
finite number of steps as a separate and often subordinate part
of the routine.
By contrast, probabilistic inference makes the formulation of
probability distributions, characterising possible error, primary.
These distributions will include an explicit prior on the latent
quantity to be inferred, and an equally explicit likelihood func­
tion capturing the relationship of computable numbers to that
latent quantity. This approach may be more or less restrictive
than the asymptotic analysis described above. It might well be
more cumbersome to state, subject to philosophical intricacies,
and requires care to not introduce new intractable tasks along
the way. Nonetheless, phrasing computation as probabilistic
inference offers substantial advantages. For one thing, the ap­
proach yields a posterior probability distribution for quantities
of interest that self-consistently combines estimate and uncer­
tainty, and can approximately track their effect through several
steps of the computation.
In the twentieth century, the idea of computation as inference
lingered in the margins. In the 1950s and 1960s - the golden
age of probability theory following the formal works of Kol­
mogorov - perhaps Sul’din should be mentioned as the first to
return to address it in earnest.4 He focused on the approxima­ 4 Sul’din (1959); Sul’din (1960).
tion of functions, a task underlying many numerical methods.
In statistics, where this process is known as regression, it took
on a life of its own, leading to a deep probabilistic framework
studied extensively to this day, whose early probabilistic inter­
pretations where driven by people like Sard (1963), or Kimeldorf
and Wahba (1970). Parallel to Sul’din’s work in Russia, the task
of integration found the attention of Ajne and Dalenius (1960)
in Scandinavia. The English-speaking audience perhaps first
heard of these connections from Larkin (1972), who went on to
write several pieces on the connection between inference and
computation. Anyone who missed his works might have had
Twenty-First-Century Probabilistic Numerics 7

to wait over a decade. By then, the plot had thickened and au­
thors in many communities became interested in Bayesian ideas
for numerical analysis. Among them Kadane and Wasilkowski
(1985), Diaconis (1988), and O’Hagan (1992). Skilling (1991) even
ventured boldly toward solving differential equations, display­
ing the physicist’s willingness to cast aside technicalities in the
name of progress. Exciting as these insights must have been
for their authors, they seem to have missed fertile ground. The
development also continued within mathematics, for example in
the advancement of information-based complexity5 and average­ 5Traub, Wasilkowski, and Wozniakowski
(1983); Packel and Traub (1987); Novak
case analysis.6 But the wider academic community, in particular
(2006).
users in computer science, seem to have missed much of it. But 6 Ritter (2000)
the advancements in computer science did pave the way for
the second of the central insights of pn: that numerics requires
thinking about agents.

Twenty-First-Century Probabilistic Numerics

The twenty-first century brought the coming-of-age of machine


learning. This new field raised new computational problems,
foremost among them the presence of big data sets in the com­
putational pipeline, and thus the necessity to sub-sample data,
creating a trade-off between computational cost and precision.
Numerics is inescapably important to machine learning, with
popular tracks at its major conferences, and large tracts of ma­
chine learning masters’ degrees, devoted to optimisation alone.
But machine learning also caused a shift in perspective on mod­
elling itself.
Modelling (or inference) used to be thought of as a passive
mathematical map, from data to estimate. But machine learning
often views a model as an agent in autonomous interaction with
its environment, most explicitly in reinforcement learning. This
view of algorithms as agents is, as above, central to pn.
Machine learning has been infused with the viewpoints of
physicists and other scientists, who are accustomed to limits on
precision and the necessity of assumptions. The Bayesian view­
point on inference soon played a prominent (albeit certainly not
the only leading) role in laying its theoretical foundations. Text­
books like those of Jaynes and Bretthorst (2003), MacKay (2003),
Bishop (2006), and Rasmussen and Williams (2006) taught a gen­
eration of new students - the authors amongst them - to think
in terms of generative models, priors, and posteriors. Machine
learning’s heavy emphasis on numerics couldn’t help but lead
some of those students to apply their hammer, of probabilis­
8 Introduction

tic inference, to the nails of the numerical problems that they


encountered. These students were bound to revive questions
from the history of pn. For instance, if computation is inference,
should it then not be possible to build numerical algorithms
that:

1. rather than taking in a logical description of their task and


returning a (floating-) point estimate of its solution, instead
take probability distributions as inputs and outputs?;

2. use an explicit likelihood to capture a richer, generative


description of the relation between the computed numbers
and the latent, intractable quantity in question?; and

3. treat the CPU or GPU as an interactive source of data?

In 2012, the authors of this text co-organised, with J. P. Cun­


ningham, a workshop at the Neural Information Processing Sys­
tems conference on the shores of Lake Tahoe to discuss these
questions. We were motivated by our own work on Bayesian
Optimisation and Bayesian Quadrature. At the time, the wider
issue seemed new and unexplored to us. By a stroke of luck, we
managed to convince Persi Diaconis to make the trip up from
Stanford and speak. His talk pointed us to a host of prior work.
In search of an inclusive, short title for the workshop, we had
chosen to call it Probabilistic Numerics (pn). In the years since,
this label has been increasingly used by a growing number of
researchers who, like us, feel that the time has come to more
clearly and extensively connect the notions of inference and
computation. The 2012 workshop also marked the beginning of
a fruitful collaboration between the machine learning commu­
nity and statisticians like M. Girolami7 and C. J. Oates, as well as 7 Hennig, Osborne, and Girolami (2015)
applied mathematicians like T. J. Sullivan, I. Ipsen, H. Owhadi
and others. They ensured that existing knowledge in either com­
munity was not forgotten,8 and their research groups have since 8Owhadi and Scovel (2016); Oates and
Sullivan (2019).
laid an increasingly broad and deep foundation to the notion
of computation as probabilistic and also more narrowly, care­
fully defined Bayesian inference.9 They have also undertaken a 9 Cockayne et al. (2019b)
commendable effort to build a community for pn within the
mathematical fields.
Although numerous interesting insights have already been
reached, just as many questions are still scarcely explored. Many
of them emerge from new application areas, and new associated
computational problems, in the age of Big Data, GPUs, and
distributed, compartmental, computation.
This Book 9

We made a conscious decision not to use the word Bayesian in


naming Probabilistic Numerics. We will unashamedly adopt a
Bayesian view within this text, which forms a natural framework
for the core ideas within pn; those ideas can also be found in
many alternative approaches to machine learning and statistics.
But there is also an important way in which pn can be viewed
as non-Bayesian. The Bayesian norm enforces hygiene between
modelling and decision-making. That is, you write down a prior
capturing as much of your background knowledge as possible,
and do inference to compute a posterior. Then, with that poste­
rior, you write down a loss function and use expected loss to
choose an action. The Bayesian’s counterpart, the frequentist,
has the loss function in mind from the outset.
However, in numerics, we are rarely afforded the luxury of
using models informed by all available prior knowledge. Such
models are usually more computationally expensive, in them­
selves, than models that are built on only weak assumptions.
Sometimes, we are willing to spend a little more computation
on a model to save even more computation in solving its nu­
merical task, but other times we are not. That is, in considering
an additional computation cost on a model, we must consider
whether it is justified in improving performance for the given
numerical task: this performance is measured by a loss function.
pn is hence, in this way, more akin to the frequentist view
in muddling the loss function and the prior. That is, numerics
requires us to make, in some cases, drastic simplifications to our
models in order to achieve usable computational complexity.
This can be conceived as letting some (vaguely specified) loss
function on computation dictate which elements of the prior
can be incorporated.

This Book

This book aims to give an overview of the emerging new area of


Probabilistic Numerics, particularly influenced by contemporary
machine learning. Even at this early point in the field, we have
to concede that a complete survey is not possible: we are bound
to under-represent some rapidly developing viewpoints, and
apologise in advance to its authors. Our principal goals will be
to study uses and roles for uncertainty in numerical computation,
and to employ such uncertainty in making optimal decisions
about computation.
Invariably, we will capture uncertainty in the language of
probabilities. Any quantity that is not known to perfect (machine)
10 Introduction

precision will ideally be assigned an explicit probability dis­


tribution or measure. We will study algorithms that take and
return probabilities as inputs and outputs, respectively, but
also allow for uncertain, imprecise, computations to take place
within the algorithm itself. These algorithms will explicitly be
treated as agents, intelligently making decisions about which
computations to perform. Within our probabilistic framework,
these decisions will be selected as those that minimise an expected
loss.
Along the way, we will make several foundational observa­
tions. Here is a preview of some of them, forming a quick tour
through the text:

Classical methods are probabilistic Classical methods often have


clear probabilistic interpretations. Across the spectrum of nu­
merical tasks, from integration to linear algebra, nonlinear opti­
misation, and solving differential equations, many of the foun­
dational, widely used numerical algorithms can be explicitly
motivated as maximum a posteriori or mean point-estimates aris­
ing from concrete (typically Gaussian) prior assumptions. The
corresponding derivations take up a significant part of this text.
These insights are crucial for two reasons.
First, finding a probabilistic interpretation for existing meth­
ods shows that the idea of a probabilistic numerical method
is not some philosophical pipe dream. Probabilistic methods
tangibly exist, and are as fast, and as reliable as the methods
people trust and use every day - because those very methods
already are probabilistic numerical methods, albeit they are not
usually presented as such.
Second, once phrased as probabilistic inference, the classical
methods provide a solid, well-studied and understood foun­
dation for the development of new numerical methods and
novel functionality addressing modern challenges. It may be
tempting to try and invent probabilistic methods de novo; but
the analytical knowledge and practical experience embodied in
numerical libraries is the invaluable condensed labor of genera­
tions of skilled applied mathematicians. It would be a mistake
to throw them overboard. Classical numerical methods are com­
putationally lightweight (their cost per computational step is
often constant, and small), numerically stable (they are not
thrown off by small machine errors), and analytically efficient
(they converge to the true solution at a “good” rate). When
developing new functionality, one should strive to retain these
properties as much as possible, or at least to take inspiration
This Book 11

from the form of the classical methods. At the same time, we


also have to leave a disclaimer at this point: Given the breadth
and depth of numerical analysis, it is impossible to give a uni­
versal introduction in a book like this. The presentations of
classical methods in these pages are designed to give a concise
introduction to specific ideas, and highlight important aspects.
We refer interested readers from non-numerical backgrounds to
the established literature, referenced in the margins.10 10 The text contains a large number of

notes in the margins. It is generally pos­


sible to read just the main text and ignore
Numerical methods are autonomous agents We will employ the these offshoots. Some of them provide
short reference to background knowl­
decision-theoretic, expected-loss-minimisation, framework to
edge for the readers’ convenience. Others
design numerical algorithms. Often, it is not just that the way are entry points to related literature.
a classical numerical method combines collected floating-point
numbers into a numerical estimate can be interpreted as a pos­
terior expectation or estimate. Additionally, the decision rule
for computing those numbers in the first place also arises natu­
rally from the underlying prior probabilistic model, through a
decision-theoretic treatment. Thus, such a classical numerical
algorithm can indeed be interpreted as an autonomous agent
acting consistently with its internal probabilistic “beliefs”. At
first glance, this insight has primarily aesthetic value. But we
will find that it directly motivates novel algorithms that act in
an adaptive fashion.

Numerics should not be random Importantly, we will not identify


probabilistic uncertainty with randomness.11 Randomness is but 11 §12.3 delves into this topic.
one possible way for uncertainty to arise. This kind is sometimes
called aleatory or stochastic, in contrast to the epistemic uncer­
tainty capturing a lack of knowledge. Probability theory makes
no formal distinction between the two, they are both captured
by spreading unit measure over a space of hypotheses. But there
are some concepts, notably that of bias, which require a careful
separation of these types of uncertainty. Furthermore, random­
ness is often used within numerics today to make (tough!)
decisions, for instance, about where to make evaluations of
an integrand or objective. We will argue that randomness is
ill-suited to this role, as can be seen in describing a numerical
algorithm as an agent (whose expected-loss-minimising action
will never be returned by a random number generator). We
will show (§12.3) how non-random, expected-loss-minimisation,
decisions promise the reward of dramatically lowered compu­
tation consumption. This is not a fundamental rejection of the
concept of Monte Carlo methods, but it reveals deep philosoph­
ical subtleties surrounding these algorithms that raise concrete
12 Introduction

research questions.

Numerics must report calibrated uncertainty To complete the de­


scription of classical methods as probabilistic, it does not suffice
to note these methods give point estimates that arise from the
most probable or expected value, the “location” of a posterior
distribution. We also have to worry about the “width”, cap­
tured by variance and support, of the posterior around this
point estimate. We must ask whether the posterior can indeed
be endowed with an interpretation as a notion of uncertainty,
connected to the probable error of the numerical method. The
sections that study connections to existing methods will pro­
vide some answers in this regard. We will frequently argue
that certain classical methods arise from a family of Gaussian
prior measures, parametrised by either a scalar or multivariate
scale of uncertainty. All members of that family give rise to
Gaussian posteriors with the same mean (which is identical to
the classical method) and a posterior standard deviation that
contracts at the same rate, and thus only differs by the constant
scale. We will see that the contraction rate of this posterior
variance is related to classical convergence rates of the point es­
timate (in its non-adaptive form, it is a conservative, worst-case
bound on the error). We will further show that the remaining
constant parameter can be inferred at runtime with minimal
computational overhead, using either probabilistic, statistical,
or algebraic estimation rules. Throughout, we will argue that
the provision of well-calibrated quantifications of uncertainty is
crucial to numerical algorithms, whether classical or new. Such
reliable uncertainty underpins both the trust that can be placed
in numerical algorithms and effective decision-making about
computation.

Imprecise computation is to be embraced Having adopted reliable


quantifications of uncertainty, we are freed from the burden
of ensuring that numerical calculations must always be highly
precise. Not all numerical problems are equal: computation can
be saved, in lowering precision, for the least important. From our
perspective, some of the most pressing contemporary numerical
problems arise in data science and machine learning, in the
processing pipelines of big data sets. In these settings, data are
frequently sub-sampled at runtime. This introduces stochastic
disturbances to computed numbers that significantly lowers the
precision of the computation. Yet sub-sampling also drastically
reduces computational cost compared to computations on the
This Book 13

entire data set. The trade-off between computation cost and


precision results in tuning parameters (e.g. specifying how large
a random sub-sample should be chosen) being exposed to the
user, a major irk to practitioners of data science. pn here offers
value in enabling optimisation of this trade-off, freeing the user
of the nasty task of fiddling with parameters.

pn consolidates numerical computation and statistical inference Both


numerical solvers and statistical-inference methods convert the
information available to them into an approximation of their
quantity of interest. In numerics, this information consists of
evaluations of analytic expressions (or functions) - e.g. of the in­
tegrand for quadrature. In statistics, it comes as measurements
(data) of observable variables. But ultimately, when viewed
through the lens of information theory, these types of informa­
tion are essentially the same - namely nothing but the output of
a (noisy) communication channel,12 either through an (impre­ 12 MacKay (2003), §9
cise) function or a (noisy) statistical observation. pn exploits this
analogy by recasting numerical problems as statistical inference.
More precisely, probabilistic numerical methods work by provid­
ing a statistical model linking the accessible function evaluations
to the solution of a numerical problem, and then approximating
the solution by use of statistical inference in this model. But this
statistical model can be useful beyond the numerical problem
at hand: should the occasion arise, it can be extended to include
additional observational data containing more information. This
way, the computational and observational data (information)
can work together, in a single model, to improve the inference
of the numerical solution, and of other latent variables of this
joint model. Real payoffs of this probabilistic consolidation of
numerics and statistics have, for example, been demonstrated
for differential equations - as we will detail in §41.3.

Probabilistic numerical algorithms are already adding value The


pn approach to global optimisation is known as Bayesian opti­
misation, and, in fact, Bayesian optimisation was conceived as
probabilistic since its invention. Bayesian optimisation is widely
used today to automatically make decisions otherwise made
by human algorithm designers. Machine learning’s growth has
created a bewildering variety of algorithms, each with their own
design decisions: the choices of an algorithm and its detailed
design to be made can be framed as an “outer-loop” global
optimisation problem. This problem makes careful selection of
evaluations of the algorithm and its design primary, as the cost
14 Introduction

of evaluating an algorithm on a full data set is expensive enough


to prevent anything approaching exhaustive search. In finding
performant evaluations sooner, Bayesian optimisation saves, rel­
ative to the alternative of grid search, considerable computation
on evaluations. Pre-dating the recent surge of research in wider
Probabilistic Numerics, Bayesian optimisation has already blos­
somed into a fertile sub-community of its own, and produced
significant economic impact. Chapter V is devoted to this do­
main, highlighting a host of new questions and ideas arising
when uncertainty plays a prominent role in computation.

Pipelines of computation demand harmonisation A probabilistic-


numerics framework, encompassing all numerical algorithms,
grants the ability to efficiently allocate computation, and man­
age uncertainty, amongst them. A newly prominent compu­
tational issue is that realistic data processing in science and
industry happens not in a single computational step, but in
highly engineered pipelines of compartmental computations.
Each step of these chains consumes computation, and depends
on and propagates errors and uncertainties. The area of uncer­
tainty quantification13 has developed methods to study and 13 Sullivan (2015)
identify such problems, but its methods tend to add compu­
tational overhead that is not recouped in savings elsewhere.
Probabilistic numerical methods, with their ability to handle
uncertain inputs and produce calibrated uncertain outputs, offer
a natural notion of uncertainty propagation through compu­
tational graphs. The framework of graphical models provides
a scaffolding for this process. Depending on the user’s needs,
it is then possible to scale between simple Gaussian forms of
uncertainty propagation that produce simple error bars at only
minimal computational overhead, to full-fledged uncertainty
propagation, with more significant computational demands.
With a harmonised treatment of the uncertainty resulting from
each step, and the computational costs of reducing such uncer­
tainty, pn allows the allocation of computational resources to
those steps that would benefit from it most.

Open questions Finally, the text also highlights some areas of


ongoing research, again with a focus on desirable functionality
for data-centric computation.
We will not pay very close attention to machine precision,
machine errors, and problems of numerical stability. These is­
sues have been studied widely and deeply in numerical analysis.
They play an important role, of course, and their ability to cause
This Book and You 15

havoc should not be underestimated. But in many pressing


computational problems of our time, especially those involving
models defined on external data sets, the dominant sources of
uncertainty lie elsewhere. Sub-sampling of big data sets to speed
up a computation regularly causes computational uncertainty
many orders of magnitude above the machine’s computational
precision. In fact, in areas like deep learning, the noise fre­
quently dominates the signal in magnitude. This is also the
reason why we spend considerable time on methods of low
order: advanced methods of high order are often not applicable
in the context of high computational uncertainty.

This Book and You

We wrote this book for anyone who needs to use numerical


methods, from astrophysicists to deep learning hackers. We
hope that it will be particularly interesting for those who are,
or are aiming at, becoming a developer of numerical methods,
perhaps those with machine learning or statistical training.
We invite you to join the Probabilistic Numerics community.
Why should you care?

Probabilistic Numerics is beautiful The study of pn is its own


reward. It offers a unified treatment of numerical algorithms
that recognises them as first-class citizens, agents in their own
right.

Probabilistic Numerics is just beginning to bloom The pn banner


has been borne since Poincare, and ours will not be the genera­
tion to let it slip. Despite these deep roots, the field’s branches
are only now beginning to be defined, and we can only guess at
what wonderful fruit they will produce. In Chapter VII, we will
describe some of the many problems open to your contributions
to pn.

Probabilistic Numerics is your all-in-one toolkit for numerics Nu­


merics need not be considered foreign to those with statistical
or machine learning expertise. pn offers a machine learning or
statistics researcher the opportunity to deploy much of their
existing skillset in tackling the numerical problems with which
they are so commonly faced. pn allows the design of numeri­
cal algorithms that are perfectly tailored to the needs of your
problem.
16 Introduction

Probabilistic Numerics grants control of computation pn’s promise


is to transform computation, sorely needed in an age of balloon­
ing computation demands and the ever-growing evidence that
its costs cannot continue to be borne.

With this, it is time to get to the substance. The next chapter


provides a concise introduction to the most central arithmetic
framework of probabilistic inference: Gaussian probability dis­
tributions, which provide the basic toolbox for computationally
efficient reasoning with uncertainty. Readers fully familiar with
this area can safely skip this chapter, and move directly to
the discussion of the first and arguably the simplest class of
numerical problems, univariate integration, in Chapter II.
Chapter I
Mathematical Background
1
Key Points

This chapter introduces mathematical concepts needed in the


remainder. Readers with background in statistics or machine
learning may find that they can skim through it. In fact, we
recommend that readers skip this chapter on their first, high-
level pass through this text, as the first conceptual arguments
for Probabilistic Numerics will arrive in Chapter II. However,
the mathematical arguments made in later chapters require the
following key concepts, which must be developed here first:

§ 2 Probabilities provide the formal framework for reasoning


and inference in the presence of uncertainty.

§ 3 The notion of computation as probabilistic inference is


not restricted to one kind of probability distribution; but
Gaussian distributions play a central role in inference on
continuous-valued variables due to their convenient alge­
braic properties.

§ 4 Regression - inference on a function from observations of its


values at a finite number of points is an internal operation of
all numerical methods and arguably a low-level numerical
algorithm in itself. But regression is also a central task in
machine learning. We develop the canonical mathematical
tools for probabilistic regression:

§ 4.1 Gaussian uncertainty over the weights of a set of basis


functions allows inference on both linear and nonlinear
functions within a finite-dimensional space of hypotheses.
The choice of basis functions for this task is essentially
unlimited and has wide-ranging effects on the inference
process.
20 I Mathematical Background

§ 4.2 Gaussian processes extend the notion of a Gaussian dis­


tribution to infinite-dimensional objects, such as functions.
§ 4.3 In the Gaussian case, the probabilistic view is very
closely related to other frameworks of inference and ap­
proximation, in particular to least-squares estimation. This
connection will be used in later chapters of this texts to
connect classic and probabilistic numerical methods. In
fact, readers with a background in interpolation/scattered
data approximation may interpret this section as covering
a first example of building a probabilistic interpretation
for these numerical tasks. In contrast, from the perspec­
tive of machine learning and statistics, regression is not a
computational task, but a principal form of learning. This
difference in viewpoints thus highlights the fundamental
similarity of computation and learning once again.
§ 4.4 Gaussian process models allow inference on derivatives
of functions from observations of the function, and vice
versa. This will be relevant in all domains of numerical
computation, in particular in integration, optimisation,
and the solution of differential equations.

§ 5 Gauss-Markov processes are a class of Gaussian process mod­


els on univariate domains whose “finite memory” allows
inference at cost linear in the number of observations. The
inference process is packaged into algorithms known as fil­
ters and smoothers. The dynamics of the associated stochastic
process are captured by linear stochastic differential equations.

§ 6 Conjugate priors for the hyperparameters of Gaussian models


allow inference on the prior mean and covariance at low
computational cost.
__________ 2
Probabilistic Inference

Probabilities are the mathematical formalisation for the concept


of uncertainty. Many numerical tasks involve quantities taking
values x in the real vector space RD for D G N. So consider a
continuous variable X G R, and assume that its value x is not
known precisely. Instead, it is only known that the probability
for the value of X to fall into the set U C R is Px(U). If Px
is sufficiently regular, one can define the probability density
function p (x) as the Radon-Nikodym derivative of PX with
respect to the Lebesgue measure. That is, the function p(x) with
the property PX(U) = U p(x) dx for all measurable sets U C R.
There are two basic rules for the manipulation of probabilities: If
two variables x, y G R are assigned the density function p(x, y),
then the marginal distribution is given by the sum rule
p(y) = p(x, y) dx,

and the conditional distribution p( x | y) for x given that Y = y is


provided implicitly by the product rule Figure 2.1: Conceptual sketch of a joint
probability distribution p( x, y) over two
p(x | y)p(y)=p(x,y), (2.1) variables x, y, with marginal p(y) and a
conditional p(x | y).
whose terms are depicted in Figure 2.1. The corollary of these
two rules is Bayes’ theorem, which describes how prior knowledge,
combined with data generated according to the conditional
density p(y | x), gives rise to the posterior distribution on x:
likelihood prior

p (x | y) = ( 1
P y x) P x) ( .
^posterior J p(y | x) p(x)
У - -
d x'
evidence

When interpreted as a function of x, the conditional distribution


p(y | x ) is called the likelihood (note it is not a probability density
22 I Mathematical Background

function of x, but of y). Computing the posterior amounts to


inference on x from y.

Although not without competition, probability theory is widely


accepted as the formal framework for reasoning with imperfect
knowledge (uncertainty) in the natural sciences,1 statistics,2 and 1 Jaynes and Bretthorst (2003)
computer science.3 This text accepts the general philosophical 2Cox (1946); Le Cam (1973); Ibragimov
and Has’minskii (1981)
and mathematical arguments made in these texts and asks, more
3 Pearl (1988); Hutter (2010).
practically, how to use probabilistic formulations in computa­
tion.
__________ 3
Gaussian Algebra

The Gaussian or normal probability distribution over RD is iden­


tified by its probability density function

Y n 1
N(x;ц, ) = (2 ^D121Y11/2 exp (- (x - ц)TY 1(x - ц.
(3.1)
Here, a parameter vector ц G RD specifies the mean of the 1 The matrix inverse Y-1 is known as the

precision matrix. The meaning of its el­


distribution, and a symmetric positive definite (spd) matrix Y G
ements is a bit more involved than for
RDxD defines the covariance of the distribution1 those of Y, and pertains to the distribu­
tion of one or two variables conditional
Hi EN(x;ц,Y)(xi), Yij covN(x;ц,Y)(xi, xj ). on all the others: If p(x)=N (x; H, Y),
then the variance of xi conditioned on the
value of xj, j = i is
There are many ways to motivate the prevalence of the Gaussian
var|xj=i (xi) = 1/[Y-1]ii.
distribution. It is sometimes presented as arising from analytic
results like the central limit theorem, or the fact that the normal An interpretation for the off-diagonal el­
ements of Y-1 is that they provide the
distribution is the unique probability distribution with mean coefficients for a linear equation that de­
ц and covariance Y maximising the differential entropy func- termines the expectation of one element
of the variable when conditioned on all
tional.2 But the primary practical reason for the ubiquity of
the others (cf. Eq. (3.10)). Assume H = 0
Gaussian probability distributions is that they have convenient for simplicity. Then, given the informa­
algebraic properties. This is analogous to the popularity of linear tion xj=i = y, the expectation of xi is

approximations in numerical computations: The main reason to


construct linear approximations is that linear functions offer a EIxj=i =y(xi) = — [y-1]ii Y
^=[ ]ijyj.

rich analytic theory, and that computers are good at the basic More on this in a report by MacKay
linear operations - addition and multiplication. (2006).
2 The entropy,

In fact, the connection between linear functions and Gaussian Hp (x) := - p(x)logp(x)dx, (3.2)
distributions runs deeper: Gaussians are a family of probability of the Gaussian is given by
distributions that are preserved under all linear operations. The
following properties will be used extensively: HN(x;ц,Y) (x)
N1
If a variable x G RD is normal distributed, then every affine
= у (1 + log(2 n)) + ^log | |. Y (3.3)

transformation of it also has a Gaussian distribution (Fig-


24 I Mathematical Background

ure 3.1):

if p(x) = N(x;ц, S),


and y := Ax + b for A e RMxD, b E RM,
then p(y)= N(y; Ац + b, ASAT). (3.4)

I
The product of two Gaussian probability density functions
is another Gaussian probability distribution, scaled by a con-
Figure 3.1: The product of two Gaussian
stant.3 The value of that constant is itself given by the value densities is another Gaussian density, up
of a Gaussian density function to normalisation.

N(x; a, A)N (x; b,B)=N(x;c, C)N(a; b, A + B), 3This statement is about the product of
two probability density functions. In con­
where C := (A-1 + B-1)-1, (3.5)
trast, the product of two Gaussian ran­
and c := C(A-1a + B-1b). dom variables is not a Gaussian random
variable.

These two properties also provide the mechanism for Gaussian


inference: If the variable x e RD is assigned a Gaussian prior,
and observations y e RM, given x, are Gaussian distributed

p(x)= N(x; ц, S) and p(y | x)= N(y; Ax + b, Л),

then both the posterior and the marginal distribution for y (the
evidence) are Gaussian (Figure 3.2):

p(x | y) = N(x; Ц, S), with (3.6)


S := (S-1 + AтЛ-1A)-1 (3.7)
= S - SAT(ASAT + Л)-1AS, and (3.8)
:
ц = S(AтЛ-1(y - b)+ S-1 ц) (3.9) Figure 3.2: The posterior distribution
p(x | y) arising from a Gaussian prior
= ц + SAT(ASAT + Л)-1 (y — (Aц + b)); (3.10)
p(x) = N(x; [1,0.5]T,321)

and Gaussian likelihood


and
p(y = 6 | x)=N(6; [1, 0.6]x, 1.52)
p (y) = N (y; Aц + b, A S AT + Л). (3.11) is itself Gaussian. This sketch also illus­
trates that the likelihood function does
The equivalent forms, Eqs. (3.9) & (3.10) and Eqs. (3.7) & (3.8), not need to be a proper probability dis­
tribution on x (only on y). The posterior
show two different formulations of the same posterior mean distribution can be correlated, even if the
and covariance, respectively. The former pair contains a matrix prior is uncorrelated. This phenomenon,
that two a priori independent “parent”
inverse of size D x D, the latter one of size M x M. Depending variables x1, x2 can become correlated by
on which of the two numbers is larger, it is more efficient an observation y = a1x1 + a2x2 connect­
to compute one or the other. The marginal covariance of y ing the two, is known as explaining away
(Pearl, 1988).
in Eq. (3.11) also shows up as part of Eq. (3.8). Computing this
evidence term thus adds only minimal computational overhead
over the computation of the posterior mean.
3 Gaussian Algebra 25

An important special case arises if the matrix A maps toa


proper subset of x. Consider a separation of x = [a, b]T into
a G Rd and b G RD-d. The joint is4 4 In other words, if we only care about
a small number d of a large set of D in­
terdependent (co-varying) Gaussian vari­
a Ha S aa S ab
p( x) =N ables, then that marginal can be com­
b Hb Sba Sbb puted trivially, by selecting elements
from the mean vector and covariance
Now consider A = [Id,0D-d], i.e. the “selector” map extracting matrix. On the one hand, this prop­
erty shows that covariance only captures
a subset of size d < D .By Eq. (3.4), the marginal of this Gaussian a limited amount of structure. On the
distribution is another Gaussian, whose mean and covariance other, this makes it possible to consider
arbitrarily large sets of variables, as long
are simply a sub-vector and sub-matrix of the full mean and
as we only have to deal with finite sub­
covariance, respectively (Figure 3.3) sets of them. This observation is at the
heart of the notion of a Gaussian process,
discussed below.
p(a) = p(a, b) db = N (a; Ha, Saa). (3.12)

Additionally using Eq. (3.6), we see that the conditional of a (


py | x )
subset conditioned on its complement is also a Gaussian:

p(a| b) = N(a;Ha + Sab(Sbb)-1b,Saa - Sab(Sbb)-1Sba).


(3.13)
Since marginalisation (sum rule) and conditioning (product rule)
are the two elementary operations of probability theory, “Gaus­
sian distributions map probability theory to linear algebra” - to
matrix multiplication and inversion. _^p(x— ------------- -~~~__p 5)
Figure 3.3: Projections and conditionals
of a Gaussian are Gaussians.
__________ 4
Regression

► 4.1 Parametric Gaussian Regression

Gaussian distributions assign probability density to vectors of


real numbers. In numerical applications, the objects of interest 1 The formalism can be extended to
complex-valued and multivariate func­
are often the output of real-valued1 functions f : X R over
tions, or to functions mapping virtually
some input domain X. A straightforward but powerful way to any input domain X to an output do­
use the Gaussian inference framework on such functions is to main Y that can be made isomorphic to
a real vector space. This generality is sup­
make the assumption that f can be written as a weighted sum pressed here to allow a more accessible
over a finite number F of feature functions [tyi : X R] i=1,...,F, as presentation. The type of X does not mat­
ter at all since it is “encapsulated” and
F mapped to a real vector by the features ф.
f (x) = ^ tyi(x)wi =: Фхw
i=1
with w £ RF• (4-1) To capture multivariate outputs (or out­
puts that are isomorphic to the multivari­
ate reals), the model has to return not
The right-hand side of this equation introduces a slightly sloppy one but several real numbers. This can
be achieved by increasing the number of
but helpful notation: Given a collection2 X с X with elements
features and weights, i.e. by “stacking”
[X]i = xi £ X,i = 1,...,N, we will denote by ФX £ RFxN several univariate functions. Chapter III,
the matrix of feature vectors with elements [ФX ] ij = фi (Xj). on linear algebra, prominently features
this kind of model.
Similarly, we will write fX for the vector of function values 2 We use the capital X instead of x to sig­

[f(x1),...,f(xN)]. nify that X can really be an ordered col­


lection of just about any type of inputs:
real numbers, but also strings, graphs,
The crucial algebraic property of Eq. (4.1) is that it models f etc. See e.g. §4.4 in Rasmussen and
as a linear function of the weights w.It is therefore called a Williams (2006).

linear regression model.3 To perform inference on such a function 3This need not mean that f is also a lin­
ear function in x!
within the Gaussian framework, we assign a Gaussian density
over the possible values for the weight vector w, written 4This includes the special case Л 0,
usually written suggestively with the
p (w) = N (w; ц, E). Dirac delta as

P(У | f) = s(У - fx).


Let us assume that it is possible to collect observations y :=
This corner case is important in numer­
[y1, ...,yN] £ RN of f corrupted by Gaussian noise with co­
ical applications where function values
variance Л £ RNxN at the locations X (Figure 4.1). That is, can be computed “exactly”, i.e. up to ma­
according to the likelihood function4 chine precision.
28 I Mathematical Background

i procedure Parametric_Infer(y, X, Л, ф, ц, X) Algorithm 4.1: Basic implementation of


2 Ф x = ф (X) / evaluate features of data the inference step of parametric regres­
sion, which constructs a posterior distri­
3 £ = £- -1 + Ф X Л -1ФХ) -1 / posterior covariance bution on the weights. Here, inference is
4 / This is the most costly step, at O(F3). performed in weight space, so the com­
putational cost, given N observations
5 ц = £ (£ -1 ц + ФxЛ -1 y) Ц posterior mean
and F features, is O(NF2 + F3). Com­
6 end procedure pare with Algorithm 4.3 for the function
space view, where the cost is O(N3).

p(y I f) = N(y;fx,Л). (4.2)

Then the posterior over the weights w,by Eq. (3.13), is

p(w I y) = N(w; fl, £),


with £ := (£ -1 + Ф x Л - 1фХ) -1
and ц := £ (£ -1 ц + Ф X Л -1 y).

Using the matrix inversion lemma (Eq. (15.9)), these expressions


can also be written with the inverse of a matrix in RN'/N as

£ = £ - £ФX(ФХ£ФX + Л) -1ФХ£,
ц = ц + £Фx(ФХ£Фх + Л)-1 (y - ФХц).

Rasmussen and Williams (2006) call this the weight-space view of


Figure 4.1: A simple nonlinear data set
regression. Alternatively, one can also construct a function-space with i.i.d. Gaussian observation noise,
view, a density directly over values of the unknown f: Since f i.e. Л = a2In, a e R+ (error-bars). Fig­
ures 4.2 and 4.3 demonstrate the effect
is a linear map of w, Eq. (3.4) implies that the posterior over
of varying prior assumptions on the pos­
function values fХ at a finite subset x С X is terior induced by these observations.

p (fx ) = N (fx; ф£ fl, ФХ £ Ф x ) with (4.3)


ФХц = ФIц + Фх£Фх(ФХ£ФX + Л)-1 (y - Фхц),
Ф££ФХ = Ф^£ФХ - ф!£ФX(ФХ£ФX + Л)-1 ФХ£ФХ.

1 procedure PARAMETRIC_PREDICT(x, fl, £) / predict at x Algorithm 4.2: Basic implementation of


the prediction step of parametric regres­
2 Ф x = ф (X) / features of x
sion in function space, using the outputs
3 mx = Ф!ji Ц predictive mean of Alg. 4.1. The cost, given D prediction
4 Vx = Ф_Т £ Ф x / predictive covariance locations, is O(DF) for the mean and
element-wise variance, and O((DF)2)
5 sx = Ф£ • ChOl(£)t • RandNoRMAl(F)+ mx / sample for a full cov. Samples cost O((DF)2) if
6 end procedure the Cholesky decomposition (p. 36) of £
is available (e.g. from Alg. 4.1).
4 Regression 29

polynomials Figure 4.2: Odd-numbered rows: Pri­


ors over univariate real functions us­
ing different features. Each plot shows
the underlying features Фx as thin grey
lines, the prior mean function ФХ ц = 0
(x)j

arising from ц = 0, the marginal stan­


dard deviation diygi'd> 1ЕФx)1/2 for the
choice E = I, and four samples drawn
i.i.d. from the joint Gaussian over the
function.
Even-numbered rows: Posteriors arising
in these models from the observations
shown in Figure 4.1. The feature func­
tions giving rise to these eight different
plots are the polynomials
(x)j

Ф, (x) = Х, i = 0,... ,3;

the trigonometric functions

Ф,(x) = sin(x/i), i = 1,...,8, and


steps Legendre
Ф, (x )= cos( x/i-8), i = 9,..., 16;
as well as, for i = -8, -7, ...,8, the
“switch” functions
(x)j

Ф,(x) = sign(x - i);


the “step” functions

Ф,(x) = I(x — i > 0);


the linear functions

Ф,(x) = |x - i\;

the first 13 Legendre polynomials (scaled


to [-10, 10]); the absolute’s exponential
(x)j

Ф, (x) = e-x-i;

the square exponential

exp-abs Ф, (x) = e-(x-i)2;


and the sigmoids

Ф, (x) = 1/(1 + e-3( x-i)).


(x)j

The plots highlight the broad range of be­


haviour accessible by varying the choice
of feature functions.
(x)j

x x x
30 I Mathematical Background

Figure 4.2 shows a gallery of prior and posterior densities


over function values5 arising from the same weight-space prior 5With some technical complications,
Eq. (4.3) assigns a measure over an
p(w)=N (w; 1, I), and differing choices of feature functions
infinite-dimensional space of functions.
ф. The figure also shows posterior densities arising from these But the plots only involve finitely many
priors under one particular data set of observations (shown function values (albeit on a grid of high
resolution), for which it suffices to con­
in Figure 4.1 for reference). The figure and derivations above sider the less tricky concept of a proba­
highlight the following aspects: bility density. For more, see §4.2.2.

& Generalised linear regression allows inference on real-valued Exercise 4.1 (easy). Consider the likelihood
of Eq. (4.2) with the parametric form for
functions over arbitrary input domains. f of Eq. (4.1). Show that the maximum­
likelihood estimator for w is given by the
By varying the feature set Ф, broad classes of hypotheses ordinary least-squares estimate
can be created. In particular, Gaussian regression models can WML = (Ф X ФХ ) "1 ФX У.
model nonlinear, discontinuous, even unbounded functions.
To do so, use the explicit form of the Gaus­
There are literally no limitations on the choice of ф : X R. sian pdf to write out log p(y | X, w), take
the gradient with respect to the elements [w]i
& Neither the posterior mean nor the covariance over function of the vector w and set it to zero. If you find it
difficult to do this in vector notation, it may
values (Eq. (4.3)) contain “lonely” feature vectors, but only
be helpful to write out ФXw = £i wi [Фx] i:
inner products of the form kab := Ф,УЕФb and ma := Фaц. where [Фx] i: is the ith column of Фx. Cal­
For reasons to become clear in the following section, these culate the derivative of log p(y | X,W) With
respect to wi, which is scalar.
quantities are known as the covariance function k : X x X R
and mean function m : X R, respectively.

► 4.2 Gaussian Processes - Nonparametric Gaussian Inference

The basic properties of Gaussian distributions summarised in


Eqs. (3.12) and (3.13) mean that computing the marginal of a
high-dimensional Gaussian distribution involves only quantities
of the dimensionality of the marginal. Thus one may wonder
what happens if the high-dimensional distribution is in fact of
infinite dimensionality. This limit is known as a Gaussian process,
and extends the notion of regression to real-valued functions.

The theory and practical use of Gaussian processes are well-


studied. Extended introductions can be found in the textbook
by Rasmussen and Williams (2006) as well as the older book
by Wahba (1990). The regularity6 and extremal behaviour7 of 6 Adler (1981)
samples from a Gaussian process have been analysed in detail. 7 Adler (1990)

As will briefly be pointed out below, Gaussian process models


are also closely related to kernel methods in machine learning,
in particular to kernel ridge regression. Many of the theoretical
concepts in that area transfer - directly or with caveats - to the
Gaussian process framework. These are discussed for example
in the books by Scholkopf and Smola (2002), and Steinwart and
Christmann (2008). Moreover, Berg, Christensen, and Ressel
4 Regression 31

(1984) provide an introduction to the theory of positive definite


functions, which are akey concept in this area (see §4.2.1). With
a focus on accessibility, this section provides the basic notions
and algorithms required for inference in these models, as well
as an intuition for the generality and limitations of Gaussian
process regression.

> 4.2.1 Positive Definite Kernels

The main challenge in defining an “infinite-dimensional Gaus­


sian distribution” is how to describe the infinite limit of a co­
variance matrix. Covariance matrices are symmetric positive
definite;8 the extension of this notion to operators is called a 8For more on symmetric positive definite
matrices, see §15.2.
positive definite kernel. We approach this area from the para­
metric case studied in §4.1: recall from Eq. (4.3) that, for general
linear regression using features ф, the mean vector and covari­
ance matrix of the posterior on function values does not contain
isolated, explicit forms of the features. Instead, for finite subsets
a, b С X it contains only projections and inner products, of the
form
F
ma := ФaЦ and kab := Ф«ЕФb = £ $i(a)$j(b)Eij•
ij=1

These two types of expressions are themselves functions, called

the mean function m:X R, and


the covariance function k:XxX R.

Because the features are thus “encapsulated”, we may wonder


whether one can construct a Gaussian regression model without
explicitly stating a specific set of features. For the mean function,
this is easy: since one can add an arbitrary linear shift b in
the likelihood (4.2), the mean function does not actually have
to be an inner product of feature functions. It can be chosen
more generally, to be any computationally tractable function
ma : X R. The covariance function requires more care; but if
we find an analytic short-cut for the sum of F2 terms for a smart
choice of (ф, ц, E), it is possible to use a very large, even infinite
set of features. This is known as “the kernel trick”.9 It is based 9 Scholkopf (2000)
on one of the basic insights of calculus: Certain structured sums
and series have analytic expressions that allow computing their
value without actually going through their terms individually.

MacKay (1998) provides an intuitive example. Consider X = R,


and decide to regularly distribute Gaussian feature functions
32 I Mathematical Background

with scale Л € R+ over the domain [cmin, cmax] C R. That is, set Proof sketch, omitting technicalities:
10

Using Eq. (3.4), we get


( (x - Ci)2 \
ф1 (x) = exp Iv -- ---------------
л 2 )’
I ci — c min + f .
(cmax - cmin )
kab —
/2- 9 2 ( c max c min )
/ПЛР
We also choose, for an arbitrary scale Q2 € R+, the covariance ^ -(a-ci)2 -(b-ci)2

i—1
_ 2 92 Q 2 (cmax cmin ) , 2Q2 92 ( c max — c min ) — (—2
-------------------=--------------------- e 2Л2
= I /~Л1'
F (ci -1/2( a+b))2
and set ц — 0. It is then possible to convince oneself10 that the ■ £e Л2Г2 .
i—1
limit of F те and cmin — те, cmax те yields
In the limit of large F, the number of fea­
tures in a region of width 3c converges
(
k (a, b) — Q2 exp ( — a - b) ) • (4.4) to F'3c/(cmax-cmin), and the sum becomes
the Gaussian integral

/292 -(a-b)2
This is called a nonparametric formulation of regression, since kab — ,—, e 2Л2
Плк
the parameters w of the model (regardless of whether their ccmax - (c-1/2(a+b))2
e Л2/2 dc.
number is finite or infinite) are not explicitly represented in the
computation. The function k constructed in Eq. (4.4), if used in For cmax, cmjn ±те, that integral con­
verges to ^/ПЛ A/2.
a Gaussian regression framework, assigns a covariance of full
rank to arbitrarily large data sets. Such functions are known as
(positive definite) kernels.

Definition 4.2 (Kernel). A positive (semi-) definite kernel is a


bivariate real function k : X x X R over some space X such that,
for any set X — [x1 , ..., xN ] C X, the matrix kXX with elements

[kXX]ij — k(xi, xj)

is positive (semi-) definite.11 Such functions are also known as Mercer 11 It follows from the definition of pos­
itive definite matrices (§15.2) that all
kernels, positive definite functions, and, in the context of Gaussian
kernels obey k(a, a) > 0 and k(a, b) —
processes, as covariance functions. k(b, a), ya, b € X.

In addition to the Gaussian kernel of Eq. (4.4), two further exam­


ples are the Wiener and Ornstein-Uhlenbeck (exponential) kernels
over the reals (more in §5.4 and §5.5):

kWiener(a, b) — min(a, b), (4.5)


Exercise 4.3 (moderate). Convince your­
kOrnstein-Uhlenbeck(a, b) exp( la .
b1) self that similar limits for the “switches” and
“steps” feature functions in Figure 4.2 give
rise to the “linear splines” and Wiener ker­
There are other limits of certain feature sets that give rise to
nels listed in Figure 4.3. More examples of
other kernels beyond the Gaussian form of Eq. (4.4) (Exer­ this form, including infinite limits of polyno­
cise 4.3).These constructions are not unique - there may be mial and Fourier features, can be found in a
technical report by Minka (2000).
other choices of ostensibly very different feature sets that, in
some limit, converge to the same covariance function. Neverthe­
less, this explicit construction provides two insights:
4 Regression 33

While the covariance function (4.4) is a “sum” (an integral)


over an infinite set of features, we had to drop the prior vari­
ance of each feature proportionally to M-1, the number of
features.12 In this sense, we have spread a finite probabil­ 12Over a bounded domain [cmin, cmax],
the determinant of this E, in the limit of
ity mass over an infinite space of features. The model may
M от, converges to the finite value
have infinitely many degrees of freedom, but not necessarily
92 .
infinite modelling flexibility. det E = M • -M(cmax - cmin )
У292 ,
& Constructing the kernel requires careful choices, like the (—, ( cmax - cmin).
П n^
scalar form of E, and feature functions with a well-defined
and analytic integral limit. So the elegance of the nonpara­
metric formulation comes at a price: while we can take any
set of feature functions to build a parametric model, we can­
not just take any bivariate function and hope that it might be
a kernel.

For an intuition of how large the space of positive definite


kernels is, we note that kernels form a semi-ring.

1. If k is a kernel, then ak for any a E R+ is a kernel.

Proof. vTKv > 0 ^ avTKv > 0. □

2. Ifk, h are both kernels, then k + h is a kernel.

Proof. vт(K + H)v = vтKv + vтHv > 0. □

3. If k, h are both kernels, then their Hadamard product k Q h,


i.e. the function

(k © h) (a, b) = k (a, b) • h (a, b)

is a kernel. This result is known as Schur’s product theorem


and is significantly less straightforward than the former two.
A proof can be found in Bapat (1997).

4. If ф : Y X is any function over a space Y, then k(ф(y), ф(y'))


is a kernel over Y. In particular, k(x/s, x/s) is a kernel for any
linear scale s E R+.

These observations show that, although not quite as uncon­


strained as the choice of features for parametric regression, the
space of kernels is still quite large. Among the key results of
subsequent chapters is the insight that certain classic numerical
algorithms can be interpreted as performing regression using
34 I Mathematical Background

a particular kernel. These kernels will often be associated with


rather general, unstructured priors, and it will be a natural
question whether more specific priors can be designed to better
address individual numerical problems. That is, to what degree
additional structural knowledge can help reduce the complexity
of a computation.

> 4.2.2 Gaussian Process Regression

If kernels provide an infinite-dimensional extension of positive


definite matrices, what does it mean to perform linear regression
(in the sense of Algorithms 4.1 and 4.2) with a kernel? The
probability measure implied by this use of a positive definite
kernel in a covariance model is called a Gaussian process.
Definition 4.4 (Gaussian process). Consider a function ц : X R
and a positive definite kernel k : X x X R. The Gaussian process
(GP) p (f) = GP (f; ц, k) is the probability measure identified by
the property that, for any finite subset X := [ x 1,..., xN ] с X,
the probability assigned to function values fX = [f(x1), ...,f(xN)]
is given by the multivariate Gaussian probability density p(fX )=
;
N(fx Цх, kxx).13 13It is not straightforward to show that
Definition 4.4 is well-formed, and what
The extension from parametric to nonparametric regression fol­ restrictions have to be put on the form
lows directly from Eq. (4.3). Given a Gaussian process prior of the kernel to guarantee the existence
of such a family of infinitely many Gaus­
p (f) = GP (f; ц, k) over an unknown function f : X R, sian random variables. For our purposes,
and the likelihood function from Eq. (4.2) (i.e. p(y | f)= it suffices to require the kernel to be
continuous on both variables to guaran­
N(У; fx, Л)), the posterior over f is a Gaussian process p (f | tee the existence of the Gaussian pro­
y) = GP(f; m,V) with mean function and spd function, m : cess. Some hints can be found on p. 1
X R and V : X x X R, respectively, of the form of Wahba’s book (1990) and in Adler’s
(1990) in-depth treatment.
mx = Цх + kxx(kxx + Л)- 1 (y - pX), (4.6)
V( x, x') = kxx/ — kxx (kxx + Л) — 1 kxx/. (4.7)

Figure 4.3 shows nine different Gaussian process priors over


X = R, alongside the posterior measures arising from the non­
linear noisy data set in Figure 4.1. The plots show the diversity
of the Gaussian process hypothesis space. GP models can be
stationary or spatially varying, and may produce smooth or
rough sample paths. The combination of basic kernels and fea­
tures - using the combination rules outlined in §4.2.1 - provides
an extensive toolbox for the construction of prior models with
specific properties.

Algorithms 4.3 and 4.4 provide a compact basic implementation


for Gaussian process regression. Like Algorithms 4.1 and 4.2,
4 Regression 35

Figure 4.3: Analogous plot to Figure 4.2,


for Gaussian process regression with var­
ious kernels. From top left to bottom
right, the prior processes are identified
by the following kernels:
(x)j

the Wiener kernel (producing Brownian


motion)

k(a, b) = min(a, b)+c, c = 10;

the linear spline kernel

k(a, b) = 1 + c - 100\a - bl, c = 1;

the integrated Wiener kernel (using a : =


a - 10, b:= b - 10)
(x)j

k (a, b) = 1/3 min( a, b )3


+ 1/2\a - b\ min(a, b)2;
the exponential kernel (Ornstein-
Uhlenbeck process)

k(a,b) = e-|a-b|;

the Gaussian (square exponential, radial


(x)j

basis function) kernel


kSE = e-(a-b)2;
its linearly and nonlinearly scaling

k (a, b) = kse( ф (a), ф (b))

for the choices

ф(x) = x and ф(x) = ( x ( +.10))


(x)j

the additive combination


2
k (a, b) = kse( a, b) + £ (ab) ; i
i=0

and the point-wise product

k(a,b)= kse(a,b) • e-(a2+b2)/4.

As in Figure 4.2, these examples demon­


(x)j

strate some of the variability among


Gaussian process priors. Some processes
(e.g. Wiener, spline, OU) produce sam­
ple paths that are almost surely nowhere
differentiable, while others are smooth
(e.g. the Gaussian kernel). Some are
mean-reverting and stationary, others di­
verge.
(x)j

x x x
36 I Mathematical Background

i procedure GP_Infer(y, X, Л, k, m) Algorithm 4.3: Basic implementation of


nonparametric Gaussian process infer­
2 kxx = k (X, X) / kernel at data
ence. In contrast to Algorithm 4.1, in­
3 mX = m (X) / mean function at data ference must performed in the func­
4 G = kXX + Л / data covariance tion space, and has cost O(N3). Line 7
re-uses the upper triangular matrix R
5 R = Chol( G ) / Cholesky decomp. of data covariance
for efficiency, because linear problems
6 Ц (i.e. G = RтR). This is the most costly step, at O(N3). with such matrices can be solved at cost
7 a = R-1 (R- 1)T(y — mX) / predictive weights O(N2) by back-substitution.

8 end procedure

they are separated into an inference and a prediction step. The


computational load for inference on the GP posterior arising Exercise 4.5 (easy). Consider two func­
tions f : Rd ^ R and g : Rd ^ R, both
from N observations is dominated by the O(N3) steps for a drawn from independent Gaussian processes
matrix square root of the so-called Gram matrix G - the algorithm as f ~ GP(цf, kf) and g ~ GP(fig, kg).
From the definition of GPs it follows that
uses the Cholesky decomposition, but other decompositions,
the basic Gaussian property of Eq. (3.4) car­
such as eigendecompositions, could also be used in principle. ries over: If A is a (sufficiently regular)
The main point of such decompositions here is that they bring linear operator and p(x) = GP(x;m,k),
then p(Ax) = N(Ax; Am, AkAT). Use
the matrix into a form from which linear systems can be solved this property to answer the following ques­
in O(N2) time. The cost of computing the mean function (the tions.

mean prediction) at a new location x* is linear in N, computing 1. Consider the real number a e R. The
function f = a • f is also distributed
the marginal variance at such a point is quadratic in N . according to a Gaussian process. What
are its mean and covariance functions?
2. The sum f + g is also distributed accord­
► 4.3 Relationship to Least-Squares and Statistical Learning ing to a Gaussian process. What is the
mean function and the kernel off + g?
In the case of Gaussian models, the probabilistic formalism 3. The sum f - g is also distributed accord­
is closely connected to other frameworks for inference, learn­ ing to a Gaussian process. What is the
mean function and the kernel off - g?
ing and approximation. This connection is helpful to connect
Probabilistic Numerics with classic numerical point estimates.

A theorem due to Moore and Aronszajn14 shows that each 14 This theorem was first published by

Aronszajn (1950), but he attributed it to


positive definite kernel is associated with a space of functions
E. H. Moore. A proof can also be found
known as the reproducing kernel Hilbert space (reproducing kernel in Wahba’s book (1990, Thm. 1.1.1).
Hilbert space (rkhs)), defined through two abstract properties:15 15 Scholkopf and Smola (2002), Def. 2.9.

1 procedure GP_Predict(x, a, R, m) / predict at x e XD Algorithm 4.4: Basic implementation of


2 kxx = k (x, x) / prior predictive covariance nonparametric Gaussian process predic­
tion. Predicting at D input locations
3 kxX = k (x, X) / cross covariance has cost O(D2) for a full covariance,
4 mx = m (x) + kxXa / posterior predictive mean and O(D) for a mean prediction and
marginal (element-wise) variance. In the
5 L = kxX / R / projection from data cov. to predictive space
latter case, line 7 is left out, and the di­
6 Vx = kxx — LLT / predictive covariance agonal of Vx in line 6 can be computed
7 Sx = ChOl(Vx)TRandNoRMAl(d)+ mx / samples cheaply using [LLT]ц = ^j[Lj]2. Com­
pare with Algorithm 4.2.
8 end procedure
4 Regression 37

Definition 4.6. A Hilbert space H offunctions f : X R is called


an rkhs if there exists a function k : X x X R such that

1. for every x e X, the function f (x) = k(x, x) is in H, and

2. k has the reproducing property that the values of every f e H


at any x e X can be written as f (x) = (f (•),k(•, x)) (where (•, ■)
is the inner product of H).

Using the rkhs notion the GP posterior mean estimate mx


of Eq. (4.6) can equivalently be derived as a nonparametric
least-squares estimate, also known as the kernel ridge regression
estimate: mx equals the minimiser of a regularised empirical risk
functional over the rkhs associated with k:16 16A proof of this theorem can be found
in Kanagawa et al. (2018), a paper that
mx = kxX (kxx + a21) - 1 y also contains many other results on the
connection between the GP and kernel
= argmin a21|f ||H + 1
feHk
£(yi - f (xi))2,
N i=1
(48 ) formalisms. The proof requires an appeal
to additional results, which are left out
here for brevity.

where || f \\-H, is the rkhs norm of f 17 The corresponding state­ 17Note that, in this text, the meaning of
the symbol k is overloaded. On the one
ment for the posterior standard deviation
hand, it signifies the covariance function
of a Gaussian process (gp) and, on the
a(x) := ^V(x, x) = ^kxx - kxx (kxx + Л)-1 kxx (4.9) other hand, the kernel of an rkhs. This
double use of k is common practice due
is given by the following theorem about the worst-case approxi­ to the similarity of both cases.

mation error of ridge regression. For simplicity, we focus on the


special case of function values observed without noise, which is
most relevant for classic numerical problems.18 For the general case of Л
18 0, see
Kanagawa et al. (2018), §3.4.
Theorem 4.8. Let k be a positive definite kernel over X, and H its
Exercise 4.7 (moderate, solution on
associated rkhs. Consider a function f e H with || f ||2 < 1, and a
p. 359). Prove Theorem 4.8. Hint: Use the re­
finite subset x e X1 . Denote y = fx e R1 . Then, for any x e X, producing property from Definition 4.6, and
the approximation error between fx := f(x) e R and mx e R as the Cauchy-Schwarz inequality

defined by Eq. (4.6) is tightly bounded above by the associated GP Кf,g)\2 < (.f,f} • (.s,g),
posterior marginal variance for Л 0: for f,g eH.

sup (mx - fx ) = kxx - kxxKxxkxx.


f eH, || f ||< 1

This statement shows that in the case of error-free evaluations


of f, the posterior variance of Gaussian process regression - the For clarity: The posterior standard­
19

deviation ax of Eq. (4.9), in general, is


Probabilists’ expected error - equals the worst-case approximation
not itself an element of the rkhs.
error if the true function is an element of the rkhs of unit-
bounded norm.19
Thus, if we are faced with the problem of inferring (estimat­
ing, approximating) a function f : X R from observations
(xi, yi)i=1,...,1, we now have two quite different viewpoints on
the very same point- and error-estimates:
38 I Mathematical Background

1. Assign a Gaussian process prior p(f) = GP(f;0,k) over a


hypothesis class of functions and use the asymptotic Gaus­
sian likelihood p(y | f) = П/=1 5(yi — f (xi)). The resulting
posterior is a Gaussian process with mean mx and marginal
variance ax.

2. Assume that the true function f is an element of the rkhs


Hk associated with k, and decide to construct the regularised
least-squares estimate for f within Hk (Eq. (4.8)), which is
given by mx .The distance between this estimate and the true
f (x) (if the assumption f G Hk is correct!) is bounded above,
up to an unknown constant (the rkhs-norm of f)by ax .

In many sections of this text, this equivalence between prob­


abilistic and statistical estimation will be used to construct
probabilistic re-interpretations of classic numerical methods.
The juxtaposition above highlights at least one reason why it
is helpful to have access to both formulations, and thus why
it makes sense to study Probabilistic Numerics in addition to
classic methods: in contrast to an empirical risk functional, a
prior is a generative object. We can study draws from the prior
to analyse and criticise the assumptions encoded in the prior.
If a particular GP prior does not encode certain properties we
tangibly know about our concrete problem, we may change the
prior to better reflect our knowledge. It is much harder todoso
based on an empirical risk. We also note that while the Gaussian
inference formulation adds this additional conceptual strength,
it does not take something away from the statistical one: Both
in terms of intrinsic assumptions and computational demands,
both are equivalent to each other.

There are, however, also subtle differences between the two


formulations. In particular, draws from a Gaussian process
are in general not elements of the rkhs.20 Instead they come 20See p. 131 in Rasmussen and Williams
(2006). Detailed analyses are offered by
for a different space that, in general, is slightly “larger ” and
Driscoll (1973) and, more recently, Stein-
“rougher ” surrounding the rkhs. This aspect will have tangible wart (2019). There is also a detailed dis­
consequences for the error analysis of probabilistic numerical cussion in §4 of Kanagawa et al. (2018).

methods. In §II and later, we will see that probabilistic error


estimates of classic methods are usually overly cautious on the
problems that these methods are usually applied to, an indica­
tion that these methods are too generic, and can be improved
upon with additional prior information.
4 Regression 39

1 Figure 4.4: Gaussian process inference


on a function f from evaluations of
the function (f (x = -4)), its derivative
0 (f'(x = —1.5)) and integrals (f (x) dx
for [., b] = [1,3.5] and [2, 3], plotted are
the average function values, the value of
и
the integral divided by (b — .)). Each ob­
-1 servation is corrupted by Gaussian noise
of varying degree (plotted error bars).
These observations have a structured ef­
fect on the uncertainty assigned to func­
-2
tion values. For example, the derivative
observation constrains the derivatives of
-10 -8 -6 -4 -2 0 2 4
x
f, but contains only weak information
about the absolute deviation of f from
the mean function.

► 4.4 Inference from and on Derivatives and Integrals

The closure of Gaussian distributions under linear projections


(Eq. (3.4)) allows for an elegant extension of regression on
functions from observed function values, which is of great
importance for numerical uses of the Gaussian inference frame­
work. Consider a real-valued function over a Euclidean vec­
tor space, f : RN R, and assign the Gaussian process prior
p(f)=GP(f;m,k). Assume that the mean function m(x) and
the kernel k (x, x') are at least q-times continuously differen­
tiable in all their arguments, and integrable against the measure
v : RN R. Then it follows21 from Eq. (3.4) that all partial 21There are subtleties involved due to
the nonparametric nature of the Gaus­
derivatives and integrals against v are jointly Gaussian (process)
sian process measure, so the appeal to
distributed, with mean functions Eq. (3.4) is simplistic, made with prac­
ticality in mind. A more precise state­
d^ f (x) d£ m (x) ment is that, if L is a linear opera­
E =----- for 0 < £ < q, tor acting on f, then Lf is a Gaus­
d x^ d xt sian process with mean function Lm
and covariance function LkL*, if Lm and
E f(x) dv(x) = m(x) dv(x), Lk(•, x') are bounded. The specific result
aa for derivatives can be found on p. 27
in Adler (1981). For a technical treat­
and covariance functions of the general form ment, see Akhiezer and Glazman (2013).
Thanks to Simo Sarkka for pointing out
these references. A more recent discus­
df (x) dmf (x) de dm k (x, x') sion is offered by Owhadi and Scovel
cov
dxf dxjm J, d xj d xjm , (2015).

fb dmf (x) Г dm k (x, x')


dv(x),
d
[J.f(x) v(x),-j = = J d xjmm

and analogously for mixed partial derivatives and higher-order


integrals. Of course, these results can be used to perform in­
ference on integrals from derivative observations and function
values, and the other way round. This tool will be a staple of
probabilistic numerical methods derived in later chapters.
40 I Mathematical Background

The kernels used in this context in later chapters include, most


prominently, the Wiener process and its integrals (see §5.4). Also
in wide use, mostly for algebraic convenience, is the Gaussian
(aka square-exponential, radial basis function) kernel

(a-b)2
kSE(a, b) := exp I 2.2)'

The Gaussian kernel amounts to an extreme assumption of


smoothness on f,but its extremely convenient algebraic proper­
ties nevertheless make it a versatile tool for rather regular prob­
lems, if extreme numerical precision is not the primary goal. For
easy reference, we include explicit forms for the most frequent
combinations of partial derivatives and integrals here, using the
shorthand Ф(z) := erf (yzjj)- If P(f) = GP(f;0,kSE), then

cov f f f (x)dx, ( f (x?)dx'j = [ k kSE(x, x) dxdx


a c ac

[ - (d - b)Ф(d - b) + (d - a)Ф(d - a)

+ (c - b)Ф(c - b) - (c - a)Ф(c - a)]


+ A2 [-kSE(d,b) + kSE(d,a) + kSE(c, b) - kSE(c,a)],

cov f(xc)dxc, f(x) = kSE(x, xc) dxc = cov f (x), f (xc)dxc


aa a

[Ф(b - x) - Ф(a - x)],

cov(f f (xf)d x, fx)} = d k kSE (x, x) d x = covf fxl f f (x)d x


\Ja dx ) dx a a d dx Ja
= kSE(x, b) - kSE(x, a),

and

(f (x), dkSE(x, x) dff(x) г/~Л


x dX co4 d-.'f(x))
x-xc
= -J2~ kSE( x, x),
/df(x) df(x) \ = d2kSE(x, x)
C0V \ dx ' dx J dxdx

= (]A2 - x .4 kSE(x,x),

cov (d dx^)f xx)dx^ = A-2 [(x - b)kSE(x, b) - (x - a)kSE(x,a)]

f), f (x)) = (
cov ( d - .2 ) kSE (x, x).
5
Gauss-Markov Processes
Filtering and Stochastic Differential Equations

As presented above, Gaussian process regression is a rather gen­


eral formulation of inference on a function from finitely many
data. However, computing the posterior ’s parameters associ­
ated with N real-valued data, in Eqs. (4.6) and (4.7), involves
the inversion of a N x N covariance matrix; hence the compu­
tational cost of doing so in general rises faster than N (general
matrix inversion has cost O(N3), see also §III). This nonlinear
cost is an obstacle to the application of this framework for the
design of efficient numerical algorithms. Ideally, a numerical
routine should be able to keep running virtually indefinitely
and increase its precision over time. This requires an inference
framework whose computational complexity scales at most lin­
early, O(N). Parametric regression, introduced in §4.1, has this
property; but it also has a finite model capacity, thus can only
compute approximations of limited quality.
Some numerical problems have a trait that offers another
way to address this challenge. They construct an ordered se­
quence of estimates by moving along a one-dimensional path
parametrised by a variable t that, for intuition, we may as well
call time. The central examples are:

Quadrature: Univariate integration methods estimate an inte­


gral F = ftf (t )dt, by evaluating f for various values ti G R,
stepping from one end of the integration domain to the other.

ODEs: Solvers for ordinary differential equations estimate a


curve (a continuous function of univariate input) x(t), t G R,
such that x'(t) = f (x(t), t). In doing so they evaluate the
multivariate function f for various values of t and x. (Often
42 I Mathematical Background

the sequence of values ti is ordered). Of course, such solvers


can be thought of as solving a univariate integration problem,
where the integration path is itself an (unknown) curve

Nonlinear optimisation: Algorithms that search for local ex­


trema x* = arg min f (x) of a real-valued differentiable func­
tion f : RN R usually construct a sequence of estimates
xt G RN designed to hopefully converge to the true x*. As in
the ODE setting, the estimates themselves are multivariate;
but the sequence of estimates can be thought of as lying
on a (discretised) curve. This notion is less precise than the
two above, since it is not always clear how to account for
varying step-lengths xt - xt-1. The locations xt actually lie
in a high-dimensional vector space with a natural Euclidean
metric, so the “times” separating the points xt are subject to
some subtle scaling. Nevertheless, the notion of a time-series
will turn out to be helpful anyway in §IV.

► 5.1 Markov Chains and Message Passing on Chain Graphs

Parametric regression models keep a finite global memory. But


in the above settings, with observations on an ordered one­
dimensional space, we may also try to construct models that
pass a finite amount of information forward and backward
along the line. This notion of a finite memory through time is
formalised by the Markov property.1 1 Markov (1906)

Definition 5.1 (Markov chain). Consider a discrete set of latent


variables (called states) X = [xt]t=0,...,T with joint probability density
p(X). Then p is a Markov chain if it has the Markov property that
xt is conditionally independent of all preceding states given the direct
precursor xt-1:

p(xt | x0,x1,...,xt-1)=p(xt | xt-1). (5.1)

In this sense, the states collected in xt can be thought of as a


kind of sufficient local memory. If we know the state at any
point in time, we know all that is possible to know about the
future behaviour of the system at time t. In addition, assume
observations yt that only directly relate to the local state xt,

p(yt | X) = p(yt | xt).

These two structural restrictions on a probabilistic model suffice


to make inference on the latent variables linear in T. A first
5 Gauss-Markov Processes: Filtering and SDEs 43

Figure 5.1: Graphical model (factor


graph) for a Markov chain. In this bi­
partite notation, every white node repre­
sents a latent variable, every black node
an observable, and every grey box a fac­
tor in the joint probability distribution of
(X, y). Edges in the graph connect vari­
ables and observables to the factors in
which they appear. The Markov prop­
erty is visibly reflected in the fact that
interesting observation is that it allows recursive prediction. As­ this graph is a chain. If all the factors are
Gaussian, this is the generic form of a
sume we have observed the time series y0:t-1 := [y0, ...,yt-1] linear state space model. If they addition­
and want to predict the states as described by p(xt | y0:t-1). ally all have the same parameters, it is a
linear time-invariant model.
This posterior marginal distribution is given by

j=t p(X)p(y0:t-1 | X) dxj


p(xtly0:t- 1)= P(X(X)P(У0:t-1 | X) dX

fj=tP(У0:t-1 I X0:t-1)p(x0) dx0 (По<j<t P(Xj | Xj-1) ^7) P(xt | xt-1) (Пi>t P(Xi | Xi-1) dxi)

f P(У0:t-1 I X0:t-1)p(x0) dx0 (По<j<tP(xj | xj-1) dx^ p(xt | xt-1) dxt (ni>tp(xi I xi-1) dxi)

_ fj<tp(xt | xt-1)P(У0:t-1 | X0:t-1)P(x0) dx0 (П<j<t P(xj | xj-l) dxj)


fj<tP ( У0: t-1 | X0: t-1) P ( x 0 ) d x0 (П<j<t P ( xj | xj-1) dj
= j p(xt | xt-1)p(xt-1 | У0:t-1) dxt-1.

In other words, it is possible to iteratively compute a poste­


rior over xt given all previous observations, by an alternating
sequence of Prediction-uPdate steps that only involve local quan­
tities:

Predict: use the posterior from the previous step to compute2 2 These steps are a special case of a
general algorithm called belief ProPaga­
tion or the sum-Product algorithm (Pearl,
P(xt | У0:t-1) = j P(xt | xt-1)P(xt-1 | У0:t-1) dxt-1. (5.2) 1988; Lauritzen and Spiegelhalter, 1988).
It formalises the computational cost
If t = 0, start the induction with the prior P(x0). of inference in joint probability distri­
butions given their factorisation into
Update: include the local observation into the posterior by local terms. There are visual formal
languages capturing such factorisation
Bayes’ theorem:
properties, known as graPhical models
(e.g. Figure 5.1).
p (xt | У )= P(yt । xt)P(xt । У0:t-1) (53)
0: t fly (yt | xt) P ( xt | У0: t-1) d xt
Expression (5.2) is known as the Chapman-Kolmogorov equation.
It can be found as Eqs. (5) and (5*) in a seminal paper by
Kolmogorov (1936).

The situation is only slightly more complicated if, having col­


lected all observations у = у0: т, one subsequently has to com­
pute a marginal distribution p (xt | y) for some 0 < t < T. Due
44 I Mathematical Background

to the Markov property, xt is conditionally independent of later


observations given xt+1:
Exercise 5.2 (easy). Show that Eq. (5.4)
p(xt | xt+1,y) = p(xt | xt+1, y0:t) (5.4) holds. Hint: Use Bayes’ theorem and the
Markov property (5.1).

and of the form


Р (xt, xt+1 I y0:t)
Р ( xt+1 1 У0: t)
p (xt+1 1 xt, yo:t) p (xt 1 yo:t)
Р ( xt+1 1 У0: t)
p ( xt+1 1 xt) p ( xt 1 yo: t)
(5.5)
Р (xt+1 1 y0: t) .
Using this, we can write the marginal posterior on xt as follows3 3 This derivation is based on the exposi­
tion in §8.1 of Sarkka (2013)
p(xt I y)= p(xt, xt+1 I y) dxt+1

= p(xt I xt+1, y)p(xt+1 I y) dxt+1 (5.6)

= p(xt I xt+1, y0:t)p(xk+1 I y) dxt+1.

Then, we just insert Eq. (5.5) into Eq. (5.6),4 which yields 4 Kitagawa (1987)

p(xt 1 y) = p(xt 1 y0:t) / p(xt+1 1 xt) ++1.1


(xt t 1 I yy0:\
pp(x t ) dxt+1. (5.7)

So to compute the marginal p (xt | y) for all 0 < t < T, using


the notion of time, we can think of this as first performing
a forward-pass as described above to compute the predictions
p(xt I y0:t-1) and updated beliefs p(xt I y0:t). The final update
step in that pass provides p(xT I y), from which we can start
a backward-pass to compute the posterior marginals p( xt I y)
using Eq. (5.7) and all terms computed in the forward pass. This
nomenclature, of passing messages along the Markov chain is
popular in statistics and machine learning, and applies more
generally to tree-structured graphs (i.e. not just chains).5 In 5 Bishop (2006), §8.4.2
signal processing, where time series play a particularly central
role, the forward pass is known as filtering, while the backward
updates are known as smoothing. Figure 5.2 depicts the output
of these methods. The next section spells out their computations
in the case when the underlying state-space model is linear and
Gaussian.

► 5.2 Linear Gaussian State-Space Models

The abstract form of the message passing algorithm above be­


comes particularly elegant if all prior and conditional distribu­
tions in the model are Gaussian, involving only linear relations.
5 Gauss-Markov Processes: Filtering and SDEs 45

Figure 5.2: Linear time-invariant filtering


and smoothing in a discrete time series.
At each discrete time step (next to each
other for legibility), the plot shows the
predictive distribution (5.10), the obser­
vation with likelihood (5.9), the interme­
diate estimation (filtering) posterior (5.3),
and the smoothed posterior (5.14). If the
state-space model is linear and Gaussian,
the estimation and smoothed posteriors
are computable by the Kalman filter and
smoother, respectively. At each location,
the variance drops from prediction to es­
timation belief, and from estimation to
smoothed belief.

We will assume the following forms, using popular notational


conventions for the linear maps involved:

p(x0)=N(x0;m0,P0), (5.8)
p(xt+1 | xt) = N (xt+1; Atxt, Qt),
p(yt | xt) = N (yt; Htxt, Rt). (5.9)

The latter two relations are also often written, and known as,
the dynamic model and an measurement model, respectively:

xt+1 = Atxt + gt with & ~ N(0, Qt) (dynamic),


yt = Htxt + Zt with Zt ~ N(0, Rt) (measurement).

In signal processing and control theory, the matrices At, Ht, Qt, Rt
are known as the transition and measurement matrices, process
noise and observation noise covariances, respectively. We will
sometimes use these intuitions evoking a dynamical system, but
it will not always fit directly to the numerical domain. The en­
tire setup is known as a linear Gaussian system in control theory.
However, to avoid confusion with the regression models defined
in §4.1 and §4.2 (which are also Gaussian and linear), we will
use another popular convention and call this a linear Gaussian
state-space model to stress that the inference takes place in terms
of the time-varying states xt .Ifthe parameters are independent
of time, At = A, Qt = Q, Ht = H and Rt = R, for all t, they
define a linear time-invariant (LTI) system.

Under these assumptions, the predictive, updated and poste­


rior marginals constructed above all retain Gaussian form. The
Chapman-Kolmogorov equation (5.2) becomes

p(xt+1 | y0:t) = N(xt+1;mt-+1,Pt-+1), with (5.10)


m-+1 = Am and P-+1 = At PtA ] + Qt. (5.11)
46 I Mathematical Background

1 procedure Filter(mt-1, Pt-1, A, Q, H, R, y) Algorithm 5.1: Single step of the Kalman


filter. Variables indexed by t are outputs,
2 m = Amt-1 / predictive mean
while the variables without indices are
3 P- = APt-1A т + Q / predictive covariance only used internally, can thus be over­
4 z = y - Hmt- / residual written. If the latent state is of dimen­
sionality xt e RL and the observations
5 S = HP— H т + R / innovation covariance
have dimensionality yt e RV, then an
6 K = P— H т S-1 // gain individual step of the filter has compu­
7 mt = mt- + Kz / updated mean tational complexity O(L3 + V3) (for the
inversion of S in line 6, and the matrix­
8 Pt =(I - KH)Pt- / updated covariance matrix multiplications in the latent states
9 return (mt, Pt), (m- , Pt ) in line 3, respectively).

10 end procedure

The update, Eq. (5.3) becomes Exercise 5.3 (easy). Using the basic prop­
erties of Gaussians from Eqs. (3.4), (3.8) &
(3.i0) and the prediction-update Eqs. (5.2)
p(xt | y0:t) = N (xt; mt, Pt),
& (5.3), show that Eqs. (5.i0) to (5.i3) hold.

where the parameters of this distribution can be computed using


the following procedure, giving explicit names to intermediate
terms (capital letters denote matrices, lower case vectors):

zt := yt - Htmt- (innovation residual), (5.12)


St := HtP-HtT + Rt (innovation covariance),
Kt := P—HS—1 (gain),
mt := mt- + Ktzt,
Pt := (I - KtHt)Pt-. (5.13)

Many of these terms originate from the field of control theory,


where they were made popular by Rudolf Kalman (i960). The
whole prediction update scheme above is thus widely known
as the Kalman filter, and the quantity Kt is called the Kalman gain
in his honour.6 6The Kalman gain is also known as the
optimal gain for historical reasons (be­
cause it is the optimal sensitivity of an
The “full” posterior marginals - those arising from all observa­ electrical filter designed to follow the dy­
tions, in the past and the future - are also Gaussian, namical system defined above). In our
context, this name is redundant. Given
the state-space model, the posterior is
p(xt | y)=N(xt;mts,Pts). (5.i4) not just the optimal, but really the only
meaningful probabilistic estimator.
Their form can be written, from Eq. (5.7), using the parameters

1 procedure Smoother(mt, Pt, A, mt-+1, Pt-+1, mts+1, Pts+1) Algorithm 5.2: Single step of the RTS
smoother. Notation as in Algorithm 5.i.
2 G = Pt A T (P-+1)—1 // gain
The smoother, since it does not actually
3 mss = mt + G (mS+1 — m-+1) Ц posterior mean touch the observations yt, has complex­
ity O(L3).
4 Pf = Pt + G (Pf+1 - Pf+1) GT H posterior covariance
5 end procedure
5 Gauss-Markov Processes: Filtering and SDEs 47

1 procedure Predict(m0, P0, [At, Qt, Ht, yt, Rt]t=1,...,M) Algorithm 5.3: Algorithmic wrapper
around Algorithm 5.1 to perform on­
2 for t = 1,... do
line prediction at a sequence of points
3 I (mt, Pt) = Filter(mt-1, Pt—1, At, Qt, Ht, Rt, yt) ti. Predict has constant memory re­
4 end for quirements over time, and its compu­
tation cost is constant per time step.
5 end procedure Thus inference from N observations re­
quires N steps of Algorithm 5.1, at cost
O(N(L3 + V3)). See the Bayesian ODE
of the prediction and updated beliefs, as follows: filter (Algorithm 38.1); one of its special
cases, the EKF0, is analogous to this Al­
gorithm 5.3 (as discussed in §38.3.4).
Let Gt := Pt AJ (Pt—)—1, (smoother gain)
then mts = mt + Gt(mts+1 - mt-+1)
and Pt = Pt + Gt(P— — P—)GtT. (5.15)

These update rules are often simply called the “Kalman smoo­
ther” or, more precisely, the (fixed interval) Rauch-Tung-Striebel
(RTS) tmoother equations.7 The estimates computed by the Kal­ 7 Rauch, Striebel, and Tung (1965)
man filter and smoother are depicted in Figure 5.2.
For reference, Algorithms 5.1 and 5.2 summarise the above
results in pseudo-code, providing the individual steps for both
the filter and the smoother. The wrapper algorithm Predict (Al­
gorithm 5.3) solves the task of continuously predicting the sub­
sequent state of a time series. For a finite time series running
from t = 0 to t = T, the algorithm Infer (Algorithm 5.4) returns
posterior marginals for every time step t. Since these marginals
(that is, their means and variances) are exactly equal to those
of GP regression (§4.2.2), algorithm Infer (Algorithm 5.4) is
nothing but a linear-time implementation of GP regression with
Markov priors.8 8 Sarkka and Solin (2019), §12.4

The Kalman filter (and smoother) are so efficient that they can
be sometimes applied even if the linearity or Gaussianity of
the dynamic or measurement model are violated.9 In numerics, 9 Sarkka (2013), §13.1
the case for such fast-and-Gaussian methods is even stronger

1 procedure Infer(m0, P0, [At, Qt, Ht, yt, Rt]t=1,...,T) Algorithm 5.4: Algorithmic wrapper
around Algorithms 5.1 and 5.2 to per­
2 for t = 1 : 1 : T do
form inference for a time-series of finite
3 ((mt,Pt), (m—,P— )) = length. While the runtime of this routine
4 Filter(mt—1,Pt—1, At—1, Qt—1, Ht, Rt,yt) is also linear in T, in contrast to Algo­
rithm 5.3, it has linearly growing mem­
5 end for ory cost, to store all the parameters of
6 for t = T — 1 : — 1 : 0 do the posterior and intermediate distribu­
tions. See the Bayesian ODE smoother (Al­
7 (mt, Pt) = gorithm 38.2); one of its special cases, the
8 +
Smoother( mt, Pt, At, m— 1, Pt—1, mst—1, Pt—1) EKS0, is analogous to this Algorithm 5.4
9 end for (as discussed in §38.3.4).

10 end procedure
48 I Mathematical Background

because (compared with conventional statistics) saved compu­


tational budget can be invested in finer discretisations. Thus,
all probabilistic numerical methods in this text use Gaussian
methods (such as the Kalman filter and smoother) for proba­
bilistic regression - with the sole exception of the probabilistic
ODE solvers presented in §40.3, where a stronger case for non­
parametric posteriors can be made due to highly nonlinear
dynamics of ODEs. Accordingly, the standard sequential-Monte­
Carlo versions of filters and smoothers, known as particle filters 10 Doucet, Freitas, and Gordon (2001)
and smoothers,10 will not appear before the ODE chapter VI.
There, they will be concisely introduced in §38.4.

► 5.3 Linear Stochastic Differential Equations

For a moment, let us put aside the observations y and only con­
sider the prior defined by the dynamic model from Eq. (5.11).
The predictive means of this sequence of Gaussian variables fol­ 11 This uses the matrix exponential
low the discrete linear recurrence relation mt+1 = Atmt. When
solving numerical tasks, the time instances t will not usually be
eX = exp(X) := £ X-, (5.17)
i
immutable discrete locations on a regular grid, but values cho­ where Xi := X • • • X is the ith power of
sen by the numerical method itself, on a continuous spectrum. C (defining Ci0tim=es I). The exponential
We thus require a framework creating continuous curves x(t) for
exists for every complex-valued matrix.
t > 10 G R that are consistent with such linear recurrence equa­ Among its properties are:
tions. For deterministic quantities, linear differential equations are ■e e0 = I (for the zero matrix);
that tool: Consider the linear (time-invariant) dynamical system ■& if Xy = YX, then
for x(t) G RN, eXeY = eYeX = eX+Y .

Thus, every exponential is invertible:


dx(t)
dt = Fx(t); and assume x(10) = x0. (5.16)
(eX ) -1 = —;

This initial value problem is solved by the curve11 ■e if Y is invertible, then


eYXY-1 = YeX Y-1,
x(t) = exp(F(t - t0))x0.
thus, det eX = etr X ;

Thus, the linear ordinary differential equation (5.16), together ■& if D = diagi (di) then eD = diagi (edi)
with a set of discrete time locations [t0, t1, ..., tN = T] gives rise (using the scalar exponential). In par­
ticular, if X = VDV-1 is the eigen-
to the linear recurrence relation
decomposition of X, then

xti+1 = Ati xti with Ati = exp(F(ti+1 - ti)). eX = VeDV-1

provides a practical way to compute


For clarity, the solution to a linear time-invariant ODE at discrete the exponential of diagonalisable ma­
trices. Another important way to com­
time locations is a linear recurrence relation. This does not mean pute matrix exponentials is if X is
that every recurrence relation can be identified with the solution nilpotent (i.e. there exists an m G N
such that Xm = 0), in which case
of one particular linear ODE, or in fact with any ODE at all. But
the exponential can be explicitly com­
one way to construct recurrence relations is all we need to build puted by Eq. (5.17).
interesting probabilistic inference algorithms.
5 Gauss-Markov Processes: Filtering and SDEs 49

The construction of probabilistic models for continuous time


series requires a generalisation of the above construction for de­
terministic linear dynamic systems to stochastic processes: We
need a way to build the Gaussian density around the mean pre­
diction mti+1 = Ati mti, such that the conditional distribution has
the form of Eq. (5.10). In particular, given xti = mti, we would
like a relative, or “differential” way to add a centred Gaussian
disturbance £t ~ N(0, Qt) over the time interval [ti, ti+1 ].
Such a generalisation is provided by linear stochastic differen­
tial equations (linear SDEs). Inconveniently, the stochastic paths
that solve such “differential” equations are a.s. nowhere differ­
entiable and thus have a.s. infinite total variation.12 A rigorous 12 Oksendal (2003), §3.1
definition of SDEs thus requires a new integral w.r.t. such a
stochastic process (the Ito integral) which gives rise to a new
kind of calculus. To avoid its complicated introduction, we
will, however, only provide an intuitive definition of SDEs. The
full-blown theory can, e.g., be found in the detailed book by
Karatzas and Shreve (1991) or in the more concise one by Ok-
sendal (2003); if slightly less rigour is acceptable to the reader,
we recommend the more accessible book by Sarkka and Solin
(2019).
Above, general Gaussian processes were defined as an infinite­
dimensional generalisation of finite-dimensional distributions.
In similar fashion, we will define a linear SDE as the abstract
object giving rise to Gaussian probabilistic recurrence relations
for finite time steps.
13In addition to sidestepping complica­
tions involved with general stochastic
Definition 5.4 (Linear SDE).13 Consider curves x : t e R ^ x(t) e
processes on the real line, the simplified
RN for t > 10. Assume matrices F e RNxN and a vector L e RN. definition for SDEs chosen here also at­
For the purposes of this text, the linear time-invariant stochastic tempts to avoid confusion with the defi­
nition of probabilistic algorithms for the
differential equation solution of deterministic ordinary differ­
ential equations in Chapter VI. The aim
dx (t) = Fx (t) dt + L dwt in that chapter will be to cast the solu­
tion of ODEs as the construction of linear
SDEs that concentrate as much proba­
(see below for an explanation of the notation d wt), together with the bility mass as possible around the true
initial value x(t0 )=x0, describes the local behaviour of a unique solution. It will not be the aim to con­
struct more involved stochastic processes
Gaussian process, determined by the following mean and covariance
driven by the ODE (and its unknown so­
function: lution) itself.

E(x(t)) = eF(t-t0)x0, (5.18)

cov(x(t), x(t')) = [ eF(t-T)LLTeFT(t'-T) dт (5.19)


t0

=: k(t, t).

This process is known as the solution of the SDE. In particular, this


50 I Mathematical Background

gives rise to the discrete time stochastic recurrence relation

p(xti+1 | xti)=N(xti+1;Atixti,Qti), (5.20)

with
Ati := exp(F(ti+1 - ti)) and
Qti := [ti+1 -ti eFTLLтeFTT dT. (5.21)
i0
t

Figure 5.3: Draws from the Wiener pro­


cess wt (here with xt=0 = 0) are paths of
The most basic choice N = 1, L = 1,F = 0 yields the SDE Brownian motion: continuous, but non-
“dx (t) = dwt” and thus an implicit definition of that ominous differentiable, almost everywhere. Their
expected deviation from x0 (grey lines)
object dwt, which is called the increment of the Wiener process.
grows as \/f.
The solution of this SDE has (since e0 = 1), the constant mean
function E(x(t)) = ц(t) = x0 (i.e. if we set x0 = 0, vanishing 14 Candidates for the invention of the

Wiener process include the Danish as­


mean), and the covariance function is k(t, t) = min(t, t) - 10.
tronomer Thiele in 1880 (see a review
This solution is thus the Wiener process14 with starting time t0 by Lauritzen in 1981), the French math­
(see Eq. (4.5)), which is depicted in Figure 5.3 for t0 = 0. The ematician Bachelier, the Austrian physi­
cist Smoluchowski, and Albert Einstein
mysterious dwt is, strictly speaking, nothing but a placeholder with his Annus Mirabilis paper on Brow­
for an Ito integral; i.e. the SDE “dx = dwt” is in essence just a nian motion (Einstein, 1906). Wiener
shorthand for the Ito stochastic integral equation x (t) = ft dws .15 (1923), however, was arguably the first to
show the existence of the process. An im­
Thus dwt /dt is, loosely speaking, some kind of “derivative of pressive encyclopedic handbook on the
the Wiener process”; it is hence often thought of as Gaussian Wiener process was provided by Borodin
and Salminen (2002).
white noise. While this is a useful intuition, it is completely
informal because a sample path from the Wiener process wt is 15See §5.2.A. in Karatzas and Shreve
(1991) for more information on the
almost surely not differentiable, almost everywhere. Figure 5.4
stochastic-integral form of SDEs.
gives an intuition for state-space inference using the Wiener
process itself as the prior measure. 16Another common notation if the step
size is fixed (ti+1 - ti = h Ei) is to scale
the state space to x(t) = Bx(t) with B =
► 5.4 Polynomial Splines diag(1, h, ...,hq/q!), and a corresponding
F = BFB-1, L = BL (this means F,i+1 =
i/h). In this notation, Eq. (5.23) becomes
An interesting class of linear SDEs arises from integrals of the
hi-1 f (i-1)(t)
Wiener process. This can be achieved by considering the SDE xi(t) = (i - 1)! .
with F E R(q+1) x (q+1) and L E R(q+1) as
The advantage is that the resulting A and
01 0 ■■ 0 0 Q have much simpler form: A becomes
the Pascal triangle matrix (the upper tri­
00 1 ■■ 0 0
angular matrix containing the Pascal tri­
F=
••
• ... , and L
. ..
. (5.22) angle), and the elements of Q depend on
h only in a scalar fashion:
0 0 01 0
[A] ij = I(j > i) (j) ,
0 0 00 e
[ Q ].. = ________в 2 h 2q+1________
This structure16 ensures that the elements of x are derivatives [Q]ij (2q + 3 - i - j)(q + 1 - i)!
of each other. So they can be interpreted as derivatives of a ___________ 1___________
function f : R R: (q + 1 - j)!(i - 1)!(j - 1)!
This can help both with numerical stabil­
x(t)= [f(t) f'(t) f"(t) ■■■ f(q)(t)] . (5.23) ity and efficient implementations.
5 Gauss-Markov Processes: Filtering and SDEs 51

Figure 5.4: State-space inference using


6 the most elementary Wiener process
prior, i.e. the linear time invariant model
4 with F = 0, L = 1, H = 1. Observations
shown as black circles with error bars at
2
two standard deviations of the Gaussian
observation noise. Top: Filtering. Predic­
0
tive means in solid black, two standard
-2 deviations in thin black. Three joint sam­
ples from the predictive distribution, con­
-4 structed in linear time during the pre­
diction run, are plotted in thin black,
also. Bottom: Posterior distribution after
smoothing (filtering distribution in grey
for comparison). The posterior samples
(which are indeed valid draws from the
6 joint posterior) are produced by taking
the samples from the predictive distri­
4 bution and scaling/shifting them (deter­
ministically) during the smoothing run;
2
they do not involve additional random
0 numbers.

-2

-4


t0 t1 t2 t3 t4 t5 t6 t7
t

The corresponding discrete-time objects A and Q contain ele­


ments that are polynomials of the time steps hi := ti+1 - ti. The
matrix A(h) is upper triangular (and Q, of course, is spd), with
elements17 (1 < i, j < q + 1) 17Since exp(X)-1 = exp(-X), Eq. (5.25)
also immediately provides
hj-i
[A(h)]ij = [exP(Fh)]ij = I(j > i)jj - i)!, (5.25) I (
[A-1 (h)]ij = (j > i) - -) i)! . (5.24)

h2q+3-i-j

[Q(h)]ij = в (2q + 3 - i - j)(q + 1 - i)!(q + 1 - j)!. (5.26)

A detailed derivation of A and Q is provided in the literature.18 18 Kersting, Sullivan, and Hennig (2020),

Appendix A; note that zero-based index­


The corresponding covariance functions have a tedious form,
ing is used there.
but they are also polynomials. For example, for q = 1 -the
integrated Wiener process -the kernel is (for t0 = 0)

1
cov(f (a), f (b)) = в2 (3min3(a, b) + \a - bl min(a, b))
(5.27)
(see also Figure 4.3). We recall that the posterior mean function
of a Gaussian process is a weighted sum of kernel evaluations.
Hence, under the choice (5.27) and noise-free observations of
f, the posterior mean on f would be the piecewise cubic inter-
polant of the data (which is unique within the convex hull of the
52 I Mathematical Background

data). It is the cubic spline. More generally, posterior interpolants


of the q-times integrated Wiener process are 2q + 1-ic splines.

► 5.5 The Matern Family

Another important base case is given by the univariate SDE


(i.e. x(t) G R) with a negative F. For convenience further below,
we will use the parametrisation F = — 1/л and L = 26/4 л for
6, Л G R+. This setup captures the velocity of a gas particle of
temperature 6/4 л in a linear (harmonic) potential well of force
constant 1/л, and its solution is known as the Ornstein-Uhlenbeck
(OU) process.19 The SDE 19 Uhlenbeck and Ornstein (1930)

x 26
dx — — л d t + dcct

with x(t0 )—x0 yields, with Eqs. (5.18) and (5.19),

t—t0 1 1 ( \t—t | t+t — 2t0


E(x(t)) — x0e—— and k(x, x) — 62 e— л —e л

The special case of 10 — те is the stationary limit of the process.


20 Matern (i960)
It has vanishing mean function, and covariance

k (t, t) — 62 e \^.
The naming of the family is due to
21

Stein (1999). See also Rasmussen and


General integrals of the OU process form an important class Williams (2006), Eq. (4.16).
of regression models in statistics and machine learning. The 22Readers with a background in func­
Matern20 class of covariances21 is given, using the distance tional analysis might find it helpful to
know that the rkhs arising from the
r :— \t — t'\ by Matern kernel kq+1/2 is norm-equivalent
to the Sobolev space Hq+1 (R). See,
kv (r) — 622^ (42Vr-)v Kv (42Vr-), e.g., Kanagawa, Sriperumbudur, and
Г(v) \ л \ AJ Fukumizu (2020).

where Kv is the modified Bessel function. For v — q +1/2 with


integer q, these kernels have the slightly more concrete form22

k (r) — 62 Г(q + 1) V (q + i) ! 48V r ) q—i • exp


q+1/2() 6 r(2q + 1) Д0 i!(q — i)!
Or yet more explicitly, for the first three integer (and most
popular) choices q = 0, 1, 2:
r
k 1/2 (r) — 6 2 • exp
—D'
/ r-
k3/2 (r)— 6 414) • 3 43r
exp —
л
f r
k 5/2 (r) — 62 (1+4+g) • 5 45r
exp — -л
5 Gauss-Markov Processes: Filtering and SDEs 53

The associated Gaussian processes are solutions of multivariate


state-space models of the form23 23Hartikainen and Sarkka (2010); Sarkka,
Solin, and Hartikainen (2013).
d z (t) = Fz (t) d t + L d wt

with z E Rq+1 and


24This matrix F is known as the compan­
0 1 0 ... 0 0 ion matrix of the polynomial
001 0 ..
. 0 p (x) = a0 + a1 x1 +-------+ aqxq + X+1,
.
. ... ... 0 and L= because p is both the characteristic and
0 the minimal polynomial of F. More in
001 §7.4.6 of Golub and Van Loan (1996), and
-a 0 -a 1 ••• -aq-1 — aq e on pp. 405ff. of Wilkinson (1965).
(5.28)
where the ai are the coefficients of the polynomial (z + iw) (q+1)
with z := . 2v/л.24 For example, for the three values q = 0,1,2
Exercise 5.5 (moderate, solution on
from above, we get p. 360). Find explicit forms for the matri­
ces At, Qt (Eq. (5.21)) associated with the
dz(t) = —z,z(t) dt + e dwt, (5.29) discrete-time forms of Eqs. (5.29) -(5.30).

0,
dz(t) z(t) dt + dwt
- z2 -2z e
0 1 0 0
dz(t) 0 0 1 z(t) dt + 0 dwt. (5.30)
-z3 -3z2 -3z e

A more general and detailed introduction to this family of


stochastic processes is provided by Sarkka and Solin (2019).

► 5.6 Riccati Equations and Steady State

In numerical computations, we will often run a Gaussian fil­


ter for many steps through time. The question will then arise
whether the algorithm will actually be able to identify an object
of interest to finite uncertainty, or whether it might “lose track”
over time, in the sense that the posterior predictive covariance
Pi- rises without bound. If this is not the case, does Pi- converge
to some finite value (and which one) for i to?

There is an elegant framework for this purpose, which is well-


studied in control engineering,25 where Kalman filters are a 25 Bougerol (1993)
foundational tool and the ability to track and control a sys­
Exercise 5.6 (moderate, solution on
tem is of central concern. Consider a linear time-invariant sys­ p. 360). Equation (5.31) demonstrates that
tem defined by the parameters A, Q, H, R. From the prediction the predictive covariances Pi- are governed
by a discrete time algebraic Riccati equation.
(Eq. (5.11)) and update (Eq. (5.13)), the step from Pi- to Pi-+1
Show that the evolution of the estimation co­
is given by the deterministic step (since it does not involve the variances Pi and the smoothed covariances
observation y) Pis are also described by DAREs.
54 I Mathematical Background

P—+1 = AP— Aт - AP— Hт (HP- Hт + R)—1 HP— Aт + Q


(5.31)
= Q + A ((P—)—1 + HT R—1H)—1A T.
26The name reflects that, under some
regularity assumptions, there is a con­
Equations of this form are known as discrete-time algebraic Riccati
tinuous time limit given by the matrix
equations (DAREs).26 Whether a DARE has a fix point, and if so, differential equation
where that fix point lies, can be answered by considering the dP = AP + PA т — PHт R—1 HP + Q
following symplectic matrix:27 (5.32)
(in the sense that the sequence of P’s
A т + H т R—1H A—1Q —H т R—1 h A—1 produced by Eq. (5.31) are solutions of
Eq. (5.32)). Historically, quadratic ordi­
-A-1Q A-1 nary differential equations

x' (t) = q (t) + a (t) x (t) + b (t) x2 (t)


If Z has no eigenvalues on the unit circle, then exactly half its
(of which Eq. (5.32) is a matrix-valued
eigenvalues are inside the unit circle, and the iteration (5.31)
generalisation) are named after the
converges to a finite fix-point, known as the steady-state predic­ eighteenth-century work of Jacopo Ric-
tive covariance. To find it, consider the matrix U E C2Nx2N of cati (1724).
27A symplectic matrix M is a matrix that
the eigenvectors of Z. Assume that they are sorted such that
satisfies Mт ОM = О for some fixed non­
the first N columns of U correspond to the eigenvectors for singular, skew-symmetric matrix О. In
eigenvalues inside the unit circle (and the other N columns to our case, e.g.,

those outside the circle). Then separate U into square N x N I


О=
I 0.
sub-matrices as
U U1 U3 Symplectic matrices have unit determi­
=. nant. They are invertible, with M—1 =
U2 U4
О—1MTO, and a product of two sym-
The steady-state predictive covariance is given by plectic matrices is a symplectic matrix.
Hence they form a group. A Lie group,
in fact, known as the symplectic group.
P— = U2 U—1. Its generators are the Hamiltonian matri­
ces, matrices of the form
This solution is relatively easy to state, but it is not the numeri­ Y- ГA B 1
cally most efficient way to find P—— . Advanced algorithms can be X |c — A t|

found in the book by Fafibender (2007). More generally, Riccati with symmetric matrices B, C. There is a
equations have been the subject of deep analysis. A classic book corresponding stability analysis for the
continuous-time algebraic Eq. (5.32) in­
was written by Reid (1972). Bini, Iannazzo, and Meini (2011) volving a Hamiltonian matrix.
provide a more recent review.
6
Hierarchical Inference in Gaussian Models

As we saw in the preceding sections, the parameters 0 :=


{m, k, A, b, Л} of a Gaussian model,

p(f | 0) = N(f; m,k), p(y | f, 0) = N(y; Af + b, Л),

fundamentally affect the shape of the posterior on f. So what


should be done if the right choice of parameters is itself un­
known? The difference between a variable (a number of interest)
and a parameter (a nuisance number required to specify the
model, but not of central interest) is only conceptual, not formal.
If the right value for a parameter is not clear a priori, the natural 1 More on the kind of exact marginalisa­
tion performed below, with extensions,
treatment is to assign uncertainty to it, too. For this purpose, can be found in a book by Bretthorst
the evidence term (1988). A more compact discussion of
conjugate priors for the Gaussian is in
p(y | 0) = У p(f | 0)p(y | f, 0) df = N(y; Am + b, k + Л) §24.1 of MacKay (2003).

provides a marginal likelihood (also known as the type-II likelihood)


2 An early work on conjugate priors is
for the parameters.1 by Edwin Pitman (1936). An general ex­
position of conjugate priors for expo­
nential families was provided by Diaco-
Because inferring parameters involves “modelling the model”,
nis and Ylvisaker (1979). The core idea
this notion is known as hierarchical inference. Hierarchical in­ behind conjugate priors, namely to for­
ference can increase the computational cost of the overall es­ mally treat prior knowledge like addi­
tional data, is arguably first used, al­
timation, so it is not always advisable. But for some special beit without much discussion, in §VI
cases of general interest, analytical forms are available. Usually, of Pierre-Simon Laplace’s Theorie analy-
tique des probabilites (1814, p. 364), in the
the probability distribution over the parameters itself contains
famous argument about inferring prob­
numbers specifying certain parts. Those numbers are called abilities from observed events, which
hyperparameters. If one truly wanted, they too could be inferred also introduces the Gaussian approxima­
tion to the Beta integral now known as
via additional hierarchical modelling layers, ad infinitum, but the Laplace approximation (1814, p. 365).
this is not done in practice to keep computational cost finite. We There, Laplace argues that one can al­
ways use a uniform prior, since non­
will now outline such a framework for estimating the unknown
uniform prior knowledge can be repre­
mean and variance of a Gaussian model. We will employ a sented by adding counts to the observa­
conjugate prior2 over these hyperparamaters, i.e. a prior which tions.
56 I Mathematical Background

ensures that the posterior distribution over these hyperparame­


ters, given data, has the same (Gaussian) form.

► 6.1 Inference on Scalar Parameters of Gaussians

Assume the real numbers y := [yi]i=1,...,N are drawn i.i.d. from


a Normal distribution with unknown mean a and variance в2,

N
p(y | a,в) = ПN(yi; a,в2) = N(У; a 1,в21)•
i=1

We would like to assign probability densities over the latent


quantities a and в. Purely from an algebraic perspective (as
opposed to a deeper philosophical motivation for this kind
of uncertainty), it is convenient to choose a prior distribution 3 The Gamma function is an extension of
the factorial function. The Greek symbol
p (a, в) of some form p (a, в) = Ц (a, в; 9) with hyperparameter 9, for this function is due to Legendre who
such that the posterior distribution has the same functional form utilised two “Eulerian integrals”, B, Г
(the first symbol being a capital Greek
П as the prior, just with updated hyperparameters 9) instead of letter Beta):
9:
B(m,n) = xm-1 (1 - x)n-1 dx,
70 (6.1)
V(a в | v) =
p(a,в 1 y)
P(У 1 a,в)П(a,в 9)
j p(y | a, в)n(a, в; 9) da dв
;
= u(a в;9)
П(a,в;9) . Г(t) = ! X1 dx.
0

A prior n with this property is called conjugate to the likeli­ In the context of this text, it is a neat
marginal observation that, while Legen­
hood p(y | a, в). For the present case of a Gaussian likelihood dre was interested in problems of chance,
with latent mean and variance, the standard choice of conjugate Euler’s original motivation for consider­
ing those integrals, in an exchange with
prior assigns a Gaussian distribution of mean ц0 and variance
Goldbach, was one of interpolation: he
proportional to в2 (with scale Л0) to a, and a Gamma distribu­ was trying to find a “simple” smooth
tion with parameters a0, b0 to the inverse of в. This is known function connecting the factorial func­
tion on the reals. And indeed,
as the Gauss-Gamma, or Gauss-inverse-Gamma prior, and has the
Г(n) = (n - 1)! n e N\0.
hyperparameters 90 := ц0 Л0 a0 b0 :
Legendre is also to blame for this un­
sightly shift in the function’s argument,
П (a, в | Ц0, Л0, a0, b0) = p (a | в, Ц0, Л0) p (в | a0, b0) since he constructed Eq. (6.1) by re­
в2 arranging Euler’s more direct result
=N a; ц 0/— G (в-2; a 0, b 0),
Л0 n! = (- log x)n dx.
b a za 1—
where G(z; a, b) := A great exposition on this story and
the Gamma function can be found in
a Chauvenet-prize-decorated article by
Here, G (•; a, b) is the Gamma distribution with shape a > 0 Davis (1959). It is left as a research exer­
and rate b > 0. The normalisation constant of the Gamma cise to the reader to consider in which
sense Euler’s answer to this interpolation
distribution, and also the source of the distribution’s name, is problem is natural, in particular from a
the Gamma function Г.3 To compute the posterior, we multiply probabilistic-numerics standpoint (that
prior and likelihood, and identify is, which prior assumptions give rise to
the Gamma function as an interpolant
of the factorials, whether there are other
p(a,в | У) “ p(У | a,в)p(a,в) (6.2) meaningful priors yielding other inter­
polations).
= N (y; 1 a, в 21) N (a; ц 0, P2/л0) G (в-2; a 0, b0).
6 Hierarchical Inference in Gaussian Models 57

We first deal with the Gaussian part, using Eq. (3.5) (and some
simple vector arithmetic) to re-arrange this expression as

G (в-2; a о, b0) N Уу; 1 у о, в 2f I


\ \
+

The second Gaussian expression is evidently a Gaussian over a,


A0 J J
•N fa; A0 У 0 + iy
\ A0 + N
fY
A 0 + NJ
(6.3)

and the first one does not depend on a .To deal with the first
part, we use the matrix inversion lemma, Eq. (15.9), to rewrite

(I + 1i1l A —1 = I — AilL
\ A0 J N + A0.
This allows writing the first line of Eq. (6.3) more explicitly
(leaving out normalisation constants independent of в) as

yy; у0,в2(I + Ao
G(в—2;a0,b0) N

■ ( ’0—1 exp < b0 A ( 1 A N/2 ( 1 ((fN=1 y^ — NУ0)2


f в 2) Л в 4 ' ex4 2в 2 (fyl y 0) a0 + N ))• (6.4)
At this point, it is common practice to introduce the following
substitutions, known as sufficient statistics:

1N
sample mean, (6.5)
f
a = N i=1 y
1N
в2 = N f
( yi — a )2 sample variance, (6.6)

with which the cumbersome expression in Eq. (6.4) simplifies


Exercise 6.1 (easy). Assume that, instead
(by shuffling sums around) to

f
of i.i.d. draws yi, one observes individually
scaled samples ai = yisi with known scales
N
(fN=1 yi — N У 0 )2 si. By re-tracing the derivation, convince
(yi — у0)2 — yourself that this situation can be addressed
i=1 A0 + N
in the same fashion, but with new sufficient
= Nв 2+ л A 0Nv(a — У 0)2. statistics
A0 + N 1

The terms in Eq. (6.4) thus have the form of a Gamma distribu­
f *si
i=1

tion over в-2. All other terms suppressed by the ■ sign make up в2
;
1
N
the normalisation constant of the new Gauss-Gamma-posterior
This will be helpful for hierarchical inference
p(a, в | у, У0, A0, a0, b0) = у (a, в I yN, An, aN, bN), (6.7) in §III.

with the new parameters


A0 у 0 + N a
yN + N ' AN = A0 + N,

aN = a0 + N/2, bN = b0 + 1/2 fNв2 + Y0Nv (a — у0)2


A0 + N
(6.8)
58 I Mathematical Background

The common interpretation for this result is that AN and aN can


be seen as counters for the number of observations collected,
while yN and bN are sufficient statistics for the population’s
mean and variance.4 In this sense, A0 and 2a0 amount to pseudo­ 4The expected value of a Gamma­
distributed random variable is
observations (see note 2) encoded in the prior. As N increases,
yN (the expected value of a under the Gauss-Gamma prior) E G (z;a,b)( z 1 = b •
converges to the sample mean a, while E(в2) = bN/aN converges
to the sum of the sample variance and a correction term

Aо N
(a - y0)2.
A0 + N

Intuitively, this term corrects for the fact that в>2 is a biased
estimate of the variance - the sample mean is typically closer to
the samples than the actual mean is, and this bias depends on
how far the initial estimate y0 is from the correct mean.
5 An alternative contender for the in­

vention of this distribution is the Ger­


In applications our main interest will usually not be in the man geodesist F. R. Helmert, but the
posterior distribution over the parameters (a, в), but predic­ endearing story of the statistician Gos-
set, writing under the pseudonym “Stu­
tions of the subsequent sample yN+1. Under the hierarchical dent” (1908) so as not to violate a non­
Gauss-Gamma model, this posterior distribution is the marginal disclosure agreement with the Guiness
distribution reached by integrating out the posterior over the Brewery, his employer, has stuck. In fact,
Helmert’s 1875 letter to the Zeitschrift fur
hyperparameters, Mathematik und Physik is an entertaining
read, too. It contains a dressing-down of
/• TO /• TO
R. A. Mees (Professor of Physics at Got­
p (yN+1 I y )=
-TO
I
0
d d
p (y + I y,a,в) p (a,в I y) a в. tingen) who, in an article published in
the same newspaper some months prior
The integral over a is easy, as it is an instance of the basic (pp. 145 ff.), apparently misread a discus­
sion, by Helmert, of Gauss’ work on es­
Gaussian properties above. The remaining integral over в is over
timation of errors. The letter opens with
the product of a Gaussian and several Gamma terms, which is the sentence “The point of the follow­
given by a famous result known as5 Student’s-t distribution: ing is to show that Mr. Mees errs, not
just in his evaluation of my work, but
throughout his entire essay.”
St(yN+1; yN, aN/bN, 2aN)
: = jTO N(yN+1; yN, в2)G (в-2; aN, bN 1 dв (6.9)

= Г(aN + 1/2) 1>NN /bN + (yN+1 - yN )2 \ -N—1/2.

► 6.2 Inference on Multivariate Gaussian Parameters

The derivations above can be extended to the case of multivariate


observations Y = yy 1,...,yN^ G RMxN drawn i.i.d. as

p(Y | a, B)= ПN(yi; a, B) (6.10)

from a Gaussian with unknown mean vector a and unknown


covariance matrix B. The corresponding conjugate prior is the
6 Hierarchical Inference in Gaussian Models 59

Gauss-inverse-Wishart distribution

p (a, B ) = N (a; ц о, 1/л0 B) W (B— 1; W0, v0), (6.11)

with the parameters ц0 G RM, Л0 G R+, a symmetric positive


definite matrix W0, and the so-called degree of freedom v0 >
M - 1. Here, W (B-1; W0, v0) is a Wishart distribution,6 6 Wishart (1928)

|B-1|(v0-M-1)/2 exp -1/2 tr(W0-1B-1)


W(B 1; W0,v0) := 2v0M/2|W0|v0/2rM(v0/2)

using what is sometimes called the multivariate Gamma function,

ГM(x) := nM(M—1)/4 П Г fx + (1—j


j=1 2
The posterior resulting from the likelihood (6.10) and prior
(6.11) emerges analogously to the derivations above. It is a
Gauss-inverse-Wishart with the updated parameters

_ Л0ц0 + Na
Hn = Л0 + N '
ЛN = Л0 + N,
vN = v0 + N,

WN = W0 + NB + , ° .((a — ц0)(Л — ц0)T,


Л0 + N

using the multivariate forms of the sufficient statistics from


Eqs. (6.5) and (6.6):

1 When comparing Eq. (6.12) to Eq. (6.9),


7

Л = N £i y i an unexpected factor of 2 shows up here


and there. This comes from differing def­
1 initions of the “counters” aN and vN,
B= N E( yi —a)(yi —a )T. which in turn are due to the standard
definition of the Gamma distribution.

As above, it is also possible to marginalise over this multivariate


posterior. This again yields a Student-t posterior, but a multi­
variate one:7

p (yN+1 | Y, ц0, Л0, v0, W0) = у p (yN+1 | a, B) p (a, B | Y, ц0, Л0, v0, W0) da dB (6.12)

r. ( ЛN +1 ,, \
= StM yN++1; HN, ЛN (vn — M + 1) Wn, vN — M + 1J
Г( vn+M/2) — (vn+M)/2

ryvN^M^nM^WNp2 1 (yN+1 yN )T WN 1 (yN+1 yN )


60 I Mathematical Background

► 6.3 Conjugate Prior Hierarchical Inference for Filters

For linear Gaussian state-space models as defined by Eqs. (5.8)­


(5.9), the marginal likelihood factorises over the Markov chain:8 8This is known as the prediction error
decomposition, See Eq. (12.5) in Sarkka
NN (2013).
p(У I 9) = П
i=1
P(yi I У1:i-1, 9) = ПN(yi; Hm—, llp
i=1
HT).

(6.13)
In particular, for SDEs like those of the preceding §5.4 and §5.5,
which can be written, with an explicit role for 9, as

dz (t) = Fz (t) dt + 9 L dwt,

hierarchical inference can be formulated efficiently in the filter


formulation:9 define the unit-scale predictive variance analo­ 9As in the preceding sections, analytic
hierarchical inference on the scale 9 re­
gously to Eq. (5.21) as
quires noise-free observations. The same
applies here. In the notation of the
Qti := Jq i+1 i eFT IL LT eFT dт, Kalman filter, this amounts to R = 0.

which extracts 9 in Eq. (6.13). We get

N
p (у 19) = П
i=1
N (yi; Hm-, 92 HP-HT)

with a recursively defined P— =: APAT + Q, started at ^- =


P0- . Given the inverse Gamma prior

p (9 ) = G (9—2, a 0, в 0),

the posterior on 9 can then be updated recursively during the


filtering pass, and the analogue to Eq. (6.7) simplifies to

z _2 , x ( 2 N 1 N (yi — Hm— )2 A
P(9 2 1 У1:n) = G 9 2;a0 + у,в0 + 2
2 2 iL
=1
HP — HT
HPi H

=: G(9—2,aN,вn). (6.14)

In Algorithm 5.1, this can be realised10 by using Q instead of 10Pseudo-code can be found in the fol­
lowing chapter, as Algorithm 11.2.
Q, and adding the lines a a + 1/2 and в в + z/2s after line
8 (those two variables having been initialised to a a0, в в0).
The corresponding Student-t marginal on the state can be com­
puted locally, if necessary, in each step.
7
Summary of Part I

This chapter provided a concise introduction to Gaussian prob­


abilistic inference. Here is a brief summary of the key results
from this chapter:

Gaussian densities provide the link between probabilistic


inference and linear algebra. Though of limited expressive­
ness, they thus form the basis for computationally efficient
inference.

& Using linear models, the Gaussian framework can be lever­


aged for inference on infinite-dimensional objects like func­
tions and curves.

& Gaussian processes provide Gaussian inference on infinite­


dimensional hypothesis-spaces. Even these nonparametric
methods, however, embody non-trivial prior constraints. Like
all consistent probabilistic models, have to spread finite prob­
ability mass over their hypothesis space.

Gauss-Markov processes, captured by state-space models, al­


low Gaussian inference in linear time, through the algorithms
known as filtering and smoothing. They are well-suited for
the design of computationally lightweight inference rules, as
required in numerical computation. Linear stochastic differ­
ential equations capture the continuous-time limit of these
stochastic processes.

& The parameters of Gaussian models can be inferred using


hierarchical inference. In most cases this poses a nonlinear
(non-Gaussian) optimisation or inference problem. But in the
special case of scale and mean of a Gaussian model, conjugate
priors allow for analytic inference.
62 II Mathematical Background

The following chapters will employ this toolset in concrete


numerical tasks.
Chapter II
Integration
__________ 8
Key Points

Upon first hearing the idea of Probabilistic Numerics - that


computation may be phrased as inference - the critical reader
may have some fundamental questions.

& Is Probabilistic Numerics even feasible? Nice idea, but how


might probabilistic numerical methods be built in practice?

& Are probabilistic numerical algorithms fast? Classic numeri­


cal methods are efficient. Bayesian inference has a reputation
for high computational cost. Can probabilistic numerical
methods be implemented at low complexity?

& Do probabilistic numerical algorithms work? In fact, what


does “works” even mean here? Which analytic desiderata
should probabilistic numerical methods satisfy? Classic nu­
merical analysis, most centrally, aims to show that the point
estimates of a certain method converge at a desirable rate to­
wards the truth. For probabilistic numerical methods, we will
likely want a similar notion. But since probabilistic methods
add uncertainty as a first-class citizen, we should also be able
to make analytic statements showing their uncertainty to be
meaningful in some sense.

Does probabilistic mean stochastic? After all, Monte Carlo


methods, relying on random numbers, are a classic, well-
established field of their own.

This chapter uses the elementary problem of integration to


establish basic notions and give first exemplary answers to the
above questions. In short, these answers are as follows:

It is actually quite straightforward to derive feasible in­


stances of probabilistic numerical algorithms. In particular, a
66 II Integration

gp prior with a Gaussian covariance yields a first probabilistic


numerical quadrature algorithm, producing good estimates
for an integral given a set of evaluation nodes. Further, the
placement of evaluation nodes across the integration domain,
the actions taken by the numerical agent, can be naturally
controlled by the probabilistic interpretation, by, for example,
selecting information gain as a utility function. In this sense,
probabilistic numerical methods really are autonomous learn­
ing agents.

We will demonstrate, in two different ways, how probabilis­


tic numerical quadrature algorithms can be fast. First, many
classical numerical algorithms can be directly derived as
probabilistic numerical (we will repeatedly do so throughout
this book to show connections). Providing an in-depth ex­
ample, the trapezoidal rule is interpretable as the maximum
a posteriori and mean estimator arising from a Wiener pro­
cess prior on the integrand. Further, the Bayesian inference
associated with the trapezoidal rule can be implemented as
a Gaussian filter, at a cost identical to that of its classical
equivalent. The fact that the trapezoidal rule is probabilistic
numerical and the fact that the trapezoidal rule is fast prove
that probabilistic numerical algorithms can be fast. We also
discuss other common tractable priors for integration, and
draw connections to the broader class of (classic) Gaussian
quadrature rules. Such connections mean that, instead of hav­
ing to “re-invent the wheel”, we can use classic methods as
inspiration for probabilistic ones. Second, we will show that
at least one ab-initio probabilistic numerical quadrature algo­
rithm, warped sequential active Bayesian integration (wsabi), can
be faster than standard alternatives. This speed is not despite,
but because of, the computation spent on Bayesian inference.
Such computation can be seen not as overhead, but as an in­
vestment that ultimately yields dividends in returning good
estimates more quickly.

& Demonstrating what “works” means for Probabilistic Nu­


merics, we show convergence statements for both the point
estimate and the uncertainty (error estimate) of the proba­
bilistic trapezoidal rule. The most basic uncertainty estimate
is found to provide an upper, worst-case error bound; and we
show that this error estimate can be adapted in an “empirical
Bayesian” fashion, at low computational overhead, to pro­
vide better calibration, something closer to an expected-case
estimate.
8 Key Points 67

Probabilistic Numerics is not just different to stochastic nu­


merics: Probabilistic Numerics is opposed to stochastic nu­
merics. In the univariate setting, we show empirically that
Monte Carlo integration has deficiencies over probabilistic
numerical deterministic algorithms. We use this example
to construct informal, generic arguments against the use of
random numbers in computation.
__________ 9
Introduction

The primary tools in this chapter are Gaussian process infer­


ence (§4.3), and the state-space formulation of Gauss-Markov
processes (§5), in particular in the algorithmic form of filtering
and smoothing (Algorithms 5.1 and 5.2).

► 9.1 Motivation

Integration is a significant numerical problem in many fields of


science and engineering. It is a key step in inference, where it is 1 M. Hutter. Universal Artificial Intelli­

gence. Texts in Theoretical Computer Sci­


encountered when averaging over the many states of the world
ence. 2010.
consistent with observed data. Indeed, a provocative Bayesian
view is that integration is the single challenge separating us
1
from systems that fully automate statistics. More speculatively
still, such systems may even exhibit artificial intelligence (ai).1 0.8 -
0.6 —
But integration is also an operation with long history, and
elementary enough to be covered at early levels of mathematical 0.4 -

education. As such, integration provides the ideal starting point 0.2 -

for a presentation of the intuitions driving Probabilistic Numer­ 0-

ics and how they might be used to derive practical probabilistic —2

numerical algorithms. This chapter will study how numerical


integration - also known as quadrature -can be formulated as Figure 9.1: The smooth function f (x)
and an upper bound, the Gaussian func­
inference. tion.
The results presented here expand and detail observations 2 We first focus on the univariate case for
from a 1988 paper by Persi Diaconis (1988), and touch on related its pedagocial simplicity. In fact, analo­
gously to multivariate GP regression, all
results by Anthony O’Hagan (1991). below Bayesian quadrature methods ex­
tend directly to the d-dimensional case
for all d e N. Note, however, that they -
As a running example, we will consider the univariate2 function at the time of writing - are only practical
up to d < 20 dimensions. See §10.1.2 for
f:R R+, f (x) := exp —- (sin(3x))2 — x2) (9.1) more details.
70 II Integration

(Figure 9.1). The handful of symbols on paper that make up 3 www.gnu.org/software/libc. There is a
fuzzy boundary between what consti­
Eq. (9.1) fully specify a unique deterministic function. For arbi­
tutes an “atomic” operation and a nu­
trary double precision values of x G R, a laptop computer can merical algorithm. We will be pragmatic
evaluate f (x) to machine precision in a few nanoseconds, using and define the libc as the set of atomic
operations. A more abstract notion, in
only multiplication and addition, and the “atomic” functions line with the purposes of this text, albeit
exp and sin, which are part of elementary programming lan­ also less concrete, is as follows: consider
an algorithm a, a map a (0) = & taking
guage definitions, like the gnu C library.3 Repeated evaluations
inputs 0 G П from some space П that de­
at the same value x will always return the same result. There is fine a computational task with a true (po­
nothing imprecise about Eq. (9.1) and thus, one may argue that tentially unknown) solution d , and re­
turning an estimate d. If there are values
there is also nothing uncertain about the function f . However, of 0 that the algorithm accepts without
the definite real number throwing an error, such that the resulting
& deviates from the true w by more than
machine precision, then we might call
F := f(x) dx GR (9.2) a a numerical method, otherwise a low-
-3
level, atomic routine. Put succinctly, a
numerical method produces an estimate
cannot be computed straightforwardly or elementarily. Since
that may be off, while an atomic rou­
f is clearly integrable, there is one and only one correct value tine just returns correct numbers. Hence,
of F. But this real number cannot be found in standard tables numerical methods can benefit from a
non-trivial notion of uncertainty, while
of analytic integrals.4 And there is no atomic operation in low- in atomic routines the associated uncer­
level libraries providing its value to machine precision. Despite tainty is always nil. This is not meant to
the formal clarity of Eq. (9.1), we are evidently uncertain about be a perfectly rigorous definition of the
term, and it is imperfect (for example,
the value of F, because we cannot provide a correct value for numerical methods also often have pre­
it without further work. This is epistemic uncertainty, the kind cision parameters and budgets to tune
and spend, an aspect ignored by this def­
arising from a lack of knowledge. inition), but a precise definition is not
But it is easy to constrain the value of F to a finite domain: needed in practice anyway.
Because f (x) is strictly positive, F > 0. A second glance at
4 Gradshteyn and Ryzhik (2007)
Eq. (9.1) lets us notice that f is bounded above by the Gaussian
function:
Exercise 9.1 (inspirational, see discus­
f (x) < g(x) := exp(-x2) Vx G R. sion in §9.3). Even without appealing to
the Gaussian integral, we could also bound f
Thus, from above with the unit function u(x)=1
on the integration domain, and would arrive
0 < F< g (x) d x = /П.
—TO at the looser bounds 0 < F < 6. On the
other hand, if we allow the use of the func­
Hence it is possible to define a proper prior measure over the tion erf (which is also in glibc), we could
value of F, for example the uniform measure p(F) = U(0,) (F)• refine the upper bound to the definite integral
over g, arriving at 0 < F < \/П erf (3). In
Thus, if we can collect “data” Y that is related to F through some which sense is this more “precise” prior more
correctly defined and sufficiently regular likelihood function “correct”? Is there a most correct, or opti­
p(Y | F), then the resulting posterior p( F | Y ) will converge mal prior? There is no immediate answer to
these questions, but it is helpful to ponder it
towards the true value F, at an asymptotically optimal rate for while reading the remainder of this chapter.
this likelihood.5 The question is thus, which prior, and which
likelihood, should we choose? 5 Le Cam (1973)

► 9.2 Monte Carlo

Before introducing a probabilistic formulation of integration, it


is helpful to briefly consider a popular non-probabilistic, but
9 Introduction 71

stochastic, way to compute integrals. That is, an approach based


on the use of random numbers. Methods that compute using
summary statistics of random numbers are known as Monte
Carlo (MC) methods. They are a mainstay of computational statis­
tics.

Assume the task is to compute F = ab f (x) dx, for some generic


integrable function f. Consider a probability measure p(x) over
the domain (a, b) that has the following properties:

1. p(x) > 0 wherever f (x) = 0in(a, b); and

2. it is possible to draw random samples from p at acceptable


computational cost (that is, using a source of uniform random
numbers and a small number of atomic operations).

For the example integrand f with the integration limits (a, b)=
(-3, 3) from Eq. (9.1), the uniform measure p(x) = U-3,3(x)
fulfils these properties. Another possibility is the Gaussian mea­
sure restricted to [-3,3], that is,

exp (—x2) if Ixl < 3,


p (x) = erf(3)Vn , (9.3)
6 Robert and Casella (2013), §2.3
I 0 else.
1
Draws from Eq. (9.3) could, for example, be constructed rela­ 0.8
tively efficiently by rejection sampling.6 0.6
Using N i.i.d. samples xi ~ p(x), i = 1,..., N from any such 0.4
p, we can construct the importance sampling estimator 0.2

1N
:
F = N E
w(xi),
i=1
(9.4)
—2

using the function w(x) := f (x)/p(x), which is well-defined by Figure 9.2: Monte Carlo integration. Sam­
ples are drawn from the Gaussian mea­
our above assumptions on p ( x). The logic of this procedure is
sure p(x) (unnormalised measure as
depicted in Figure 9.2. dashed line, samples as black dots), and
Note that, in contrast to F, the estimator F is a random number. the ratio w(x)=f (x)/p(x) evaluated
for each sample. The histogram plot­
By constructing F, we have turned the problem (9.1) - inferring ted vertically on the left (arbitrary scale)
an uncertain but unique deterministic number - into a stochas­ shows the resulting distribution p(w) =
w(x) dp(x). Its expected value, times
tic, statistical one. This introduction of external stochasticity
the normalisation erf(3^/n, is the true
introduces a different form of uncertainty, the aleatory kind, integral. Its standard deviation deter­
a lack of knowledge arising from randomness (more in §12.3). mines the scale for convergence of the
Monte Carlo estimate.
Because we have full control over the nature of the random num-
bers, it is possible to offer a quite precise statistical analysis of
F:

Lemma 9.2. If F is integrable, the estimator F is unbiased. Its variance


is
1
(9.5)
var(F) = N varP(w),
72 II Integration

assuming varp (w) exists. Hence, the standard deviation (the square
root of the variance, which is a measure of expected error) drops as Proof of Lemma 9.2. F is unbiased be­
cause its expected value is
O (N- 1/2) - the convergence rate of Monte Carlo integration.
This is a strong statement given the simplicity of the algorithm (F> = N E la w (xi) p(xi)dxi = F,
it analyses: random numbers from almost any measure allow
given that the draws xi are i.i.d., and as­
estimating the integral over any integrable function. This inte­ suming that the w (■) function is known
grator is “good” in the sense that it is unbiased, and its error (more on this later). As F is a linear com­
bination of i.i.d. random variables, the
drops at a known rate.7 The algorithm is also relatively cheap: variance of F is immediately a linear
it involves drawing N random numbers, evaluating f ( x) once combination of the respective variances:
for each sample, and summing over the results. Given all these Ei varp (wi) varp (w)
var( F) = N2 N •
strong properties, it is no surprise that Monte Carlo methods
have become a standard tool for integration. However, there is □

a price for this simplicity and generality: as we will see in the 7The multiplicative constant varp (w) can
next section, the O(N-1/2) convergence rate is far from the best even be estimated at runtime! Albeit
usually not without bias. See also Ex­
possible rate. In fact, we will find ourselves arguing that it is the ercise 9.3 for some caveats.
worst possible convergence rate amongst sensible integration
algorithms.
Exercise 9.3 (moderate, solution on
Monte Carlo is not limited to problems where samples can p. 361). One of the few assumptions in
be drawn exactly. Where exact sampling from a distribution Lemma 9.2 is the existence of varp(w). Try
to find an example of a simple pair of inte­
is difficult, Monte Carlo is often practically realised through
grand f and measure p for which this as­
Markov-Chain Monte Carlo (mcmc). These iterative methods do sumption is violated.
not generally achieve the O(N-1/2) convergence rate, but they
can still be shown to be consistent, meaning that their estimate
of the integral asymptotically converges to its true value.

► 9.3 Relevance and Limitations of Prior Knowledge

The Monte Carlo method returns a “frequentist” statistical esti­


mator: a random number whose distribution has known good
properties. The strength of this approach is that the analysis
of Lemma 9.2 requires only few assumptions. In particular, it
involves almost no restrictive requirements on the integrand f.
In practice, however, it is not necessary to be so cautious, for we 8 See O’Hagan (1991). The term
“Bayesian quadrature” is used am­
invariably do know quite a lot about the integrand f - after all,
biguously to mean a broad class of
the computer performing the computation must have access to methods based on the idea of integration
a definition of f in a formal language: the source code encoding as probabilistic inference, or to refer
to O’Hagan’s specific algorithm, or
f. That description is bound to reveal additional information something in between, for example
about the integrand. In our running example, we can inspect any quadrature rule derived from a
Gaussian process prior. We use it in the
Eq. (9.1) to see that it is bounded above and below, smooth, etc.
first, general sense.
How much performance can be gained if the algorithm is
allowed some assumptions about the integrand - some prior
information? This chapter introduces a probabilistic class of al­
gorithms for integration, known as Bayesian quadrature (bq).8
We will discover that some classical integration rules, in partic­
9 Introduction 73

ular the fundamental trapezoidal rule, arise naturally from this


perspective.
Constructing a probabilistic quadrature method requires two
ingredients. Together they form a general recipe that will also
feature for all other probabilistic numerical methods:

1. A model describing the relationship between the latent ob­


ject of interest and the computable quantities. That is, a joint
probability measure M := p(F, Y) over the integral F and a
data set Y := [f(x1), ..., f (xN)] of function values. Assum­
ing sufficient regularity of M, the product rule (2.1) allows
to write M as a generative model in terms of a prior p( F ) and
a likelihood p(Y | F) for F. Such a generative model will usu­
ally involve the integrand itself as a latent variable, because
(as Y consists of evaluations of f) F and Y are independent
when conditioned on f :

p (F, Y) = p (F) p (Y | F ) = I p (F | f) p (Y | F, f) d p (f)

= p(F | f) p(Y | f) dp(f).

Thus, the posterior on F can be computed via the posterior


on f:
p(F|Y)= p(F|f)dp(f|Y).

The model thus encodes assumptions, not just over the inte­
grand, but also over its relationship to the numbers being
computed to estimate it.

2. A design rule governing the choice of nodes X := [x1, ...,xN]


where f is evaluated by the computer. In general, such
design rules will be a function mapping the model M,
the previous choices {xj} and previously collected data
Y<i := {yj := f (xj) | j < i} to the decision xi. If the rule
explicitly includes previous evaluations Y<i, it will be called
adaptive or closed-loop. If X is chosen based solely on the
model, the model is called non-adaptive, or open-loop. A non-
adaptive design may be less sample-efficient, but can also
have lower computational cost, as the design can be pre­
computed. We will discuss both non-adaptive and adaptive
approaches in §10.2. The role of the design rule is thus to
translate the information encoded in the model - and, for
adaptive rules, the collected observations - into actions of
the agent.
74 II Integration

Figure 9.3: An agent, and its model of the


world, must necessarily be simpler and
smaller than the world itself. Similarly,
a numerical algorithm, and its model
of the problem, must be simpler and
smaller than the problem itself.
WORLD

It is tempting to try and construct the model M by encoding


as much valid prior information as possible in a probability
measure. But this philosophical desire for a maximally informa­
tive prior must be balanced against practical constraints. We are
looking for a probability measure M that not only puts high
mass close to the truth, but also allows the efficient computation
of a posterior p(F | Y), using only atomic operations.
A potential criticism of the probabilistic viewpoint on com­
putation is that there is really only one acceptable prior - the
one that puts unit probability mass on the true solution on the
task, thus expresses zero uncertainty. But such a “perfect”, or
“informed”, prior would be just as analytically intractable as the
numerical task itself, and thus of no practical concern. Asking
for a probabilistic numerical method using the perfect prior
is analogous to asking for a classical method that returns the
correct answer without performing a single computation, or to
posit that a learning machine is only “well-posed” if it does not
require any training data. The goal must be to construct tractable
priors that have various desirable properties. For instance, these
priors should give rise to non-trivial uncertainty at intermedi­
ate points in the computation. Figure 9.3 illustrates the point:
models must be simpler than the problem modelled.
__________ 10
Bayesian Quadrature

This section will introduce Bayesian quadrature as a practical


tool for numerical integration, and will give the key procedures
needed to apply it to real problems.
We will tackle the generic integration problem of computing

F= f(x (x) dv (x), (10.1) 1The example of Eq. (9.2) can either be
X seen as integrating the function f (x)=
e- sin2 (3x) against the Gaussian measure
where v (x) is a measure.1 The domain of integration, X, is, in v(x) = e-x2, or as integrating the f from
practical applications, often a bounded interval, such as [a, b] C Eq. (9.1) against the Lebesgue measure.
R. The integration problem (10.1) is written as univariate, which
will be the setting motivating the majority of this chapter. We
will, however, also explain in §10.1.2 how Bayesian quadrature
can be generalised to multivariate problems.
As we already saw in Chapter I, Gaussian process models for
are a fundamental tool for efficient formulations of inference.
And indeed they allow for an analytic formulation of integration
as probabilistic inference. Because Gaussian measures are closed
under linear maps (see Eq. (3.4)), they are, in particular, also
closed under integration. Following the exposition of §4.4, a
generic Gaussian process prior ( p f )=GP( f ; m,k) over the
integrand amounts to a joint Gaussian measure over both the
function values collected in Y := [f( x1), ...,f( xN)] and the
integral F:

A .
p (F, Y )=! p (F, YI f) p (f) df=/5 (F - L f (x)d v (x)) П Vi - f (xi)) p(f) df
i=1

Y mx kxx X, kxx d v (x)


F X m(x) dv(x) X kxX dv(x) Xkx kxx' dv (x)d v (x)
Here we have used the Dirac point measure 5 as a likelihood
term to encode exact, perfectly certain, observations of f at
76 II Integration

the locations X. The conditional for F given Y, the posterior


p(F | Y), is a univariate Gaussian distribution2 with parameters 2Eq. (10.2) can also be reached by first
forming the posterior p(f | Y)=
m G R and v G R+, given by
GP(f;mx + kxXk-X1X(y - mX),
p(F | Y)=N F;m,v , (10.2)
kxx' - kxXkXXkXx'),

where and using the closure of this Gaussian


process under the (linear) integration op-
m :=X mx + kxXk—X(Y — mX) dv(x)
X1 (10.3) eration.
eration.

= mo + k Xkxx(Y — mx),

v := .XX kxx1 - kxxkxxkxx1 dv(x)dv(x)

= K — kxk—x kx. (10.4)

These expressions are tractable if there are analytic forms for


the following three univariate and bivariate integrals:3 3In practical application v is often a prob­
ability measure. For example, in proba­
bilistic statistics and machine learning,
m0 := m m(x) dv(x) G R, (10.5) the typical situation is that v( x ) is a prior
X
measure on x, while f (x) is a likelihood
k(xi) := X k(x, xi) dv(x) G R, (10.6) function arising from some data. In such
settings k(x) equals the kernel mean em­
K := XX k(x, x!) dv(x)dv(x!) G R+. (10.7) bedding ofv in the rkhs ofk, and Bayesian
quadrature is related to the notion of ker­
nel herding from the kernel community in
When these expressions are available, Eq. (10.2) amounts toan machine learning. For more, see Smola
et al. (2007) and Huszar and Duvenaud
analytic map of the observations Y onto a probability measure
(2012).
over F with support on the entirety of, or a subset of, the real
numbers. These integrals do not involve the integrand f,but
the kernel k (and m) instead. Embedding the single integrand in
a hypothesis class of many possible true functions (the support
of the prior) in this way thus turns the intractable problem of
exact integration into tractable, uncertain inference.
Notably, the posterior mean in Eq. (10.2) can be written as an
affine function of Y :
N
m = Fo + wT Y = Fo + £ Wiyi, (10.8)
i=1

with offset F0 := mo — kxkxXmx,


N
and weights Wi := £ kx [kxX]ji,
. i = 1,..., N.
j=1

The construction so far just provides a probability measure over


F without making any claims on its usefulness. But we would
like to interpret that measure as a probabilistic estimate, of F,
treating the posterior mean m as a point estimator for F, and the
posterior variance v as an estimate of its square error, a notion
of uncertainty. Not all Gaussian process priors will be equally
10 Bayesian Quadrature 77

“good” for a particular integration problem. To argue for one


prior over another, we thus need to show that either quantity
has analytic properties supporting this interpretation of point
estimate and error estimate. This analysis will be provided, for
a specific model choice, in §11.1.5.

► 10.1 Models

After the abstract introduction, we turn to describe tangible


models for Bayesian quadrature. By default, we consider the
real domain, X = R, although more general cases will also be
presented.
Given a fixed set of nodes, integration requires two choices:
that of the measure v(x) and the model for the integrand. The
measure v(x) is often stipulated by the problem. It may also be
possible to choose v( x) to ease the resulting integration problem.
For example, given a real domain, a Gaussian v may be taken for
the convenience of the resulting numerical integration problem.
We can now turn to the choice of model. Within Bayesian
quadrature, the choice of gp prior for the integrand leads to
two further choices: that of the prior mean m(x) and the prior
covariance k (x, x'). As above, these choices are not free, but
have to be made such that the integrals in Eqs. (io.5)-(io.7)
remain tractable. The mean m represents an initial guess for the
integrand, while the covariance should aim to capture both the
deviation of the true integrand from m and knowledge about
analytic structure of the integrand beyond that represented by
m e.g. periodicity or differentiability.
In practice, the prior mean function m is almost always taken
as zero: m(x)=0.It is rare that an integrand is known suf­
ficiently well ahead of its evaluation to support any better-
informed choice. The zero mean function has the additional
practical advantage of rendering m0 from Eq. (io.5) trivially
zero, and thereby simplifying the posterior mean equation for
the integral, Eq. (io.3).
As in §6, this chapter will exclusively use a scaled covariance
family, separating a simple scale в E R from a “unit” kernel
k (x, x),
k(x,x') := в2k(x,x'). (10.9)
The unit kernel k captures more intricate structure, and will
control the shape of the posterior and thus the rate at which
the uncertainty contracts. The scale в can then be inferred hier­
archically as introduced in §6, to find the scale constant. That
is because the scale в also shows up multiplicatively in the
78 II Integration

posterior on F:

. Т Т —1 /л z
mx + kxX kXX (Y mX) dx,

21
kxx' kxXkXXkXx'dxdx
в I I I.

> 10.1.1 Gaussian Covariance / Squared-Exponential Kernel

Within this family of covariances is the Gaussian covariance


discussed in §4.4, which can also be written indirectly using the
Gaussian density function as

k (x, x') = в 2N (x; x', A 2).

When paired with the Gaussian measure

v(x) = N(x; y, a2),

it yields tractable results (listed in §4.4) for the integrals required


for Bayesian quadrature (Eqs. (io.5)-(io.7)). For convenience,
we repeat that the posterior for the integral F given evaluations
Y=[f(x1),...,f(xN)] is

p(F | Y) = N(F;m, v),

where

m = kXk-XY,
v = K - kXkx X k X.

For the pairing of a Gaussian measure, zero prior mean and


Gaussian covariance, we arrive at

[k]i = в2N(xi; y., A2 + a2) e R, for i e {1,..., N},


в2
K= , \ =*= e R.
д/2 n (2 a2 + A2)

Now we have all the ingredients for a practical Bayesian quadra­


ture procedure for univariate problems.

> 10.1.2 Multivariate Integrals

Multivariate integrals are common. Fortunately, Bayesian quadra­


ture is, in principle, readily extended to integrals over d variables
(which we compile into the d-dimensional vector x). This exten­
sion requires only the choice of a Gaussian process prior over the
d-dimensional space that yields tractable integrals against the
10 Bayesian Quadrature 79

measure. For instance, if the domain is Rd, with multi-variate


Gaussian measure

v (x) = N (x; ц, E),

and we take a gp with zero prior mean and multi-dimensional


Gaussian covariance

k (x, x) = 0 2N (x; x, Л), (10.10)

we arrive at (where xi is the ith node, a d-dimensional vector)

[k]i = 02N(xi; ц, Л + E) G R, for i G {1,..., N},

K = 02 (det 2n(2 E + Л) 2 G R.

In practice, however, such a simple bq model is unlikely to work


well in high dimension. In general, multivariate integration is
plagued by the “curse of dimensionality”: the volume to be
integrated over grows exponentially in dimension. As a conse­
quence, if the number of evaluations does not grow similarly
exponentially, a high-dimensional volume will be only sparsely
explored by the evaluations. As a result, the higher the dimen­
sion, the greater the importance of the model, and the lesser
the importance of evaluations of the integrand. The Gaussian
covariance (10.10) is particularly poor in high dimension. The
light tails of the Gaussian covariance prevent an evaluation from
having influence over gp predictions at far-removed locations:
exactly what is necessary in order to predict anything other than
the gp prior mean in sparsely-explored spaces. At the time of
writing, Bayesian quadrature is thus rarely used in dimension
greater than around 20.

> 10.1.3 Other Tractable Models

We conclude this section with a survey of tractable models


for Bayesian quadrature. Table 10.1 reviews other pairings of
prior measure v (x) and Gaussian process kernel k(x, x') that
yield tractable results for the Bayesian quadrature equations,
Eqs. (io.5)-(io.7). All assume a zero prior mean. This table is,
by no means, to be considered exhaustive: a creative mind may
be able to frame a given problem in additional ways to yield
computable results. In doing so, it may be guided by the con­
crete interpretation of the model choice as a prior on functions.
For example, one may consider the properties of samples from
the prior to gain intuition for their fitto a particular problem.
Classic quadrature methods also form a similar “zoo” of choices
80 II Integration

X v k Reference

[0, 1]d Unif(X) Wendland TP Oates et al. (2019a)


[0, 1]d Unif(X) Matern Weighted TP Briol et al. (2019)
[0, 1]d Unif(X) Gaussian Use of error function
Rd Mix. of Gaussians Gaussian Kennedy (1998)
Sd Unif(X) Gegenbauer Briol et al. (2019)
Arbitrary Unif(X ) or mix. of Gauss. Trigonometric Integration by parts
Arbitrary Unif(X) Splines Wahba (1990)
Arbitrary Known moments Polynomial TP Briol et al. (2015)
Arbitrary Known d log v (x) Gradient-based kernel Oates, Girolami, and Chopin
(2017); Oates et al. (2019a)
Table 10.1: A non-exhaustive list of distri­
bution v and kernel k pairs that provide
(see Table 11.1), but choosing between them is a less intuitive a closed-form expression for both the
kernel mean J k (■, x) dx and the initial
process, and adding to them is even harder, because there are
error K from Eq. (10.7). Here TP refers to
only analytical, rather than constructive, means to do so without the tensor product of one-dimensional
the interpretation of a prior. Interpretability is a key strength of kernels, and Sd indicates the d-sphere
{x = (x 1,..., Xd+1) e Rd+1 : ||»H 2 = 1}.
pn. This table is adapted from Briol et al.
(2019).

► 10.2 Node Selection

The Bayesian quadrature model engenders a natural design


rule. That is, thinking of the design rule as resulting from the 4Because the entropy of a Gaussian is a
monotonic function of its variance (3.3),
minimisation of an expected loss, one loss function immediately
this design rule can also be phrased in
suggests itself: the square error. The expected loss is then equal an information-theoretic way as choos­
to the variance of p(F | Y), v (the expected square error after ing the “most informative” design - that
which minimises the conditional entropy
receiving the evaluation Y). The resulting design rule selects the of F given Y.
locations X = [x1, ..., xN] such that v is minimised (note that v
5 Ko, Lee, and Queyranne (1995)
is a function of X, a fact we will make explicit here):4

X = argmin v( X).
X e R''

Designing the optimal grid for such a rule, even for regular ker­
nels, can be challenging, because the corresponding multivariate
optimisation problem can, in general, have high computational
complexity.5 However, instead of finding an optimal grid, one can
also sample, at cost O(N3), a draw from the N-determinantal point
6The point of these papers does not
process (dpp) associated with k. Results by Bardenet and Hardy disagree with our general argument, in
(2019) suggest that doing so causes only limited decrease in §12.3, against the use of random sam­
pling. Rather, these results show that al­
performance over the optimal deterministic grid design (which
lowing minor deviations from the opti­
amounts to the maximum a posteriori assignment under the mal design can drastically reduce com­
N-dpp). Belhadji, Bardenet, and Chainais (2019) offer further putational complexity at negligible de­
crease in performance. The necessary
support for the use of determinantal point process sampling for “samples” can even be drawn in a de­
integrands known to live in an rkhs.6 terministic way.
10 Bayesian Quadrature 81

Recall from Eq. (10.4) that the variance, v, of the integral


does not depend on the values of the observations, Y, at all.
Hence this is an open-loop, non-adaptive design criterion. This
is simultaneously a good and a bad property of all Gaussian
models. It means an evaluation pattern can be chosen a priori
and does not need to be adapted at runtime, potentially saving
computation time. However, the independence of the variance
from Y also implies that the error estimate (and the evalua­
tion locations) will be based entirely on prior assumptions, and
will not depend on the collected function values themselves.
These facts may feel disquieting: should we not hope that the
evaluations inform both our confidence and the locations of fu­
ture evaluations? To address these concerns, adaptive Bayesian
quadrature schemes have been devised.

> 10.2.1 Adaptive Selection

Adaptive Bayesian quadrature schemes have, broadly, adopted


non-Gaussian models, thereby shedding both the good and bad
of such models. To motivate such models, note that many inte­
gration problems possess non-negative integrands. The classic
case is probabilistic inference, where many integrands are a
product of non-negative densities like likelihoods and priors. A
particularly common case is the model evidence (or marginal
likelihood),
pD) J p(D | x) p(x) dx . (10.11)
J ^xT ^;<xr

As noted in the motivation for this chapter, §9.1, solving such


integrals may also be a key step towards ai. To date, all known
Bayesian quadrature schemes that are adaptive (according to
our definition in §9.3) consider this setting of non-negative
integrands.
The first attempt at adaptive Bayesian quadrature7 was termed 7 Osborne et al. (2012)
doubly-Bayesian quadrature (bbq) for the (Bayesian) decision-
theoretic treatment of Bayesian quadrature. bbq adopts an (ap­
proximate) means of modelling the logarithm of the integrand
with a gp, thereby (approximately) enforcing the non-negativity
of the integrand. The motivation for this model is the large
dynamic range (variation over many orders of magnitude) of
many probabilistic integrands: a gp on the logarithm allows the
model to learn such a large dynamic range. To select nodes, bbq
uses the variance of the integral, (10.4), as a loss function, so
that the expected loss is the expected square-error after eval­
uating at that node. As such, bbq uses what is arguably both
82 II Integration

the most desirable model (a log-gp) and the most desirable loss
function. However, bbq employs a first-order approximation to
the exponential function, along with the maintenance of a set of
candidate points xc at which to refine the approximation. Such
approximation proves both highly computationally demanding
and to express only weakly the prior knowledge of the large
dynamic range of the integrand.
The first practical adaptive Bayesian quadrature algorithm
was wsabi,8 which adopts another means of expressing the 8 Gunter et al. (2014)
non-negativity of a integrand: the square-root of the integrand
f (x) (minus a constant, a G R) is modelled with a gp. Precisely,
1
f (x) = a + 2 f (x)2 ,
where, given data D,

p (f ID) = GP (f; m, V),

where m(x) and V(x, x') are the usual gp posterior mean (4.6)
and covariance (4.7), respectively (and thus depend on D). An
integrand modelled as the square of a gp will have a smaller
dynamic range than one modelled as an exponentiated gp. In
this respect, wsabi is a step backwards from bbq.
However, the bbq approximations are significantly more
costly, both in computation and quality, than those required for
wsabi. That is, wsabi considers both linearisation and moment-
matched approximation to implement the square-transformation:
both prove more tractable than the linearised-exponential for
bbq. Linearisation gives the following (approximate) posterior
for the integrand:

p (f ID) ~GP (f; mL, V L),

mL(x) := a + 1 m(x)2,
VL(x,x') := m(x)VV(x,x')m(x'). (10.12)

The posterior for moment-matching is

p (f ID) -GT (f; mM, VM),


mM(x) := a + 2(m(x)2 + V(x,x)),

VM(x,x') := 1V(x,x')2 + m(x)VV(x,x')m(x'). (10.13)

The expressions above demonstrate that both linearised and


moment-matched are readily implemented. In either case, the
posterior for the integrand is a gp, manipulable using the stan­
dard gp equations (e.g. Eq. (4.6) for the posterior mean and
10 Bayesian Quadrature 83

Eq. (4.7) for the covariance). A closely related approach, moment-


matched log transformation (mmlt),9 instead proposed moment­ 9 Chai and Garnett (2019)
matching for the (arguably more desirable) exponential trans­
formation, which, again, proves less costly than bbq.
As a loss function, both wsabi and mmlt use the (point­
wise) variance in the integrand (rather than the variance in the
integral, as per bbq). This policy, of sampling a function where its
variance is highest, is known generically as uncertainty sampling.
That is, uncertainty sampling is the sequential design rule

xi+1 = arg max V(x, x),


x

where V (x, x) is the posterior variance for the integrand (that


is, VL (x, x), Eq. (10.12), for linearised wsabi and VM (x, x),
Eq. (10.13), for moment-matched wsabi). Uncertainty sampling
seems sensible: by reducing the point-wise variance in the in­
tegrand, we are presumably also reducing the variance in the
integral so that, in the limit of evaluations, we should expect con­
vergence of the estimate for the integral. However, uncertainty
sampling is less clearly related to the ultimate goal of quadra­
ture than targeting the variance (and hence squared error) of
the integral.
For instance,10 consider the integral Thanks to Roman Garnett for this ex­
10

ample.
F=J f (x) d x,

where f (x) := sin(x + ф) is known, but the phase ф G R is


unknown. As a result, the pointwise variance in the integrand,
var(f (x)), will depend on p(ф), and will generally be non­
zero. If our loss function is the pointwise variance, we will be
impelled to take evaluations (potentially many, if the evaluations
are noisy) - however, these evaluations have no impact on our
belief in F. Conversely, if our loss function is the variance in
the integral, we need not make any evaluations at all. Given
just prior information, we arrive at p(F) = 5(F): we can be
certain that F = 0, so var(F) = 0. In such, arguably corner cases,
uncertainty sampling is clearly wasteful.
However, in practice, uncertainty sampling leads to algo­
rithms that are usually substantially faster than bbq. In fact,
both mmlt and wsabi have demonstrated convergence that
is significantly faster than Monte Carlo competitors - not in
the number of evaluations, but in wall-clock time. Asexamples,
Figures 10.1 and 10.2 demonstrate that wsabi can achieve a
given estimation error in less time (measured in seconds) than
even a sophisticated competitor like annealed importance sampling
84 II Integration

Figure 10. 1: A comparison of ap­


proaches to estimating the marginal like­
lihood F from Eq. (10.11) for a gp re­
gressor fitted to real data - the yacht hy­
drodynamics benchmark data set (Ger-
ritsma, Onnink, and Versluis, 1981). The
required integral was eight-dimensional.
This plot is adapted from Gunter et al.
(2014), which contains full details.

(ais).11 These are remarkable results, because the algorithm 11 Neal (2001)
for wsabi requires the maintenance of a full gp, with its costly
O (N3 ) computational scaling in N, the number of evaluations.
These algorithms also require the computationally expensive
management of the gp’s hyperparameters. wsabi also selects
evaluations at each iteration by solving a global optimisation
problem, making calls to optimisation algorithms. In short, ws-
abi requires substantial computational overhead. In comparison,
Monte Carlo makes use of a (pseudo-) random number generator
(prng) to select evaluations, at negligible overhead cost, and
a negligibly costly simple average as a model. Nonetheless,
the substantial overhead incurred by wsabi, included in those
measurements of wall-clock time, does not prevent wsabi from
converging more quickly than Monte Carlo alternatives. Again,
the overhead is perhaps better framed as an investment, whose
returns more than compensate for the initial outlay. We will
return to these considerations in our generic arguments against
random numbers in §12.3.
Buttressing these promising empirical results, Kanagawa and

Figure 10. 2: A comparison of ap­


proaches to evaluating the marginal like­
lihood F from Eq. (10.11) for a gp clas­
sifier fitted to real citation network data
(Garnett et al., 2012). The required inte­
gral was four-dimensional. Further de­
tails are available in the plot’s original
source (Gunter et al., 2014).
10 Bayesian Quadrature 85

Hennig (2019) provide a theoretical framework for adaptive


Bayesian quadrature schemes. It develops a novel property -
termed weak adaptivity - that guarantees consistency (subject
to further regularity conditions) for adaptive bq rules with a
sequential design rule given by

xi+1 = arg max ai(x),


x

with
ai(x) = T(q2(x)VV(x, x)) bi(x),

where V(x, x) is the usual gp posterior variance from Eq. (4.7),


q is a positive function (e.g. the integration measure), T is the
transformation (also known as warping), and bi(x) is an “adap­
tivity” function (in linearised wsabi, that is the square posterior
mean). Weak adaptivity, loosely speaking, requires that bi is
bounded away from zero and infinity. Linearised wsabi does
not technically fulfil this condition, but can be made to do so
with a minor correction. Moment-matched wsabi and mmlt
do satisfy the condition. Intuitively, weak adaptivity means that
the method is not “too adaptive” relative to non-adaptive bq,
and so can only be “stuck for a while, but not forever ”. Kana­
gawa and Hennig (2019) also provide a worst-case bound on the
convergence rate that approaches that of non-adaptive Bayesian
quadrature. This result is weaker than that desired: empirically,
we usually see adaptive schemes offer improved convergence
over non-adaptive schemes.
Weak adaptivity plays a conceptually analogous role to the
notions of detailed balance and ergodicity used to show that
MCMC algorithms, likewise, can be “stuck for a while, but
not forever ”. For MCMC the resulting insight is that, on some
unknown time-scale, the mixing time of the Markov Chain, the
algorithm converges like direct Monte Carlo. Similarly, weak
adaptivity shows that, up to some constants, adaptive BQ works
at least as well as its non-adaptive counterpart. Detailed bal­
ance and ergodicity alone don’t necessarily make a good mcmc
method (consistency is a very weak property, after all). However,
the statistical community has used the theoretical underpinning
provided by detailed balance and ergodicity as a licence to de­
velop a diverse zoo of mcmc methods that are chiefly evaluated
empirically. One may hope that the licence of weak adaptivity
might enable a similar flourishing of adaptive bq schemes.
__________ 11
Links to Classical Quadrature

This section returns to the univariate integration problem. In this


setting, we will draw the many connections between classic and
Bayesian quadrature. We will also use the setting to illustrate
some deeper points about probabilistic integration, and, more
broadly, about Probabilistic Numerics.

► 11.1 The Bayesian Trapezoidal Rule

A particularly interesting Bayesian quadrature scheme is ob­


tained by choosing the mean m(x) = 0, Vx, and the covariance

k(x,x') = в2(min(x,x) - X, (11.1)

with constants в G R+ and x € R,X < a, to set

p(f) = GP^f;0, в2(min(x, x) - X). (11.2)

As we sawonp.50 (see also Figure 5.3), this defines the Wiener


process with starting time X and intensity в - a fundamental
process, covering a particularly large hypothesis space.
So what is the integration rule arising from this prior? As­
sume function evaluations at an ordered set of nodes
N-1
X = [xi,xi + 51,xi + 51 + 52, ...,xi + £ 5i] =: [x1,x2, ...,xn]
i=1

for 5i G R+, Vi. To simplify the exposition, we assume a < xi <


b Vi. The marginal prior mean me of Eq. (10.5) vanishes because
of the vanishing prior mean m(x)=0 chosen above, and the
integrals from Eqs. (10.6) and (10.7) evaluate to

K =11 k(x, x) dxdx = в2(1 b3 - a2b +


a 33 2a3 - x(b - a)2
88 II Integration

1.5 Figure 11.1: The posterior measure over


f under a Wiener process prior, after 11
evaluations at equidistant points (black
circles). The plot shows the piecewise
linear posterior mean of this process in
solid black and the probability density
function as a shading. A single sample
0.5 from the posterior is shown for compar­
ison to the integrand (both thin black).
The sample is clearly much less regular
than the integrand.

-3 -2 -1

and

k( xi )= k k (x, xi) d x = 6 2( [ x dx + f xi d x — x (b - a)
a a min(b,xi)
1 /1 z О О\ /
= 6 у 2 (min(b,xi)2 - a2) + xi[b - min(b,xi)) - X(b - a)
1
= 62 I xib — 2 (a2 + x2) - X(b - a) by the assumption a < xi < b above.

These results provide one possible implementation, but as such


they do not say much about the properties of this quadrature
rule. As it turns out, they are a disguise for a well-known idea.
A simple way to see this is to note that the posterior mean over
f is not just a weighted sum of the observations Y (Eq. (10.8)),
but also a weighted sum of kernel functions at the locations
{xi}:

Ep(f |Y)(f (x)) = kxXkxxY =


4 v
Ei k(x, xi)ai. (11.3)
1 There are some technicalities to con­
=: a sider if x1 does not coincide with a, be­
cause the choice of the starting time x
The k(x, xi) of Eq. (11.1) are piecewise linear functions, each affects the extrapolation behaviour on
with a sole non-differentiable locus x = xi . Thus the poste­ the left. One resolution is to choose
the asymptotic setting x - ^, which
rior mean of Eq. (11.3) is a sum of piecewise linear functions, gives rise to a constant extrapolation,
hence itself piecewise linear, with “kinks” - points of non­ known as the natural spline (Minka, 2000)
(Wahba, 1990, pp. 13-14). That same solu­
differentiability - only at the input locations X; see Figure 11.1.
tion is also found by the filter of §11.1.1,
As this is the posterior mean conditioned on Y, that piecewise which does not require a starting time.
linear function with N kinks has to pass through the N nodes
in Y. Assuming, for simplicity, x1 = a, xN = b, there is only
one such function1 on [a, b]: the linear spline connecting the
evaluations: for a < xi < x < xi+1 < b,
2The two integrals can be exchanged due
x — x У to Fubini’s theorem, which states that
E p (f |Y)(f ( x )) = f ( xi) ' ^ {f ( xi+1) - f ( xi)). this is possible whenever the (here: bi­
variate, over x and f) integrand is abso­
The expected value of the integral is the integral of the expected lutely integrable, which is true for inte­
grands fulfilling the assumptions above.
value,2 written
11 Links to Classical Quadrature 89

Ep(f\Y)[ b f (x) dx) = b E(f (x)) dx


aa

= t2
N 1 2(f(x+1)
i=1
+ f(x)). ' 4>

This is the trapezoidal rule.3 It is arguably the most basic quadra­ 3 Davis and Rabinowitz (1984), §2.1.4
ture rule, second only to Riemann sums. Hence we have the
following result.

Theorem 11.1. The trapezoidal rule is the posterior mean estimate4 4 Because the mean of a Gaussian dis­
for the integral F = jb f (x) dx under any centred Wiener process tribution coincides with the location of
maximum density, the trapezoidal rule is
prior p(f) = GP(f;0,k) with k(x,x') = в2(min(x,x') - X for also the maximum a posteriori estimate
arbitrary в E R+ and x < a E R• associated with this setup.

This is an important result: explicitly, it states that the trape­


zoidal rule is Bayesian quadrature, for a particular choice of co­
variance function. This foundational algorithm of classic quadra­
ture has a clear probabilistic numerical interpretation.
It is worth noting that the trapezoidal rule’s simplicity makes
it rarely the best integration rule in practice. Nonetheless, we
use the simplicity of the trapezoidal rule to drive intuitions
applicable to all numerical methods. For stronger methods, see
§10 and §11.4.

> 11.1.1 Alternative Derivation as a Filter

Another way to arrive at Theorem 11.1 is to phrase the joint


inference on (f, F) under the Wiener process model for the inte­
grand as a Kalman filter (§5). Doing so provides an alternative
proof of Theorem 11.1 and a convenient way to compute the
posterior variance on F, which we have so far ignored in the
analysis. A downside of the filtering formulation is that it does
not extend to the multivariate domain: the form of Eq. (10.2) is
more general. However, the filtering framework affords some
generalisations, introduced below.
Further, the algorithmic implementation of the filter serves
as an explicit example that a probabilistic formulation of com­
putation need not have prohibitively high cost. In fact, the
probabilistic numerical algorithm can be identical in cost to a
classical form. At this point, it may be helpful to review §5.3
on stochastic differential equations (SDEs), from which we adopt
notation.
To simplify notation, define the anti-derivative

Fx = У f (jc) djc for x > a, (11.5)


90 II Integration

and consider the state-space z(x) = Fx, fx . This state space


links the integrand and its anti-derivative, allowing the Gaussian
process prior of Eq. (11.2) to equivalently be formulated as the
linear sde

dz (x) = Fz (x) dx + L dwt with


01 0 0
F L= and z (x) =
00 e 0

As already derived in Eqs. (5.25) and (5.26), from Eq. (5.21) we


find that this SDE is associated with the discrete-time quantities
(for a step of length Xi+1 — Xi =: Si)

1 Si Si3/3 Si2/2
Ai = Qi = e2
0 1 Si2/2 Si
5 Since the goal is to infer the definite in­
Using H = 01and setting R = 0 to encode the likelihood tegral Fb at the right end of the domain,
there is no need to also run the smoother
p(yi | f) = S(yi = f (xi)), we can thus write the steps of the (Algorithm 5.2). It could be used, how­
Kalman filter (Algorithm 5.1) explicitly,5 and find that they ever, to also construct estimates for the
anti-derivative Fx (Eq. (11.5)) at arbitrary
simplify considerably. The mean and covariance updates in locations a < x < b.
lines6 7 and 8 of Algorithm 5.1 are simply 6 Note that the symbol z has a different

meaning in this chapter than in Algo­


rithm 5.1.
[mi—1]1 + Si/2(yi + [mi—1]2)
mi = (11.6)
yi
[Pi—1]11 + Si3/12 0
Pi = (11.7)
00.

If we allow for an observation at x1 = a, then the initial values Exercise 11.2 (easy). Convince yourself
of the SDE are irrelevant. Since we know Fa = 0 by definition that Eqs. (11.6) and (11.7) indeed arise as the
updates in Algorithm 5.1 from the choices or
(thus with vanishing uncertainty), the natural initialisation for
A, Q, H, R made above. Then show that the
the filter at x1 = a is resulting mean estimate mN at x = b indeed
amounts to the trapezoidal rule (e.g. by a
telescoping sum). That is,
0 0 0
m1 = P1 =
f(a) 0 0 E( F) = iE
=11 2 (fi+1 + fi),

The filter thus takes the straightforward form of Algorithm 11.1. e2 N—1
12 i=1
This algorithm is so simple that it barely makes sense to spell var(F) = 12 E
S3.

it out. The significance of this result is that Bayesian inference In practice, the algorithm could thus be im­
on an integral, using a nonparametric Gaussian process with plemented in this simpler (and parallelisable)
form. Note again, however, that this algo­
N evaluations of the integrand, can be performed in O(N) rithm is not a good practical integration rou­
operations. tine, only a didactic exercise. See §11.4 and
in the literature cited abovefor more practical
Put another way, the simple software implementation of algorithms.
the trapezoidal rule is identical to that of a particular Bayesian
quadrature algorithm.
11 Links to Classical Quadrature 91

1 procedure Integrate(@f, a, b, N, 1) Algorithm 11.1: The probabilistic trape­


2 S := (b — a) /(N — 1) / choose step size zoidal rule formulated as a filter. The al­
gorithm takes a handle to the integrand
3 x a, y1 = f (a), m 0, V 0, / initialise f,the integration limits, a, b, a budget
4 for i = 2,..., N do of N evaluations, and an externally set
scale 12 (see below for how to adapt this
5 x x+S / step
at runtime).
6 yi f (x ) ' evaluate
7 m m + S/2(yi—1 + yi) / update estimate
8 V V + S3/12 / update error estimate
9 end for
10 return E(F) = m, var(F) = 92V / probabilistic output
11 end procedure

> 11.1.2 Uncertainty

Formulating univariate integration as inference - as the con­


struction of a posterior distribution over a latent quantity, under
certain prior assumptions - we have “re-discovered” an old and
well-trusted method for this numerical task. Though simplistic,
this is an encouraging result for Probabilistic Numerics. Evi­
dently, probabilistic numerical algorithms do not have to be
abstract and involved, but can be quite simple indeed.
But the probabilistic formulation not only yielded a new way
to derive the well-known trapezoidal rule, but something new:
an estimator for the error of the trapezoidal estimate, in the form
of the posterior variance v in Eq. (10.2). For the concrete choice
of the Wiener process prior (11.2), we saw that this expression
simply evaluates to

v = varp(f iY,X)(F) = JJ V(x, x) dxdx

2 N-1 (11.8)
= 1-2 E S3.
i=1

We will have to ask whether this value - which, again, clearly


does not depend on the collected values yi at all - has any
sensible interpretation as an error estimate. Before we address
this issue, however, we find that its form can be used as a
pleasingly simple way to fix the evaluation design, the step
sizes [Si]i=1,...,n-i.

> 11.1.3 Regular Grids as the Maximally-Informative Design

In many basic implementations of the trapezoidal rule, the eval­


uation nodes [x1, ..., xN] are placed on a regular grid. The above
result motivates this choice probabilistically, without appealing
92 II Integration

to classical analysis. As mentioned in Eq. (10.2), a natural de­


sign rule is to require var(F) to be minimised. It follows from
Eq. (11.8) that this is achieved precisely when the evaluation
points lie on an equidistant grid.7 See §11.4 for more discussion 7To see this, first encode the assumption
(made above) that the first and last node
on how node-placement in classic quadrature rules relates to
must lie on the boundaries, by setting
optimal design from the probabilistic perspective.
N-2
3N-1 = (b - a) - ^ 3.
i=1
i

> 11.1.4 Adaptive Error Estimation The gradient w.r.t. 3j, j < N - 1 is then
2
dvar(F)
Just like the mean estimate of the trapezoidal rule (11.4), the = у [ 3j- 3 N-1 ].
d3j
equidistant grid as an optimal design choice for the rule is
Setting this to zero (recall that all 3i must
independent of the scale Q2 of the prior. Hence, there is a family of be positive) gives 3j = 3N-1, Vj = N - 1.
models M (Q), parametrised by Q 6 R+, so that every Gaussian Without the formal requirement of x1 =
a, xN = b, it actually turns out (Sacks
process prior in that family gives rise to the same design (the
& Ylvisaker, 1970 & 1985) that the best
same choice of evaluation nodes), and the same trapezoidal design is
estimation rule. The associated estimate for the square error, the 2i
posterior variance, of these models is given by Eq. (11.8). If we * = a + (a - b) 2NTT •
choose the design with equidistant steps as derived above, that This leaves a little bit more room on the
left end of the domain than on the right,
expression is given by
due to the time-directed nature of the
Wiener process prior. E.g., for N = 2,
, . Q2 1 (b-a V Q2(b - a)3 the optimal nodes on [a,b] = [0, 1] are at
(11.9) [2/5, 4/5].
var( ) = 12 Л VN-lJ = 12( N - 1)2,

which means that the standard deviation std(F) = д/var(F),


an estimate for the absolute error, contracts at a rate O (N-1),
and thus more rapidly than the O (N-1/2) of the Monte Carlo
estimate.
Hence the choice of the kernel k, other than the scale Q, de­
termines the rate at which the error estimate var(F) contracts,
while Q itself provides the constant scale of the error estimate.
This situation mirrors the separation, in classical numerical anal­
ysis between error analysis (rate) and error estimation (scale). The
algebraic form of the estimation rule is difficult to fundamen­
tally change at runtime without major computational overhead,
so its properties (e.g. rate) are studied by abstract analysis. The
scale, on the other hand, relates to an estimate of the concrete
error of the estimate, and should be estimated at runtime.
Sections 6 and 6.3 introduced the mechanism of conjugate
prior hierarchical inference on Q: using a Gamma distribu­
tion to define a prior on the inverse scale Q-2, the joint pos­
terior over Q and f, F remains tractable, and can be used to
address the error estimation problem. Given the prior p(Q-2) =
G(Q-2; a0, в0), the posterior on Q-2 can be written using the
recursive terms in the Kalman filter as (reproduced for conve­
11 Links to Classical Quadrature 93

nience from Eq. (6.14))

p (9 2 I [Y]1:n ) = G (- 2; a0 + N, в0 + 1 E (yi Hm- )


i=1 HPi- H T

=: G(9 2,a.N,вn). (11.10)

This posterior on 9 is associated with a corresponding Student-t


posterior marginal on the integrand F (see Eq. (6.9)),

p (F I [Y]1:n ) = f N(F; Pf\y9 9о2) G (9 2, aN, вn ) d9 2

I
_ CNIr.., aN
St I F; PF\Y , ,, T , aN I , 2
\ в N °F 7

where 02 := var|Y(F)/92 is the standardised posterior variance


for F under the “unit” kernel, and iiF Y := Em (F | Y) is the
corresponding posterior mean for F. The mean of the Student-t
coincides with y.FY. Its variance is8 8E.g. Eq. (B.66) in Bishop (2006). For
large values of N, the shape of the dis­
tribution approaches that of a Gaussian,
вN
0F2 = в 0 + N9 Ml _2 (11.11) with mean Ff\f and variance given by
aN — 1 a0 + N — 1 °F. Eq. (11.11).

From Eq. (11.10) we see that, in contrast to Eq. (11.8), this esti­
mated variance now actually depends on the function-values
collected in Y. For the specific choice of the Wiener process
prior (11.2), the values collected in вN in Eq. (11.10) become9 9If necessary, the second line of
Eq. (11.12) can be used tofixX to its
most likely value, given by
1N (yi H Hmi )2
вN = в0 + ^ E HP i— H5 т XML :=
(11.12)
1 f(x 1 )2 N f(xi) -f(xi—1))2 1 N (f(xi) — f(xi—1))2
= в0 + ^ + E Si
f (x1)2 1 E
N- i=2 Si
X1 + X i=2
x1.
For reference, Algorithm 11.2 on p. 97 provides pseudo-code However, if x1 = a, and the first eval­
and highlights again that this Bayesian parameter adaptation uation y1 is made without observation
noise, the value of X has no effect on the
can be performed in linear cost, by collecting a running sum of
estimates.
the local quadratic residuals (f(xi) - Hmi-)2 of the filter.

What is the cost of adding uncertainty to a computation? In the


case of the trapezoidal rule, Algorithm 11.2 gives the answer,
which depends on how one defines computation. If we count
the number of evaluations of f, then probabilistic inference
can be offered at no computational overhead over the classic
equivalent. If we count the overall number of computations, then
we need to consider only the additional accumulation of the
terms f(xi ) - Hmi-. Hence the overhead is a small percentage
of the cost of the classic method.
94 II Integration

Figure 11.2: Convergence for Monte


Carlo and trapezoidal rule quadra­
ture estimates, along with different er­
ror estimates. The shown instance of
Monte Carlo integration converges as
O(N-1/2), as suggested by Lemma 9.2
(theoretical standard-deviation from
Eq. (9.5) shown in dashed black). The
trapezoidal rule overtakes the quality of
the MC estimate after eight evaluations,
and begins to approach its theoretical
convergence rate for differentiable inte­
grands, O(N-2) (each thin line corre­
sponds to a different multiplicative con­
stant). The non-adaptive GP error esti­
mates of the form const./N (Eq. (11.9))
are under-confident. So is the adaptive
Student-t error estimate (dash-dotted, as
# samples in Eq. (11.11)), reflecting the overly con­
servative assumption of continuity but
non-differentiability in the Wiener pro­
cess prior. Nevertheless, the adaptive er­
ror estimate contracts faster than the non-
> 11.1.5 Convergence Rates adaptive rate of O(N-1 ).

How good, in practice, is the probabilistic numerical trapezoidal


rule? Recall that this rule was derived by assigning, ad hoc, a
Wiener process prior over the integrand f and placing evalua­
tion nodes xi to minimise the posterior variance on the integral
F: these choices are far from canonical. Figure 11.2 compares
the convergence of the estimator EM (F) with that of the Monte
Carlo integration estimate. To create this plot, the true inte­
gral was evaluated to high precision using another, advanced
quadrature routine.
The solid black line of the Monte Carlo estimate varies
stochastically, its trend described by the standard deviation
^varp (w)/N convergence of Lemma 9.2. Compared to this
stochastic rate, the trapezoidal rule estimate converges much
faster: after N = 32 evaluations of f,the absolute error be­
tween the true integral F and the estimate F of the trapezoidal 10Again, the argument here is not that
the trapezoidal rule is particularly effi­
rule is roughly 1.3 x 10_6. To reach the same fidelity with the
cient (it is not!), but that even the limited
Monte Carlo estimator, the expected number of required func­ prior information encoded in the Wiener
tion evaluations is N ~ 8.8 x 1010, or 2.75 billion times more process allows for a drastic increase in
convergence rate over the Monte Carlo
evaluations. So using the trapezoidal rule allows a large saving rule, which can be interpreted as arising
of computational cost in this situation.10 from a maximally “uninformative prior”
(see §12.2).
What is the reason for this increase in performance? The
probabilistic interpretation for the trapezoidal rule offers an
explanation, in form of the prior in (11.2). It defines a hypothe­
sis space which assigns non-zero probability measure to only
continuous functions f . This is a more restrictive assumption
than that of Monte Carlo, which only requires the integrand to
11 Links to Classical Quadrature 95

be integrable. But of course it is a correct assumption, because 11 Davis and Rabinowitz (1984), p. 53
the integrand of (9.1) is indeed continuous (even smooth).
12 The intuition for the corresponding
After about N = 64 evaluations, the trapezoidal rule settles proof is that, if the integrand is contin­
into a relatively homogeneous convergence at a rate of approxi­ uously differentiable, then, by the mid­
point rule, the infimum and supremum
mately O(N-2). This behaviour is predicted by classical anal-
of f1 give an upper and lower bound on
yses11 of this rule for differentiable integrands like this one.12 the deviation of the true integral in a seg­
The probabilistically constructed non-adaptive error estimate ment [xi, xi+1] from the integral over the
linear posterior mean in that segment,
of the trapezoidal rule (Eq. (11.8)) predicts a more conservative, and that deviation drops quadratically
slower, convergence rate O(N-1). We know from Eq. (11.9) that with the width of the segments.
this is a direct consequence of the Wiener process prior assump­
Davis and Rabinowitz (1984), p. 52.
13
tion: draws from Wiener processes are very rough functions Note, however, that draws from the
(almost surely continuous but not differentiable). While there Wiener process are actually almost surely
not Lipschitz. The nearest class of con­
is no direct classic analysis for this hypothesis class of Wiener
tinuity that can be shown to contain
samples itself, there is classical analysis of the Trapezoid rule them is that of Holder-1/2 continuous
for Lipschitz continuous functions that agrees with the poste­ functions; a very rough class for which
the cited theorem can only guarantee
rior error estimate and predicts a linear error decay.13 Even O(N-1/2) convergence.
the adaptive error estimate arising from Eq. (11.11), although
converging faster than the 1/N rate of the gp posterior standard­ 14 Not every classical quadrature rule can

be derived from a Gaussian process prior


deviation, still yields a conservative estimate (the dash-dotted
on the integrand - at least not a natural
line in Figure 11.2). prior. An example of a method that can
The Wiener process is such a vague prior that it not only not be derived in this way is the piece­
wise quadratic interpolation rule
fails to capture salient analytical properties of the integrand,
but also gives overly conservative estimates! This kind of in­ I bf (X) dX « £2 2 -
sight is a strength of the probabilistic viewpoint. As we have Ja ' i 2
associated the trapezoidal rule with the unnecessarily broad x (f(xi-1) + 4f(xi) + f(xi+1),
Wiener-process prior, that classic method, too, now seems under­ (for a regular grid {Xi}i=1,...,N), which,
constrained. Even without further analysis, we can strongly depending on national allegiance, is ei­
ther called Kepler’s or Simpson’s rule.
conjecture that better algorithms must be feasible for integrands There is actually a “contrived” proba­
like that of Eq. (9.1). And the gp prior with its defining parame­ bilistic interpretation for this rule: given
a fixed evaluation grid, it arises from a
ters m, k provides a concrete handle guiding the search for such Gaussian prior on the weights for piece­
rules. We should seek to incorporate additional, helpful prior wise parametric polynomial features (Di-
knowledge, to concentrate prior probability mass around the aconis, 1988). But that interpretation is
so far-fetched (among other problems, it
true integrand. requires a constant set of pre-determined
The space of such priors for integrands is large. It includes evaluation nodes) as to seem useless.

models that give rise to other, more advanced, classical quadra­


ture rules (see §11.4), but also many others that are not asso­ 15There is a difference between knowl­
edge and assumptions. A prerequisite for
ciated with a classic method.14 Bayesian quadrature is thus a
any numerical computation is the ex­
framework for the design of customised algorithms for specific plicit definition of the task, in a machine-
problems. The remainder of this chapter contain some examples. readable form like (9.1) - that is, an en­
coding in a formal (programming) lan­
The following section studies one particularly straightforward guage. Thus, properties like the existence
way to incorporate knowledge15 about the smoothness of the of continuous derivatives can be truly
known, not just assumed.
integrand.
96 II Integration

Figure 11.3: Draws from, and marginal


densities of, the zero-mean integrated
(top) and twice-integrated (bottom)
Wiener processes. Dotted lines denote
two standard deviations.

► 11.2 Spline Bayesian Quadrature for Smoother Integrands

If the prior behind the trapezoidal rule is too conservative, can


we use the additional information, e.g. about the integrand’s
smoothness to build a better quadrature rule? More precisely,
what if we know that the integrand is differentiable q times
(but not what those derivatives are)? The Wiener process prior
on f, which turned out to be the probabilistic analogue of the
trapezoidal rule, is the simplest case of a more general class
of quadrature rules based on polynomial spline interpolation.
We can hence use the derivations provided in §5.4 to build new
quadrature routines.
Let us assume that the integrand f is not just continuous, but
has q continuous derivatives. We define the state

Fa(x)
f(x)
z (x) = f' (x)

()
f(q) x

and again model dz (x) = Mz (x) dx + L dwt, now with the


matrices M, L as defined in Eq. (5.22). For q = 1, we get the
integrated Wiener process, for q = 2, the twice-integrated Wiener
process (Figure 11.3) and so on. Posterior mean functions of the
integrated Wiener process are cubic splines, those of the twice-
integrated Wiener process are quintic splines, etc. (Figure 11.4).
Using these two, and H =[0, 1,0,0, ...],the standard Kalman
11 Links to Classical Quadrature 97

1 procedure Integrate(@f, a, b, N) Algorithm 11.2: General probabilistic


2 x a / initialise state quadrature rule in Kalman filter form, in­
cluding marginalisation of the intensity
3 m <—[0; f (a); 0q], P e.g. Eq. (11.13) / define prior 02. Lines 3-5 define the model, the rest is
4 A, Q e.g. Eqs. (5.25) & (5.26) a generic form of the Kalman filter with
conjugate prior hierarchical inference on
5 a a0,в в0 02. The notation @f is meant to imply
6 for i = 1,..., N — 1 do access to the function f itself at arbitrary
7 x x+S / move inputs.

8 m- = Am / predictive mean
9 P - = AP A т + Q / predictive covariance
10 z = f (x) — Hm- / observation residual
11 s = HP - H т / residual variance
12 K = 1/sP - H T // gain
13 m m- + Kz / update mean
14 P P- - KsKт / update covariance
15 в в + z2/2s / update hyperparameter
16 end for
17 E(F) m1 / point estimate
18 var(F) в/(a0 + N/2- - 1) • P11 / error estimate
19 r в - в0 - N / model fit diagnostic (see §11.3)
20 return E(F), var(F) / return mean, variance of integral
21 end procedure

filter of Algorithm 5.1 turns into a probabilistic integration


method. It is reproduced with the additional lines required
for hyperparameter adaptation in Algorithm 11.2. The only
terms requiring some care are the initialisation. Because we
do not know - at least not without further computation - the
derivatives of the integrand, we should be ignorant about them
at initialisation. A simplistic way to encode this is to initialise
m =[0, f(a),0,...] and

0 0 01xq
P= 0 0 0 1 xq (11.13)
0 qx 1 0qx1 aIq
with a “very large” value of a.
Figure 11.5 shows empirical convergence of these rules when
integrating our running example of Eq. (9.1), for the choices
q = 0 (the trapezoidal rule) through q = 3.

► 11.3 Selecting the Right Model

We have seen how the performance of quadrature methods


critically depends on how well their implicit assumptions -
their model - describe the integrand. How should the model be
98 II Integration

Figure 11.4: Posteriors over the inte­


grand f of Eq. (9.1) arising from the inte­
grated Wiener process (top, cubic spline
mean) and twice-integrated Wiener pro­
cess (bottom, quintic spline mean). Com­
pare against the priors in Figure 11.3. For
further explanation, see Figure 11.1.

selected? In the unusual case that the integrand is an element


of the prior hypothesis space, performance can be increased by
concentrating prior mass on the integrand. In the more realistic
case that the integrand is not in the prior space, the issue changes
to how much mass the prior assigns to good approximations
that are close to the integrand.16 In either case, designing a good Kanagawa, Sriperumbudur, and Fuku-
16

mizu (2020)
quadrature rule for one specific problem involves finding a
good combination of prior mean m and covariance k so that the
three integrals of Eqs. (10.5), (10.6), and (10.7) are analytically
available. Recall that we introduced a diverse range of tractable
possibilities for Bayesian quadrature in §10.1. Exploring this
space to find the best model is daunting.
Even if we only consider Bayesian quadrature algorithms
based on linear state-space models, there is a large space of po­
tential sdes to consider. In this setting, can the choice of model
be automated? This search for the “right” prior model is itself
another, potentially challenging, inference problem. In §11.1.4,
11 Links to Classical Quadrature 99

# samples # samples # samples # samples

Figure 11.5: Convergence of quadrature


rules from higher-order Wiener filters on
we tackled a simple version of this problem: we managed (that the integrand of Eq. (9.1). The lower plots
show the model fit statistic introduced
is, analytically marginalised, rather than selected) the model’s in Eq. (11.15).
parameter в, the scale of uncertainty in the integral for the co­
variance family (10.9). For aspects of the prior other than this
simple scalar term, such hierarchical inference is not always
tractable. As such, its computational overhead is likely to be
difficult to justify for a single integrand of modest evaluation
cost.

However, inferring components of the model can still be inter­


esting. Here we investigate doing so as a backstop mechanism,
to detect if a numerical method is a poor match for the problem
at hand. Consider the model class (10.9) with unit uncertainty­
scale hyperparameter, в = 1. A likelihood for the model itself
can be found by marginalising over the latent function f, as
given in (6.13). The logarithm of this likelihood is a simple sum

1N (yi - Hmi-)2 1
log p (Y IM ) = - 2 £ HPi- H T
— ^ log IHP— HT| + const.

(11.14)

To provide a reference baseline for this quantity, assume we


have to predict this quantity before observing the values Y, but
after having decided on the (regular) grid locations. Under the
prior, the expected value of yi is Hmi-, and the expected value
of (yi — Hm— )2 is just H^-HT. Hence, the expected value of
Eq. (11.14) for observables Y drawn directly from the model is

EYIM (logP(Y | M)) = — N — 1log lHPi HT| + const.


The expected log-ratio between predicted and observed likeli-
100 II Integration

hood is

r (Y, M) := f log p (^M) p (Y | M) d Y


p(Y |M)
N (y- H Hm- )2
1
2
N- E
i=1 HP3- H T
(11.15)

E Z^_ N N
eN - в0 H у,
-E12 s2 2

using the interior Kalman filter variables from Algorithm 11.2.


We can see that the expected log-ratio is independent of the
terms that only depend on design choices, rather than the data
itself. It compares the variability of the observed data with its
expectation. If r(Y, M) > 0, then the integrand is less variable
and more regular than expected under the prior, and we can ex­
pect the error estimate varM (F) to be conservative (i.e. the true
error is likely smaller than the estimated one) - the numerical
method is under-confident. On the other hand, if r(Y, M) < 0,
then the integrand varies more than the model would have
predicted. Other, so far unexplored, areas are likely to lead to
similar surprises, and thus the error estimate should not be
trusted - it is over-confident.
Conveniently, the final line of Eq. (11.15) reveals that r is a
trivial transformation of the sufficient statistic в, already col­ Figure 11.6: A draw from the integrated
lected in Algorithm 11.2! It can thus be computed at runtime Wiener process. This particular function
was used as the alternative integrand
without additional overhead. The bottom row of Figure 11.5 in the experiments represented in Fig­
shows these values r during the integration process for the dif­ ure 11.7.
ferent models. We see that, although the integrand in this case is
an infinitely smooth function, its derivative varies on a scale not
expected by the higher-order Wiener processes. Based on the
evidence, q = 1 appears like a good choice within this family
of integrators - and indeed, the convergence rates do not rise
further on increasing q.
Figure 11.7 shows another example: this time, the integrand
is a true sample from an integrated (q = 1) Wiener process (the
integrand is shown in Figure 11.6). As expected, the integrator
based on the rough q = 0 process collects values for r that
suggest its model assumptions are too rough, while the q > 1
integrators can detect their overconfident smoothness assump­
tions. For fixed values of в, these integrators have over-confident
error estimates. Posterior inference on в adapts the error esti­
mates so they become cautious again, but remain unreliable.
11 Links to Classical Quadrature 101

100 102 104 100 102 104 100 102 104


# samples # samples # samples # samples

Figure 11.7: Convergence of quadrature


rules when integrating a sample from a
q = 1 times integrated Wiener process.
► 11.4 Connection to Gaussian Quadrature Rules Top row: error of the mean estimator
(solid black, other values of q in grey for
comparison). Dash-dotted lines are error
The detailed treatment above shows that the trapezoidal rule estimates for a fixed value of 0, dashed
(both the regular grid design and the estimation rule), can be de­ lines are marginal variances under the
Gamma prior on 0-2. As in Figure 11.5,
rived as a maximum a posteriori estimate arising from a Wiener
the bottom row shows the evolution of
process prior. The trapezoidal rule is a simple integration rule, the model evidence statistic introduced
often too simple for practical use. The most popular class of in Eq. (11.15).

generic integration routines are known as Gaussian quadrature


rules.17 They are more advanced than the trapezoidal rule, and 17 The naming is due to a separate 1814

paper by Gauss, and unrelated to the


achieve significantly faster convergence on most practical prob­
Gaussian process prior used in Bayesian
lems. This section investigates the connection between Gaussian quadrature.
quadrature and Gaussian process regression, reviewing results
by Karvonen and Sarkka (2017).
It will turn out that the probabilistic formulation of such
methods is not particularly pleasing, because it yields a vanish­
ing posterior variance on the estimate. These results nevertheless
provide a reference point and intuition. The main reason to be
interested in Gaussian and other classical quadrature rules, even
from a probabilistic perspective, is that they are highly efficient.
Figure 11.8 showcases the performance of the Gauss-Legendre
on our running example, where its asymptotic convergence is
very fast.
For this section, we denote the integration domain by П C R.
Let v be a probability measure on R, and f : П R denotes the
integrand.18 The task is to compute the integral 18 Depending on the concrete quadrature

rule in question, varying regularity re­


quirements on O, v, and f may apply.

d
v (f)=Lf (x) v (x).
They will be assumed to hold without
further comment.

A quadrature rule Q =(X, w) is a (linear) map of f to an approx-


102 II Integration

Figure 11.8: Error evolution for Gauss-


Legendre quadrature on the running
example of Eq. (9.1). In comparison to
the Monte Carlo and Trapezoidal esti­
mates introduced in previous sections,
which each exhibit polynomial conver­
gence rates as predicted by their corre­
sponding error analysis, the Gaussian
rule converges much faster. The curved
grey line is a suggestive exponential
function.

100 101 102


# samples

imation Q(f) of v(f) of the form

N
Q(f) := E wi •f (xi),
i=1
That rule is identified by the weights
19
where X = [x1, ...,xN] are the nodes or knots (also sometimes w satisfying the N monomial constraints
called sigma-points) of the rule, and w = [w1, ..., wN] are the Q(xi) = v(i) for i = 0, ..., N - 1, which
amount to the Vandermonde system
weights.
0
The most popular design criterion for quadrature rules is x10 N w1
to require the rule to be exact for all polynomials up to some .
degree. x1N-1 N-1
N wN
v(x0)
Definition 11.3. Let pN(x) = EiN=0 aixi be a polynomial of degree
.
N. A quadrature rule Q is of degree M if it is exact with respect to
v(xN-1)
v for all polynomials pN of degree N < M, i.e. if

Q (PN (x)) = v (Pn (x)),


20Theorem 11.4 is a simplified form of
and inexact for at least one polynomial of degree N = M + 1. a more general result. For more see,
e.g., p. 97 of Davis and Rabinowitz (1984).
A proof is in Thms. 1.46-1.48 in §1.4 of
For any set of N pairwise different nodes x G ПN, there exists
Gautschi (2004).
a quadrature rule19 of degree at least N - 1. But the punchline
of classical quadrature is that, if the nodes are chosen carefully,
21 A sequence of polynomials
it is in fact possible to achieve degree 2N - 1 with those N {p0( x), pi( x),... } (where pi is of
integrand evaluations. Such rules are called Gaussian quadrature degree i) are called orthogonal under the
measure v if
rules:20
v (Pi • Pj) = ! Pi (x) • Pj (x) dv (x) = Ci Sij.
Theorem 11.4 (Gaussian quadrature). For sufficiently regular v,
there exists a unique quadrature rule with N nodes of degree 2N - 1. For each sufficiently regular v, this se­
quence exists and is uniquely identi­
Its nodes are given by the roots of the Nth v-orthogonal Polynomial21 fied up to the constants Ci (see §1.4 in
pN, and its weights w are all positive. Gautschi (2004)).
11 Links to Classical Quadrature 103

Table 11.1: Popular cases of integra­


О v( x) {tyi}
tion domain О, (unnormalised) measure
v and corresponding set of orthonor­
[-1, 1] 1/2 Legendre
mal polynomials ty. The parameters of
[-1, 1] 1/ V1——x2 Chebyshev the measure for the Jacobi polynomi­
[-1, 1] (1 - x)a (1 + x) в Jacobi als are two reals a, в > —1. The spe­
cial case a = в gives rise to what are
[0, to) e-x Laguerre termed Gegenbauer polynomials and
(— to, to) e-x2 Hermite Gauss-Gegenbauer quadrature.

A relatively small dictionary of domains, base measures and


corresponding orthogonal polynomials have found wide use in
practice. Some of them are listed in Table 11.1.

Karvonen and Sarkka (2017) provided the connection between


these foundational rules and Bayesian quadrature. Let [tyi]i=0,1...
be aset of orthonormal polynomials. That is,

V (tyi tyj) = f tyi (X) tyj (X) dV (x) = 5ij.

Now consider the Gaussian process p(f) = GP(f;0,k) arising


from the degenerate (i.e. finite-rank) kernel
q-1
kq (x, X) = £ city)i (x)ty)i (x1), (11.16)
i=0

and assume a set of positive scales [ci]i = 0,1,... C R+. We might


want to call this a polynomial kernel, but that name is already
overloaded in the Gaussian process literature.
Recall from §4.1 that taking the degenerate kernel (11.16)
amounts to the assumption that f can be written as f ( x)=
£i tyi (x) vi with weights vi G R drawn i.i.d. from p (vi) =
N (vi; 0, ci). The following theorem shows that, for any design
x, the Bayesian quadrature algorithm arising from the Gaussian
process prior p(f) = GP(f;0,kq) yields a quadrature rule of
non-trivial degree, and the design choice that is optimal under
the Bayesian perspective coincides with the Gaussian quadra­
ture rule (i.e. the optimal one under the classic interpretation).

Theorem 11.5 (Karvonen and Sarkka, 2017). The Bayesian quadra­


ture rule with the kernel kq on (О, v) coincides with the classical
quadrature rule Q of degree M - 1 if, and only if, N < q < M. For
such a rule, the posterior variance on the integral vanishes: v = 0.

After setting N := 2, we obtain the following statement.

Corollary 11.6 (Bayesian Gaussian Quadrature). For each N G N,


there is a unique N-point optimal Bayesian quadrature rule for the
kernel k2N on (О, v), and it coincides with the Gaussian quadrature
rule.
104 II Integration

Figure 11.9: Probabilistic interpretation


of Gauss-Legendre integration (that is,
for v(x) the Lebesgue measure). Priors
(left) and posteriors (right) consistent
with the Gauss-Legendre rules of degree
2q - 1 = 5 (top) and 15 (bottom), respec­
tively. The left plots show two marginal
standard deviations as a shaded area, the
first 2q - 1 Legendre polynomials span­
ning the kernel (which are exactly inte­
grated by the associated quadrature rule)
in white, and two samples from the prior
in dashed black. The right panels show
the posterior after q evaluations at the
nodes of the qth polynomial, again with
two standard deviations as a shaded re­
gion, the posterior mean and the inte­
grand in thick black, and two samples
in dashed thin black. Even though the
posterior variance on the integrand f is
non-zero, the posterior variance on the
integral F vanishes - the two shown sam­
ples have the same integral as the poste­
rior mean. Changes in the scales ci of the
individual polynomials can change the
-1 0 1
means, marginal variances, and samples
x x
shown in these figures drastically, while
leaving the resulting integral estimates
identical.
Although this result shows certain Bayesian quadrature rules
to be equivalent to Gaussian quadrature, it also highlights the
differences between the probabilistic and the classic approach.
Three key aspects are particularly worth noting:

& The equivalence class of gp priors whose associated bq rules


coincide with the Gaussian rule is large, because it includes
all choices of c E R+N in the prior

p(v)=N v;0, diag(c)

for the weights v of f (x) = Yi2N0 ViPi(x)• Figure 11.9 shows


priors for the particular choice c = 1, but changes to these
weights can give rise to wildly different priors in the space
of f. If the variances ci decay rapidly with growing i, the
prior assumes a more regular integrand, while a growing
sequence of ci amounts to the assumption of a “more variant”
(though always smooth) integrand. All these choices, though,
yield the same integral estimate when used in the setting
of Corollary 11.6, i.e. when there are N observations at the
nodes of the Nth polynomial.

& Under the posterior on f arising from the kernel k2N, the vari­
ance is zero at the N nodes X of the Nth polynomial, but is
generally non-zero at x E/ X (see Figure 11.9). To explain this
11 Links to Classical Quadrature 105

situation, note that in a set of v-orthogonal polynomials with


c0 > 0, all but the constant one have vanishing expectation
under v. That is, they satisfy

v (ty) = 0, Vi > 0,

because tyi(x) = ±tyi(x)ty0(x)/y/C0. In the setting of Corol­


lary 11.6, those N evaluations exactly identify the value of
the first coefficient v0, but not necessarily those of the other
coefficients. So there is flexibility left in the function values,
but only in ways that do not contribute to the integral. In the
posteriors shown in Figure 11.9, all sampled hypotheses, and
the posterior means, share the same integral.

& A related feature is that the prior of Eq. (11.16) must be


parametric (of finite rank, degenerate) to yield Gaussian
quadrature rules. But of course, the number N of evaluations
is meant to grow as the algorithm runs. From the probabilistic
perspective, this means the prior constantly becomes more
general, more flexible, as N grows. Of course, its dependence
on data, through N, means that this prior is not a real prior
at all, at least one that an orthodox Bayesian would recognise.
The price to be paid for this oddity is that it is difficult
to associate a meaningful notion of uncertainty with the
posterior in this simple form, because v = 0.It remains an
open question at the time of writing, however, whether an
empirical Bayesian extension is possible.
12
Probabilistic Numerical Lessons
from Integration

Equipped with our survey of both current bq approaches and


their connections to classic quadrature algorithms, we will now
tease out some of the intuitions that will inform the remainder
of this book.

► 12.1 Why Be Probabilistic?

Discovering the sometimes close connections between proba­


bilistic inference and classical quadrature methods suggests
that the probabilistic approach is, at least, no worse than the
latter, widely accepted, one. Explicitly, probabilistic numerical
algorithms can be just as computationally lightweight, just as
performant, and just as reliable, as classical methods - because
the classical methods are implicitly probabilistic.
Having identified these close connections between probabilis­
tic and classical quadrature methods, we will now emphasise
their differences: in particular, the unique offerings of the prob­
abilistic approach.

> 12.1.1 Aiding Model Customisation

Perhaps the strongest argument for the probabilistic approach is


that it exposes the choice of model, as gp mean and covariance,
to the user. We have seen how getting this choice right can
yield methods that are substantively both more efficient and
more trustworthy. If the task is to come up with an efficient
quadrature rule for one specific (class of) univariate integrand,
the most productive path forward is likely to be choosing a
mean function m and covariance k such that samples from the
108 II Integration

associated gp match the integrand as closely as possible.1 Of 1 Analysis through the lens of the rkhs
adds further analytical tools. For practi­
course, building a strong model entails its own design and
cal purposes, however, samples from the
running costs (e.g. in updating the gp with new data). For associated Gaussian process are arguably
a cheap integrand, like the running example in this chapter, more explicit, and more interpretable.

it is likely to be easier to simply use a standard quadrature


library. That said, as we discussed for wsabi in §10.2.1, the
costs of a strong model should be more properly viewed as
an investment of computation, that, in many cases, will yield a
computational return. That is, the computational overhead is
more than compensated by savings in computation due to the
increased efficiency of the more accurate model. Such a case
is found in applications where each individual evaluation of
the integrand has high evaluation cost - perhaps because it
involves parsing a large data set, running a costly simulation,
or even performing a physical experiment.2 Here designing a 2A recent example is in Frohlich et al.
(2021).
good prior is likely to be worth the work.

► 12.2 Monte Carlo as Probabilistic Inference

We close this chapter by returning to Monte Carlo integration


for another, closer look. Now that we have realised that several
classical quadrature methods can be constructed in a probabilis­
tic fashion, it is natural to ask whether such an interpretation
also exists for Monte Carlo integration. The answer is yes, as we
3 Readers who are justly worried that this

will see in this section. process is not strictly well-defined can


We already noted in passing that, while the trapezoidal rule consider it as the limit case of the Gaus­
sian process with covariance function
amounts to the implicit assumption of a continuous integrand,
— (x—x )2
Monte Carlo integration is applicable to any Borel-integrable k(x,x) = lim e
Л 0
2 х2 1 Ix—x>i<3Л.
function. This suggests that a probabilistic formulation of Monte
The restriction to a finite support is nec­
Carlo integration will involve a very vague prior on the inte­ essary for the argument below to work, it
grand. Thus, consider the assumption that the integrand is has no practically relevant effect on this
probability measure. The argument here
“white noise” with an unknown mean: can then be performed for finite values
of Л, and the result below obtained as a
f (x) ~ GP(m,k(x,x1)), limit case.

where m G R is an unknown constant, and k(x, x') = Q2I(x =


x') with the indicator function I(a) = 1 if a, 0 elsewhere,3
encoding the assumption that all function values are indepen­
dent of each other, individually varying around m with vari­
ance Q2. Following an old idea,4 we place the Gaussian prior 4Blight and Ott (1975); O’Hagan and
Kingman (1978).
p(m) = N(m;0,c-1) on the unknown constant (the precision
c will be taken to the limit c 0 below). Marginalising over m,
we get the prior

p(f I c) = P P(f I m)p(m | c) dm = GP{f;0,k(x,x') + c1)


12 Probabilistic Numerical Lessons from Integration 109

for the integrand. Under this prior, after evaluations Y at lo­


cations X = [x1, ..., xN] (assumed to be pairwise different),
the posterior mean and covariance functions on f can be re-
arranged5 to

„ ............. . *
E\х,y (f (x)) = R X m + kxxkX xy = RI m + E I(x = xi) Vi,
i=1

cov\x,Иf (x), f (x)) = kxx' - kxxkxxkxx' + RI (c + 1TkxX1) - 1 Rx',

with
N
5 Rasmussen and Williams (2006), §2.7
Rx := 1 - 1TkX 1Xkxx = 1 - E I(x = xi), and
i=1
q-2 N
m := (c + 1TkxX 1) 11TkxXY = c + Q- 2n E yi.
The corresponding Gaussian posterior over the integral F =
ab f(x) dx has mean and variance

bb , X
E\X,Y(F) = Ja E\X,Иf(x)) dx =(b - a)m1, and
var\x,Y(F) = aa cov\x,y f (x), f (x)) d x dx= j+- aN

Evidently, for the improper limit of c 0 (“total ignorance”


about the value of m), we get

m b-a N Q2(b - a)2


E\X,Y(F) N E
yi
i=1
and var\X,Y(F) = --- N------- •
(12.1)
The variance Q2 of the “white noise” around the mean m could
be inferred using a Gamma prior p (Q) = G (Q-2; a 0, в о), analo­
gously to the procedure described in §11.1.2. For large values
of N, and for the improper limit of a0 0, в0 0, the resulting
Gamma posterior on Q-2 would concentrate on its expected
value6 6 The mode of the Gamma distribution

is at ^^-1, which is often used as an un­


EG(e 2;aN,eN)(Q ) eN Eif(xi)2, biased alternative. For large N, the two
estimates are asymptotically equal.
so this probabilistic method’s estimate for the square error of the
Monte Carlo estimate matches one that can also be constructed
from Eq. (9.5) by other means. We have just found the following
theorem.

Theorem 12.1. The Monte Carlo estimate is the limit as c 0 of the


maximum a posteriori estimate under the prior

p (f ) = GP (0, e 2I( x = x')+ c-1), (12.2)


110 II Integration

with arbitrary в E R+. The corresponding posterior variance esti­


mate on the integral, в2(b - a)2/N, matches the convergence rate
of the Monte Carlo estimator. Under this prior, any arbitrary (but
non-overlapping) design X = [x1, ...,xN] yields the same posterior
variance on F.

A defender of Monte Carlo might argue that its most truly desir­
able characteristic is the fact that its convergence (see Lemma 9.2)
does not depend on the dimension of the problem. Performing
well even in high dimension is a laudable goal. However, the
statement “if you want your convergence rate to be independent
of problem dimension, do your integration with Monte Carlo”
is much like the statement “If you want your nail-hammering
to be independent of wall hardness, do your hammering with
a banana.” We should be sceptical of claims that an approach
performs equally well regardless of problem difficulty. An ex­
planation could be that the measure of difficulty is incorrect:
perhaps dimensionality is not an accurate means of assessing
the challenge of an integral. However, we contend that another
possibility is more likely: rather than being equally good for any
number of dimensions, Monte Carlo is perhaps better thought
of as being equally bad.
Recall from §10.1.2 that the curse of dimensionality results
from the increased importance of the model relative to the evalu­
ations. Theorem 12.1 makes it clear that Monte Carlo’s property
of dimensionality-independence is achieved by assuming the
weakest possible model. With these minimalist modelling as­
sumptions, very little information is gleaned from any given
evaluation, requiring Monte Carlo to take a staggering number
of evaluations to give good estimates of an integral. As a con­
trast, Bayesian quadrature opens the door to stronger models
for integrands. The strength of a model - its inductive bias - can
indeed be a deficiency if it is ill-matched to a particular inte­
grand. However, if the model is well-chosen, it offers great gains
in performance. The challenge of high dimension is in finding
models suitable for the associated problems. Thus far, Proba­
bilistic Numerics has shone light on this problem of choosing
models, and has presented some tools to aid solving it. It is now
up to all of us to do the rest. Of course, we must acknowledge that
contemporary quadrature methods (both probabilistic and clas­
sical) do not work well in high-dimensional problems: indeed,
they perform far worse than Monte Carlo. However, arguments
like those in this chapter show that there is a lot of potential
for far better integration algorithms. Such methods can work
12 Probabilistic Numerical Lessons from Integration 111

only in moving away from Monte Carlo, and its fundamentally


limited convergence rate.

For further evidence to this point, we note that even the most
general model underlying Monte Carlo integration can actually
converge faster if the nodes are not placed at random. Equa­
tion (12.1) is independent of the node placement X. Soifit is
used for guidance of the grid design as in §11.1.3, then any
arbitrary node placement yields the same error estimate (as
long as no evaluation location is exactly repeated). Since the
covariance k assumes that function values are entirely unrelated
to each other, a function value at one location carries no infor­
mation about its neighbourhood, so there is no reason to keep
the function values separate from each other.
The tempting conclusion one may draw from Theorem 12.1
is that, because, under this rule, any design rule is equally
good, one should just use a random set of evaluation nodes. This
argument is correct if the true integrand is indeed a sample
from the extremely irregular prior of Eq. (12.2). But imagine for
a moment that, against our prior assumptions, the integrand f
happens to be continuous after all. Now consider the choice of
a regular grid,

X = [a, a + h, a + 2h,..., b - h] with h := b n a.

Then, the mean estimate from Eq. (12.1) is the Riemann sum

E|X,Y(F) = h Е f (xi).
For functions that are even Lipschitz continuous, this sum con­
verges to the true integral F at a linear rate,7 O(N-1). That is, 7 Davis and Rabinowitz (1984), §2.1.6
the poor performance of Monte Carlo is due not just to its weak
model, but its use of random numbers. This insight into the
advantage of regular over random node placement is at the
heart of quasi Monte Carlo methods.8 As we have seen above, 8 E.g. Lemieux (2009)
however, it is possible to attain significantly faster convergence
rates by combining non-random evaluation placements with
explicit assumptions about the integrand.

Thus, from a probabilistic perspective, there is an argument


for the integration rule (12.1) if there is no known regularity
in the integrand (note that this is virtually never the case in
practice, and other integration rules, using more informative
priors, should then be preferred). But, even though this prior
gives no reason itself to prefer any design X over another one,
112 II Integration

it is still a good idea to use a regular evaluation rule, on the


off chance that the true integrand is continuous after all. From
these observations, we draw the contentious conclusion that it
is generally a bad idea to introduce random numbers into an
otherwise deterministic computation.

► 12.3 A Meditation on Randomness

We offer a few more, provocative, thoughts on the utility of


random numbers in the computation of deterministic quantities,
and be it only to counter the frequent reflex to equate probabilistic
methods with stochastic ones. We begin with generic arguments
against the use of a (pseudo-) random number generator in
computation. A prng:

1. doesnot have a coherent decision-theoretic motivation;

2. is worse at its goal of efficient exploration than a probabilistic


numerical approach;

3. can be expected to be more computationally expensive in


solving a numerical task; and

4. muddles subjectivity and bias.

A decision-theoretic view of randomness What would be the prob­


abilistic numerical view of a prng? In such an interpretation,
the output must be the minimiser of an expected loss func­
tion. However, a prng is built with the intent of expressing no
preference over its outputs. If we assume that the prng is the
right approach, and try to determine the associated expected
loss, the prng’s assertion that any possible output is equally
good implies that the expected loss surface is uniformly flat.
Put another way, this is the assertion that the calculation de­
manding the prng is completely insensitive to its output. This
is a strong assumption. An example is found in the extreme un­
derpinnings (see Theorem 12.1) of a Monte Carlo estimate of an
integral. In most real applications, there are symmetry-breaking
forces at play that render some outputs (at least slightly) better
than others. The best such output should always be picked. It
is more common that we have a given expected loss function,
with unique minimum, over a real interval. If a prng is used to
select an action in this setting, the action will be sub-optimal,
with probability one. If you accept that a random number is a
decision, it is necessarily a poor one.
12 Probabilistic Numerical Lessons from Integration 113

Exploration But what is the right loss function for the task
addressed by a prng? It is hard to defend a single, one-off,
choice being made by a prng: that is, to defend the expected loss
for such a choice being uniformly flat. A prng is perhaps more
productively considered as a heuristic for making a sequence of
decisions. The goal of this sequence (or design), X = {x1, ...}, is
to achieve exploration, which we will roughly define as providing
information about the salient characteristics of some function
f (x). As a motivating example, consider f (x) as the integrand
of a quadrature problem. A prng provides exploration but,
remarkably, requires neither knowledge or evaluations of f,
nor more than minimal storage of previous choices x. These
self-imposed constraints are extreme. First, in many settings,
including in the quadrature case considered in this chapter,
we have strong priors for f . Second, many problems (again,
as in quadrature), render evaluations f (x) pertinent to future
choice of x: for instance, a range of x values for which f (x)
is observed to be flat is unlikely to require dense sampling.
Third, as computational hardware has improved, memory has
become increasingly cheap. Is it still reasonable to labour under
computational constraints conceived in the 1940s?
The extremity of the prng approach is further revealed by
broader consideration of the problem it aims to solve. Explo­
ration is arguably necessary for intelligence. For instance, all
branches of human creative work involve some degree of explo­
ration. Human exploration, at its best, entails theorising, probing
and mapping. This fundamental part of our intelligence is ad­
dressed by a consequentially broad and deep toolkit. Random
and pseudo-random algorithms, in contrast, are painfully dumb,
and are so by design.
To better achieve exploration, the Probabilistic Numerics ap­
proach is to explicitly construct a model of what you aim to
explore - f (x). This model will serve as a guide to optimally ex­
plorative points, avoiding the potential redundancy of randomly
sampled points.
Figure 12.1, in contrast to Figure 9.3, is a cartoon indictment
of the over-simplicity of a randomised approach.

> 12.3.1 Computation Investment

A defender of a prng might argue that, in order to define an


expected loss, a model and loss function must be defined and
maintained. The model, and working with an expected loss, will,
of course, require computation and memory: surely this will
114 II Integration

Figure 12.1: As argued in Figure 9.3,


a numerical algorithm, and its model
of the problem, must be simpler and
smaller than the problem itself. However,
a random numerical algorithm takes this
to an undue extreme: its solution is too
WORLD AGENT RANDOM simple to be effective.

mean that the probabilistic numerical approach will be unable


to provide a practical replacement for prng9 . This defender of 9 On a pedantic note, prngs them­
selves have their own - low, but non­
the prng might protest that it is most truly decision-theoretic to
zero - overheads. The popular Mersenne
choose the random algorithm - that which is actually practical. twister (Matsumoto and Nishimura,
The argument here depends on an assumption: that a prng 1998) prng, requires 2.5 kilobytes of
state and a moderate-length sequence
will lead to lower memory and computation costs than those of of computational operations to generate
an explicitly probabilistic numerical algorithm. This assumption its output.
is false. As above, we would like to reframe the computation
used by a probabilistic numerical algorithm not as a burden, but
as an investment. That is, the computational expense incurred
by a numerical algorithm must be judged at the level of the al­
gorithm as a whole, not at the level of a single iteration. While a
probabilistic numerical integrator may spend more computation
than a random alternative in selecting a single node, this invest­
ment of computation may allow itto converge more quickly
to a good estimate of the integral. As such, the probabilistic
numerical algorithm, even if its iterations are more computa­
tionally expensive, is designed to reduce overall computation
consumption. In this spirit, we have seen practical examples of
the computational reductions possible with Probabilistic Nu­
merics in adaptive Bayesian quadrature (§10.2).

Subjectivity and bias Adversarial settings are often used to moti­


vate the use of prngs. More concretely, if your adversary knows
which specific piece of computer code (with termination rules
and internal tolerances) you intend to use as a quadrature rou­
tine, that adversary can present a specific Riemann-integrable
integrand that will make your code return the wrong answer.
In contrast, the argument in §9.2 showed that the Monte Carlo
estimator, placing its evaluation nodes randomly, does not suffer
from the same problem: its estimator is guaranteed to converge.
12 Probabilistic Numerical Lessons from Integration 115

It is unbiased on every Lebesgue-integrable function, and con­


sistent on every square-integrable function.
Let us examine this statement more closely. At its core is the
primary property of random numbers: they are unpredictable.
This property is likely to seem intuitively clear to most readers,
but is quite difficult to define precisely. Formal definitions of
this property are challenging because they inevitably require a
definition of all things predictable by a computer, a notion closely
connected to computational complexity itself.10 Thankfully, we Church (1940); Kolmogorov (1968);
10

Loveland (1966).
do not actually need to understand precisely what is and is
not predictable. The point of the Monte Carlo argument is that,
if we have access to a stream of unpredictable numbers and
use itto build an integration method, then no one can design
an integrand that will foil our algorithm, simply because that
adversary cannot predict where our method will evaluate the
integrand.
The possible existence of adversarial problems motivates the
construction of unbiased Monte Carlo estimators. Unfortunately,
‘bias’ is an overloaded and contested term. In the context of
Monte Carlo, ‘unbiased’ simply means that the expected value
of a random number is equal to the quantity to be estimated
(Lemma 9.2). But this purely technical property draws some
emotional power from the (completely unrelated!) association,
in common language, of ‘bias’ with unfairness.11 The technical, 11 Pasquale (2015); O’Neil (2016).
statistical, definition of bias, that used in defining unbiased esti­
mators, is one term within a particular decomposition of error
in predicting a data set. As argued by Jaynes,12 this term has no 12 Jaynes and Bretthorst (2003), §17.2
fundamental significance to inference. Our goal should simply
be to reduce error in the whole. (‘Inductive bias’,13 meaning 13 Mitchell (1980)
the set of assumptions represented within a machine learning
algorithm, represents yet another distinct use of the term. In
this sense, Probabilistic Numerics unashamedly encourages bias,
through the incorporation of useful priors.)
It is important to keep in mind that the users of numerical
methods are not adversaries of the methods’ designers. In fact,
the relationship is exactly the opposite of adversarial: users often
change their problems to be better-suited to numerical methods
(as, in deep learning, network architectures are chosen to suit
optimisers). As a result, the majority of integrands, and optimi­
sation objective functions, are quite regular functions. Moreover,
this regularity is well-characterised and knowable by the nu­
merical algorithm. There may be good use cases for random
numbers in areas like cryptography, where a lack of informa­
tion, unpredictability, is very much the point. But numerical
116 II Integration

tasks are communicated to a numerical algorithm through the


error-free, formally complete, channel of source code. In this
sense, a numerical algorithm knows as much about its task as is
possible. Acknowledging known regularity in a numerical task
can lead to substantive performance improvements.
Formal statements of the properties of stochastic algorithms
often depend on the use of “real” random numbers. The argu­
ment for random numbers in computation finally comes apart
when one considers that virtually every instance of Monte Carlo
methods does not actually use “real” random numbers, but only
pseudo-random numbers. So the argument rests entirely on the
user not knowing the random number generator used in the
computation (and its seed). If anything, this clearly shows that
computation is fundamentally conditional on prior knowledge.
We now appeal to aesthetic arguments. It seems odd that the
good behaviour of, for example, your optimisation algorithm
should require the hiding of information. Equally, it seems
unintuitive to artificially inject additional uncertainty into a
calculation that is, ultimately, designed to increase certainty.
Here are some sequences of numbers. Can you tell which
one is random?

1. 6224441111111114444443333333

2. 169399375105820974944592307816

3. 712904263472610590208336044895

4. 100011111101111111100101000001

5. 01110000011100100110111101100011

Here is the solution.

1. This sequence was generated by throwing a six-sided dice


seven times and copying down the numbers in sequence of
their occurrence, repeating the result of the ith throw i times.
This sequence is “random” because it is unpredictable, but
it does not pass standard tests of randomness, and a Monte
Carlo integration rule using this sequence does not converge
in O(N-1/2).

2. These are the 41st to 70th digits of the irrational number n.


This sequence, if continued, is devoid of any structure. It is
perfectly good for use in a Monte Carlo method. Unless, of
course, you tell your reviewer where you got it from. Arising
from one of the most widely studied numbers of mathematics,
it is also anything but “random”.
12 Probabilistic Numerical Lessons from Integration 117

3. This sequence was generated by the von Neumann method,14 14 Von Neumann (1951)
a pseudo-random number generator, using the seed 908344. It
is the kind of sequence used in real Monte Carlo algorithms,
and - now that you know the seed - entirely deterministic.
It would have been ok to use this sequence for Monte Carlo
estimation up until three sentences ago, when we ripped
down the veil of randomness.

4. These are digits taken from a CD-ROM published by George


Marsaglia containing random numbers generated using a
variety of physical random number generators. These digits
(perhaps) were once “random”. But, now that you know
where we got them, they are obviously deterministic. But
wait! Marsaglia’s CD is actually not available online anymore.
Most original copies have long been lost as retiring academics
cleared out their offices. We will admit that we do not own a
physical copy either. Does this revelation make those digits
above random again?

5. To generate this sequence we began with a certain, secret


string represented in ASCII, then dropped a coin 32 times
to decide whether to flip a bit. We always started with the
coin facing heads up. If it landed tails up, we flipped the
bit. Here is the catch: we will not tell the height from which
the coin was dropped. If we told you the drop was only five
millimetres, not enough for the coin to actually turn around,
then the sequence is not random. If we told you the height
was two meters, then you might conclude the sequence is
random. There are an awful lot of “we’s” and “you’s”in this
argument. Evidently, randomness is very much a subjective
property.
Our arguments here are not purely philosophical. A number
of studies15 have highlighted that the choice of random seed 15Henderson et al. (2018); Islam et
al. (2017); Colas, Sigaud, and Oudeyer
is empirically significant to the performance of popular rein­
(2018); Mania, Guy, and Recht (2018).
forcement learning algorithms. In scientific fields, discovering
that an unexpected variable affects outcomes should lead to
work to uncover the mechanism. When the variable is a random
seed, however, this approach is impossible: the sole purpose of
a pseudo-random generator is to render the outcome an unpre­
dictable function of the seed. If we are to be truly scientific, there
seems no solution other than to abandon the pseudo-random
generator.
Distinct to the seed’s role within the exploration routine,
the seed may also dictate the progression of a simulation of
an environment. Therein, it may be a way of acknowledging
118 II Integration

environmental randomness. Environmental randomness may


similarly limit our ability to improve an agent, but is due to
forces outside our control. Even here, randomness is likely to be
more epistemic than aleatoric, and science is possible. Indeed,
simulations could be chosen to be informative for the training
of the agent, rather than naively randomly generated.16 16 Paul, Osborne, and Whiteson (2019)
Islam et al. (2017) notes that some reinforcement learning
algorithms search for the best random seed, and various au-
thors17 advocate instead for averaging over a large number of 17Mania, Guy, and Recht (2018); Colas,
Sigaud, and Oudeyer (2018).
possible seeds. Averaging, in effect, treats the seed as a ran­
dom variable, an uncontrollable and unknown property of the
problem that must be marginalised. However, the seed is not
a feature of the problem at all. Importantly, a random seed’s
existence is owed to the prng driving the agents’ exploration.
In this respect, a seed is a design choice. We typically make such
choices to improve the performance of our agents. When we
choose to use a pseudo-random generator, we artificially bound
our understanding of how to do so. Put another way, the use of
a pseudo-random generator introduces a variable, the seed, that
can never be improved. We have rendered our performance, in
this respect, unimprovable.
13
Summary of Part II
and Further Reading

This chapter - using a concrete univariate integral as a guide­


post - built intuition about the connection between inference
and computation. This led to some conceptual insights:

Classical methods can be re-framed as probabilistic. Certain elemen­


tary numerical methods can be derived precisely as maxi­
mum a posteriori estimates under equally elementary proba­
bilistic models. More specifically, we saw that the trapezoidal
quadrature rule arises as the MAP estimate under a Wiener
process prior on the integrand.

Design policies arise naturally. The placement of evaluation nodes


on a regular grid arose as the choice that minimises posterior
variance over the integral.

Prior knowledge begets bespoke methods. We encountered the gen­


eral recipe for the construction of a probabilistic numerical
method. First, define a joint generative model (a prior and a
likelihood) for the latent, intractable, quantity and the “ob­
servable” (computable) quantities. Next, construct an action
rule through specifying a loss function, determining which
computations to perform. For example, in integration, by
aiming to reduce the variance of the posterior. By encoding
knowledge in a prior measure and goals in a loss function,
one can construct customised numerical methods, tailored to
a specific task.

Meaningful error measures require good priors. The choice of prior


determines the scale of the posterior, and hence influences
120 II Integration

how representative the posterior variance is as an error mea­


sure. Hierarchical inference allows the model class to be
adapted at runtime to calibrate the posterior uncertainty. This
calibration is epistemic - it relies on the assumption that the
“unobserved” parts of the problem (here, the integrand) are
subjectively similar to those already observed.

Probabilistic Numerics can be fast. Assigning a posterior proba­


bility distribution to a numerical task need not be signifi­
cantly more expensive than “classic” point estimation. Even
uncertainty calibration may be achieved at almost the same
cost as the classical point estimate. Moreover, we have seen
that some probabilistic numerical algorithms, like wsabi, are
legitimately faster than alternatives. This leads us to frame
the computation spent on probabilistic modelling as an in­
vestment, rather than overhead.

Imposed randomness can be harmful. While Monte Carlo estima­


tors can be elegant from an analytic perspective, computa­
tionally, it can be counter-productive to artificially introduce
randomness into an otherwise deterministic problem. Monte
Carlo estimators can at best converge with stochastic rate,
even on extremely regular, simple, estimation problems, like
the one used in this chapter.

These observations mirror aspects of inference well known in


statistics. In many ways, numerical computation and statistics
share the same challenges: which model classes lead to good
estimation performance, at acceptable computational cost? To
which degree can the error, the deviation of the estimator from
the true value, be estimated? But there is also a crucial difference
between numerical tasks and statistical inference from physical
data sources. In stark contrast to “real-world” statistics, e.g. in
the social sciences or medicine, a numerical task is defined not
through a vague concept, but in a fully descriptive formal lan­
guage - the programming language encoding the task. It is thus
possible, at least in principle, for the computer itself to check
the validity of prior assumptions, and thus to choose among
available numerical methods with differing prior measures. This
prospect is the subject of contemporary research.

Software

At the time of writing, the open-source emukit library1 provides 1 Paleyes et al. (2019), available at
a package for Bayesian quadrature with a number of different emukit.github.io
13 Summary of Part II and Further Reading 121

acquisition functions, including for WSABI-L. Emukit is an outer­


loop package in the sense that it provides the active learning
loop, but the surrogate model has to be coded and wrapped
by the user into an emukit interface. In contrast, the ProbNum
library2 which focuses on the lower numerical level, additionally 2Code at probnum.org. See the corre­
sponding publication by Wenger et al.
provides surrogate model functionality and contains a Bayesian
(2021).
quadrature component under active development.

Further Reading

The main purpose of this chapter was to introduce core notions


of probabilistic numerical computation in the simple setting of
univariate integration. However, numerical integration poses a
field of its own right, and remains the subject of intense study
to the present day. Readers interested in deeper insights may
find the following notes helpful.

Quasi-Monte Carlo

Quasi-Monte Carlo (qmc) is the name used to describe integra­


tion rules that are based on unit weights (i.e. of the algebraic
form of the Monte Carlo estimator of Eq. (9.4)), but which use
carefully crafted designs of low discrepancy. The error and con­
vergence analysis of qmc methods is related to that of Bayesian 3 Dick, Kuo, and Sloan (2013)
quadrature, and has had some success in achieving appealing
convergence rates even in the high-dimensional setting. For
more on qmc and its analysis using rkhs hypothesis spaces
can be found in a review by Dick et al.3

Closely Related Concepts

Related to Bayesian quadrature are topics including “kernel


herding”4 and kernel quadrature, which themselves are related 4Chen, Welling, and Smola (2010);
Huszar and Duvenaud (2012).
to other numerical approximation methods.5 A subtle difference
5Bach, Lacoste-Julien, and Obozinski
between Bayesian quadrature and kernel quadrature is that the (2012); Bach (2017).
latter often allows for the kernel to change as a function of
the observation nodes (in particular, to shrink as their number
grows). Doing so poses conceptual problems in the Bayesian
framework: it is not compatible with the philosophical notion
of a prior; but more practically, this adaptation of the kernel
amounts to a continuous relaxation of the associated probabilis­
tic error estimate (which does not play such a central role in the
kernel view).
122 III Integration

Convergence Analysis

The theoretical treatment of convergence of Bayesian quadrature


has reached a sophistication beyond what can be presented in
this chapter. See for example, recent works by Briol et al. (2019),
and Kanagawa, Sriperumbudur, and Fukumizu (2020). In par­
ticular, both kernel quadrature and Bayesian quadrature using
translation-invariant kernels can achieve the optimal worst-case
rate for integrands constrained to lie in a Sobolev space, if cer­
tain requirements on the distribution of the design points are
fulfilled. As already mentioned above, even the more theoreti­
cally challenging case of adaptive Bayesian quadrature has now
been furnished6 with results establishing asymptotic consis­ 6 Kanagawa and Hennig (2019)
tency (under some loose conditions).
Chapter III
Linear Algebra
14
Key Points

In this chapter we will consider computations involving matrices.


Here is a preview of the central results:

Algorithmic structure will play a prominent role. Since linear


algebra methods are part of the bedrock of computation on
which other methods are built, computational efficiency is
particularly important. Smart book-keeping is a large part of
good methods. Considerations on computational complexity
will significantly constrain the choice of probabilistic models,
to a class of Gaussian distributions with specific factorisation
structure.

Do not be confused by the term linear algebra. Matrix inver­


sion is in fact a nonlinear operation. This will force us to
make a conscious decision when designing the probabilistic
model underlying an algorithm. The model can either fit
neatly with the likelihood and thus allow for more compli­
cated observation (computation) paradigms, or fit cleanly
with the latent quantity (the solution of a linear problem),
and thus provide more explicit notions of posterior uncer­
tainty. This is in contrast to integration, where the latent
quantity (the integral) is linearly related to the observables
(the integrand), so that both can be captured jointly in a
Gaussian model.

& Certain properties of a matrix (for example positive definite­


ness), even when known a priori, can be difficult to capture in
a prior that still allows for computationally efficient inference.
We may decide to not encode certain kinds of knowledge in
favour of computational complexity. While this means that
such knowledge can then not be used in the action rule of the
126 III Linear Algebra

algorithm, one may still require that resulting point estimates,


and even error estimates, are consistent with it.

Putting these abstract points together, we will arrive at a


concrete class of prior distributions that, when combined
with certain “natural” algorithmic choices, give rise to clas­
sic iterative solvers; in particular to the seminal method of
conjugate gradients. In contrast to the integration chapter,
however, we will find that calibrating uncertainty is a much
more intricate and challenging issue in linear algebra. The
reason for this, put simply, is that matrices are big objects:
their quadratic size means that linearly many matrix-vector
projections as observations identify only a small part of the
matrix.

Each chapter has a wider conceptual point that transcends


the concrete numerical setting. In the case of this chapter, it
is the observation that the design of a good computational
prior is a trade-off between constraints set by knowledge
and those set by computational considerations. Even though
one may patently know some things to be true, it may be
counter-productive to include this information in the prior
when doing so makes the computation considerably more
complex.
15
Required Background

In addition to the concepts of Chapter I, this chapter requires


some basic linear algebra concepts, briefly reviewed in the fol­
lowing.

► 15.1 Vectorised Matrices and the Kronecker Product

This chapter revolves around inferring matrix-valued objects.


For this purpose, a matrix A G RNxM will frequently be treated
as a collection of NM real numbers arranged into a rectangular
table, rather than an object associated with an algebraic struc­
ture. For a real matrix A G RNxM, the symbol A G RNM will
then denote the vector arising from stacking the elements of A
row after row (“row-major” indexing1). It would be possible 1This is not that same as stacking
the matrix column-by-column (“column­
to index the elements of this vector with a single number run­
major”). The latter is used in Fortran
ning from 1 to NM. However, to avoid confusing translations and thus also by Matlab’s A(:) operator.
between index sets, the elements of this vector will be indexed Numpy’s A.ravel() allows the user to
specify the order, but defaults to row­
by the same index set (ij) C N x N as the matrix itself. major indexing, which is also a standard
On some rare occasions it will be necessary to use the inverse used in C. Under column-major index­
ing, Eq. (15.3) changes to
operation of vectorisation - to re-shape a vectorised matrix back
into rectangular form. This will be denoted by the t| symbol: (A ® B)C = (BCAT),

which would cause all sorts of havoc


A = ft A. (15.1) down the line.
The Kronecker product A B of two matrices A G RNAxMA ,
B G RNBxMB is a matrix of size NANB x MAMB with elements

[A ® B]ij,k£ = [A]ik [B]j£. (15.2)

This matrix maps from RMA •MB, the space of vectorised real
MA x MB matrices, to RNAxNB, the space of vectorised NA x NB
matrices. For C G RMA'MB, and C = C, we have

[(A ® B) C]ij = £[A]ik[C]kf. [Bj = [ACB-]ij. (15.3)


ke
128 III Linear Algebra

In this sense the Kronecker product “translates” between matrix


multiplication and vectorisation. The Kronecker product has a
number of analytically useful properties,2 which mostly arise 2 van Loan (2000)
from the relation

(A ® B)(C ® D) = AC ® BD. (15.4)

Among them are


Exercise 15.1 (easy but instructive, solu­
tion on p. 362). Prove Eqs. (15.4)-(15.7).
(A 0 B)—1 = A-1 ® B- 1,
Show that Eq. (15.6) is indeed the eigen-
IA 0 B\ = | A|rk( B) -\B\rk( A), (15.5) decomposition of A 0 B.

(A 0 B) = (Va 0 Vb)(Da 0 Db)(Va 0 Vb)—1, (15.6)


(A 0 B)T = Aт 0 вt, (15.7)

where I AI is the matrix determinant of A, rk( A ) is the rank


of A, and A = VADAVA-1, B = VB DB VB-1 are the eigenvalue
decompositions of A and B, respectively. These properties only
hold assuming that these decompositions and inverses exist.

► 15.2 Positive Definite Matrices

A real matrix A G RNxN will be called spd if it is symmetric


(A = AT) and positive (semi-) definite. It is positive (semi-) defi­
nite if vT Av > 0 (if vT Av > 0) for all non-zero vectors v G RN.
Like all symmetric real matrices, spd matrices have orthogo­
nal eigenvectors. Strictly positive definite matrices have strictly
positive real eigenvalues, and are thus invertible.

► 15.3 Frobenius Matrix Norm

There are many norms3 on the space of real matrices A G 3 A matrix norm ||A|| G R has the prop­
erty that
RN x M, some with certain analytical advantages over others.
IIAII > 0,
The Frobenius norm is defined by
■o \\A\\ = 0iff A = 0,
NM ■& 11 aA || = | a 11| A11 for all a G R,
IIA||F := tr(ATA) = £ £ Aj = A-A = || A||2, * \\a + B\\<wAWW.
i=1 j=1

where || • || 2 is the standard Euclidean (£2) vector-norm. Of


particular importance will be the generalised, weighted Frobenius Exercise 15.2 (easy, solution on p. 362).
norm, using two symmetric positive definite matricesV, W: Using Eq. (15.6), show that the Kronecker
product of two symmetric positive definite
matrices is itself spd, then use this result
IIAhF,v,w := tr(AтV—1 AW—1) = AT(V 0 W)—1 A. (15.8) to show that Eq. (15.8) does indeed define a
matrix norm.

► 15.4 Singular Value Decomposition 4 There is an analogous formulation for


complex matrices, but the chapter only
involves real matrices.
Every real matrix4 B G RMx N has a singular value decomposition
15 Required Background 129

(SVD)
B = Q S UT
with orthonormal5 matrices Q G RNxN, U G RMM, whose 5 That is, QT Q = IN and UT U = IM.
columns are called the left- and right- singular vectors, respec­
tively, and a rectangular diagonal matrix6 S G RNxM which 6 That is, Sij = 0ifi = j.
contains non-negative real numbers called singular values of B on
the diagonal. Assume, w.l.o.g., that N > M and the diagonal el­
ements of S are sorted in descending order, and Srr with r < M
is the last non-zero singular value. Then Q can be decomposed
into its first r columns, Q+, and the (potentially empty) N - r
columns, Q- as Q = [Q+, Q-], and similarly U = [U+, U-] for
the columns of U. The SVD is a powerful tool of matrix analysis:

r equals the rank of B: rk(B) = r;

the columns of U- span the null space (the kernel) of B (those


of U+ the pre-image of B);

the columns of Q + span the range (the image) of B (those of


Q- the co-kernel of B);

the matrix B = argminrk(B,)=< \\B - B||f (“the best rank-k


approximation to B in Frobenius norm”) is given by B =
QS UT , where S equals S except that Sii = 0 for i > k;

the matrix V = argminVTV=I \\B - V||p (“the orthonormal


matrix closest to B in Frobenius norm”) is V = QUT.

► 15.5 Matrix Identities

For matrices A, U, C, V, if all the inverses in the following equa­


tion exist, then the matrix inversion lemma7 holds: 7 The matrix inversion lemma has been
ascribed to many different authors, in­
(A + UCV)-1 = A-1 - A-1U(C-1 + VA-1U)-1VA-1. (15.9) cluding M. Woodbury, J. Sherman and
W. Morrison, M. Bartlett, and I. Schur. It
It is helpful if U is a “skinny” matrix (i.e. U G RNxM, M N), is likely that the result is even older than
any of the corresponding works. Some
and the inverse of A is known, because then the inverse of the
historical background, along with inter­
larger N x N left-hand side of Eq. (15.9) can be found from the esting additional numerical analysis, can
smaller M x M inverse on the right-hand side. be found in a review by Hager (1989).

The inverse of the block matrix


PQ Pi Qi
, if it exists, is A-1 (15.10)
RS i
R M

with

M :=(S - RP-1Q)-1, i
P := P-1 + P-1QMRP-1,
i
Q := -P-1 QM, i
R := -MRP-1.
130 III Linear Algebra

This relationship is often attributed to I. Schur (1917). The term


P is sometimes called the pivot block, the term M, and others of
its general form, are widely called Schur complements.8 This is 8An introduction to Schur complements,
including motivations for the terminol­
due to its use in the Schur determinant formula, also simply
ogy and a discussion of applications and
known as the determinant lemma, which follows from the above: related concepts, is provided by Cottle
(1974).
|A| = |P| ■ |S - RP-1 Q|.

It can also be re-phrased in the notation of the matrix inversion


lemma above:

|A + UCV| = |A| ■ |C| ■ |C-1 + VA-1U. (15.11)

The related result for block matrices is9 9 Lutkepohl (1996), §4.2.2, Eq. (6)

A B
A non-singular ^ det det(A) det(D - CA-1B),
C D
r
A B
D non-singular ^ det det(D) det(A - BD-1C).
C D
__________ 16
Introduction

Linear algebra operations like the solution of linear systems,


inversion and decomposition of matrices are arguably the most
basic kind of numerical computations. Alongside even more ele­
mental operations like matrix-vector and matrix-matrix multipli­
cation, they are the building blocks of virtually all heavyweight
computation on contemporary computers. Consequently, this
area has been studied to great depth, which has led to extremely
well-crafted algorithms. A thorough introduction can be found
in a tome by Golub and Van Loan (1996). A more recent - and
only slightly less voluminous - treatment is provided by Bjorck
(2015). Just seeing these books, each well over 700 pages thick,1 1A less hefty introduction at undergrad­
uate level is offered by Ipsen’s (2009) con­
sitting on a bookshelf should convince anyone that it would be
cise book.
foolish to attempt a probabilistic analysis of all of numerical
linear algebra in this chapter. As in the other chapters, we will
focus on a few key aspects.
One strong simplification we will make is to assume that
computations can be performed with arbitrary precision. Doing
so gives us a licence to ignore all questions of numerical stability.
Seasoned numerical analysts may raise their eyebrows here -
making algorithms stable to numerical errors is a core goal of
research in numerical linear algebra. But in machine learning,
machine precision is regularly by far the smallest source of
uncertainty, dominated by issues arising from the lack of data
and data-sub-sampling. By putting stability aside, we can con­
centrate on the question of epistemic uncertainty: what is the
most efficient way to extract information about the solution of a
linear system from a computer?

To simplify things further, we will focus on one particular


problem -the solution of the symmetric, positive definite, real
132 III Linear Algebra

linear system (see Figure 16.1)

Ax = b, where A G RNxN spd., and x, b G RN. (16.1)

Equation (16.1) can also be written as an optimisation problem:


x is the unique vector that minimises the convex quadratic
function
1
f (x ) = 2 xT Ax — xT b, (16.2)

and is thus known as the least-squares problem. f(x) has gradient


(see Figure 16.1)
Figure 16.1: Sketch of a symmetric posi­
r(x) := Vf (x) = Ax — b, (16.3) tive definite linear problem.

also known as the residual (for reasons to be introduced in §17.2).


The terms residual and gradient will be used interchangeably,
depending on which aspect is to be emphasised. f(x) has con­
stant Hessian matrix VVTf (x) = A. Because A is spd, it has a
unique inverse, which will play such a central role that it will
be afforded its own symbol:

A—1 =: H.

(The choice of the letter H is historic convention for inverse


Hessians in optimisation.) Problem (16.1) is interesting for two
reasons. First, it lies at the heart of least-squares estimation,
which in turn is the basis for many basic numerical, statistical,
and indeed machine learning algorithms, like Gaussian pro­
cess regression (§4.2). Second, as we will see in Chapter IV, 2 See §11.3 in Golub and Van Loan (1996)
for an introduction. This connection is
estimating spd matrices and their inverse also features in local
also the reason why computing the pos­
nonlinear optimisation. In fact, some of the most core algorithms terior uncertainty (covariance) of Gaus­
of nonlinear optimisation will arise as natural extensions of the sian process regression adds little to no
overhead to just computing the asso­
results of this chapter. ciated point estimate (posterior mean).
Solving linear problems is also interesting more generally Finding the minimum of a quadratic
problem requires the same consider­
because certain other core tasks of linear algebra can be solved
ations as characterising the geometry
with algorithms that are structurally similar. In particular, there (Hessian) of the problem. As we have
is a deep connection between iterative linear solvers and meth­ already seen in previous chapters, this
has crucial implications for the complex­
ods for computing matrix decompositions (like eigen- and sin­ ity of Probabilistic Numerics: if we are
gular value decompositions) through the Krylov sequence.2 De­ happy to accept approximately Gaussian
posteriors, then these can often be had at
veloping a probabilistic description for (iterative) linear solvers
essentially no extra cost over the classic
thus holds the promise of similar advances in other linear alge­ point estimate.
bra tasks.

The largest part of this chapter will be devoted to re-phrasing


an existing, elementary linear algebra method - the method of
conjugate gradients. The point is to build an intuition for the
constraints on the model class, in particular on the choice of
16 Introduction 133

prior covariance, that are required to build a lightweight, effi­


cient estimator for linear algebra quantities. Because they form
the foundation of many computational methods, linear solvers
have to work on a large class of problems at low computational
cost. For the most basic methods, this also means they must not
themselves require the solution of a linear problem, not even
one smaller than the original one. This restriction is less im­
portant for practical applications requiring uncertainty, where
one may be willing to pay a small computational overhead for
probabilistic “error bars” on point estimates. In such situations,
it may be convenient to use a classic (highly optimised) solver
for internal computations, and construct a probabilistic esti­
mate “around” its output, using some statistics of the solver’s
computations to calibrate uncertainty.

► 16.1 Classic Methods

The “pedestrian” solution to Problem (16.1) is Gaussian elimina-


tion,3 one of the oldest algorithms known to humankind, and 3See Gauss (1809). For a historical per­
spective, see Grcar (2011).
certainly one the best known ones, widely taught at high-school
level. Gaussian elimination provides a solution for any simul­
taneously solvable set of linear equations, not just symmetric
positive definite ones. It is the standard approach for small and
medium-sized problems, among other reasons because it can
be made numerically quite stable by ordering the operations
suitably.4 Gaussian elimination of an N x N system in general 4 In formal terms, Gaussian elimination
is often captured in the so-called LU-
involves N intermediate steps, each involving a matrix-vector
decomposition, a notion introduced by
multiplication, hence the overall complexity is O(N3). Turing (1948). The decomposition is
However, Gaussian elimination is not an “anytime” algo­ made numerically stable by helpful per­
mutations of A, a process known as
rithm - it only provides a correct answer after it runs to comple­ pivoting. Details can be found in §3.2­
tion. In fact it can be shown5 that if this algorithm is stopped at 3.4 of Golub and Van Loan (1996). The
LU decomposition splits the matrix A
step i < N and the computations performed until this point are
into a Lower and an Upper triangular
used to compute an “estimate” Xi of the true solution x, then matrix, and is a generalisation of the
there are scenarios in which the error ||Xi — x|| can actually grow Cholesky decomposition (Benoit, 1924)
(which only applies to spd matrices).
with every step i < N, and only converge suddenly on the final More can be found in Turing’s very read­
step i = N. When studying probabilistic formulations of linear able work.
5 Hestenes and Stiefel (1952)
algebra, we will primarily be interested in assigning uncertainty
in incompletely solved problems, and thus in algorithms which
improve a point estimate over the course of their run.
134 III Linear Algebra

i procedure CG(A(•), b, x0) Algorithm 16.1: Conjugate gradients (ba­


sic form). Adapted from Nocedal and
2 r0 = Ax0 - b / initial gradient
Wright (1999). The algorithm takes as in­
3 d0 = 0, в0 = 0 / for notational clarity puts the problem description (A, b) and
4 for i = 1,...,N do an initial guess x0 (often set to either 0 or
b), then iteratively computes estimates xi
5 di = -ri-1 + fii-1 di-1 / compute direction
of typically increasing quality. The dom­
6 zi = Adi / compute inant computation is the matrix-vector
multiplication in line 6, all other steps
7 ai= -dJri-1/d] zi / optimal step-size
are of either linear or constant cost. Lines
8 si= aidi / re-scale step 8 and 9 could obviously be folded into
9 yi= aizi / re-scale observation the following two lines, they are kept
separate here to ease the comparison to
10 xi= xi-1 + si / update estimate for x
probabilistic equivalents constructed be­
11 ri= ri-1 + yi Ц new gradient at X{ low.
12 Pi = r] r/r-1 ri-1 Ц compute conjugate correction
13 end for
14 end procedure

Iterative solvers provide such a behaviour, and are used for


large-scale problems where the cubic cost of finding the exact
solution is too large to contemplate. These methods also involve
individual steps of cost O(N2) each. But they aim to contin­
uously improve an initial guess X0, so that the algorithm can
be stopped for i N and already provide a good estimate
of x. The prototypical algorithm in this class is the method of
conjugate gradients (cg),6 an algorithm applicable to the spd 6 Hestenes and Stiefel (1952)
problem (16.1) above. For reference, it is displayed in Algo­
rithm 16.1.
cg consists of a single loop of iterative refinements to an esti­
mated solution xi of Eq. (16.1). Each iteration computes a single
matrix-vector multiplication (line 6), a projection of A along
a vector di .The other lines consist of cheaper computations
that track the current estimate xi and the current residual r(xi).
The steps are determined by two “control” parameters a, в. The
former, a, regulates how the observed result of the projection
is used to update xi, and latter, в, governs which projection
is chosen next. Hence, this algorithmic structure fits with the
separation of an active agent into an estimation (inference) and
action (decision, policy) part that we already employed in the
integration chapter.
The central result of this chapter on linear algebra will be
the construction of a set of active probabilistic inference meth­
ods that are consistent with conjugate gradient. That is, they
construct a sequence of Gaussian posterior measures associ­
ated with a mean estimate for x that equals the sequence con­
structed by cg. Even for readers interested in non-symmetric,
16 Introduction 135

non-definite problems, cg is a good starting point, as it can


be generalised to larger classes of tasks. Many iterative solvers
can be constructed in this way, including the generalised minimal
residual method (gmres)7 and the bi-conjugate gradient method 7 Saad and Schultz (1986)
(bicg),8 the conjugate gradient squared method (cgs),9, and the 8 Fletcher (1975)
quasi-minimal residual method (qmr).10 Further, cg is closely 9 Sonneveld (1989)
10 Freund and Nachtigal (1991)
connected to the Lanczos processLanczos process for approximately
computing the eigenvalues of symmetric matrices,11 and the 11 Lanczos (1950)
generalisation to gmres is analogously connected to the more
general Arnoldi process for iterative approximation of eigenval-
ues.12 A probabilistic analysis of cg thus gets us closer toa 12 Arnoldi (1951)
general understanding of linear algebra in terms of probabilistic
inference.
There are several good textbooks available for readers inter­
ested in the classic analysis of linear algebra methods. German­
speaking readers can find a great overview of all the methods
mentioned above, and of their relationship to each other, in §4.3
of the book by Meister (2011). A more compact introduction
to gmres and cg in particular is also in §IV of Trefethen and
Bau III (1997). The book by Parlett (1980) specifically focuses
on symmetric matrices; the contents of the present chapter are
related to the exposition in Chapters 12 and 13 therein.
17
Evaluation Strategies

In pursuit of a probabilistic formulation of linear algebra, it is


tempting to start in the same way we started the treatment of in­
tegration in Chapter II - by writing down a family of probability
measures, say, over H and x, and see if certain choices of pa­
rameters can be identified with the classic methods listed above.
But due to the structured nature of linear algebra, this approach
runs the risk of yielding intractable algorithms. A large part
of linear algebra consists of efficient book-keeping. It is thus a
better strategy to leave the probabilistic model abstract initially,
and focus on algorithmic structure. Concrete implementations
informed by the lessons from this section will be addressed in
§19.
With this in mind, assume we want to solve the problem
Ax = b with a probabilistic solver that mimics the structure of
an iterative solver like cg (Algorithm 16.1). As noted above,
these methods proceed by iteratively collecting observations of
matrix-vector multiplications zi = Adi of some smartly chosen
vector di with the matrix A. The vectors di are termed search
directions or projections. We will thus consider probabilistic meth­
ods that collect action-observation pairs (di, zi = Adi)i=0, To
simplify notation, we will collect the d1, ...,di and z1, ...,zi in
the columns of two rectangular matrices Zi, Di, so that after
M iterations of the method, Z, Di G RNxM and the entire set
of observations can be written compactly as Zi = A Di .Af­
ter each iteration, the solver constructs a posterior measure
p(A | Di, Zi), which can be used both to construct an estimate
xi, and to decide on the next projection di . For the moment, the
only assumption we will make about the as-of-yet unspecified
probabilistic inference scheme is that it is consistent with the ob­
servation, in the sense that it puts measure zero on all matrices
138 III Linear Algebra

A for which A Di = Z[. This also implies that any reasonable


point estimators Ai and Hi (for H = A-1) constructed from this
probability measure should have the property

Ai Di = Zi and HiZi = Di .

The crucial question is, how should the solver choose the action
di+1 from the posterior?

► 17.1 Direct Methods

The most straightforward idea is to choose di+1 a priori, i.e. inde­


pendent of the observations Zi . For example along a randomly
chosen direction, or as di+1 = ei+1, the unit vector ei+1,j = 8ц.
One advantage of this strategy is that the directions can be cho­
sen in a way that keeps computational cost low. To wit, sparse
directions like the choice di = ei allow the computation of the
projections zi = Aei = A:i in linear time. This idea is at the
heart of approximation methods that do not attempt to find
an optimal solution to a linear problem, but only an ad-hoc,
reasonable yet cheap projection. Such approaches can work well
if outside information about the structure of the matrix is avail­
able. For instance in least-squares problems where the matrix A
is constructed by the evaluation of a positive definite function,
such as kernel ridge, or Gaussian process, regression. Ideas like
the Nystrom approximation1 (the name was chosen by Baker,2 1 Williams and Seeger (2000)
based on the quadrature work of Nystrom),3 inducing point 2 Baker (1973), §3
3 Nystrom (1930)
methods,4 and spectral approximations,5 and of course Gaus­
4Snelson and Ghahramani (2005);
sian eliminiation itself (i.e. LU and Cholesky decompositions), Quinonero-Candela and Rasmussen
all fall in this category. M individual steps of such methods (2005).
tend to have cost O(NM2), linear in the size of the matrix. The 5 Rahimi and Recht (2007)

downside of choosing projection directions a priori is that the


solver cannot adapt to the structure of the matrix, so if the prior
is badly calibrated, the solver’s estimate may be bad, too. For
the rest of this chapter, we turn our attention to solvers which
use collected search directions to converge towards the exact
solution of the linear problem (defined in Eq. (16.1)). Each itera­
tion of such a method will typically have quadratic cost O (N2 )
because it involves a generic matrix-vector multiplication, so
the cost of performing M steps is O (N2M).

► 17.2 Iterative Methods

We will take an intuitive approach to constructing an algorith­


mic skeleton for our adaptive probabilistic solver, which will be
17 Evaluation Strategies 139

complemented by a subtle improvement later. Our initial devel­


opments will not require symmetry of A, but we will change
this later. Recall the residual (defined earlier in §16)

r(x) = Ax - b.

For any estimate x, wherever it may come from, the update

x x - Hr(x)
= xc - H ( Axc - b)=Hb = x

would yield the exact solution, if we had access to the exact


matrix inverse H . Since the solver has access to a posterior
measure p(A | Di, Zi), we will assume that this measure can be
used to compute some estimate Hi for H. The solver’s estimate
for the solution x should be consistent with Hi . This suggests
the estimation update rule

xi+1 := xi - Hir(xi), (17.1)

where the inference on H is so far left abstract.


Figure 17.1: A quadratic optimisation
Following our general recipe, the second part of the solver is problem: extremum as black centre, Hes­
sian with eigen-directions represented
the action rule, the choice for the next projection di+1 of A. by an ellipse with principal axes. The
There are two, related but not identical, objectives for this rule: restriction of the quadratic to a linear
sub-space is also a quadratic. The opti­
One the one hand, we would like to know the new residual
mum in that sub-space can be found in
r(xi+1 ), if only to track process and check for convergence a single division, but it is not identical
(remember that the problem is solved iff r(xi+1 )=0). On the to the projection of the global optimum
onto the sub-space.
other hand, we want to efficiently collect information about A
and H; i.e. explore aspects of A that will maximally improve
subsequent estimates x>i. To this end, consider the projection
and accompanying observation

di+1 := xi+1 - xi = -Hir(xi),


(17.2)
zi+1 = Adi+1.
An appealing aspect of this choice is that it allows us to compute
the new residual without having to evaluate a second matrix­
vector multiplication. We can simply update the residual in
O(N) time as

r(xi+1) = Axi+1 - b = A(xi+1 - xi +xi) - b = zi+1 + r(xi).

The downside of this action rule is that, without further analysis,


there is no guarantee that it will produce particularly informa­
tive observations of A. After all, the step xi+1 - xi is just the
greedy choice that moves the estimate to whatever is currently
140 III Linear Algebra

i procedure LinSolve_Draft(A(•), b, p(A)) Algorithm 17.1: A draft for a probabilis­


tic iterative linear solver. The notation
2 Xо = Hо b / initial guess
A (•) is meant to represent a function per­
3 Го = Axо — b / initial gradient forming multiplication with A (which
4 for i = 1,..., N do does not necessarily require N2 opera­
tions, but might). In this algorithm and
5 di = —Hi-1 r—i / compute optimisation direction
others below, the line collecting the ob­
6 Si = di / define projection servation is marked with a bold-font
comment to signify the expensive part of
7 yi = ASi / observe
the iteration. Note the structural similar­
8 Xi = Xi-1 + Si / update estimate for x ity to cg (Algorithm 16.1).
9 Ti = Г— i + yi / new gradient at xi Line 10 is only a placeholder so far; con­
crete inference rules will be constructed
10 Hi = Infer(H I Yi, Si, p(A)) / estimate H
later. In a practical implementation, lines
ii end for 5 and 10 would be merged, so that no
12 end procedure actual estimate of the quadratic matrix
H has to be formed.

estimated to be the solution. However, Figure 17.1 illustrates an


intuition that might inspire some hope in Eq. (17.2). Recall from 6 A minor notational complication: line
(16.3) that the residual is the gradient of the quadratic objective 3 of Algorithm 17.1 wastes one initial
matrix-vector multiplication that is not
f (X). As such, if we consider the problem from an optimisation
used to infer information about H. This
perspective, then the residual r (Xi ) is the direction of maximal is for generality, so that x0 = 0 (and
improvement. We will just have to take care that the mapping H0 = 0) remains a possible choice. A
related issue one may worry about: Why
through the estimate Hi does not destroy this beneficial aspect. does the algorithm not simply set Xi =
Hib, which may seem like the “natural”
estimate for the solution (e.g. it is the
Algorithm 17.1 translates the two design choices of Eqs. (17.1) mean estimate for X if H is assigned a
and (17.2) into more concrete pseudo-code for a first draft of our Gaussian measure)? Note that the up­
date (17.1) can also be written as
probabilistic iterative solver.6 There is a small change to the lines
Xi+i = Xi - Hi(AXi - b)
in Algorithm 17.1 that amounts to a significant improvement in
practice. Note that the observation in line 8 of Algorithm 17.1 is = Xi - Hi I ^ As + Ax
j 0
\j<i
—bj
J
available at a point in the loop iteration before Xi+1 is actually
assigned. This offers an opportunity to already use some infor­ = x 0 — HiAx0 + Hb.

mation from zi within the loop, as long as it is possible to do so Hence, for x0 = 0 (or HiAx0 =
x0), Eq. (17.1) is actually equal to xi =
at low computational cost. In particular, we can re-scale the step
Hib. This is mostly a problem of presen­
as di+i ai+idi+i, using a scalar ai G R. Doing so introduces tation: Algorithm 17.1 is a compromise
an ever so slight break in the consistency of the probabilistic allowing both a general probabilistic in­
terpretation while staying close to classic
belief: the estimate Xi in line 8 of Algorithm 17.1 will neither formulations, which typically allow arbi­
be equal to Hi-ib nor to Hib. But this is primarily an issue of trary x0.
algorithmic flow (the fact that Xi and Hi are computed in differ­
ent lines of the algorithm), and the practical improvements are
too big to pass on. In any case, this adaptation is also present
in the classic algorithms, so we need to include itto find exact
equivalences.
Indeed, under the assumption of symmetric A, the optimal
scale ai+i can be computed in linear time, using the observation
in line 7.We will consider symmetric A hereafter. Consider
the parametrised choice Xi = Xi-i + aidi. The derivative of the
17 Evaluation Strategies 141

i procedure LinSolve_Project(A(•), b, p(A)) Algorithm 17.2: Projected linear optimi­


sation. The variables s, y are redundant
2 x0 = H0 b / initial guess
with d, z, either pair could be dropped
3 0 = Ax0 — b / initial gradient in a real implementation. Both are kept
4 for i = 1,..., N do around here to provide uniform nota­
tion across algorithms. Note the struc­
5 di = — Hi—1 ri—1 / compute optimisation direction
tural similarity to classic iterative solvers
6 zi Adi / observe like conjugate gradients (Algorithm 16.1).
The only difference is in lines 5 and 12.
7 ai = —d ri—1/di zi / optimal step-size
8 si aidi / re-scale step
9 yi ai zi / re-scale observation
10 xi = xi—1 + si / update estimate for x
11 ri = ri—1 + yi / new gradient at xi
12 Hi = Infer(H | Yt, Si, p(A)) / estimate H
13 end for
14 end procedure

convex objective f, (16.2), in this scalar parameter is

df (xi—1 + aidi)
aid]Adi + d] (Axi—1 - b)
dai
= aid ]Adi + diT ri—1.

This derivative is zero at


d ]Adi
ai
d ]r—1.

Recalling that Vf (x) = Ax — b, note that this ai defines the


point where Vf is orthogonal to di:

diT Vf (xi—1 + aidi ) = diT (Axi—1 + a Adi — b ) = 0.

Thus, line 7 of Algorithm 17.1 provides exactly the required


matrix-vector multiplication zi = Adi to find the root of the
projected gradient. Explicitly,

diT ri—1
ai = (17.3)
dizi '

Algorithm 17.2 incorporates this insight. It involves only a few


re-arrangements from Algorithm 17.1, and one new line of linear
cost, to compute the optimal step size ai . This algorithm will
serve as the structural skeleton for the remainder of this chapter.
In §19 and following, we will consider several possible choices
for the inference step in line 12, and see how they relate to
existing classic algorithmic families. In preparation for this step,
the following sections give a brief overview over some relevant
families of classic linear solvers.
__________ 18
A Review of Some Classic Solvers

Several different notions and nomenclatures are used to de­


scribe linear solvers. This section connects Algorithm 17.2, in
its general form or under certain restrictions, to some popular
classic concepts. This will allow us, in Corollary 18.5, to iden­
tify a requirement that probabilistic solvers have to fulfil to be
equivalent to cg.

► 18.1 Projection Methods

Independent of the concrete inference rule in line 12, Algo­


rithm 17.2 selects the estimate xi as xi G x0 + Ki with Ki =
span{d1, d2, ...,di}. That is, the estimate lies in an expanding
sequence of sub-spaces spanned by the search-directions. Algo­
rithms that select the estimate as xi G x0 + Ki, for some Ki, are
known as projection methods.1 Because ai is chosen to minimise 1 Meister (2011), Def. 4.58
the objective f along the search direction, the resulting gradient
ri+1 is orthogonal to di (but not necessarily to all dj for j < i).
More generally, there is a space Li orthogonal to the gradient,

Vf (xi) = ri = Axi - b ± Li. (18.1)

Methods for which K = L are called orthogonal projection


methods, and Eq. (18.1) is then known as the Galerkin condition.2 2 Conversely, if K = L, the term oblique­

projection method is occasionally used;


Eq. (18.1) is then denoted as the Petrov-
► 18.2 Conjugate Directions Galerkin condition. Galerkin’s name is at­
tached to these methods because they
relate to Galerkin methods, a family
A further restriction of orthogonal direction methods is formed
of solvers for partial differential equa­
by conjugate directions methods. tions that solve a high- or infinite­
dimensional problem by projecting toa
Definition 18.1 (Conjugate directions). Given a symmetric matrix lower-dimensional space.
A G RNXN, two vectors v, w G RN are called A-conjugate to each
other if vT Aw = 0. An iterative linear solver for Ax = b with
144 III Linear Algebra

symmetric A is called a conjugate directions method if it chooses


projections di that are pairwise A-conjugate to each other.

Conjugate directions methods converge to the correct solution


x in at most N steps,3 a property called linear consistency. They 3 Nocedal and Wright (1999), Thm. 5.1
also ensure that the gradient rk after k steps is orthogonal to the
preceding search directions,4 i.e. rjpi = 0 for i < k. So they are 4 Nocedal and Wright (1999), Thm. 5.2
orthogonal projection methods.

The following result connects our skeletal Algorithm 17.2 to


conjugate direction methods. For the sake of readability, most
technical derivations have been moved to the end of the chapter.

Theorem 18.2 (proof onp. 183). IfA is symmetric, and the inference
rule in line 12 of Algorithm 17.2 produces a symmetric estimator Hi,
then Algorithm 17.2 is a conjugate directions method.

► 18.3 Krylov Subspace Methods

Another important sub-class of projection methods is formed


by choosing Ki as the Krylov sequence

Ki = Ki(r0, A) = span{r0, Ar0, A2r0,..., Ai-1r0}. (18.2)

Such algorithms are called Krylov subspace methods. They include


the aforementioned seminal cg and gmres. Krylov subspace
methods have been studied in great detail since the 1980s. In
addition to the textbooks already mentioned above, further
information can also be found in the books by Demmel (1997),
Greenbaum (1997), and Saad (2003). Due to the generic form
of line 12, Algorithm 17.2 is not a Krylov subspace method in
general, but we will see below what we need todoto make it
one.

► 18.4 Conjugate Gradients

One characterisation of cg is that it is the conjugate direction


method that is also a Krylov subspace method. Section 19.9 5 Eq. (18.3) is obviously redundant, be­
cause the spans of Si, Yi, and ri may
will identify a class of probabilistic models under which Ki is
overlap. This somewhat awkward form
the Krylov sequence. Under this choice, Algorithm 17.2 will be is deliberate, as it will later simplify
shown to be equivalent to the method of conjugate gradients, the construction of concrete probabil­
ity measures satisfying this restriction.
based on the following technical result. Lemma 18.3 clears out the notational un­
As in the preceding section, assume that the estimator Hi dergrowth, revealing that the assump­
tion can be written much more suc­
constructed in line 12 of Algorithm 17.2 is symmetric. Further,
cinctly.
now assume that Hi is selected5 such that, for all i > 1,
18 A Review of Some Classic Solvers 145

di = -Hi-1 ri-1 € span {s i,..., Si-1, yi,..., yi-1, ri-1}. (18.3)

The following lemma establishes the connection to Krylov sub­


space methods.

Lemma 18.3 (proof on p. 184). Assumption (18.3) is equivalent to


the statement that Algorithm 17.2 is a Krylov subspace method, i.e. it
is equivalent to the statement

si € span{r0, Ar0, A2r0,..., Ai-1r0}. (18.4)

We will now show the following Theorem.

Theorem 18.4 (proof on p. 184). If A is spd, Hi is symmetric


for all i > 0, Assumption (18.3) holds, and Algorithm 17.2 does not
terminate before step k < N, then

ri ± rj V 0 < i = j < k,

and there exist Yi € R\0 for all i < k so that line 12 in Algorithm 17.2 Figure 18.1: Analogous plot to Fig­
can be written as ure 17.1. The gradients at points sam­
pled independent of the problem’s struc­
di = -Hi-1 ri-1 = Yi - -ri-1 + ^~-i- 1 ) , ture (“needles” of point and gradient as
(18.5) black line, drawn from spherical Gaus­
Yi-1 sian distribution around the extremum)
with are likely to be dominated by the eigen­
vectors of the largest eigenvalues. Thus,
ei := . by following the gradient of the problem,
rl- 2 ri-2 one can efficiently compute a low-rank
approximation of A that captures most
Comparing Algorithm 17.2 to cg (Algorithm 16.1), we note that of the dominant structure. This intuition
they are identical up to re-scaling by Yi : is at the heart of the Lanczos process that
provides the structure of conjugate gra­
dCG = Y dProbabilistic dients.

Since the scaling Yi € R is cancelled by the step size ai in line


7, the two algorithms produce the same sequence of estimates
{xi }, and we have the following result:

Corollary 18.5. Under the assumptions of Theorem 18.4 on the


estimators Hi, Algorithm 17.2 is equivalent to the method of
conjugate gradients, in the sense that it produces the exact same
sequence {x0, x1, ...,xN} of estimates as cg initialised at x0 = H0b.

Figure 18.1 depicts how the gradients are sampled by cg - or


equivalently, if the Assumptions of Theorem 18.4 are satisfied,
by Algorithm 17.2.

► 18.5 Preconditioning

Krylov subspaces (Eq. (18.2)) have the following invariance


properties6 for scalars a, т € R\0 and orthonormal matrices U 6 Parlett (1980), §12.2.2
146 III Linear Algebra

(i.e. Uт = U- 1):

scaling: Km (ar, т A ) = Km (r, A),


translation: Km (r, A + aI) = Km (r, A),
change of basis: Km(Ur, UAUт) = UKm(r, A).(18.6)

A generalisation of the last Equation (18.6) is the notion of a


preconditioned iterative solver. Take a non-singular matrix C, and
consider the transformed - the pre-conditioned - problem

AX = b with (18.7)
A := C-TAC- 1, X := Cx, b = C-Tb.

If C is chosen smartly, the conjugate gradient method can con­


verge significantly faster on the transformed system (18.7) than
on the original one. As shown in Algorithm 18.1, precondition­
ing can be almost totally externalised from the basic solver, by
adding only two additional lines, both solving a linear problem
of the form Ky = rk+1, with the matrix K = CCT. The ideal
but impractical choice is K = A, which directly yields the exact
solution in one step. Of course, having to compute the inverse
C -1 of what would then be the matrix square root of A begs the
question of running the solver in the first place. Good choices
of pre-conditioners C are thus some form of incomplete or ap­
proximate decompositions of A that can be efficiently inverted.7 7For a general introduction to this topic
and some specific examples, see Golub
contains a general introduction to the topic and some specific
and Van Loan (1996), §11.5.
examples.

Let us return to the context of probabilistic solvers, and speak


simplistically for the moment. Imagine that we find a probability
measure pcg (A) over A consistent with the conjugate gradient
method, in the sense of Theorem 18.4. Then consider a variation
ppcg (A) of it that maps di to the transformed space

di E span{s 1,..., si- 1, y 1,..., y- 1, K- 1 ri- 1}. (18.8)

Then, ppcg (A) is consistent with a preconditioned conjugate


gradient method using the pre-conditioner K = CCT.
Since the process of preconditioning involves using an a priori
guess K-1 for the matrix inverse H to simplify the work of the
solver, it will be no surprise for us to find later (Corollary 19.17)
that this modification in Eq. (18.8) is associated with a change
to the prior of a probabilistic solver. And, with the derivations
still to follow, we can already anticipate that the ideal choice
K = A will be associated with a particularly “natural” prior.
18 A Review of Some Classic Solvers 147

i procedure pCG(A(•), b, x0) Algorithm 18.1: Preconditioned conju­


gate gradients. Adapted from Nocedal
2 r0 = Ax0 -b / initial gradient
and Wright (1999). The only difference
3 g0 = K-1 r0 / corrected gradient to Algorithm 16.1 is the construction of
4 d0 = 0, в 0 = 0 / for notational clarity the corrected gradient in lines 3 and 13.

5 for i = 1,..., N do
6 di = -gi-1 + pi1 pi-1 / compute direction
7 zi Adi / compute
8 ai = -d ri-1/di zi / optimal step-size
9 si aidi / re-scale step
10 yi ai zi / re-scale observation
11 xi = xi-1 + si / update estimate for x
12 ri = ri-1 + yi / new gradient at xi
13 gi = K-1 ri / corrected gradient
14 pi = rl gi/r]_ 1 ri-1 / compute conjugate correction
15 end for
16 end procedure

► 18.6 Summary: Connecting Algorithm 17.2 to Classic Solvers

This section cast several families of classic linear solvers in the


notation of Algorithm 17.2. Doing so identified certain prop­
erties of the prior measure p(A) that make Algorithm 17.2
equivalent to these classic methods: any probabilistic estimator
for the matrix inverse H which is merely consistent with ob­
servations can give rise to a projection method. If the estimate
is also symmetric, Algorithm 17.2 becomes a conjugate direc­
tions method. Finally, we identified certain technical properties
that make the algorithm equivalent to the method of conjugate
gradients, and its preconditioned generalisation.
None of these restrictions directly imply a concrete form for
p( A); in particular, not a Gaussian one. So the situation in linear
algebra differs from that in integration, where each quadrature
rule was directly associated, up to a single scalar degree of free­
dom, with a particular Gaussian process prior on the integrand.
The underlying abstract reason is that iterative linear solvers do
a comparably simple job: an N x N invertible matrix has a finite
number N2 of degrees of freedom. At each iteration, the solver
identifies N of them perfectly, while learning nothing about
the other N(N - 1). Because of this, the estimation aspect of the
computation is less pronounced than in, say, integration, where
a finite number of evaluations must be used to estimate an
infinite sum. Linear algebra, to a significant degree, amounts to
sophisticated book-keeping rather than inferring an intractable
148 III Linear Algebra

object.
This is not to say there is no use for uncertainty in linear
solvers. It just so happens that classic solvers address a corner
case, one less demanding of uncertainty. It is nevertheless useful
to understand the connection to probabilistic inference in this
domain, because uncertainty is more prominently important if:

the object of interest is an infinite-dimensional operator,


rather than a finite matrix (this situation arises in the so­
lution of partial differential equations); or

& the question at hand involves the matrix inverse itself more
than the estimate of the solution. For example if we want to
compute Laplace approximations to large-scale models like
deep neural networks,8 which involve the inverse Hessian of 8 MacKay (1992); Daxberger et al. (2021).
the regularised empirical loss. Such matrices are routinely
much too large to be directly inverted. But they can also have
a limited number of prominent eigenvalues. If we use an
iterative solver in such situations, we will be forced to stop
it very early compared to the size of the matrix, and then
look for a good uncertainty estimate over the “unexplored”
remainder; or

the projection observations are corrupted by noise. This situ­


ation arises regularly in machine learning applications when
data is sub-sampled to lower computational cost. Sometimes
such situations can be re-phrased as projections of a Schur
complement, which is effectively a reduction to the previous
points.

A lack of restrictions is simultaneously a blessing and a curse


for our search of a probabilistic interpretation of linear algebra.
We may worry, if many quite different probabilistic priors can
simultaneously be associated with, say, the conjugate gradient
strategy, which is the right one? The answer we will find is:
it depends. On the one hand, considerable additional prior
assumptions will be required to arrive at a tractable uncertainty
estimate. On the other, it will become clear that different kinds
of priors may be useful to address different kinds of tasks.
__________ 19
Probabilistic Linear Solvers:
Algorithmic Scaffold

► 19.1 Gaussian Distributions over Matrices

To build a concrete solver, we need a concrete family of probabil­


ity distributions over matrices. And since matrices are structured
objects, constructing such a family is not entirely straightfor­
ward. There are arbitrarily many ways to define a probability
measure over the elements of a real matrix. To arrive at tractable 1 Wishart (1928)
forms, though, we require a probability measure that “fits well”
with observations of the form Y = AS. 2The Wishart distribution W (X; V, v) is
the measure over symmetric matrices
For example, the widely known Wishart distribution1 might
X G RNxN with v > N — 1 G R degrees
at first seem like a great choice to model least-squares problems offreedom and the the positive definite
(Eq. (16.1)), since it assigns probability density to only sym­ scale matrix V G RNxN, defined by prob­
ability density function (suppressing a
metric positive definite matrices.2 Unfortunately, the posterior complicated normalisation constant)
arising from conditioning on the linear projections (Y, S) is not
W(X; V,v) a |X|(N1)/2e—1/2tr(V- 1 X).
Wishart, and has no obvious compact form. However, see Ex­ (19.1)
ercise 19.4 and Section 15.2 for a further discussion of how the It can also be characterised as the mea­
Wishart distribution can be connected, at least in approximation, sure arising from the following gener­
ative process. Consider v vectors wi G
to the following derivations. RN drawn independently from p(wi) =
Instead of the Wishart, we will once again resort to Gaussian N(0, V). Then the symmetric N x N ma­
trix v
distributions. It is straightforward to define a joint Gaussian
measure over the elements of any real matrix, even a rectangular
X= ^w w i (19.2)
i=1

X G RNxK by using the vectorisation operation (§15.1) and has distribution W(X; V, v).
using the multivariate Gaussian distribution over vectors to
define3 3 We will omit the arrow over matrices

in the shorthand N(■; ■, ■). For matrices,


1 - —- —- „ 1 - —- —- „ \
(— 2 (X— X0)т£0 (X — X0 Й the Gaussian distribution will always be

(X (X; X0, £0) := (2n)NK/21£011/2 , (19.3) used in this sense, so there is no risk of
ambiguity.

where X0 is any vector in RNK, interpreted as a vectorised


150 III Linear Algebra

matrix, and Eo is a symmetric, positive semi-definite matrix in


rNKxNK. if Eo is full rank (i.e. if it is spd), this distribution
assigns non-zero measure to every matrix in RNK. In particular,
if X is square, this distribution assigns non-vanishing measure
to symmetric and asymmetric, to positive definite and indefinite
matrices, and to invertible and non-invertible ones. Sections 19.5 4 Instead, A may be an implicitly defined
and 19.6 will consider what kind of restrictions to this generality linear operation. For example, a convo­
lution operation can be defined as “take
can, and cannot, be readily incorporated in a Gaussian prior. For
the Fourier transform of x, multiply the
the moment, though, we consider the general form of Eq. (19.3). resulting vector element-wise, then take
the Fourier back-transform”, and can be
implemented without actually writing it
► 19.2 A Prior over A, H, or x1 out as an explicit matrix.

When considering probabilistic formulations of the linear prob­


lem, we have to consider what, exactly, are the uncertain aspects
of the equation Ax = b.Isit just the vector x? Are we looking
for one solution x for one particular b, or for the general inverse
H = A-1 that can solve for any b? In fact, we may even be uncer­ 5 Dennis and More (1977), §7.3
tain about the elements of A themselves, because being able to
perform multiplications Ad = z for arbitrary d G RN (the only
operation on A required for Algorithm 17.2) is not the same as
having access to every element of A.4 For the majority of this
chapter, we will consider the more general problem of inferring
the matrix-valued objects. Section 20.2 considers simplifications
arising if one only requires one particular solution x to one
particular b. This means there are principally two different ways
to use a Gaussian prior in the solution of a linear system. Both
have historical5 and practical relevance.
Figure 19.1: Inverses of Gaussian vari­
1. One may treat the matrix A itself as the latent object, and ables are not themselves normal dis­
tributed. The plot shows the distribu­
define a joint probability distribution6 p(A, Y, S). This model
tion of x-1 if p(x) = N(x; ц, 1), for five
class will be called inference on A. This approach has the ad­ different values of ц. Since (x + e)—1 и
vantage that the computation of the matrix-matrix product x-1 — ex-2, in the limit Щ от, the dis­
tribution approaches
AS = Y is described explicitly. This would be relevant, for
example, if the main source of uncertainty is in this computa­ p(x-1) и N(x-1; ц-1, ц-4).

tion itself - if we do not actually compute exact matrix-matrix However, for small values of ц, the distri­
bution becomes strongly bi-modal. It is
multiplications, but only approximations of it (a setting not therefore clear that we will have to resort
further discussed here). to approximations if we want to infer
both matrices and their inverse while
The downside of this formulation is that it does not explicitly using a Gaussian distribution to model
involve x. This is an issue because a tractable probability either variable. See also Figure 19.9 for
more discussion.
distribution on A may induce a complicated distribution on 6 We will generally assume that b itself
x. For intuition, Figure 19.1 shows distributions of the inverse is known with certainty, and thus not
of a scalar Gaussian variable of varying mean. For matrices, explicitly include it in the generative
model.
the situation is even more complicated, as the probability
measure might put non-vanishing density on matrices that
19 Probabilistic Linear Solvers: Algorithmic Scaffold 151

do not even have an inverse to begin with.

2. The alternative is to write the solution x explicitly as

x = Hb

(recall that we assumed A to be spd, so the inverse H = A-1


exists), and to formulate a distribution p ( H, Y, S) on H. This
approach will be called inference on H. The advantage of this
formulation is that, since x is a linear function of H, simple
tractable posteriors on H translate into tractable posteriors on
x. The downside is that if Y = AS does not hold exactly - if
the computations are performed approximately - it can be
difficult to capture the likelihood p(Y, S | H).

We will focus on the special but important case that observations


Z = AD are made without noise. In this setting, the same set of
observations can be written as linear maps of either A or H:

Z = AD .-, D = HZ.

In this case, inference on A can be directly transformed into


inference on H by simultaneously exchanging

S Y and A H.

Because of this connection, to avoid cluttering notation, we will


consider only inference on A for the moment, and return to
study inference on H later.

► 19.3 General Gaussian Inference

How far can we get with a general Gaussian prior

p (A; A 0, So )= N (A; A 0, So)

over the elements of A? The observations Y = AS can be cast


as a Dirac likelihood, the limit of a Gaussian likelihood:

p(S, Y I A) = 5(Y - (I ® ST) A) = lim N(Y; (I ® ST)a,вЛ),


в 0

for any constant, spd Л G RNMxNM (recall that M is the num­


ber of observations: S G RNxM). The Kronecker product formu­
lation shows that the observation Y is a linear projection of the
elements of A, hence the posterior measure arising from this
conditioning observation will also have Gaussian form. From
Eqs. (3.10) and (3.8), it is

p(A I Y,S)=N(A;AM,SM), (19.4)

with mean and covariance given by, respectively,


152 III Linear Algebra

AM := Ao + Zo(I ® S)((I ® ST)Zo(I ® S)) \Y - (I ® ST)Ao), and (19.5)


<- -- ->
=: GM

ZM := Zo - Zo(I 0 S)((I 0 ST)Zo(I 0 S))- 1 (I 0 ST)Zo. (19.6)

Unfortunately, the Gram matrix GM featuring in both mean and


covariance is of size NM x NM, and hence actually larger than
the original N x N matrix. But there is little point in an inference
model for the inverse of a matrix which requires computing the
inverse of an even larger matrix! Thus, some structure will have
to be imposed on Zo to simplify the posterior’s parameters, and
build a practical algorithm.
We briefly reassure ourselves that, no matter the prior co­
variance Zo, the posterior of Eq. (19.4) is consistent with the
observations, in the sense that it only puts non-zero mass on
matrices A obeying AS = Y. In this regard, we use the general
property of Gaussian measures from Eq. (3.4) to see that the
marginal distribution over the linear projection A- = (I 0 S)T A
is

p(AS) = N(AS; (I ® S)TAM, (I 0 S)TZM(I ® S)) (19.7)


= N(A# S-; Y, o).

Hence, any sample A^ from this measure is consistent with the


observations. If this sample is invertible, it also has the property
A -1Y = S required in Algorithm 17.2. This holds, in particular,
also for the posterior mean AM . To arrive at a practical algo­
rithm, we will now consider two classes of covariances. Each
allows efficient inference over matrix elements, and also ensures
that A M is invertible, with an analytically computable inverse.
As such, they provide a good candidate for the estimators Hi
used in Algorithm 17.2.

► 19.4 Kronecker Covariances for Efficient Computation

The general Gaussian prior (19.3) is quite a flexible model;


it allows for arbitrary covariance between any pair of matrix
elements. To reduce computational cost, we can try to restrict
the expressive power of Zo.
This situation is not unrelated to that in integration. To con­
struct tractable integration methods in Chapter II, we had to
restrict the prior to gp models with mean and covariance func­
tions on the integrand chosen such that the posterior mean
could be analytically integrated. In this chapter, we will simi­
larly be forced to restrict the prior covariance on matrix elements
19 Probabilistic Linear Solvers: Algorithmic Scaffold 153

to a class that allows analytic inversion of the posterior mean.


Exercise 19.1 (intermediate, and some­
While it is a somewhat ill-posed task to identify the set of all what tedious. Solution on p. 362). Con­
analytically integrable kernels for quadrature, we will below sider the assumption Sij,kt = $ik jWij, with
a positive definite W. It amounts to an inde­
identify a concrete class of tractable priors for linear algebra.
pendent Gaussian prior, of variance Wij, for
A naive choice would be to model all matrix elements as every single matrix element. Show that this
independent of each other.This corresponds to a diagonal So, choice gives rise to a G in Eq. (19.5) whose
inverse involves N separate positive definite
and means computing G costs O(NM2). However, this leads to matrices, each of size M x M.
a G whose inversion costs more than M3 (Exercise 19.1), which
is still an unacceptably high cost.

A better prior would encode the fact that A is not just a long
vector, but contains the elements of a square matrix. The pro­
jections terms (I 0 S), with their Kronecker product structure,
already contain information about the generative process of the
observations. We thus consider a Kronecker product for the
prior covariance, too:7 7Distributions of the form N(X; X0, V ®
W) are sometimes called a matrix-variate
So = Vb ® Wo with spd Vo, Wo € RNxN. (19.8) normal, due to a paper by Dawid (1981).
This convention will be avoided here,
since it can give the incorrect impression
What kind of prior assumptions are we making here? If both
that this is the only possibility to assign a
matrices in a Kronecker product are spd, so is their Kronecker Gaussian distribution over the elements
product (see Eq. (15.6)). Hence, Eq. (19.8) yields an spd overall of a matrix, when in fact Eq. (19.3) is the
most general such distribution.
covariance, and the prior assigns non-vanishing probability
density to every matrix A, including non-invertible, indefinite
8 One helpful intuition for this situation
ones, etc., despite the fact that such spd matrices V0, W0 only is to convince oneself that the space of
offer Kronecker products spans a sub-space
of rank one within the space of N2 x
2 ■ (1/2N(N + 1)) = N(N + 1) N2 real matrices, and that this sub-space
does contain a space of spd matrices.
degrees of freedom (as opposed to the 1/2N2 (N2 + 1) degrees
of freedom in a general spd S0).8
The prior assumptions encoded by a Kronecker product in the
covariance are subtle. A few intuitive observations follow. The
Kronecker covariance can be written as

COV(Aij, Ak£) = [Vb ® W0]ij,W = [ V0]ik ■ [Wb] j£. (19.9)


As such, it is tempting to think of V0 as capturing covariance
“among the rows” and W0 “among the columns” (Figure 19.2).
This is not entirely incorrect. Consider the following process to
Figure 19.2: Imposing Kronecker prod­
create a matrix A: uct structure V ® W on the prior covari­
ance means the covariance over matrix
1. Draw N column-vectors bi € RNx 1, i = 1,..., N i.i.d. as elements is governed by two factors. V
affects covariance as a function of row
bi ~ N(0, Vo). number, W as a function of column num­
ber. This is not quite the same as assum­
ing the matrix itself is a product of two
2. Draw N row-vectors cj € R1xN,i = 1, ..., N i.i.d. as terms.

Cj ~ N(0, Wo).
154 III Linear Algebra

3. Set Aij = biCj + Ao,ij i.e. Al = B © C + A0, if B and C denote 9NB: the product of two Gaussian prob­
the matrices resulting from arranging the vectors bi and cj ability distributions is another Gaussian
distribution (times a Gaussian normali­
into square matrices in the obvious way. sation constant). But the product of two
Gaussian random variables is not itself a
The matrix A arising from this process is not Gaussian dis- Gaussian random variable!
tributed.9 But it indeed10 satisfies Eq. (19.9): the covariance is 10 This is because
Kronecker-structured.
E (AlijAlы - E(Alij)E(Alы))
Another helpful observation is that the marginal variance of
= E(aibjakbt)) = [V0]ik ■ [W0]je.
individual matrix elements under the choice E0 = V0 © W0 is
determined solely by the diagonals of the two matrices in the
Kronecker product:

var(Aij) = [VA0]ii ■ [WA0]jj.


Figure 19.3 shows five samples each from two different Gaussian
distributions with Kronecker product covariance.
The main takeaway is that while the Kronecker covariance
does represent a helpful restriction, it is not one that limits
the space of matrices that we can infer. The prior measure
encompasses all matrices.

What Kronecker structure on the covariance does achieve is


a drastic reduction of computational complexity.Choosing E0
as in Eq. (19.8) shortens the posterior’s mean and covariance
(Eqs. (19.5) and (19.6)) to, respectively,

AM = A0 + (Y - A0S) (STW0S)-1 STWo , and (19.10)


=:ДMeRNxM " eRMxM ' iRM^N
SM = V0 © (W0 - W0S(STW0S) -1STW0). (19.11)
' =WM '

Note the absence of arrows over AM and HM .The posterior


mean AM is a sum of the prior mean (which can, of course, be
chosen to be simple, e.g. a scalar or diagonal matrix), and an
outer product of rank at most M. All the objects can be stored in
O(NM2), and they contain a single matrix inverse of only an Figure 19.3: Five i.i.d. samples from
M x M matrix. The terms in this outer product are: the distribution N (I, V0 ® W0) for the
choices V0 = W0 = I (left column) and
V0 = diag[104, 94, 84,...], W0 = I (right
ДM the residual between the expected value A0 S (the column), respectively.
prediction for Y under the prior), and the actual
Exercise 19.2 (easy). Using Eqs. (15.2)
observation,
to (15.7), prove that the choice (19.8) simpli­
STW0S the predictive covariance between the rows of the fies Eqs. (19.5) & (19.6) to (19.10) & (19.11),
residual, and respectively.

ST W0 the predictive covariance between the residual’s


rows and the rows of A.
19 Probabilistic Linear Solvers: Algorithmic Scaffold 155

The vector Y shows up in AM without a mapping to its left. In


the simple case of A0 = 0, the columns of AM are a scaled copy
of Y, with each rows transformed by W0. This is reflected in
the posterior variance: Observing linear projections of all the
rows of A, given assumptions about the covariance along rows
and columns, affects the “uncertainty along the rows”, but not
“along the columns”.

The posterior mean AM of Equation (19.10) provides an esti­


mator for A. Algorithm 17.2 instead requires an estimator for
H = A-1. Thanks to the low-rank structure of AM, however, it
is straightforward to compute its inverse, and use it as such an
estimator Hi = Ai-1. Using the matrix inversion lemma (15.9),
we find, after a few simple re-arrangements,

AM = (A о + Д am (ST Wo S) -1S T Wo) -1 (19.12)


= A - 1 + (S - A - 1Y)(S т Wo A - 1Y) -1S T Wo A - 1.

But we remind ourselves that this inverse of the expected value


of A is not the same as the expected value of the inverse! The lat­
ter is more challenging to compute. Nevertheless the following
result shows that, for sensible choices of the prior parameters
Ao and Wo, the estimate A-M1 exists.

Lemma 19.3. Assume Ao and Wo are spd, and the search directions Proof. If Ao is spd, its inverse exists.
S are chosen to be linearly independent. Then, for our assumption of Y = AS, and products of spd matrices
are spd. Thus, Wo Ao-1 A is spd, hence
spd A, the inverse (19.12) exists. S1Wo A -1 AS invertible. □

We thus have our first way to concretely realise Algorithm 17.2:


Its line 12 can be implemented by assigning the Gaussian prior

p (A) = N (A; A 0, Vo ® Wo), with A о, Vo, W0 spd, (19.13)

to A, which gives rise to the posterior mean of Eq. (19.10). If


the directions S are independent, the inverse of the mean is
given by Eq. (19.12) and provides the estimator Hi.11 That is, 11If the columns of S are not linearly in­
dependent, this could be rectified inter­
explicitly,
nally in the solver by keeping track of a
linearly independent sub-space. Doing
Hi = (A o + Дam (S т Wo S) -1S T Wo) -1 so would require extra linear algebra op­
-o11yV erations (for example it could be realised
-o11 +(
— A
= i /С
S-_A Y)(S WoA Y) - 1Q
-o11V^ -o11 .
S WoA
using the SVD of S).

► 19.5 Encoding Symmetry

We defined the problem Ax = b with the assumption that A is a


symmetric positive definite matrix. The priors on A or A-1 con­
sidered so far, however, do not encode this knowledge yet. They
156 III Linear Algebra

assign non-zero probability density to all real-valued matrices.


This is not just an aesthetic shortcoming: §18.2 guarantees con­
jugacy of the search directions if the estimator HM is symmetric;
but the estimator of Eq. (19.12) is not symmetric in general.
We will now find that symmetry can be encoded in the prior,
with a bit of legwork, without deviating from tractable Gaussian
inference. This is because symmetry is a linear constraint; it can
be written as the set of linear equations

Aij - Aji = 0, V1 < i, j < N.

From Eq. (3.4), we recall that the Gaussian family is closed


under such constraints. In contrast, positive definiteness is a
considerably more intricate property because the positive def­
inite cone is not a linear sub-space of RN2 . A discussion of
positive definiteness follows in §19.6.
To capture symmetry, will begin by simplifying the prior
covariance from Eq. (19.13) to the choice V0 = W0, i.e.

p(A)= N(A; Ao, Wq 0 W0). (19.14)

This step alone does not encode symmetry (samples from this
distribution are still asymmetric with probability one), but it
avoids technical complications in the following.
For a formal treatment, we introduce two projection operators
acting on the space RN of square N x N matrices:

Пд : RN2 RN2, with elements

П0 (ij),(k-') — 1/2($ik$jk + $il3jk),

is the projection onto the space of symmetric matrices. It has the


characteristic property

Ъ ПеX = 1/2(X + XT).

11 ■ : RN2 RN2, with elements

Пф (ij),(kk) := 1/2($ik j - X5jk), 12Also known as skew-symmetric matri­


ces.

is the projection onto the space of anti-symmetric12 matrices. It


has the characteristic property 13This also implies that every matrix
X e RNxN can be written as a sum of a
symmetric and an antisymmetric part:
ИX = 1/2(X - XT). —- —-
X = П0 X + Пф X,

which holds simply because


It is easy to convince oneself that Пд and П ■ are orthogonal
projection operators that jointly span RN2, i.e. that13 X = 1/2(X + X1 + X - X1).
19 Probabilistic Linear Solvers: Algorithmic Scaffold 157

Пе пе = Пе, Пф Пф = Пф, and


Пт = Пе, 11т = Пф, and
Пе пф = ПФПе = 0 n2, Пе + пф = IN2. (19.15)
We also notice that, for Kronecker products of a matrix W with
itself, we have (leaving out afew simple lines of algebra)

[Пе (W 0 W)ПФ](ij),(и)
1
= 4 (W W - W W - WikWje + W W) = 0.

So a Kronecker product of this type can be written as a direct


sum of a “symmetric” part and an “anti-symmetric” part:

W 0 W = Пе(W0W)Пе + Пф (W3W)Пф .
' =W W ' ' =:,«■ W '

These two products are known14 as the symmetric Kronecker 14 van Loan (2000)
product W 0 W and the skew-symmetric Kronecker product W ® W,
with elements
Exercise 19.4 (easy). Show that for matri­
ces X drawn from the Wishart distribution
[C 0 D] (ij),(u) = 1/2(CD + CltDjk), and (19.16) W(X; V, v) (defined in Eq. (19.1)) the ele­
ments of X have the covariance
[C ® D](ij),(k£) = 1/2(000 - CtfDjk).
cov(Xij, Xu) = 2v(V ф V).
They inherit some, but not all, of the great properties of the Hint: Use the generative definition of
Kronecker product. In particular, when applied to symmetric Eq. (19.2). The fourth moment of a central
normal distribution is given by Isserlis’ the­
matrices,15
orem (Isserlis, 1918):
(W 0 W) -1 = W-1 0 W- 1.
p (a )= N (a;0, L) ^
However,16 for general C and D, E( aiaj ak ae) = L ij Lkt + L ik Ljt + L ie Ljk.
(C 0 D)-1 = C-1 0 D- 1.

We also have
— —
(C 0 D)X = 1/2(CXDT + CXTDT), and
(19.17) 15If W e RNxN is of full rank, the ma­
, _ к --- , _ _ _ _к
(C ® D)X = 1/2(CXDT - CXTDT). trix W ф W has rank 1/2N(N + 1), the
dimension of the space of all real sym­
Using this framework, the information about A’s symmetry can metric N x N matrices. That its inverse
on that space is given by W-1 ф W-1
be explicitly written as an observation with likelihood can be seen from Eq. (19.17). The inverse
on asymmetric matrices is not defined.
# . .к _# —- _. . # # #— _ —- _ _ 4
p (e I A ) = 5 (Пф A - 0) = lim N (0 n2; П® A, в IN2). (19.18) 16Alizadeh, Haeberley, and Overton
в 0 (1988)

By Eqs. (3.8) and (3.10), the density on A resulting from condi­


tioning the prior (19.14) with this likelihood is

;
p(A 1 ©) = N(A A0 - L0Пф(ПФ^Цф) 1(-ПФA0), (i i )
Lo - L0ПТ (ПфLoпт )-1 ПфL0). 9
158 III Linear Algebra

Using the Kronecker structure Ео = W0 ® W0 and the identity


from Eq. (19.15), the mean reduces to its symmetric part:

Яо - (Пф + Пе)ЕоПТ (Пф£оП®)-1 ПфЯо


= (Пф + Пе)А--- - ПфA- = ПеАо,

while the covariance simplifies to a symmetric Kronecker prod­


uct:

Ео - ЕоП^ (ПфЕоПТ )- 1Пф£о = ПеЕоПе = Wо ® Wo.

The resulting probabilistic model

Р (А) = N (А; А о, Wо) ® Wо) (19.20)

with a symmetric prior mean matrix Ао, only assigns non­


zero measure to symmetric matrices; samples from Eq. (19.20)
are depicted in Figure 19.4. Like for the Kronecker product
covariance, this prior, too, can be conditioned to the linear
observations Y = А S, giving a posterior

Р (АП, S,e) = f ;^^WdА

= N(А; Ам, Е M).

Some algebraic footwork17 is required to find the posterior


means and covariance, the analogues to Eqs. (19.10) and (19.11).
They are Figure 19.4: Samples from a Gaussian
prior encoding symmetry. Results anal­
ogous to Figure 19.3. Five i.i.d. sam­
Ам = А о + (Y - А о S)(S т Wо S)-1S т W6 (19.21) ples from the distribution of Eq. (19.20)
for W0 = I (left column) and W0 =
+ Wо S (S т Wо S)-1 (Y - А о S )т
diag[102, 92, 82, ...] (right column), re­
- WoS(SтWoS)-1Sт(Y - АоS)(SтWoS)-1SтWo, spectively. Note differing choice for W0
relative to Figure 19.3, since W0 here is
ЕM = WM ® WM, with (as in Eq. (19.11)), (19.22) in both terms of the product.

WM := Wо - WоS(STW6S)-1STWe-
Exercise 19.5 (moderate). Explicitly com­
pute the evidence term
These expressions, in particular the posterior mean Ам, play
a central role not just in linear solvers, but also in nonlinear 13 (Y - AS)N(А; А о, Wо ® Wb) dА.

optimisation. Let us take a closer look. We first note that Ам is


What is its form?
indeed symmetric if А о is symmetric, because ST Y = ST AS is
symmetric. What is less obvious is that the expression added 17 For a derivation, see Hennig (2015).
to Ам is of at most rank 2м. This can be seen by defining the Exercise 19.6 (hard). Derive the result in
helpful terms Eq. (19.21). In performing the derivation,
try to gain an intuition for why the pos­
terior mean (19.21) is not simply the sym­
U := WoS(STWoS)-1 € RNxM and (19.23) metrised form Пд Ам of the posterior mean
from Eq. (19.10) (consider Eq. (19.19) for a
V :=(I - 1/2UST) (Y - АоS) € RNxM. hint).
19 Probabilistic Linear Solvers: Algorithmic Scaffold 159

Using these, AM can be written as

0 IM Uт
A m = A о + UV т + VU т = A 0 + [u
IM 0 Vт

The inverse of this expression, if it exists, follows from the


matrix inversion lemma:

i i -1 г
U т A 0 1U U т A - 1V + I Uт
A M1 = A0-1 - A0-1 U V A001 . (19.24)
V т A -1U + I V т A 0 1V Vт

There is no particularly enlightening way to simplify this ex­


pression for general model parameters A0, W. But note that the
matrix to be inverted on the right-hand side is of size 2M x 2M.
So assuming the inverse of the prior mean A0 is known or easy 10
to compute, computation of this inverse estimator has complex­
ity of at most O(M3) (and multiplying with it has cost O(NM2)).
Below we will find that there are special instances for which the
complexity is significantly lower.

► 19.6 What about Positive Definiteness?


0
010
Encoding symmetry in the prior is analytically feasible because 0 10 10
it amounts to a linear constraint (Eq. (19.18)). Since we assume A11

throughout this chapter that A is not just symmetric, but also


Figure 19.5: Outer boundaries of the pos­
positive definite, it would be desirable to also encode this infor­ itive definite cone within the space of
mation in the prior. Unfortunately, the space of positive definite symmetric 2 x 2 matrices (the only case
that allows a plot). The thick line down
matrices is a cone, a nonlinear sub-space of RN2, the space of
the centre of the cone marks scalar matri­
all (vectorised) square N x N matrices, and also a nonlinear ces. The outer edge of the 2 x 2 positive
sub-space of R1/2N(N+1), the space of symmetric such matrices definite cone is given by matrices A with
A12 = A 21 = ±VA11A 22.
(see Figure 19.5). Information about positive definiteness can
thus not be captured in a Gaussian likelihood term using only
linear terms of A.
It is, however, possible to scale the parameters of the Gaussian
prior post hoc to ensure that the posterior mean estimate always
lies within the positive definite cone. This is helpful in so far
as it means this posterior point estimate can be trusted to be
admissible, and this is how this correction is used in practice.
From a probabilistic perspective, however, this is not particularly
satisfying since it means the model cannot make use of the
known positive definiteness during inference.
The following is a minor generalisation of a derivation in a
seminal review by Dennis and More (1977, §7.2) reproduced in
some detail here because it provides valuable insights, also used
in Chapter IV. Consider the symmetry-encoding prior (19.20),
160 III Linear Algebra

which gives rise to the posterior mean of Eq. (19.21). Assume


that the prior mean A0 has been chosen to be symmetric posi­
tive definite. Since the posterior mean is of the same algebraic 10

form (Gaussian, with a symmetric Kronecker product posterior


covariance), the posterior can equivalently be computed itera­
tively, as rank-2 updates to the mean estimator. Consider the
0
posterior belief

p(A | Yi) = N(A; Ai, Wi ® W.),


0
conditioned on observations Yi, the i matrix-vector multiplica­
100
tions yj = Asj for j = 1, ...,i. 10 10
A11
Using p(A | Yi), and given the next observation yi+1 = Asi+1,
we can calculate a Gaussian posterior on A with the mean and Figure 19.6: A Gaussian prior measure
covariance of mean A0 = 3I and symmetric Kro-
necker covariance with W = 3I shown
(yi - Aisi)siWi + Wisi (yi - Aisi)T relative to the positive definite cone. The
Ai+1 Ai + symmetric Kronecker product inherits
siT Wss
some of the cone’s structure in so far as
(
yi - Aisi)tsi wissWi
(19.25)
the marginal variance of off-diagonal el­
(siT Wsi )2 ements under this prior is half that of di­
agonal elements. But the distribution still
= Ai + uvT + vuT, assigns non-vanishing measure to the in­
definite matrices outside of the cone.
for (see Eq. (19.23))

w:= -W^
u : s] Wisi'

Wisisi (yi - Aisi)


v := (yi - Aisi ) -
2s ] Wisi
Wisisi Wi
Wi+1 = Wi -
si Wisi
Each posterior mean update is thus a rank-2 update. This it­ 18 Wilkinson (1965), pp. 95-98
erative form is more manageable from an analytic perspective
than the immediate form of Eq. (19.21). The idea is now to ask
for a value of W0 such that Ai can be shown by induction to be
positive definite, a notion that Dennis and More call hereditary
positive definiteness. For this we make use of a result from matrix
perturbation theory.18 Intuitively speaking, a rank-1 update can
at most shift the eigenvalues of the original matrix up or down
to the value of the nearest neighbouring eigenvalues.

Lemma 19.7. Let A G RNxN be symmetric with eigenvalues

A 1 < A2 < • • • < ^N.

And let A* a A + a a aT, with a G RN, a G R. If a > 0, then A*


has eigenvalues A* such that

A 1 < A * < A 2 < • • • < AN < AN.


19 Probabilistic Linear Solvers: Algorithmic Scaffold 161

If a < 0, then the eigenvalues of A * can be arranged so that

A * < A 1 < A*<•••< A*N < AN.

Now note that the rank-2 update in Eq. (19.25) can be written
as the sum of two symmetric rank-1 matrices:

Ai+1 = Ai +1/2((u + v)(u + v)T - (u - v)(u - v)T). (19.26)

This insight can be used to arrive at the following helpful state­


ment about rank-2 updates.
Lemma 19.8 (Dennis and More, Thm. 7.5). Consider the updated
estimator Ai+1 as defined in Eq. (19.25). Assuming the Ai is positive
definite, and s J Wisi = 0, the matrix Ai+1 is positive definite if and
only if det Ai+1 > 0.
To leverage this result, we make use of the matrix determinant
lemma, Eq. (15.11). Conveniently rearranged for the present
context, it reads

det(I + abT + baT) = (1 + aTb)2 - (aTa)(bTb).

Thus, the determinant of the updated Ai+1 is

(y]A- 1 Wisi)2 - ii
ii (yiTA- 1 yi)(i s]WiiA- 1 Wisi) + i(sWWiiA- 1 Wisi)(i yjsi)
det Ai+1 = det Ai
(siT Wisi )2
Since Wi is positive semi-definite, by Lemma 19.8, Ai+1 is thus
symmetric positive definite if and only if

(yiT A-1 Wist )2


yiTsi > yiA 1 yi- (19.27)
siWA-1 Wisi'
There are two interesting special cases for which this condition
can be simplified. They are summarised in the following two
statements.
Proof of Corollary 19.9. Under the as­
Corollary 19.9. Assume A0 is symmetric positive definite, Ai is sumptions of the Corollary, the right­
the posterior mean estimator of Eq. (19.21), and Algorithm 17.2 is hand side of Eq. (19.27) vanishes, and
the condition is always fulfilled: by as­
used to construct search directions si,i = 1, ... (in particular, this sumption A is spd and ysi = stAsi.
means the search directions are conjugate). If Wi has the property that Algorithm 17.2 ensures the search direc­
tions si are conjugate under A, and thus
Wisi = yi, then all Ai are symmetric positive semi-definite. (This is
the case for the unrealistic, but conceptually interesting parameter Ws = Asi - Y(SтY)-1Yтsi = yi - 0.

choice W0 = A). □

Theorem 19.10 (proof on p. 187). Assume A 0 = a I, W0 = в I


for a, в G R+, Ai is the posterior estimator of Eq. (19.21), and the
search-directions S are conjugate to each other. Then there exists a
finite a0 > 0, so that any choice a > a0 ensures that all Ai are
positive semi-definite.
162 III Linear Algebra

In general, the posterior mean is not guaranteed to be positive


definite (see Figure 19.7). But Corollary 19.9 and Theorem 19.10 10
y A
establish two different paths to guarantee positive definiteness
of the posterior mean: Corollary 19.9 establishes that if we
■0
choose W such that W0 S = Y and set any positive definite prior
mean A0, then inference on a positive definite A will always
produce positive definite posterior means (see Figure 19.8). If
we consider scalar covariances instead, then Theorem 19.10
0
provides the weaker statement that, in that setting, it is at least
-10
possible to “drag the posterior mean into the positive definite 0 10 10
A11
cone” by increasing prior covariance.
Both statements, however, are dissatisfying from the proba­ Figure 19.7: Gaussian inference under
bilistic perspective for two reasons. First, they are just statements the Gaussian prior of Figure 19.6 and
about the mean. The posterior distribution, being Gaussian, will the projection s = [1,0]T, on the spd
matrix A with A11 = A22 = 9, A12 =
always put non-zero measure on parts of the real vector-space 0.7 ■ 9 (black circle). The plots on the
outside of the positive definite cone. Second, these statements “left wall” of the plot show the projec­
tions of the prior and A into the obser­
are of a post-hoc nature, literally: the prior still puts mass out­ vation space [A11, A 12]T. Although both
side of the cone (see Figure 19.6). The information that A is the prior mean and the true matrix are
positive definite, available a priori, can thus not be leveraged by symmetric positive definite, the poste­
rior mean (black square, connected to
the solver in its action policy. The real value of prior information A0 by a dashed line) lies outside of the
is that it can change the way the algorithm acts, not just the cone. Theorem 19.10 shows that one way
to fix this is to increase the prior mean.
final estimate. At the time of writing, there is no clear solution The graphical representation of this re­
to this problem. sult is that A1 always lies on the black
projection line connecting A and y (rec­
ommended instant exercise: why?). As
the prior mean increases, A1 eventually
► 19.7 Summary: Gaussian Linear Solvers moves along that line into the cone.

We are interested in Gaussian inference models that can be


made consistent with the evaluation strategy and the estimates
10
constructed by iterative linear solvers for symmetric positive У
definite matrices. The derivations of the preceding pages show
that even these comparably concrete constraints still allow for 0

a diverse space of models. Table 19.1 summarises the four dif­


ferent candidates for a Gaussian framework of linear solvers.
Aiming to solve A x = b for x, assuming that A is symmet­
ric positive definite, we adopt the algorithmic paradigm of an 0
iterative solver, as defined in Table 17.2, constructing projection­ -10
0 10 10
observation pairs S = [s 1,...,sM] G RNxM and Y = AS = A11

[y1 Ум ] G RNx M.
Figure 19.8: Analogous to Figure 19.7,
To endow that algorithm class with a probabilistic meaning,
but with the covariance choice W = A
we might directly model the matrix inverse H. Modelling H considered in Corollary 19.9. Under this
allows a joint Gaussian model over both H and the solution choice, the posterior mean always lies
“to the right” of the true A along the pro­
x. Alternatively, we might model the matrix A. Modelling A jection line, thus in the positive definite
allows direct treatment of Gaussian observation noise, and still cone.
19 Probabilistic Linear Solvers: Algorithmic Scaffold 163

Asymmetric model Symmetric model

p(H) = N(H0, V 0 Wo) p(H)= N(H0, W3 0 W0)


p (Y, S | H) = 3 (S - HY) p(Y, S | H)=3(S- HY)
Model for H

= lim N(S;A-,y2(In ® Im)) = lim N(S;AY,y2(In 0 Im))


Y 0 Y 0

E|Y(H) = H0 + (S - H0Y)(YT W0Y-1YT Wg E|Y(H) = H0 + (S - H0Y) (YTW0Y-1YтWg


= : UT =:A =: U T
+ UAT - UYТД Uт
V|Y(H) = V 0 W0(I - YUT) VIY(H) = W3(I - YUT) 0 W0(I - YUT)

p(A) = N(A0, V 0 Wg) p(A)= N(A0, W0 0 W0)


p(Y, S |A)=3(Y-AS) p(Y, S | A)=3(Y-AS)
Model for A

= lim N(#; AS, y2(In ® Im)) = lim N(Y; AS,y2(In 0 Im))


Y 0 Y 0

E|Y(A) = A0 + (Y - A0S) (SтWS)-1SтWg E|Y(A) = A0 +(Y - A0S) (STW0S)-1SтWg


=: U T =:A =:’U T

+ UAT - USTA UT
VY(A) = V0 ® W0(I - SUT) VIY(A) = W0(I - SUT) 0 W3(I - SUт)

Table 19.1: A summary of the four differ­


ence classes of probabilistic models un­
der consideration for the construction of
linear solvers. Each cell lists the form of
gives rise to a low-rank mean estimate for A, thus an easily com­
the prior over matrix elements, the obser­
putable estimate for the inverse A-1. In either model class, we vation likelihood, and the posterior mean
can also decide to explicitly model the symmetry of the matrix, and covariance over the matrix elements
thus arising. See text for a discussion.
or to not encode this information in the probability measure.
Modelling symmetry complicates the derivations somewhat,
but turns Algorithm 17.2 into a conjugate direction method.
Before making an explicit link between certain classes of
classic solvers and particular choices within those four families,
we point out some subtle connections and differences between
these for model choices.
Although the observation likelihoods in all four cases are
Dirac point masses on a projection AS or HY, these point masses
arise from different limit processes: If the prior encodes sym­
metry, i.e. uses the symmetric Kronecker product ®, then the
likelihood has to also involve a symmetric Kronecker product
in the covariance - otherwise the posterior does not inherit
the symmetric property, and both the posterior mean and co­
variances do not have the compact forms listed in Table 19.1.
Without going into further details, this is the principal reason
why it is not straightforward to extend this framework to the
noisy setting, i.e. a situation in which matrix-vector multiplica­
tions can only be computed with (Gaussian) noise.
164 III Linear Algebra

► 19.8 Consistency between beliefs on the matrix and its in­ 3


verse 2

1
As we have already noted in §19.2, there are structural differ­
ences between a Gaussian model over A and one over its inverse 0

H. These differences reflect a larger point that one should be 1

very careful when considering the inverse of Gaussian ran­ 2


dom variables x ~ N(ц, a2) if the signal-to-noise ratio is small, 3
ц/a < 1 (see Figure 19.9). But if the ratio is large, the inverse 3 2 1 0 1 2 3

is relatively well approximated by a Gaussian itself. What does ц


Figure 19.9: The inverse x-1 of a scalar
this mean for our matrix-valued Gaussian beliefs? Are there Gaussian random variable x ~ N(ц, 1)
situations in which a Gaussian prior on the matrix A can be has a complicated distribution if | ц | < 1
associated with a reciprocal Gaussian belief on H; and does (see also Figure 19.1). For ц = 0, the
mean (solid line) is given, in a princi­
this relationship persist as observations arrive and the posterior pal value sense, by Dawson’s function
evolves? This section summarises some such results.19 (Lecomte, 2013)

Definition 19.11 (Posterior correspondence). Consider two solvers EN ( ,1) (x-1)


ц

of the form of Algorithm 17.2, one with a belief on A with prior mean (x-ц)2/2 erfi Х-Ц
erfi V2 '
A0 and a covariance parameter W0A (with associated posterior mean
where erfi(z) = -ierf(iz). However,
AM) and one maintaining a distribution on H with prior mean H0
for ц 1, the distribution p(x- 1)
and covariance parameter W0H (with associated posterior mean HM). is relatively well approximated by
We say their priors induce posterior correspondence if N(x- 1; ц- 1, ц-2) (dashed line ц- 1, dot­
ted lines at ц ± 2yJц-2). The plot also
shows 20 samples (x + ц)-1. For ц = 0,
AM = HM for 0 — M — N. (19.28) the mean does not exist.
19 These statements are all from Wenger

And we speak of weak posterior correspondence if we only have and Hennig (2020), where proofs can
also be found.
A-M1Y = HMY. (19.29)

For asymmetric models, it is then possible to first show a general


result:

Lemma 19.12. Assume an asymmetric prior. Let 1 — M — N, W0A,


W0H symmetric positive definite, and assume A0-1 = H0. Then prior
correspondence (Eq. (19.28)) holds if, and only if,

0 = (AS - A0S)[(STWAA- 1 AS)-1STW0AA-1 - (STATWHAS)-1STATWH].

For example, a specific choice that fits this structure is A0 = a01,


WA = в 0 A paired with H0 = a -11, WH = в 0 / a 0. A more
general, and more practical form, will be developed in §21.

Symmetric models are more restricted. But we can still show


weak correspondence:

Theorem 19.13. Assume an symmetric prior. Let 1 — M — N,


W0A, W0H symmetric positive definite, and assume A0-1 = H0. Further
19 Probabilistic Linear Solvers: Algorithmic Scaffold 165

assume that W0A, A0, W0H satisfy

W0A S = Y, and
Sт ( waa- 1 - AWH) = 0.

Then weak posterior correspondence (Eq. (19.29)) holds.

► 19.9 Gaussian Models Consistent with Classic Solvers

By inspecting the above in light of the theorems in §17 onward,


we can now establish the following results.

Theorem 19.14 (Probabilistic projection methods). Any Gaussian


generative model on elements of A, p (A) = N(A, A0, LA) or on
elements of the inverse H, p (H) = N(H, H0, LH), when used in the
form of Algorithm 17.2, gives rise to a projection method.

Proof. By Eq. (19.7), the estimator arising from these priors


obeys the consistency requirement for line 12 of Algorithm 17.2
and is thus a projection method by construction. □

Theorem 19.15 (Probabilistic conjugate direction methods). Any


Gaussian generative model with symmetric Kronecker covariance for
either A or H, i.e. p(A) = N(A, A0, WA ® WA) or p(H) =
N (H, H0, WH ® WH), with symmetric positive definite A0, WA
or H0, WH, when used in Algorithm 17.2, gives rise to a conjugate
direction method.

Proof. The theorem follows immediately from Theorem 18.2,


using the consistency of the Gaussian posterior (19.7), and the
fact that the posterior estimates AM or HM are symmetric. □

Theorem 19.16 (Probabilistic conjugate gradients). Consider a


prior p (H) = N(H; H0, WH ® WH). For all parameter choices
of (H0, WH) with scalar H0 = a I and WH = в I + YH for a e
R and в, Y e R+, Algorithm 17.2 is equivalent to the method of
conjugate gradients, in the sense that it produces the exact same
sequence of estimates xi . The same is true for the model class p( A)=
N (A; A 0, Wa ® Wa ) with scalar A 0 = a I and Wa = в I + YA.20 Yes, WH includes the true, inaccessible,
20

H, and, similarly, WA includes A. This


oddity is discussed in more depth in §20.
Proof. We will make use of Theorem 18.4. First, by Theorem 19.15,
the algorithm constructed here is a conjugate directions method.
For Theorem 18.4 to hold, we have to show that the estimator
166 III Linear Algebra

Hi resulting from these models and parameter choices satis­


fies Assumption (18.5), i.e. that the gradient ri-1 is mapped to
si = -Hi-1ri-1 in the span of {S, Y,ri-1}. We verify that this is
true by considering the image of Hi as listed in the right column
of Table 19.1. For the model on H, if S:i-1 denotes (s1, ...,si-1)
and analogously for Y:ii- 1, then H- 1 maps any vector v G RN to
the span of {H0v, S:i-1, H0Y:i-1, WHY:i-1}. Hence, if H0 is scalar
and WH = в I + Y H, then (since HY = S), r- 1 is mapped to the
span of {ri-1, S:i-1, Y:i-1}.
For models on A, Hi-1 maps ri-1 to the span of

{A0-1ri-1, A0-1WAS:i-1, A0-1Y:i-1,S:i-1}.

For a scalar A0 and WA = в I + YA, using AS = Y, this is the


span of {r- 1, Yii- 1, S:i- 1}. This concludes the proof. □

Corollary 19.17 (Probabilistic preconditioned conjugate gradi­


ent). Consider a prior p (A) = N(A; aK, в2K ® K) with a positive
definite matrix K that can be decomposed as K = CT C, and scalars
a, в G R. Then Algorithm 17.2 is equivalent to the preconditioned
method of conjugate gradients with the preconditioner K. The same is
true for the model class p (H) = N(H; aK- 1, в2 (K ® K) - 1) •

Proof. This result follows from Theorem 19.16, by noting that


preconditioned CG amounts to running CG on the transformed
problem (C- 1TAC- 1)Cx = C- 1Tb. If we define the transformed
quantities A := C- 1TAC- 1, X = Cx and b = C- 1Tb, then
by Theorem 19.16, running conjugate gradient on AX = b is
equivalent to running Algorithm 17.2 to infer A from the prior
p (A) = N (A; a I, в 2( I ® I))• We note that this transformation
of A can be written as

A = (C ® C)TA.

Thus, by Eq. (3.4), the associated belief over A is a Gaussian of


the form21 21 The step from the first to the second
line can be shown from the definitions of
p(A) = N(A;a(C ® C)TI,в2(C ® C)T(I ® I)(C ® C)) the symmetric (Eq. (19.16)) and standard
(Eq. (15.2)) Kronecker products.
= N(A; aK, в2K 8 K).

An analogous computation for H = (C ® C) 1A 1 yields the


corresponding prior for H. □

The above results, in particular Theorem 19.16, show that there


is an interesting class of probabilistic interpretations for widely
used iterative linear solvers. In the sense of these Theorems, we
19 Probabilistic Linear Solvers: Algorithmic Scaffold 167

may interpret an algorithm like cg as the policy of an agent with


an internal probabilistic model for the matrix A, or its inverse
H, who explicitly uses this model to iteratively estimate the so­
lution of the linear problem Ax = b, and performs probabilistic
inference from the observations collected in this process.
It may seem questionable that the exact A, or its inverse H,
show up as possible parameter choices in Theorem 19.16. After
all, these are the objects of interest in the inference scheme, and
thus definitely not available at runtime. For the moment, we just
note in passing that the posterior means HM or AM only contain
WH or WA in the form WHY and WA S, respectively. Since we
know HY = S and AS = Y, respectively, we can replace WHY =
(в I + YH) Y = eY + YS and, analogously WA S = (в I + Y A) S =
eS + yY. So the means, the point estimators, can be computed
for these choices - they only contain aspects of the intractable
W that are accessible at runtime (i.e. not A or H).
In contrast, the posterior covariances explicitly contain the
matrix W. So if these error estimates are required, this “trick”
cannot be used, at least not without further thought. A way to
address this issue will be considered in §21. The principal idea
will be that of empirical Bayesian inference: If the algorithm has
already run for several iterations, and we know what structure it
needs to have to be equivalent to cg, what is our best guess for a
covariance matrix W that is both consistent with the algorithms’
actions so far and also might preserve the equivalence in future
steps? Of course, this will require some assumption of regularity
about the matrix A itself.
20
Computational Constraints

► 20.1 Inference Interpretations of Algorithms


vs Algorithms for Inference

In this chapter, we have so far established that there are self­ 1 Expositions on the convergence of
cg and related methods can for example
contained probabilistic inference algorithms on matrix elements
be found in §5.1 (p. 112 onwards) of No-
whose behaviour is consistent with that of existing linear solvers. cedal and Wright (1999) and in §11.4.4. of
These corroborate the view that computation is inference. Golub and Van Loan (1996). A frequently
used result is that cg after k + 1 steps,
On a more practical level, a corollary of these results is that assuming exact computations, finds the
we can use existing - efficient, stable - implementations of linear solution Xk+i = Pk(A)r0, where P£ is
the (matrix) polynomial of degree k that
solvers, cg in particular, and treat them as a source of data,
solves the following optimisation prob­
producing informative action-output pairs (si, yi = Asi)i=1,...,M. lem over all such polynomials:
Combined with the kind of structured Gaussian priors stud­ Pk = argmin\\x0 + Pk(A)r0 - x\\a,
ied above, these data then give rise to posteriors on A and H, Pk
(20.1)
and these posteriors have convenient properties: their posterior where x is the true solution of Ax =
mean is a sum of A0 and a term of low rank, thus easy to handle b, and ||v||A := vTAv. This result can
be used to phrase the convergence in
both analytically and computationally. And the good conver­
terms of the eigenvalue spectrum of
gence properties1 of cg translate into desirable convergence A (e.g. p. 116 in Nocedal and Wright
properties of the Gaussian posterior. (1999)). In particular, if A has eigenval­
ues Ai < ■ ■ ■ < AN, then the error of
cg after M + 1 steps is roughly given by
Thus, consider the data set S, Y G RN' M collected by running
||xm+1 - x\\a ~
cg on A, b.Ifwe adopt this practically minded approach, the
(An-M - A1)\\x0 - x*\\a. (2O.2)
desiderata for the Gaussian prior change. It no longer matters
Simply put, if A has K N large
whether the prior is particularly consistent with the actions of
eigenvalues and N - K small ones, then
the algorithm that collected the data. Instead, two new consid­ cg finds a good estimate in only K steps.

erations arise: Since these optimisation steps are taken


in the span of SM (the Krylov sequence
Computational efficiency: The Gaussian posterior should have of Eq. (18.2)), this also means that the
low-rank term in the posterior mean
particularly low storage and evaluation cost. In particular, one
A M approximately covers the dominant
would like to use the known properties of cg - orthogonal eigenvalues of A.
gradients, conjugate directions.

Uncertainty calibration: The posterior covariance - the main


170 III Linear Algebra

“added value” of the probabilistic solver over a classic solver -


should be analytically linked to the estimation error. We will
find below that this intuitive desideratum is not straight­
forward to translate into a concrete analytical statement.
Since the object of interest is matrix-valued, design criteria
focussing on different aspects of the matrix lead to different
choices for the prior.

Assume we want to perform inference on A with a symmetry­ 2A scalar A 0 = a 01 is also relatively easy
to handle. Because cg produces orthog­
encoding prior p(A) = N(A; A0, W0 ® W0). Both the computa­ onal gradients and yi = Asi = ri - ri-1,
tional and the calibration viewpoints suggest choosing the matrix YT Y is symmetric tridiagonal
positive definite:

A0 = 0, and (YTY)ij = УJyj = (ri - ri-1)T(rj - rj-1)


W0 such that W0 S = Y. = &ij (ri ri + ri- 1 ri-1)
+ $i,j-1 ri-1 ri-1
Computationally, this choice is appealing, because then the + ^i,j+1 rl ri.
Gram matrix STW0S = STY = STAS is diagonal (since cg pro­
duces conjugate directions), and to set A0 = 0, and all terms Tridiagonal problems can be efficiently
solved in O(M) time and thus can be
with ST A о S in the posterior vanish.2 From Eq. (19.21), the pos­
treated as “essentially trivial”. For spd
terior mean is then given simply by tridiagonal systems, a concrete exam­
ple is Algorithm 4.3.6 (p. 181) of Golub
M M and Van Loan (1996). This algorithm re­
Am = Y(STY) -1YT = E ymym = YYT, (20.3) quires 8M floating point operations to
solve the M x M system). In lapack (An­
m=1 smym
derson et al., 1999), the corresponding
and can thus be stored in form of the N x M matrix Y = solvers are called xPTSV (for Positive def­
inite Tridiagonal SolVer, with x=S,D,C,Z
Y(STY)-1/2. Since for spd A, STY = STAS, is a diagonal matrix for single, double, complex, double com­
with only positive entries on the diagonal, its matrix square root plex, respectively).

can be computed simply from the real numbers у/ s^ут > 0.

Let us again consider the hypothetical choice

W0 = A.

We find that it is also favourable from the viewpoint of uncer­


tainty calibration: the prior p(A) = N(A;0, W0 ® W0) assigns
element-wise variance
1
varp (A)([A ] ij ) = 2 ([ W0 ] ii[ W0 ] jj + [ W0] 2j ). (20.4)
Thus, for elements of the diagonal, we have what might be called
perfect calibration - the true square error equals the expected
square error:

([A]ii - Ep(A)([A]ii)) =1
(20.5)
Ep(a)(([A]ii - Ep(A)([A]ii))2) .
20 Computational Constraints 171

For off-diagonal elements, the variance is an upper error bound

([A]ij - Ep(a)([A]ij))2 = [A]2 < 1


Ep(a)(([A]ij - Ep(A)([A]ij))2) 1 ([A]ii[A]jj + [A]2)< ,

(20.6)
because we assumed A to be spd, and thus

I [ A ] ij | <y] [ A ] u [ A ] jj, V 0 < i, j < N.

It would of course be preferable to achieve perfect calibration


for all matrix elements, not just on the diagonal. But the sym­
metric Kronecker structure imposes restrictions on the form
of the covariance, so we have to live with some trade-off. This
situation is analogous to that we encountered in the chapter on
integration: to infer integrals, we had to model the integrand
with a GP whose covariance function (kernel) was chosen such
that the posterior mean be analytically integrated. This cannot
be achieved in general with perfect calibration, so we settled for
under-confidence. In this chapter, to infer matrix inverses, we
have to model the matrix with a Gaussian whose covariance is
chosen such that the posterior mean can be analytically inverted.
Again, we find that perfect calibration cannot be achieved, and
3 The Moore-Penrose pseudoinverse A+
settle for under-confidence. of a matrix A is a generalisation of the
Corollary 19.9 further supports the choice inverse A-1 that exists for any matrix A.
It is defined (for real-valued matrices) as
a matrix with the properties
W0 = A, A0 = 0,
AA+A =A,
since AM is then always positive semi-definite. Setting A0 = 0 A+AA+ = A+,
yields a rank-deficient estimator AM, so we cannot use the ma­ (AA+)T = AA+,

trix inversion lemma to compute an inverse. However, the pseu- (A+A )T = A+A.

doinverse3 of A M can be computed efficiently and has the right (The concept seems to have been in­
vented by Fredholm (1903) for operators,
conceptual properties for many applications. For a factorised
and discussed for matrices by Moore
symmetric matrix like our AM = YYT, the pseudoinverse is (1920).) The pseudoinverse yields the
given by least-squares solution A+b for our lin­
ear problem A x = b in the sense that
A + = Y (Yт Y) - 2 Yt.
||Ax - b||2 > \\AA+b - b||2 Vx e RN.
Since YTY is tridiagonal symmetric positive definite, its inverse For the choice A0 = 0, A+M can also be
can be computed in 8M operations (see note 2). seen as the natural limit of the estimator
A-M arising from A0 = al for small a,
because, for general A,
Alas, we can of course not set W0 = A, since A is the very
A + = lim (A т A + a I)- 1 A т
matrix we are trying to infer. We could set a 0

= lim AT (AAT + al) - 1.


a 0
W0 = AM

in what might be called an empirical Bayesian approach. This


would ensure WS = Y. But then the posterior variance vanishes
172 III Linear Algebra

(from Table 19.1):

^а: = W0 - W0 S (S T W0 S) - 1 S T Wo
= YYт - YYт = 0.

Sowe will have to do something more elaborate. Section 21


discusses ways to estimate a prior covariance parameter W0
that is consistent with the choice of A а and achieves non­
zero posterior variance according to different desiderata for
calibration.

► 20.2 Inferring the Solution x

Across this chapter, the linear problem Ax = b has been phrased


in terms of finding the matrix inverse A-1. This provides a way
to find x for any b G RNxN. Depending on the application,
however, it may be entirely enough to just solve for one specific
b. In such cases the detour through the matrix inverse is not
necessary, and we may wonder whether we can get away with
less computational overhead. A derivation and analysis of linear
solvers phrased as inference on x can be found in Cockayne
et al. (2019a); the connection to matrix-based inference is further
explored in Bartels et al. (2019).
This setting is where a formulation with a prior on H is more
convenient than one with a prior on A. Because x = Hb is a
linear map of H, any Gaussian prior

p(H) = N(H; H0,S) with S G RN2xN2

directly translates into a Gaussian prior on x, with

p(x = Hb) = N(x; H0b, (I ® b)TS(I ® b))


=: N(x; x0, Eo) with Eo G RNxN.
The (noise-free) observations AS = Y can then be interpreted in
two different ways: either as linear projections of H of the form
S = HY, or as linear projections of x = HT b of the form

YT x = ST b. (20.7)

Equation (20.7) is a statement about only the b numbers in


YT HT b = ST b G RM rather then the N x M numbers in S,
highlighting that inference on x is more limited, but also less
expensive, than explicit inference on H followed by a projection
on b. The direct Gaussian posterior on x has the form

p(x I YTx = STb) = N(x; xM, EM), (20.8)


with xm = x0 + EoY(YTE0Y)- 1(Sтb - Yтx0),
and Em = Eo - EoY (Y^EoY)-1YW (20.9)
20 Computational Constraints 173

For example, the non-symmetric prior p (H) = N (H; H0, V ®


Exercise 20.1 (medium). The more gen­
W) induces the prior p (x) = N(x; x0, E0) with x0 = HJb and eral yet more expensive path is to condition
the prior p(H) = N(H; H0, W0 ® W0)
Eo = (I ® b)T(W0 ® V0)(I ® b) = (bTV0b)W0. on the N x M observations S = HY, then
project onto x. The resulting posterior mean
is xM = HMb, with HM from Table 19.1.
On the other hand, the symmetric Kronecker-structured prior
Compute this projection HM b and convince
yourself that it has a more complicated form
p (H) = N(H; H0, Wо ® W0) induces than Eq. (20.10) (e.g. it explicitly contains
terms in H0). What is the significance of
p (x) = N(x; x0, E0) with x0 = Hо b and
these additional terms? Do they change the
value of xM ? (Hint: Consider their effect on
Eo = (I ® b)T(Wо ® W0)(I ® b)
the space outside the span of YT W0b.)
1
= 2 (W0( bT w0 b) + w0 bbT w0)

1 , ~~ s
=: 2 (ew0 + bbT) with b := W0b,в := bTW0b.

The matrix inversion lemma yields

1
(YTE0Y) -1 = 2 (в(YTW0Y) + YTbbTY)
Y т b b т Y (Y т W0 Y) -1 \
= 2 в-1 (Y T Wo Y) -1 I ,
в + b т Y (Y т Wo Y) -1 Yb J

and we see from Eqs. (20.8)-(20.9) that conditioning on S = HY


gives rise to a posterior on x with mean

/ VTMTV/VTW Y/-— 1 \
xm = x0 + (.WоY + bbTY)в-1 (YTW0Y)-1 (I — в + ~TY(YTWoY)—1 Yb) (STb + YTx0). (20.10)

While these terms have a tedious structure, they can be man­


aged efficiently by keeping track of the M-dimensional vector
bT WоY and the M x M matrix YT Wo Y. As before, this can be
achieved in a particularly efficient manner if the choices W0
and Y well-chosen relative to each other: if the observations
Y are constructed by running cg and W0 = wI, then YTY is a
tridiagonal spd matrix.
__________ 21
Uncertainty Calibration

We know from the preceding chapters that we can use the


iterations of classic CGto construct1 a “good” (symmetric, posi­ 1 We saw in Theorem 19.16 that p(A) =

tive definite, fast-converging, computationally efficient) mean N (A; a I, в21 ® I) is another prior con­
sistent with conjugate gradient. Since it
estimate AM .But we could have done all of this without a gives rise to a posterior mean that offers
probabilistic formulation. The final goal of this chapter is to less of these good properties (it is more
expensive, and not necessarily positive
construct a tractable covariance that is both probabilistically definite), it is less interesting. But its sim­
consistent with AM (i.e. both mean and covariance arise from pler structure allows a different form of
uncertainty calibration, via a conjugate
the same generative model) and well calibrated, so that it can
prior. This will not be further explored
serve as a notion of uncertainty. here, but interested readers can find a
Our approach to achieve this will be to set W0 to a matrix derivation in the appendix to this chap­
ter, in §22.5.
that acts like A on the span of S, and estimate its effect on the
complement of this space using regularity assumptions about
A. That is, W0 could be chosen as the general form2 2See Hennig (2015). Additional discus­
sion and further experiments can be
found in Wenger and Hennig (2020),
Wo = YYT + (I - S(STS)-1ST)n(I - S(STS)-1ST), (21.1) which, among other things, investigates
the notion of Rayleigh regression.

with a general spd matrix Q.. The projection matrices surround­


ing Q. ensure that it only acts on the space not covered by
Am = YYT. For any such П, this choice of Wo evidently gives
rise to the posterior AM from Eq. (20.3).
For simplicity, we will only consider the simplest form of this
nature, a scalar matrix

П = w I. (21.2)

This is arguably a natural choice, too, because in the absence


of more specific prior knowledge about A, there is no way to
identify certain directions in the null-space of S over others.
A scalar Q. also simplifies Eq. (21.1) because the two terms on
176 III Linear Algebra

either side of Q. are projection matrices, thus we get

Wо = YYT + w (I - S (S TS)-1ST) and 3Recall, e.g. from Theorem 18.4, that for
cg, the space spanned by the directions
Wm = Wo - Wo S (ST Wo S) S T Wo S and that of the A-projections Y are
= Wo - YYT closely related. More precisely,

= w(I - S(STS)-1ST). span{ro, r1,..., rm-1}


= span{ro, y1,..., ym-1}
The scale w can then be interpreted as scaling the remaining = span{s1,...,sm}
uncertainty over the entire null-space of S, the space not yet = span{ro, Aro,..., Am-1ro}.
explored by cg.3 How should w be set? We already saw in For a proof, see e.g. Theorem 5.3 in No-
Eqs. (20.4)-(20.6) that the very (symmetric) Kronecker struc­ cedal and Wright (1999).
ture in the covariance that engenders the desirable low-rank
structure of the posterior mean also restricts the calibration of
uncertainty and causes a trade-off: calibrated uncertainty on the 4 See Vijayakumar and Schaal (2000), or
§2.5 in Rasmussen and Williams (2006).
diagonal elements implies under-confidence off the diagonal,
The data can be found, at the time
and conversely, calibrated uncertainty off the diagonal means of writing, at www.gaussianprocess.org/
over-confidence on the diagonal. We can thus expect to have to gpml/data/. It contains a time series of
trajectories mapping 21-dimensional in­
strike some balance between the two. puts x G R21 (positions, velocities, and
accelerations, respectively, of 7 joints of a
robot arm) to 7 output torques. The first
► 21.1 Rayleigh Regression of these torques is typically used as the
target y(x) G R for regression, as was
Figure 21.1 shows results from an empirical example, a run of done here, too. The entire training set
contains 44 484 input-output pairs. For
cg on a specific matrix. The SARCOS data set4 is a popular,
the purposes of this experiment, to allow
simple test setup for kernel regression. It was used to construct some comparisons to analytical values,
this was thinned by a factor of 1/3, to
a kernel ridge regression problem Ax = b with
N = 14 828 locations. The data was stan­
dardised to have vanishing mean and
A := kXX + a21 G R14828x 14828 and b := y, (21.3) unit covariance.

where k is the isotropic Gaussian kernel (Eq. (4.4)) with length­


scale 2 and noise level a = 0.1. On this problem, standard
cg (Algorithm 16.1, with some adaptions for stability) was run
for m = 3oo steps. The plot shows the sequence of the Rayleigh
coefficients, the projected values of A arising as

s mAs m
a(m) :=
smsm
5 Both bounds hold because A is as­
where sm is the mth direction of cg. These coefficients are read­ sumed to be spd, thus all its eigenval­
ily available during the run of the solver, because the term ues are real and non-negative. The up­
per bound holds because the trace is the
sm Asm (up to a linear-cost re-scaling) is computed in line 7 of sum of the eigenvalues. If kxx = UDUT
Algorithm 17.2. From Eq. (21.3), there are straightforward upper is the eigenvalue decomposition of the
spd matrix kXX, then the lower bound
and lower bounds both for elements of A and for a(m). With holds because UDUT + a21 = U(D +
the eigenvalues A1 > • • • > AN of A, we evidently have a21) UT. For this specific matrix, we also
know from the functional form of kXX
A 1 > a (m) > AN for all m (Eq. (4.4)) that [A]ij < 1 + a28ij, although
such a bound is not immediately avail­
able for H = A-1.
and thus also5
21 Uncertainty Calibration 177

Figure 21.1: Scaled matrix projections


collected by Conjugate Gradients for
the SARCOS problem. The plot shows,
as a function of the iteration index
m, the value of the scalar projection
a(m) := (smAsm)/(smsm) (black cir­
cles), where sm is the search direc­
tion of cg. These observations for
m = [3,4,...,24,50,51,...,149 ] (verti­
cal lines) are used to estimate a struc­
tured regression model for a(m) (see text
for details, local models as thin black
lines). The regression line of that model
is shown as a broad grey curve. The plot
also indicates the two strict upper (tr( A))
and lower (a2) bounds on the eigenval­
I I I I I I I
050100 150 200 250 300 ues of A, which are clearly loose (outside
# iterations m the plot range). The constant value of di­
agonal elements, which happens to be
known for this problem, is indicated by
a horizontal line. For comparison, the
2 v T Av plots also shows the values of the 300
< tr A
a2 <------- for any v G RD. largest eigenvalues of A (dotted line).
vт v
Since the trace is the sum of all N eigenvalues, this upper bound
is relatively loose (as Figure 21.1 confirms). But of course we do
not know the eigenvalues ki themselves, so constructing better
uncertainty scale will require a little bit more work.

While this experimental setup is not particularly challenging


for cg, Figure 21.1 shows the typical behaviour of this itera­
tive solver. The collected projections rapidly (i.e. in M N
steps) decay as the solver explores an expanding sub-space
of relevant directions. Those directions, while not identical to
the eigenvectors of A corresponding to the largest eigenvalues
(shown dotted in the plot), are related to them (see Eqs. (20.1)
and (20.2)). The plot shows a behaviour frequently observed in
practice: A small number of initial steps (in this example, from
m = 1 to about m = 50) reveal large projections am. Then comes
a “kink” in the plot, followed by a relatively continuous decay
over a longer time scale. It is tempting to think of the first phase
as revealing dominant “structure” in A while the remainder
is “noise”. But since there are N — 50 » 50 such suppressed
directions, their overall influence is still significant. We also note
that, while the am clearly exhibit a decaying trend, they do not
in fact decrease monotonically.

Since the a(m) are readily available during the solver run, it
is desirable to make additional use of them for uncertainty
quantification - to set w in Eq. (21.2) based on the progression of
a(m). One possible use for the posterior mean AM is to construct
178 III Linear Algebra

a cheap estimator AMv for matrix-vector multiplications Av


with arbitrary v E RN, including for v E RN that lie outside
of the span S. If this is the target application, then w should
be set to provide the right scale for such projections. Since this
amounts to a statement about aspects of the matrix that have
not yet been observed, it must hinge on either prior knowledge
or prior assumptions about A and their implications for a(m).
Apart from the fact that a(m) comes “for free” during the solver
run, another nice aspect is that the a(m) are scalar. So we can
use relatively inexpensive univariate regression from the M
observations of a(m) to try and predict values for a(m > M),
and thus also the value vTAv. We call this process Rayleigh
regression.6 It is visually represented in Figure 21.1. The broad 6 Wenger and Hennig (2020)
grey curve isafit for a function of the form

a( m) = a2 + 10 ^1+^2 m + 10 ^3+^4 m, (21.4)

with real constants £1,...,£4. The constants were found by


a least squares fit (or equivalently, but more on-message, as
the posterior mean of parametric Gaussian regression) over
the transformed observations log10 a(m) on the region m =
[3,4,..., 24] (for £1, £2) and m = [50,51,..., 149] (for £3, £4, the
contribution of the previous term can be essentially ignored in
this domain, simplifying the fit). It is then possible to estimate
the average value of am from any particular stopping point M to
N, to get a candidate for the scale w:

wprojections — N M a(m) dm. (21.5)


Of course a specific parametric form like Eq. (21.4) is not always
available. A more general, and more automatic model based on
GP regression on the values of log a(m) can be found in Wenger
and Hennig (2020).

> 21.1.1 Predicting General Matrix Projections

Under the posterior p(A) = N(A; AM, WM ® WM), the marginal


over a matrix projection Av = (I ® v)TA is

1
p(Av) = N(Av; AMv, 2 (WMvTWMv + (WMv)(vTWMA).
\______________ _______________ /
=:Lv

Figure 21.2 shows results from experiments with random direc­


tions v (with elements drawn from a shifted Gaussian, uniform
and binary distributions) on the SARCOS set-up described
above (see figure caption). The predicted scale w = 0.02 (fitted
21 Uncertainty Calibration 179

Figure 21.2: Predicting the projection Av


after M = 300 steps of cg on the SAR-
using Eq. (21.5)) is not perfect - indeed it would be surprising COS problem (see Figure 21.1). For this
plot, the elements of a vector v e R14828
if it were, given the ad hoc nature of the fitted model. But it
were drawn i.i.d. from a unit-mean
captures the scale of the vector elements quite well. The more Gaussian distribution, a uniform dis­
conservative estimate w = 1, let alone the hard upper bound tribution, and as binary values [v]i =
{-1, 1}, respectively for each panel. To
w = tr A, would give radically larger scales. So wide in fact, simplify computations, all steps were
that the corresponding pdf would not even be visible in the only performed on a subset of 4000
randomly chosen indices of Av. The
plot.
plot investigates the standardised vari­
able z := E-1/2(Av - Ep(A|S,Y)(Av)), us­
The plot also clearly shows that the sampled matrix projections ing the matrix square root of Ev : =
covp(A|,S,Y) (Av), the predictive covari­
are indeed modelled quite well by a Gaussian probability dis­ ance of Av under the posterior. If the
tribution. This is neither surprising nor particularly profound: probabilistic model were perfectly cali­
Since the elements of v are drawn from a probability distribu­ brated, the elements of this vector should
be distributed like independent standard
tion Vi ~ pv independent of each other and of the elements of Gaussian random variables. The plot
A, the central limit theorem applies, and the elements shows this standard Gaussian predic­
tion for zi (solid black) alongside, for
three independent realisations of v, the
[ Av ] i = E[A ] ij vj empirical distribution of the actual ele­
ments zi (histogram), and an empirical
fit (dashed) of a Gaussian pdf to the
are approximately Gaussian distributed with mean and variance elements of z. These means and stan­
dard deviations of such empirical distri­
Epv([Av]i) = Epv([v]j) E[A]ij, and butions (estimated from 10 realisations
of v, not shown) are printed in each plot.

varp ([Av] i) = varp ([v] j) E[A] 2.


This figure uses the value for the scale
v v
w fitted as described in §21.1.1, which
gives wprojections(M = 300) = 0.02 (in
other words, for the naive setting w = 1,
Hence, the fact that the elements of Av have a Gaussian distri­ the shown solid pdf would be about 50
bution is not surprising. What is reassuring is that the Gaussian times wider).
posterior on Av manages to capture the two moments of this
distribution rather well, even though it does not make use of
(and has no access to) pv . That the posterior mean AM provides
a very good prediction of arbitrary matrix-vector products is a
180 III Linear Algebra

testament to the ability of the Lanczos processLanczos process


to quickly expand a sub-space of relevant directions, and not
in itself related to the probabilistic side. The predicted vari­
ance Ev is a new, fundamentally probabilistic error estimate,
emerging from the probabilistic interpretation of linear solvers.
The relatively simple and computationally cheap, hand-crafted
regression model of Eq. (21.4) on the cg step sizes manages to
find the right scale for the remaining aspects of A not captured
by the mean.

> 21.1.2 Predicting Individual Matrix Elements

The above section is an example in which the prediction of an


unknown variable is apparently made easier by randomness. In
our koan on randomness (§12.3), we argued that randomness re­
moves structure. Here, the removal of structure by randomness
leaves us with a genuinely Gaussian distribution, the Gaussian
being the “least-structured” (in a maximum-entropy sense) of
all distributions of a given mean and variance over the real line.
One may argue that, in this case, the randomness is helpful
because it ensures an asymptotically perfect fitto the poste­
rior. But the more honest statement is that this randomness
has washed out all remaining, interesting, structure. We are left
with the plain old Gaussian, that we have allowed ourselves to
model.

The estimation task becomes harder when we consider more


explicit, deterministic aspects of the latent matrix A. As a par­
ticularly glaring case, we consider the prediction of individual
matrix elements [A]ij. The posterior marginal distribution on
these scalars under our model, conditioned on cg’s observa­
tions, is (see Eq. (20.4))
/ 1 \
P ([ A ] ij I Y, S ) = N A ] ij; [AM ] ij, 2 ([ WM ] ii[ Wm ] jj + [ Wm ] j .

We already know from Eqs. (20.5)-(20.6) that there is no scalar


w (in fact, not even a full spd matrix W0) such that the posterior
variance is a tight prediction for the approximation error on all
matrix elements. Sowe are forced to choose between a worst­
case, hard error bound on all matrix elements, and a reasonably
scaled error estimate that can be too small for some elements.
Figure 21.3 shows the progression of the posterior distribu­
tion for an increasing number of cg steps on the SARCOS task
introduced above (details in the caption, the Figure only shows
a small sub-set of the matrix elements for visibility). The top row
21 Uncertainty Calibration 181

Figure 21.3: Contraction of the posterior


measure on A during a cg run on the
shows the posterior mean converging towards the true matrix SARCOS problem (see Figure 21.1). Top
row: Posterior mean AM on a (randomly
A. The bottom row shows the element-wise posterior marginal
sampled) subset of 1o index pairs, af­
variance,7 for the following two different choices of the scale ter M = 1, 3o, 25o steps of cg (the full
parameter w: 14 828 x 14,828 matrix is too big to print).
The target sub-matrix of A is shown for
A hard upper bound: From Eq. (20.4), we see that varp(A) ([A]ij) comparison on the right. Middle and
bottom row: Absolute estimation error
is an upper bound to the square estimation error [A - AM]i2j | A - AM |, scaled element-wise by the
if posterior standard deviation. The mid­
1 dle row shows the choice for the scaling
2 ([ W0 ] ii [ Wo] jj + [ Wo] ))) > IA - AM \i). w = 1 (grey-scale from 0 to 1), the bot­
tom row for wprojection ~ 0.02, the value
For symmetric positive definite matrices (which obey | A|2 < used for Figure 21.2 (grey-scale from 0 to
| [Aii] | • | [A])) |), one way to ensure this bound holds is to set 3). ‘Mis-scaled’ entries, where the scaled
error is outside of the grey-scale range,
w > max[A]i). are marked with a cross-hatch pattern.
ij j

Such an element-wise bound can either be constructed explic­


itly by inspecting the matrix, or may be known a priori. For
spd matrices, one way to construct such a bound in O(N)
time would be to leverage the aforementioned bound again
7Under the joint posterior, these matrix
and set w = maxi [A]i2i. For the example of the SARCOS ma­ elements are correlated. So one should
trix, which is a kernel Gram matrix, this bound is known a not try to build a mental histogram of
priori to be 1. The resulting scaling is shown in the middle the numbers in the plotted matrix and
ask about their relative frequencies.
row of Figure 21.3. The Figure confirms that the variance
provides an upper bound, but also shows this bound to be
rather loose for off-diagonal elements (which are by far the
majority of matrix elements!).

An estimated average: If instead we again use Rayleigh regres­


sion and set w to wprojections from Eq. (21.5), this will not
182 III Linear Algebra

give an upper bound on the error. But since this bound is


constructed to capture the typical scale of the matrix, it may
provide a more aggressive scaling that, while not offering
any guarantees, might be more useful in practical applica­
tions. The resulting scaling is shown in the bottom row of
Figure 21.3. Note the different colour scale relative to the row
above. This part of the figure shows this scaled estimate to
provide a better average-case error estimate for off-diagonal
elements (values of value ~ 1). On the diagonal, the error
estimates can be far off, though. Some of the outliers (marked
by a cross-hatch pattern) can have ratios of true to estimated
error beyond 10.
__________ 22
Proofs

► 22.1 Proof of Theorem 18.2

Theorem 18.2 establishes that Algorithm 17.2 amounts toa


conjugate direction method if the estimator Hi is symmetric.

Proof. By induction: For the base case1 i = 2, i.e. after the first 1 For this proof, it does not actually mat­
iteration of the loop, we have (recall that a 1 = -d1 r0/d}Ad1). terhow the first direction d1 is chosen.
The symmetry of the estimator Hi is used in the third to last
equality:

d}Ad2 = -d}A(H1 r 1) = -d}A(H1(y 1 + r0)) = -d}A(s 1 + H1 r0) = -a 1 d}Ad 1 - d}AH1 r0


= d } r0 - a - 1 s } AH1 r0 = d } r0 - a - 1 y} H1 r0 = d } r0 - a - 1 s } r0 = d } r0 - d } r0 = 0.

For the inductive step, assume {d0, ..., di-1} are pairwise A-
conjugate. For any k < i, using this assumption twice yields

dk Adi = -dk A ( Hiri)

= -dk A Hi I £ yj +r0 ]
<<i J

= -dk A ( £ sj + Hir0 )
\j<i /

= -dkA I £ ajdj + Hir0 I


j<i

= -akdjAdk - djA(Hir0)
= dkrk- 1 - dkr0

= dk yj + r0) - dkTr0 = ££ajdkTAdj = 0.



184 III Linear Algebra

► 22.2 Proof of Lemma 18.3

Lemma 18.3, establishing the connection between cg and Krylov


subspace methods, also helps in the proof of Theorem 18.4.

Proof of Lemma 18.3. We begin by noting that di and si = aidi


only differ by a scalar, and recall both that yi = Asi and ri-1 =
r0 + Ej<i yj. Hence, Eq. (18.3) can immediately be shortened to

si E span{r0,s 1,...,si-1,As 1,As2,...,Asi—1}. (22.1)

The proof can then be completed by an almost trivial induction.


For i = 1, the statement reads si к r0. For i > 1, assume

si—j E span{r0, Ar0, A1-2r0 } V 0 < j < i.

Recursive application of this assumption to Eq. (22.1) directly


yields Eq. (18.4). Evidently, this derivation also implies that
Eq. (18.4) can be written equivalently as

di E span{r0, Ad1, ...,Adi—1}


= span{s0, s1,..., si—1, ri—1} (22.2)
= span{r0, y1, y2,..., yi—1}
= span{r0, r1,..., ri—1}.

► 22.3 Proof of Theorem 18.4

Theorem 18.4 identifies a property of the probabilistic model


on A that, when used in Algorithm 17.2, makes that algorithm
equivalent to the method of conjugate gradients.

Proof of Theorem 18.4. Because Hi is symmetric, Theorem 18.2


holds and the algorithm is a conjugate directions method. Hence
rjsj<i = 0 and siAsj<i = siyj<i = 0. Because the Theorem
posits that the algorithm has not yet terminated at step i, rj = 0
for all j < i.
The proof now again proceeds by induction. For i = 1, the
sets Si—1, Yi—1 are empty, so the assumption trivially amounts
to d 1 к r0. In other words, d 1 = —Y1 r0, for some y = 0. With
this choice, we get (y1 cancels in a0)

r о r0
a0 =
rо Ar0.
22 Proofs 185

We thus have x1 = x0 — a0r0 and r 1 = r0 — a0Ar0. This means


r 1 ± r0, so the first statement holds. Also,

span{r1, s1, y1} = span{r0, Ar0}, and


d2 = 50r0 + 51 Ar0, with 50, 51 E R.

Because the algorithm produces conjugate directions, dJ Ar0 =


d J Ad 1 = 0, thus 50 rJ Ar0 + 51 rJ AAr0 = 0, i.e.

r rоAr0 x j
51 =— A5°,and
, _ , ( rJ Ar0
d2 = 50 V 0 — AA r 0;

_ ( ( ( (rJ Ar0 )2
50 I 00 a0Ar0(0,00)(0,AAr0)

_( (rО Ar0 )2 A . ( (rO r0)(rо AAr0)


= 5077—W т л л—\ r0 — a0Ar0 + r0 ( ^~л—\2
(r0rо)(r0AAr0) I'--------- ' v (r0Ar0)2
=r1
=: —Y 0
- f r . 1 ((rОr0)(rОAAr0)
= YО I — r 1 +--------- --------------- 7^7—72 1 d1 .
V YA (rо Ar0 )2
We see Y0 = 0, because A is spd and r0 = 0 by assumption
(otherwise the algorithm would be converged!). To close this
part, we observe that

r i r1 = (r0 — a 0 Ar0)T( r0 — a 0 Ar0)


= rJ r0 — 2 a 0 rJ Ar0 + a 2 rJ AAr0
(rо r0 )2 rо AAr0
= — rо r0 +
(r0 Ar0)2
= rJ r0, and
/ (r r0) r AAr0 — \ = ф-1
V (r0 Ar0 )2 J rJ r0.

Thus, we indeed have the required statement d2 = y0 (— r 1 + e2/Y1 d 1 )•


For the inductive step, assume rk ± rj for all k = j and
k, j < i - 1. Further assume that

dj = Yj (—rj—1 + вj dj—1) for all j < i,

with вj defined as in Eq. (18.5). By the Theorem’s assumption


(and the re-formulation of Eq. (22.2)), direction di can be written
ala
using sc rs {vj }j=1,...,i as

di ^, vj sj + viri~ 1. (22.3)
j<i
186 III Linear Algebra

Because the algorithm produces conjugate directions, we have,


for £ < i:

0 = sj Adi = v^ sj As^ + vis J Ari— 1


= v sj Ase + vty} г—1.

If £ < i — 1, then the second term in this sum cancels by the first
induction assumption:

y} r—1 = (г — r— 1)T r—1 e<= 10.

But A is positive definite and the algorithm has not converged,


so d; = 0, hence

t-7 = 0 for all j < i — 1.

Equation (22.3) thus simplifies to

di vi—1 si—1 + viri—1 vi—1 ai—1 di—1 + viri—1 .


One degree of freedom is removed because di must be conjugate
to s—1, so

0 vi—1 ai—1 s]— 1 Adi—1 + visi— 1 Ari—1


= vi—1 ai—1s—1 A7i—1( — ri 2 + в—1/Yi—2 di—2 ) + viy— 1 ri—1
= — v— 1 ai—1 Yi—1 y— 1 ri—2 + vi (ri—1 — ri—2 )T ri—1, and
= v— 1 ai—1 Yi—1 r—2 ri—2 + vir—1 ri—1
Pi
vi—1 ai—1 = — v-.
Yi—1

Setting Yi = —vi = 0, we thus get the required statement

di = Yi —-—i-1 A-^-d—1 ) . (22.4)


Yi-1
What remains is to show that ri ± rj V j < i. To do so, we re­
arrange the recursion property (which is now proven for all
j < i), to write the preceding gradients rj in terms of the search
directions:

в в1 . \ 1 > Pi . \ , ■
dj = Yj \ —j 1 +---- — dj—1 I ^ rj 1 =------- dj------------- — dj—1 I for all j < i. (22.5)
\ Yj—1‘ ) Yj\‘ Yj—1 )

At step i, the updated gradient is

dT ri— 1
ri = Asi + ri—1 = aiAdi + ri—1 = — £ Ad Adi + ri—1. (22.6)
22 Proofs 187

For j < i — 1, Eq. (22.5) directly shows that rj ± ri, since rj is


then orthogonal to ri-1 on the one hand, and can be written as
a combination of dj and dj—1 on the other hand (both of which
are A-conjugate to di). For j = i — 1, we use Eq. (22.6) to see

rT r. — i - 1d — — di^ Ad + rT r- = — dTr. . + rT r 1
ri—1 ri ЛТДЛ I di di-1 Adi + ri— 1 ri— 1 — d dir_ 1 + r_ 1r_ 1,
Yjui tMAi \ li—1 / li
and use Eq. (22.4) to get

rT r - rT г I вj dт Г +rT r - вj dT r
ri—1 ri— —ri—1 ri—1 + — dj—1 rj—1+ ri—1 ri—1— y—y dj—1 rj-1.

But recall that dj—1 ± rj—1 by construction:

d—1 rj-2
r_1 — aj—1 Adj—1 + rj—2 — — dj—1 Adj—1.

Hence rj ± Tj for all j < i, and the proof is completed. □

► 22.4 Proof for Theorem 19.10

Theorem 19.10 states that for symmetric inference on the el­


ements of a symmetric positive definite matrix with a scalar
mean and covariance, it is possible to ensure hereditary positive
definiteness of the posterior mean by shifting the scalar prior
mean “far out” into the positive definite cone. Unfortunately,
deriving a qualitative version of this result requires seriously
tedious linear algebra derivations. So instead, Theorem 19.10
only claims that there is some sufficiently large value a0 ensuring
hereditary positive definiteness.

Proof of Theorem 19.10. We first note that for scalar W0, we have
Wi — в(I — S(STS) —1ST), so в cancels out of the right-hand side
of Eq. (19.27), and Wisi amounts to computing the projection
of si onto the complement of the span of S. Now we make the
inductive assumption that Ai is positive definite. If this holds,
then the second term being subtracted on the right-hand side
of Eq. (19.27) is strictly positive (the numerator is the square
of a real number, the denominator is nonnegative because Ai
is positive definite). So the right-hand side of the inequality is
smaller than yjA—1 yi. Furthermore, the left-hand side, yTsi —
siAsi is positive, since A is assumed to be spd. We will show
that the upper bound y]A—1 yi can be brought arbitrarily close
to zero for large values of a, so that the inequality eventually
has to hold.
188 III Linear Algebra

Consider the expression for Ai-1 given in Eq. (19.24). To com­


pute yjA-1 yj, it will be multiplied from the left and right by y^.
Since si is conjugate to all preceding directions by assumption
y] S = 0, and thus yJ A —1U = 0, as well as y] A —1V = 1 yJ Y.
This insight significantly reduces the amount of work required,
but still leaves some tedious derivations that prevent a quantita­
tive result here. Inspecting Eq. (19.24), we see that we need to
compute
yZ A-1 * = a y yj — a2 y YMY т у1 ,

where
matrix M is theinterm
inverse 19the
Eq. (in .24) lower
(and depends x M
right M on a!).block of the
Its form is

given by Eq. (15.10), but it requires a lengthy derivation, left out


here for space, to find that its concrete form here is

_ /VU
=( —
Y (A0 1 I+ — WS(S WA—0 WSt)- 1SqTW
A01WQI<;TW — 0 bv_
WA VTCl
)Y - Y S) —1

-1
= (1/a ((YTY) - (YTS)(STs) —1 (YтS)j — Yт.

Hence, for a те, M — (YтS)—1, and we have

т л —1 a —> те 1 т 1 Tv/vrr\ —1л/т a —> те


yl A-1 yi —> a y yi + 02 y^ y (Y T S) 1 YT yi —> 0,

which completes the proof. □

To an informal degree, this derivation also gives a rough idea


of scale: since y- is bounded above by Лmax ||s/1|, where Лmax is
the largest eigenvalue of A, hereditary positive definiteness is
achieved by setting a much larger than Лmax.

► 22.5 Conjugate Prior Inference for Matrices

Section 21 introduces a method to calibrate the posterior co­


variance for observations collected from CG in an empirical
Bayesian fashion for models with WA S = Y. Such models are
preferable because their associated prior mean has several good
algebraic properties (see §20). But the alternative interpretation
for CG from Theorem 19.16, with scalar covariance,2 2To avoid clutter and redundancy, the
exposition below, as above, will focus
Pa (A I a a , в a )= N (A; 0a I, в AI ® I) or (22.7) on pA ,the model inferring the matrix
A. Since there is little risk of confusion,
ph(H 1 aH,вh) = N(H; aHI,вHI ® I), the subscripts will be dropped there, so
the model parameters will be denoted
allows another conceptually appealing form of uncertainty cali­ as a, в. The case for inference on H is
entirely analogous under the exchange
bration, via the conjugate prior. a « H, y « s .
In principle, this is a case of applying the general framework
of hyperparameter inference from conjugate priors, outlined in
22 Proofs 189

§6. Annoyingly, its application to probability measures over ma­


trix elements raises a few tedious linear algebra complications.

To find the Gauss-Gamma posterior arising from the symmet­


ric Gaussian prior with scalar parameters, Eq. (22.7), we use
the closure of the Gaussian family under marginalisation and
multiplication (Eq. (3.5)) to compute the marginal likelihood
analogously to §6.2:

p (Y, S | a, в) = j p (Y | A, S) p (A | a, в) d A

= J 5(Y - (I ® S)TA)N(A; a#-,в2(I 8 I)) dA


-- — —
= N(Y;aS,в2(I ® S)T(I ® I)(I ® S)).

Then we multiply with the Gauss-Gamma prior to get the pos­


terior up to normalisation

p (a, в | Y) <x p (Y | a, в, S) p (a, в)


= N (Y; a S, в 2( I ® I)) N (a; ц о,в 2/ло) G (в-2; a o, bо).

In the analogue to Eq. (6.2), we take care of the Gaussian part by re-arranging

_ 1 <-> - / —- —- <-> / 1 1_1 1 1 1 —- 1 —-_


p(a,в | Y,S) a G(в-2; ao, bо)N Y; цоS,в2 (I ® S)т(I ® I)(I ® S) + S —ST (22.8)
ло
xN (a; Y ^А0цо + S т ({I ® S)T(I ® I) 1 (I ® S)) Y),в2Y

( #—- ■- - _,. - -- - I #-\ - 1


with Y := Аоо + ST ((I ® S)T(I ® I)(I 0 S)) 1 Sj . (22.9)

In contrast to its scalar analogue, this equation contains several


cumbersome expressions that require further treatment before
we can arrive at a manageable representation. Central among
them is the marginal covariance

G :=(I ® S)T(I ® I)(I ® S) E RNMxNM

We can use Eq. (19.16) and the definition of the Kronecker


product (15.2) to find an explicit form for this expression:3 3It follows from the structure of the sym­
metric Kronecker product that the in­
N verse of this matrix exists on the space
Gia,jb = £ 5ikS£a 1/2(5kn5£m + 5km5,n )5njSmb of all matrices that can be written in
the form CS E RNM, where C = Cт e
k,€, n, m=1
RNxN. Of course, in Eq. (22.8), this in­
= 1/2 (5ij (STS )ab + SjaSib ) . verse is invariably multiplied with such
matrices. If S is full rank, then this space
So if this matrix acts on an arbitrary X E RNM, the result is has NM- 1/2(M2 - M) degrees of free­
dom, so, for a E R\0,

GX = 1/2( X (S T S ) + S X т S). det(aG) = aNM-1/2(M2-M) det(G).


190 III Linear Algebra

To simplify Eq. (22.8), we require solutions X to problems of the


type GZ = X, where Z is of the form Z = CS with a symmetric
C. This is a similar problem to the inversion of W ® W for
Eq. (19.21). Explicit constructions of such solutions can be found
in the literature,4 but one can just convince oneself by inspection 4 Hennig (2015)
that such a solution is given by

GX = Z 5The trace in Eq. (22.10) has a particu­


larly simple form if the directions S are
^ X = 2Z(SтS)-1 - S(SтS)- 1(ZтS)(SтS)- 1. A-conjugate (as is the case if they are
produced by cg). In that case, YTS is a
In this sense, we have diagonal matrix, and the expression be­
comes
STG-1S = tr (SSTS)(STS)- 1} = M,
tr( Yт S (S т S)- 1)
M
which simplifies Y from Eq. (22.9) to Y = (Л0 + M)- 1. Anal­
ogous to the scalar base case, we introduce a “sample mean”
= £( s
m
InASm )[( S T S ) - 1] mm.

defined by5
a := M- 1 tr(YTS(STS)- 1). (22.10)
With this the second line of Eq. (22.8), the posterior on a, be-
comes6 6For an intuition of this expression, note
that if S = I:,1:M (the first M columns
V(a I в2 Y S) = N (a; ' Ma в2 of the identity), then the posterior mean
p(a 1 в , Y, S) N , л о + m , ло + M . on a is essentially (up to regularisation)
computing a running average of A’s first
To approach the posterior on the variance в2, we continue to fol­ M diagonal elements.

low the guidance from Chapter I and apply the matrix inversion
lemma, which yields

=r- 1 _ (G- 1 S )( STG- 1)


____

7----------- — ( — 1 --_\ .---------- —


(Y - ц0S)T G + S — ST (Y - цоS)
\ Ло )
= (Y - цо S)T (2Y(STS)- 1 - S(STS)- 1 (YTS)(STS)- 1) - ц0S(STS)- 1)

(Ma - Мцо)2
Ло + M
= tr(2YтY(STS)- 1 - YTS(STS)- 1 YTS(STS)- 1) - 2ц0Ma + ц2M - M (*+^M0)

As mentioned in Note 3 above, some care must be taken when


considering the number of pseudo-observations. The matrix
determinant lemma (15.11) provides

7 ( -— 1 —- = det(вGG)(^1 + STG-1S )
det в G G + S — ST
Ло U0 7
= в2(NM-1/2(M2-M)) det(G) f 1 + M^ .
Ло
22 Proofs 191

So the sufficient statistics for the Gauss-Gamma posterior on


a, в2 are

Л0 ц 0 + Ma
HN = Л0 + M '
Л N = Л0 + M,
aN = a0 + 1/2(NM - 1/2(M2 - M)),
bN = b0 + 1/^ tr(2YTY(STS) ^1 - YTS(STS) -1YTS(STS)-1)

- 2Ц0Ma + Ц22M - M a Ц0
Л0 + M
( - ) . (22.11)

These expressions tell us what computations are needed to


estimate the parameters a, в. But what do they mean? In con­
trast to the scalar case, it is not so straightforward to define
this expression in terms of a sample mean and variance. To
understand their structure, assume that the singular value de­
composition of S is given by S = VEUT. Then Y can be written
in the same basis as Y = VRUT with a dense rectangular matrix
R 6 RNxM that - since Y = AS - is given by R = VTAVE.
Recall from the introduction (§15.4) that E is a “rectangular
diagonal” matrix, its only non-zero elements are the singular
values ai = E ii, i = 1,..., M .By E- 1, we will denote its pseudo­
inverse - the rectangular diagonal matrix containing a-1 on its
diagonal. This notation drastically simplifies the expressions
above. We find that7 7Note that the surrounding orthonormal
matrices V do not cancel in this trace, as
a = 1/m tr( Y T S (S T S)-1) = 1/m tr( UR T E -1U T) the sum only runs from 1 to M, not over
all elements. However, in the final state
M
= 1/m tr( R TE - 1) = 1/m £ (V т AV) ii.
M = N, we indeed get ft = 1/M £N.1 Лц.

i=1

So a is indeed an empirical average of the “diagonal” entries


of A, but in the basis of the singular vectors of S. What about
the unwieldy trace term in Eq. (22.11)? Transforming it into the
basis of the SVD yields (note again that both sums only run to
M, not N, and that VTAV is symmetric by assumption)

tr(2YтY(STS)-1 - YTS(STS)-1YTS(STS)-1)
= tr(2URTRE-2UT - U(RTE-1)(RTE- 1)UT)
= tr(2RTRE-2 - (RтE-1)(R^E- 1))
M
= £ 2RjiRjiai-2 - RijRjiai-1aj-1
ij=1

M
= £ (VтAV)2.
ij=1
192 III Linear Algebra

And we see that introducing the sufficient statistic

1 M
e2:= M E /
((V’AV)2, - ( VTAV) ii ( VTAV ),
ij=1
M

gives a result analogous to Eq. (6.8):

1 Л0 M
bN = bо + 2 (Me + (a - но)2
Ло + M

This is a pleasing result. The sufficient statistics compute an


empirical expectation over the elements of A in the left-singular
basis of S. The term a is an empirical mean over the diagonal.
The more involved expression /32 is an empirical sum over
the squares of elements, corrected by the outer product of the
means. We also note that things are particularly easy if S is
an orthonormal matrix. Since then its SVD is simply given by
S = S • I • I, i.e. V = S, we simply get VтAV = SтAS = YтS.
This is interesting because both direct and iterative solvers that
use conjugate search directions project along a transformation
ofaset of orthogonal directions. For conjugate gradients, that
set is the sequence of gradients (residuals) ri = Axi - b.
23
Summary of Part III

The solution of linear systems of equations A x = b is a highly


structured process that requires keeping track of just the right
quantities, in the right order. Considering the process from the
probabilistic perspective, we have again seen that classic itera­
tive solvers arise as mean estimates of Gaussian models, with a
highly structured surrounding posterior uncertainty. In particu­
lar, the method of conjugate gradients can be motivated in two
separate ways, from Gaussian priors on either the matrix A or
its inverse. The resulting posterior uncertainties (covariances)
can be made to be closely related, though not identical to each
other.
Conjugate gradients efficiently explores the expanding sub­
spaces of the Krylov sequence, providing potentially rapidly
converging point estimates for the solution x. Since this process
perfectly identifies the solution within the sub-space, it is not
surprising that a probabilistic interpretation of this process is
associated with vanishing uncertainty within the subspace. In
its complement, on the other hand, the solution is not identified
at all. A first task for a probabilistic linear solver is thus to assign
uncertainty across this part of the solution space, and adapt or
calibrate this uncertainty. We saw that this can be possible under
regularity assumptions on A, which allow predicting the value
of upcoming projections Asm in future steps in the process of
Rayleigh regression. Since this process uses observations (the
Rayleigh coefficients) that are essentially already collected by cg
anyway, it offers a convenient way to tune uncertainty estimates,
and add predictive uncertainty to algorithms like conjugate
gradients.
Along the way, we also made several structural observations
that may seem marginal at first sight, but hold interesting in-
194 IV Linear Algebra

sights, in particular:

Preconditioning is connected to the notion of prior informa­


tion. This connection is not just intuitive, but can be made
quite precise (Corollary 19.17). One way to interpret these
results is that a pre-conditioner is a means to calibrate not
just the initial point estimate A0 ,but also the surrounding
uncertainty.

& Certain kinds of prior information, even if known with cer­


tainty, can only be imperfectly captured in the prior without
sacrificing computational tractability. As a concrete example,
while we can ensure that, for symmetric positive definite
matrices, the posterior mean estimate is always spd, too, this
cannot be captured in a Gaussian prior. In this sense, not just
the probabilistic solver but cg et al. with it, are “missing out”
on useful prior information about positive definiteness, in
the sense that they do not directly make use of itto guide
their search strategy.

& In linear algebra, more than in other domains, performance


hinges on some crucial assumptions being true, which allow
for ignoring or not tracking certain quantities. In particular,
the assumption that matrix-vector multiplications As = y
can be computed without error. If this does not hold (which
is a concrete problem in machine learning wherever data-
subsampling leads to stochastic imprecision), then various
tacit assumptions fly out the window. We can no longer
equate models on A and its inverse with each other (because
AS = Y is then not equivalent with S = HY). Even if the dis­
turbance is Gaussian, the resulting Gaussian posterior loses
its convenient algebraic structure (Eqs. (19.10) and (19.11) as
well as Eqs. (19.21) and (19.22)). In general, we can thus not
expect to find linear algebra methods for the “noisy” setting
that come close, even in complexity, to existing iterative rou­
tines. Stochasticity has very severe effects on linear algebra.

Software

The ProbNum library1 provides reference implementations of 1Code at probnum.org. See the corre­
sponding publication by Wenger et al.
probabilistic linear solvers with posterior uncertainty quantifi­
(2021).
cation.
Chapter IV
Local Optimisation
24
Key Points

Most of the background for this chapter is provided by Chapter I.


To understand the interpretation of quasi-Newton methods,
readers should first pass through Chapter III, which introduces
most of the relevant concepts.
Chapter II on integration focussed on explicitly re-constructing
existing numerical quadrature rules as probabilistic estimators.
Chapter III on linear algebra took a similar approach, adding
some new functionality to the existing methods at the end. In
this chapter, we will move further away from existing methods,
and develop new functionality. Classic optimisation routines
will remain a reference point. But contemporary optimisation
problems, in particular in the area of machine learning, pose
challenges that classic methods are not particularly well-suited
to address. Here, the probabilistic viewpoint offers new avenues.
We will argue in §26.1 that the key numerical challenge sep­
arating machine learning from classic numerical problems is
posed by the central role of (big, external) data in the computa­
tion. When data sets are large - as they often are - data points
are regularly sub-sampled for internal computations. This batch­
ing causes stochasticity, noise, of a magnitude far beyond the
machine precision, and classic notions of stability are no longer
applicable. It no longer makes sense to talk about a correct num­
ber that is being computed with a tiny error. A real likelihood
function has to enter the picture. In this chapter, we will see that
if this object is not properly modelled and represented in the
computation, it can contribute to a number of problems that are
very visible in areas like deep learning, which exhibits a rather
disappointing algorithmic landscape at the time of writing this
text.

Selecting algorithmic parameters: There are several “hidden”


198 IV Local Optimisation

or internal quantities that control and define an optimiser.


Examples addressed below include step sizes, stopping crite­
ria, and the computational precision of observations. These
hyperparameters are either absent or mere nuisances in prob­
lems where gradients can be accessed with essentially perfect
certainty. But they can become real obstacles in stochastic
optimisation. In fact, some parameters may even be wholly
unidentified, and setting them may then require the computa­
tion of additional observables. We will find that probabilistic
formulations help both in identifying and solving such prob­
lems. It is not always necessary to track a full, calibrated
posterior in such cases; just using probabilistic reasoning,
with likelihoods and evidences, may be enough.

Search directions: For high-dimensional stochastic optimisa­


tion problems, a frequent analytical approach is to take an
optimiser designed for noise-free optimisation, then analyse
its robustness to noise. This can lead to a conflation of con­
cepts, where design choices that were originally conceived
to address problems unrelated with uncertainty (e.g. under­
damped dynamics of the optimisation rule) are now also
used to address noise. Explicit probabilistic inference on un­
derlying quantities of interest (like the latent gradient, which
is observed with low certainty or corrupted by noise) can
help separate different design aspects of an optimisation rou­
tine, and provide more nuanced ways to achieve differing
desiderata.
25
Problem Setting

The preceding chapters on integration and linear algebra dealt 1Integration is a linear operation in so
far as, for two functions f, g and real
with linear problems.1 In this chapter, we move to more general,
numbers a, в G R,
nonlinear tasks. This theme will then continue in the chapters
on differential equations. У af (x) + вд (X) dX
Nonlinear optimisation problems are another class of numer­
=a f (x) dx + в g(x) dx.
ical tasks that have been studied to extreme depths. They have
ubiquitous applications of massive economic relevance. As in A Gaussian (process) prior on an inte­
grand f is thus associated with a Gaus­
previous chapters, the scope of this text is too limited to give a sian marginal on the integral over f. The
comprehensive overview, nor even to address just a significant linear algebra problems studied in Chap­
ter III actually do not have this prop­
part of the myriad different types of optimisation problems
erty, because matrix inversion is non­
(some listed below). We will focus on the following basic set-up: linear ((A + B)-1 = A-1 + B-1). The
consider the real-valued function f (x) G R with multivariate principal linear property used in Chap­
ter III is that matrix-vector multiplica­
inputs x G RN. Typical values for N can vary from a handful tions As = y provide a linear projection
(e.g. in control engineering) to billions (e.g. in contemporary ma­ of the latent matrix A.

chine learning). Throughout this chapter, we will assume that


f is at least twice continuously differentiable. Slightly adapt­
ing the widely used notation, we will denote the gradient and
Hessian functions of f by

Vf : RN RN, [ Vf (x)] i = fx), and


dxi
[B(x)R.= d2f(x)
B : RN RNxN, [B(x)]ij dxidxj '

The function f will be called (strongly) convex if B(x) is (strictly)


positive definite everywhere. We will not generally assume
that f is convex. And we will stick to the comparably simple
unconstrained setting. For the purposes of this chapter, we are 2One could equally well decide to look
for maxima. The minima of f are the max­
looking for a local minimum2 of f, which will be denoted by
ima of -f. Methods searching for global
extrema are fundamentally different, and
x* = argmin f (x) : ■& (Vf (x*)= 0) Л (B(x*) is spd). covered in Chapter V.
x
200 IV Local Optimisation

1 procedure optimise(f, Vf,x0) Algorithm 25.1: Pseudo-code for a


generic iterative (first-order) optimisa­
2 [f0,Vf0 ] ^[f (x 0),Vf (x 0)] / first evaluation
tion algorithm. For generality, this code
3 [F, VF] Hfо, Vfo] / initialise storage assumes access to both the objective f
4 for i = 0,1,... do and its gradient Vf. Many practical op-
timisers require only the gradient. The
5 di Direction(F, VF) / decide search direction
algorithm iterates over three principal
6 ai LineSearch(f (xi + tdi), diVf (xi + tdi)) steps: decide on a search direction and a
/ move step length, then update the estimate and
7 xi+1 xi + aidi collect a new observation of the objective
8 fi+1 f (xi+1) / observe and/or its gradient. The line search sub­
9 Vft+1 - Vf (xi+1) / observe routine deciding the step length operates
on a projected, univariate sub-problem.
10 if Terminate(Vfi) then Ц done?
The first part of this chapter primarily
11 return xi / success! deals with the line search, the second
12 end if part with the selection of directions.

13 [F, VF] HF U ft+1, VF UVft+1] / update storage


14 end for
15 end procedure

(That is, we will use the notation arg min even for local minima,
for simplicity). Introductions to classic nonlinear optimisation
methods can be found in a number of great textbooks. Nocedal
and Wright (1999) provide an accessible, practical introduc­
tion with an emphasis on unconstrained, not-necessarily-convex
problems. Boyd and Vandenberghe (2004) offer a more theoreti­
cally minded introduction concentrating on convex problems.
Both books also discuss constrained problems, and continuous
optimisation problems that do not have a continuous gradient
everywhere (so-called non-smooth problems). These two areas
are at the centre of the book by Bertsekas (1999). Other popular
types of optimisation include discrete, and mixed-integer “pro-
grams”.3 Genetic Algorithms and Stochastic Optimisation are 3 For historical reasons, the optimisation
and operations research communities
also large communities, interested in optimising highly noisy or
use the terms “program” and “problem”,
fundamentally rough functions (see e.g. the book by Goldberg as well as “programming” and “optimi­
(1989)). Such noise (i.e. uncertainty/imprecision) on function sation” synonymously. A mixed integer
program is a problem involving both con­
values will play a central role in this chapter - in fact, one could tinuous (real-valued) and discrete param­
make the case that Probabilistic Numerics can bridge some con­ eters.
ceptual gaps between numerical optimisation and stochastic
optimisation. However, we will make the assumption that there
is at least a smooth function “underneath” the noise. Stochastic
and evolutionary methods are also connected to the contents of
Chapter V.

Algorithms for unconstrained nonlinear optimisation often have


a basic algorithmic structure (Algorithm 25.1) that mirrors that
of the linear solvers discussed in Chapter III. They iteratively
25 Problem Setting 201

produce a sequence of points4 xi, i = 0, ..., M, that should 4 We will make the philosophical leap
to call these iterates “estimates”. That
ideally converge towards x* in a robust and fast way. Here
is how they are interpreted by practi­
“robust” may mean that the sequence will converge from more tioners (who can never run an optimiser
or less any choice of the starting point x0. “Fast” means that to perfect convergence), and the classic
convergence analysis also supports this
either the sequence of residuals of the estimates {||xi — x* ||}i>0 interpretation.
or the sequence of function values {f (xi) — f (x*) }i>0 converge
to0at some high rate.5 5Several different types of rates are used
in optimisation when talking about how
For many classic local optimisers, each iteration from xi to
fast a sequence {ri}iGN converges to
xi+1 consists of the same two principal steps as for the linear zero. Stephen Wright once summarised
solvers defined in Algorithm 17.2: them thus:
A sublinear rate, in the optimisa­
Decide on a search direction di G RN, meaning that the next tion context, means that ri 0, but
iterate will have the form xi+1 = xi + aidi with ai G R. This ri+1/ri 1. An example is the de­
crease ri < C/i for some constant C.
step usually involves at least one call to the “black boxes”
■e A linear or geometric (sometimes
f and/or Vf. Several options for the choice of di will be also exponential) rate means that
discussed in §28; but the naive choice is to set di = —Vf (xi). ri+1 < dri for a constant 0 < d < 1.
Thus ri < Cdi for some constant C.
This is known as gradient or steepest descent and is such an
■e- A super-linear rate means ri+1/ ri 0.
elementary idea that there may be no meaningful citation
In other words, a geometric decrease
for its invention.6 Exercise 25.2 shows that gradient descent with decreasing constant d 0. We
actually has certain pathological properties. Nevertheless, it will see that this rate can only be
achieved with comparably elaborate
remains a popular algorithm. algorithms.
■e A quadratic rate means that ri+1 <
Fix the step-size ai. This may be done by a closed-form guess
dri2 for some constant d (which may
about the optimal step-size. But if this step is performed by even be larger than 1 and still al­
evaluating f and/or Vf for different values of a in search of a low convergence if r0 is sufficiently
small!). This is the rate of Newton’s
“good” location close to the optimum along this direction, it is method. Achieving it generally re­
called a line search. If the line search finds the global optimum quires access to the Hessian function
B.
(along this univariate direction) a* = arg mina f(xi + adi),
In a quadratic decrease, the number of
it will be called perfect. This is an analytical device - for leading digits in xi that match those of x*
nonlinear problems, perfect line searches do not exist in doubles at each iteration. The consensus
practice, although a good line search on a convex problem is that, in high-dimensional problems,
this should be fast enough for anybody.
can come close.

6Cauchy may be a contender, in 1847.


Gradient descent, both with an efficient choice of step size and a See Lemarechal (2012).
fixed step size, offers important reference points. The following
two results7 show that gradient descent with exact line searches 7 Thms. 3.3 and 3.4 in Nocedal and

Wright (1999). The proof for the first one


has a linear convergence rate.
is in Luenberger (1984).
Theorem 25.1 (Convergence of noise-free, exact line search,
steepest descent on quadratic functions). Consider the strongly
convex quadratic function (already studied in Chapter III)
1
f (x) = -xTBx — bTx, B G RNXN, b G RN

with symmetric positive definite Hessian B. This function has a global


minimum at x* = B—1 b. Let 0 < A1 < A2 < ■ ■ ■ < AN be the
202 IV Local Optimisation

eigenvalues of B. Then the sequence of iterates of steepest descent with


exact line searches8 8The optimal step length used here was
derived in Eq. (17.3). We use the short­
Уfl уfi \ hand У f (xi ) = Уfi.
xi+1 = xi - Уfi ВУfi) Уfi

satisfies

||Xi +1 - x*||В < Q^ - д 1) \\xk - x*||В. (25.1)


Exercise 25.2 (easy, instructive, solution
on p. 363). Theorem 25.1 characterises the
Two special cases to consider: in an isometric problem (all eigen­ convergence of gradient descent if optimal
step sizes can be found. Many practitioners
values equal to each other), the iteration converges in one step.
actually just set the step size to a fixed con­
On the other hand, if the condition number к (В) = Kn/к1 is stant, like a = 0.1. This exercise may help
very large, then this bound is essentially vacuous, because the gain an intuition for why this is problematic,
and hides some underlying assumptions.
constant on the right-hand side of Eq. (25.1) is almost unit. Two wheeled robots are standing on top of
Although these are just bounds, they reflect the practical be­ a steep hill. Their task is to drive down
the hill by performing “gradient descent”.
haviour of gradient descent well. The following theorem shows
At every time step i, standing at location
that these properties translate relatively directly to the nonlinear xi, they evaluate their potential energy den­
case. sity f (xi) = E(xi)/m = g • h(xi), and
its gradient Уf (x), then move a step of
size a = 0.1 to the new location xi+1 =
Theorem 25.3 (Convergence of steepest descent on general
xi — аУf (xi). Here, xi e R2 is the robots'
functions). Consider a general, twice continuously differentiable 2D GPS co-ordinate, g is the free-fall acceler­
f : RN R, and assume that the iterates generated by the steepest- ation, and h(xi) is the height of the ground
at xi. (m is the robot's mass, we use energy
descent method with exact steps converge to a point x* at which В(x*) density rather than energy, so that this mass
is symmetric positive definite with eigenvalues 0 < К1 < • • • < Kn. cancels out of all calculations).
The first robot uses SI units. For it, the
Let c be any number with
starting point is h(x0) = 456m above sea
level g = 9.81m/s2, and the initial gradient,
ce
Kn — К i ,1 at xо, is У f (x0) = 5J/kg-m. The other robot
Kn + К i uses Imperial units. Hence, h(x0) = 1496ft
above sea, g = 32.19ft/s2, and the initial gra­
Then for all i sufficiently large, it holds that dient is У f (xо) = 1.03 • 10—5 Cal/oz^ft. How
far will either robot move in its first step, and
(assuming h(x) is locally well-described by
f (xi+1)- f(x*) < c2 f(xi)- f(x*) .
a linear function), what is the new energy at
x1?
__________ 26
Step-Size Selection - a Case Study

In the linear problem, solving for x in Ax = b, the line search


step turned out to be trivial - once the observation Z{ = Adi was
computed, ai could be computed in a single analytic step involv­
ing an inner product between two vectors. This does not work
in the nonlinear setting; but line searches are still considered an
easier problem than the “outer loop” deciding on the direction
di . In fact, line search methods are often presented as essentially
a solved problem,1 while the choice of search direction is still 1Nocedal and Wright (1999) discuss line
searches at the start of their book, in
the subject of contemporary research. But in stochastic optimi­
Chapters 3.1 and 3.5. Bertsekas (1999)
sation problems (definition below), line searches are anything even relegates them to Appendix C.
but straightforward. And, as Theorem 25.3 and Exercise 25.2
show, they play a crucial role in making gradient descent a prac­
tical algorithm. Thus, this chapter will begin with an extensive
discussion of line searches, and only then move to the outer
loop.
The probabilistic treatment of line searches will provide a con­
crete example of a situation in which classic numerical routines
do not meet the needs of contemporary numerical problems,
and the probabilistic viewpoint provides a clean description that
does not add undue overhead. We will end up building a prob­
abilistic line search by carefully replacing all non-probabilistic
aspects of a classic routine with their probabilistic equivalent,
extending wherever necessary.

► 26.1 Numerical Uncertainty is a Feature of Data Science

We briefly digress to explain why stochastic problems are of in­


terest in the first place. Whenever we have made connections to
classic numerical algorithms so far, we ended up re-discovering
them as mean or maximum a posteriori point estimates for
204 IV Local Optimisation

some unknown quantity z arising from computed numbers c


under a model M by Bayes’ theorem with a Dirac likelihood
and a Gaussian prior. That is, from a posterior of the form

(z I c M ) = p(z IM) p(c | z, M)


p(z 1 c, M) p (c IM)
= N(z; UM, £M) • 5(c - PMz)
N(c; PM UM, PM £M PM)
where PM is some linear operator, and UM, £M are some, poten­
tially nonparametric, prior parameters. The Dirac distribution
5 in this expression encodes the underlying expectation that a
computer, when told to compute c = Pz, actually computes that
number, to machine precision. That assumption is quite nat­
ural, and fundamental to classic numerical analysis. Machine
precision is not infinite, and even small disturbances can be
amplified by subsequent operations to cause significant overall
error. Studying such errors is the point of traditional numerical
stability analysis. When this approach is applicable, it amounts
to allowing the numerical method to pretend that the compu­
tational error vanishes. This simplifies the construction. If only
point estimates are needed, Bayes’ theorem can then often be
side-stepped by easier derivations.

However, some computational tasks of contemporary interest


feature errors of a more drastic nature. In such cases, the Dirac
distribution in Eq. (26.1) should be replaced with an explicit
likelihood. Contemporary “Big Data” tasks are one area were
such noisy function evaluations are of central importance. Op­
timisation problems in machine learning typically arise when
fitting a set of parameters x to a data set E := [g1,..., CK] by
minimising a loss or risk function f (x) = L (E, x). This is known
as (regularised) empirical risk minimisation.2 The data E are exter­ 2This loss function is itself a corrupted
estimate of the actual, inaccessible target
nal, empirical quantities and should not be confused with the
function. The population risk
numbers computed by the numerical method. Often, the loss
can be written generally as a sum over the data-points and a Lpop.(x) = r(x) + Eppop. (x))
= r(x) + J (x) dppop. (x),
regulariser r that does not depend on the data (the regulariser
may also be nil) where ppop. is the unknown distribution
from which the physical quantities are
1K drawn in the wild. This will become rel­
L(E, x) = r(x) + к }2 £(&, x), (2M evant later in §27.2.

with loss terms £ (^k, x) involving only one individual datum.


Examples of this setting include most popular deep-learning
models, support vector machines, basic least-squares and lo­
gistic regressors, and maximum a posteriori inference in any
26 Step-Size Selection - a Case Study 205

probabilistic models in which the generative probability p(E | x)


for the data is an exchangeable measure.If K is a large number - if
we have Big Data - then it may be impossible or at least incon­
Exercise 26.1 (easy, highly recommended
venient to evaluate the entire sum every time the optimisation if you have never done this before. Solu­
algorithm asks for its value or its gradient. Instead, one may tion on p. 364). Consider the basic Gaus­
sian process regression model (using the stan­
decide to select some “representer” points from E to build what
dard notation for such tasks): Data Y :=
is known as a batch E := [fj1,..., fjM] for some index set J of [y1, ...,yN] C R is assumed to be produced
much reduced size M K. This can be used to compute an by a latent function f : X R at loca­
tions X := [x1, ...,xN] according to the
approximation i.i.d. Gaussian likelihood

1M p(Y | f) = N(Y; fX,a2IN),


L (x) = r (x ) + M £ £ (fJm , x) « L (E, x). (26.3) with a e R+ and fx :=
m=1
[f(x1 ), ...,f(xN)]. Assume a general
A typical approach is to draw J C [1, K] at random in an Gaussian process prior p(f) = GP (f; p, k)
on the function f. Show that the posterior
i.i.d. fashion, and to re-draw the batch every single time the mean Ep(f|Y,X)(fX) for the function values
optimiser asks for a function or gradient value. In that case, at X is the solution to an optimisation
problem involving a loss function with the
the smaller sum is an unbiased estimator for the larger one
form given in Eq. (26.2), and can thus be
and, by the central limit theorem, L is approximately Gaussian computed with the methods discussed in this
distributed around L: chapter (to get the notation of that equation,
replace Y «■ E, y^«■ fi, and fx «■ x). What
effect does the choice of prior p(f) have on
p(L(x) | L(E,x)) ^N(L(x); L(E,x),a2), (26.4)
this structure? Which other priors would
retain the structure of Eq. (26.2)? What
with variance a2 к 1/м. From the point of view of the optimiser, about the likelihood p(Y | f)? Which
evaluations at different x are then disturbed by independent structure does it need to have to keep the
connection to Eq. (26.2)?
Gaussian noise, because the batches are re-drawn every time the
optimiser requests a value of L or its gradient. Batching thus
effectively provides a knob, which the user or the algorithm may 3 Eq. (26.4) really is an entirely quanti­
tative, identified object that can be ex­
twist to trade off computational precision against computational
plicitly used in a probabilistic numeri­
cost.3 Because data set sizes K tend to be large and low-level cal method. Assume that K M, and,
cache sizes are limited, the balance in this decision will often be for simplicity, that J is drawn uniformly
from [1, K]. Further assume that the
dominated by cost considerations. In deep-learning problems, data fi are drawn i.i.d. from some mea­
even signal-to-noise ratios well below one are quite common. sure p - e.g. from p к exp(€(f I x)),
recalling € is the loss of a single da­
tum and where x is the “correct” value
The Gaussian noise introduced by batching explicitly introduces of x. Then a2 = var(€(x))/M, where
a likelihood term into the computation, and thus naturally sug­ var(€(x)) = Ep (€2(f,x)) - Ep (€(£, x))2.
Even if var(€ (x)) cannot be analytically
gests a probabilistic treatment. Knowing that classic numerical
computed because p is unknown, it
methods are associated with Dirac likelihoods, it is not sur­ can be estimated empirically during the
prising that classic methods for optimisation, in particular the batching process at low cost overhead,
using the statistic
efficient ones, tend to struggle with the noisy setting.
£ 2 (fjj, x).
j
► 26.2 Classic Line Searches For a while, it was difficult to access
these quantities in standard deep learn­
In the remainder of this section, we will study how significant ing libraries, but following work like that
of Dangel, Kunstner, and Hennig (2020),
computational noise can invalidate a classic numerical routine,
even the established libraries are begin­
and how this problem can be addressed from the probabilistic ning to make full-batch quantities avail­
perspective. The object of interest will be the line search, the able.
206 IV Local Optimisation

internal sub-routine that determines the step-sizes of optimisers.


We start with a review of the classic solution.

As Exercise 25.2 shows, the performance of optimisation meth­


ods that iterate in steps Xi+1 = Xi + aidi does not just depend
on the choice of search direction di, but also crucially on the
step size ai G R+. This is not just true for first-order methods
like gradient descent, but also for more advanced algorithms.
In linear problems, the step size can be computed analytically
(see Eq. (17.3)); in nonlinear problems this is not generally pos­
sible. The task of finding a “good” step size is addressed bya
line search method. Line searches solve a univariate optimisation
problem. To stress this crucial point, we adopt a simplifying,
if slightly abusive, notation: we will overload the symbol f to
refer both to f (X) with a multivariate input X G RN, and the
univariate function f (a) := f (Xi + adi). The derivative of the
univariate function is the projected gradient

ff(a) := df(x^di) = djVf(xi + adi) G R.

Note that both f (a) and f1 (a) are scalars. The entirety of this
section will be concerned with the problem of finding good
values for a. This all happens entirely within one “inner loop”,
with almost no propagation of state from one line search to
another. Sowe drop the subscript i. This is a crucial point that
is often missed at first. It means that line searches operate in
a rather simple environment, and their computational cost is
rather small. For intuition about the following results, it may
be helpful to keep in mind that a typical line search performs
between one and, rarely, 10 evaluations of f (a) and f1 (a), re-
spectively.4 4 We here assume that it is possible to si­
multaneously evaluate both the function
f and its gradient f'. This is usually the
Because line searches are relatively simple yet important algo­ case in high-dimensional, truly “numer­
rithms, we can study them in detail. The following pages start by ical” optimisation tasks. The theory of
automatic differentiation (e.g. Griewank
constructing a non-probabilistic line search that is largely based (2000)), guarantees that gradient evalua­
on versions found in practical software libraries and textbooks. tions can always be computed with cost
comparable to that of a function evalu­
They are followed by their probabilistic extension.
ation. But there are some situations in
which one of the two may be difficult to
access, for example because it has differ­
► 26.3 The Wolfe Termination Conditions ent numerical stability. The probabilistic
line search described below easily gen­
Building a line search requires addressing two problems: Where eralises to settings in which only one of
the two, or in fact any set of linear pro­
to evaluate the objective (the search), and when to stop it (the jections of the objective function, can be
termination). We will start with the latter. Intuitively, it is not nec­ computed.
essary for these inner-loop methods to actually find a true local
26 Step-Size Selection - a Case Study 207

minimum a* = arg mina f (xi + adi) of the objective - doing so


may be very expensive, and all it does is move the iterate xi+1
forward or backward by a tiny amount, a precision that may
be “washed out” quickly in the optimiser’s subsequent steps.
Instead, the line search should look for a point that is “good
enough”. Several authors have proposed ways to quantify this
notion of sufficient improvement.5 The most widely quoted for­ 5 See e.g. §8.3 in Ortega and Rhein-

boldt (1970). Alternatively, see Warth and


mulation is by Philip Wolfe (1969). The Wolfe conditions consider
Werner (1977), or Werner (1978).
a step length a acceptable if the following two inequalities hold,
using two constants 0 < c 1 < c2 < 1.

f (a) < f (0) + c 1 af'(0), (26.5)


f'(a) > c2f'(0). (26.6)

Figure 26.1 gives an intuition. The first condition (26.5) is also


known as the Armijo,6 or sufficient decrease condition. It requires 6 Armijo (1966)
that the function lies sufficiently below the initial function value.
For c1 = 0, it suffices if f(a) < f (0), i.e. the objective is just
below the initial value. Positive values of c1 impose a linear
decrease.
The second condition (26.6) is called the curvature condition
because it concerns a change in the gradient, requiring that the
gradient increase beyond the initial one.These conditions are
Exercise 26.2 (check whether you follow).
also, more specifically, known as the weak form of the Wolfe Why is it necessary to set c2 > c1 ?
conditions. The strong Wolfe conditions add an upper bound on
the gradient. They are

f (a) < f (0) + c 1 af'(0), (26.7) Figure 26.1: The Wolfe conditions for
I f'(a) |< c21f'(0) I. (26.8) line search termination. Optimisation
utility f (a) and f1 (a) as black curves
(Eq. (26.7) is identical to Eq. (26.5), it is just reprinted for easy ref­ (top and bottom figure, respectively).
True optimal step a* is marked bya
erence.) For every continuously differentiable f that is bounded
black circle. The sufficient-decrease con­
dition excludes the grey region in the
top figure, constraining the acceptable
space to [0, a2]. Adding the weak cur­
vature condition additionally excludes
the lower grey region in the bottom
plot (thus restricting the acceptable space
now to [a1, a2]. The strong extension also
excludes the top region (restricting to
[a1, min{a2, a3}]). All points in between
are considered acceptable by the Wolfe
conditions. In this example, the true ex­
tremum lies within that acceptable re­
gion, but this is not guaranteed by the
conditions. (For this plot, the parameters
were set to c1 = 0.4, c2 = 0.5 to get an in­
structive plot. These are not particularly
smart choices for practical problems; see
the end of §26.2.)
208 IV Local Optimisation

below, there exist step lengths that satisfy the conditions.7 7 Nocedal and Wright (1999), Lemma 3.1
As can be deduced from Figure 26.1, the Wolfe conditions do
not guarantee that the true optimum lies within their bracket of
acceptable step sizes (not even if the objective is convex). Nor do
they strictly guarantee that the optimal step size is particularly
close to the chosen one. However, they obviously prevent certain
kinds of undesirable behaviour in the optimiser, like increasing
function values from one step to another, or step size choices
that are drastically too small. When used in combination with
algorithms choosing the search directions di in the outer loop,
the Wolfe conditions can also provide some guarantees for
di+1. For example, for the BFGS method discussed in §28, the
curvature condition (26.6) guarantees that di+1 is a descent
direction (see more in that section). Good choices for the two
parameters c1, c2 depend a little bit on the application and
the outer-loop optimisation method. But practical line searches
often use lenient choices8 with a small c1, e.g. c1 = 10-4 and 8 Nocedal and Wright (1999), §3.1
a large c2, e.g. c2 = 0.9. Under these settings, the conditions
essentially just require a decrease in both function value and
absolute gradient, no matter how small.

► 26.4 Search by Spline Interpolation

The other ingredient for a classic line search is an action rule


that determines where the algorithm decides to probe f and
f'. For this problem, too, many different paradigms have been
proposed and studied. They are guided by various motivations,
such as robustness against “roughness” in the objective, speed of
convergence, and others.9 Here, we will only consider one such 9 Nocedal and Wright (1999), §3.3
policy: interpolation with cubic splines. It is used in popular
optimisation libraries because it is quite efficient. It is also con­
venient in our setting because of the close connection between
cubic splines and Gaussian process regression (see §5.4).

Our interpolating line search will iteratively construct cubic


local approximations of the univariate f (a) using collected val­
ues of f and f', at each iteration stepping to either a local
minimum or an extrapolation point. Cubic spline interpolation
can be phrased in various different ways, stressing different
aspects.10 In §26.5 we will re-encounter a probabilistic inter­ 10 Wahba (1990)
pretation for them. Here is a more traditional description.11 11Adapted from the presentation in No-
cedal and Wright (1999), §3.5.
Figure 26.2 gives the pictorial story.
At the beginning of the line search, we only have the two
numbers f0 := f (0), and f0 := f'(0) available. So our best
26 Step-Size Selection - a Case Study 209

guess for f (see below for a more precise motivation) is a linear


approximation
/(a) = f0 + af0.
Since this linear function has no local minima, we have to decide
on a first step length a 1 ad hoc. A perhaps natural choice is
to set a 1 to the step length at which the preceding line search
terminated. We compute f1 := f (a 1) and f1 := f1 (a 1). If (f1,f1)
satisfy the Wolfe conditions, then the line search ends.
If the Wolfe conditions are not yet satisfied, the next step
depends on the gradient at a 1. If f1 < 0, then the initial step
was too short, and, assuming the objective is bounded below,
there must be a local minimum to “the right” of a1. This case
is shown in the left panels of Figure 26.2. In this situation, we
perform another extrapolation step. For example, we could set
a2 = 2a 1. On the other hand, if f1 > 0, then we know that
there must be a local minimum in (a0,a1). Because we have
four numbers, f0, f1, f0, f1, available, there is a unique cubic
polynomial interpolating them, given by

f (a) = aa3 + ba2 + af0 + f0, (26.9)

where a, b are the unique solution to the constraints f (a 1) = f1


and f' (a 1) = f1. They are given by (deliberately not performing
some obvious simplifications so as to show the path to the
solution transparently)

a 1 2a 1 3
— a2 f1 — a 1 f0 — f0 (26.10)
b 3
a4(2 - ) -a2 a1 f1— f0

The derivative f' is a quadratic function, which has a unique Figure 26.2: Cubic spline interpolation
minimum at for searching along the line. Each (top
and bottom) pair of frames shows the
same plot as in Figure 26.1, showing
progress of the search and interpolation
steps. Left: The first evaluation only al­
lows a linear extrapolation, which re­
quires an initial ad hoc extrapolation
step. Centre: The first extrapolation (to
the second evaluation) step happened to
/ be too small, so another extrapolation
step is required. Since the “natural” cu­
B^
bic spline extrapolation is linear, the next
step is again based on an ad hoc exten­
sion of the initial step. Right: The third
evaluation finally brackets a local mini­
mum. An interpolation step follows, us­
ing a cubic spline interpolant. The next
step will be at the local minimum of the
interpolant. It so happens that this point
(empty square) will provide an evalua­
tion pair that satisfies the Wolfe condi­
tions.
210 IV Local Optimisation

Figure 26.3: Spline interpolation can be


brittle to noise. Figure adapted from
Mahsereci and Hennig (2017) with per­
mission. The left panel shows four ex­
act evaluations of function values and
derivatives, interpolated by the natu­
ral cubic spline, along with the Gaus­
sian process posterior arising from the
Wiener process prior. In this noise-free
case, the GP posterior mean and the
spline are identical. In the right plot,
Gaussian noise was added to all four
function values and gradients. The orig­
inal interpolant is shown in white for
reference. The noise causes oscillations
in the spline interpolant (dashed black)
that, in particular, cause new local min­
ima. The GP posterior mean (solid black)
reverts toward the prior mean, yielding
a smoother interpolant.
This will be the next evaluation point. If f (a2), f' (a2) satisfies
the Wolfe conditions, the line search ends there. Otherwise, the
interval is bisected: depending on the sign of f' (a2) we know
that the local minimum has to lie either in (a0, a2) or (a2, a1),
and we can build a new cubic interpolation within that region,
using the four numbers constraining it on either side.

This sequence of interval bisection by interpolation terminated


by the Wolfe conditions, a popular standard in nonlinear optimi­
sation, provides an example of the general problem, discussed
above, of existing numerical methods that do not generalise well
to situations with significant uncertainty or noise (see Figure
26.3). Let us assume that approximate evaluations y'(a) & f'(a)
of the gradient are available only with Gaussian observation
likelihood

P(y(a) I f'(a)) = N(y'(a); f'(a),a2)

where the standard deviation a is about as large as the starting


value f' (0) (this situation, a signal-to-noise ratio of 1, is not
unrealistic in Big Data problems). Then, we may observe y'(a 1) >
0, and decide to start the interpolation phase of the line search,
even though the true value f' (a 1) is negative. Once such an
erroneous bisection decision is taken, the line search is headed
down a dead-end street, from which it cannot recover, at least
not in its basic form. The Wolfe criteria, too, may be erroneously
evaluated as satisfied even though they are not, or the other
way round.
26 Step-Size Selection - a Case Study 211

► 26.5 Probabilistic Line Searches

This section constructs an “uncertain” extension to the line


search paradigm that is more robust to evaluation noise in both
the objective and its gradient. The content is based on Mahsereci
and Hennig (2015).12 The resulting probabilistic line search allows 12For a longer version with much
more details, see Mahsereci and Hennig
efficient local step-size selection in noisy settings. In doing so,
(2017).
it also generalises and formalises the notion of a line search as
a sequence of decision problems. An implementation can be
found online.13 13https://github.com/
ProbabilisticNumerics/
probabilistic_line_search
A word of caution: The line search developed below provides
a rigorous solution to the instability of its classic progenitor.
It performs robustly on large multi-layer perceptions and lo­
gistic regression models on comparably small data sets such
as MNIST or CIFAR10 and stochastic gradients of non-trivial
noise Figure 26.6). On some other networks architectures, and
on large data sets, it is still outperformed by standard methods
and hand-tuned parameters. Some hypotheses for why this may
be the case can be found in Schneider, Dangel, and Hennig
(2021). In any case, as presented here, the method does provide
a “textbook” example for how deterministic/nonprobabilistic
numerical routines can be methodologically generalised to the
stochastic/probabilistic setting:

We will identify instances of point estimation and determin­


istic decision making in the classic deterministic line search
developed above, and explicitly replace them with probabilistic
inference and uncertain decisions. In total we have to make
three such changes:

the spline interpolation rule has to be extended to noisy


observations;

the decision rule for where to probe the objective function


has to reflect the fact that no region of inputs can be “bisected
away” deterministically;

& the Boolean Wolfe termination criteria have to be replaced


with a probability of having found a suitable input point.

It will turn out that all three of these issues can be addressed
jointly, by casting spline interpolation as the noise-free limit of
Gaussian process regression.
212 IV Local Optimisation

> 26.5.1 Cubic Spline Interpolation Replaced by Gaussian Re­


gression

Recall from §5.4, specifically Eq. (5.27), that the total solution of
the stochastic differential equation

f (a) 0 1 f(a)
d da + dwt, (26.11)
f' (a) 0 0 f' (a) q

with initial value f(0) = f'(0) = 0 is given by the Gaussian


process with kernel

cov(f(a),f(b)) = в2 (3min3(a, b) + \a - bl min(a, b)) . 1


We will confirm below that, like the kernel itself, posterior
means arising from this prior and a Gaussian likelihood are
piecewise cubic polynomials that, in the limit of noise-free ob­
servations, revert to the cubic spline interpolant.
Thus, we can generalise the spline interpolation rule used
in the non-probabilistic line search to the noisy setting, using
Gaussian process regression as the framework. Since modelling
Gaussian observation noise within the Gaussian process re­
gression model adds essentially no computational overhead
compared to the noise-free case, the resulting probabilistic line
search will have computational cost similar to its classic coun­
terpart.
We only have to take some care with the likelihood. The
following model captures the situation described in Eq. (26.2).
Model 26.3. Assume the Gaussian likelihood

p(Y \ f) = N(Y; [f(a 1)....... f(ак),f'(a 1)........ f'(ак)]т, Л),


(26.12)
for Y G R2K with a symmetric positive definite noise covariance ma­
trix Л G RK K.14 Further assume that Л has the following structure: 14So Y is a vector containing the noisy
function values in the first half, then
a2i ifi = j, i < K the derivatives stacked below. We have
here assumed that we always evaluate
ifi =j,i > K f and f' together, so the number K of
Лij = < -i
observed function values and gradients
pi afi af[ ifi = j + K V j = i + K is the same. Of course this could be
changed, at the cost of a slightly more
.0 else
involved notation.
This form allows the noise on function values and gradients to
co-vary at one location ai with correlation coefficient pi, but the
noise at two different locations is assumed to be independent.
This assumption is correct in the case of Eq. (26.3) if the batches
are re-drawn, independently, for each pair of evaluations. We
will use the shorthand A = [a0,..., aK] for the set of evaluation
nodes.
26 Step-Size Selection - a Case Study 213

Figure 26.4: The integrated Wiener pro­


cess prior yields the cubic spline in-
terpolant as the posterior mean if the
observation noise Л vanishes. This fig­
ure shows the same data as Figure 26.2,
with the Gaussian process prior (left),
and posteriors after one (centre) and
two (right) evaluations of the objective
f and its gradient f'. Shown is the pos­
terior mean and two marginal standard­
deviations (solid lines), three samples
(dashed) and the marginal density (shad­
ing). The marginal distribution on f' is
a standard Wiener process (note rough,
a* a a Brownian motion, samples).
a a

For the prior, we extend the introduction from the background


to deal with the nontrivial observation at the initial value a = 0.
The total solution of Eq. (26.11) conditioned on [f (0), f'(0)]T =
[f0, f0] is a Gaussian process with mean function15 15One way to see this is from Eqs. (5.18)
and (5.19), using that F is nilpotent (F •
F = 0), thus
H (a) f+ + f0 a
ц' (a) f0 exp(F • St) = 0
St]
1.
and covariance function (kernel), using the notation Xu :=
min(aa, ab),

kaaab kaa ab := [cov(f (aa), f (ab)) cov(f (Xa), f' (Xb))


dk«a ab 9kUj ’ |_cov(f'(aa ),f ( ab )) cov(f'( aa ),f'( ab ))

2 1/зa □ — 1/2a □ + aa ab au — 1/2a u + aa au


(26.13)
—1/2a u + ab au au

Then we can use the standard results from §4.2, and in particular
§4.4, to compute a Gaussian process posterior measure with
posterior mean function

vx цх ka A k»A kAA
AA
+ Л^ (Y — ЦА ),
kdaA
dk kdAA
dk (26.14)
v» H» dkxA dkAA
AA

and covariance function

kaaab kaa ab kaaA kkdaaA kAA kAab


Kaa,ab K«a,ab AA Aab
dk dkd dk dkkdaaA dkAA kdAA
dk kAab
dk kdAab
dk
K«a,Xb KXa,Xb aaab aaab aaA AA
(26.15)
It is relatively straightforward to see that vx = E|Y(f (a)) is a
piecewise cubic polynomial, with at most K “seams” (points
of non-differentiability in f') at a 1,..., aK (Figure 26.4). By in­
spection, we see that the posterior mean is a weighted sum of
kernels and the prior mean. Since both the prior mean and the
214 IV Local Optimisation

kernels kaA and kdxA are (at most) cubic polynomials, so is the
posterior mean. Analogously, the posterior mean inherits the
points of non-differentiable first derivative from these kernels,
at the K points a = a, i = 1,..., K.
In the limit of Л 0, i.e. if we assume the evaluations are
available without noise, then the cubic splines of (26.9) with 16A remark on implementation: Since Л
is block diagonal we can compute the
parameters given by Eq. (26.10) are the only feasible piecewise
posterior mean functions v, v' by filter­
cubic estimate for t G [a0, aK], since then each spline is restricted ing (§5). Then each of the K filter and
by 4 conditions on either end of [ai-1, ai], i = 1, ..., K. In this smoother steps involves matrix-matrix
multiplications of 2 x 2 matrices, as does
sense, the integrated Wiener process prior for f is an extension Equation (26.10), and the computational
of cubic spline interpolation to non-zero values of the noise co­ cost of this regression procedure is only
a constant multiple of that of the classic
variance Л G RKxK. This construction allows explicit modelling
interpolation routine. However, since a
of the observation noise (Figure 26.5). The additional cost of this typical line search performs only a very
probabilistic line search is often negligible, even without using limited number (< 10) of function and
gradient evaluations, the concrete imple­
Kalman filtering and smoothing to speed up the computation mentation practically does not matter,
of the GP posterior from Eqs. (26.14) and (26.15).16 and the direct form of Eq. (26.15) can be
used instead. More generally, the cost
overhead of this line search is often neg­
ligible when compared to the demands
> 26.5.2 Selecting Evaluation Nodes of computing even a single gradient if
N » K.
In the noise-free setting, our line search chose the evaluation
node ai either at an extrapolation point, following some rule for Exercise 26.4 (advanced, discussion on
how to grow extrapolation steps, or, once an outer bound for the p. 365). Are there other possible choices to
extend the cubic spline model to Gaussian
extrapolation has been established, as the minimum of the spline
process regression? More precisely, is there
interpolant. Because it was possible to perform deterministic a Gaussian process prior GP(f, ц, k) with
bisections and the interpolant is a cubic polynomial, there was mean function ц and covariance function k,
such that, given observations under the likeli­
always exactly one17 such evaluation candidate. But in the noisy hood (26.12), the posterior mean, for Л 0,
setting, no part of the search domain can ever be “bisected-away” converges to the cubic spline of Eq. (26.9)
with parameters (26.10) in the inner inter­
with absolute certainty. Hence, the probabilistic line search will
vals a0 < a < aK ?
have to consider a set of candidates for the next evaluation ai,
and use some decision rule to settle on one of them. We will now 17If the spline interpolant has no internal
minimum, then the “right” end of the bi­
first design a finite set of candidate points т := [ Ti, ..., Tl ] G R+; section either has negative gradient (thus
then address the question of how to choose among them. we would extrapolate), or the right-most
gradient is zero, and thus accepted by
First, because the presence of noise makes it impossible to
the Wolfe conditions.
rule out the possibility that a local optimum lies to the right
of amax := max{ai}i=0,...,K, our list of candidates will always
include T1 = amax + ri, where ri is some extrapolation step.
Just as in the noise-free case, the extrapolation strategy could
be chosen more or less aggressive, depending on the problem
setting. For high-dimensional optimisation problems in machine
learning, a constant (ri = 1) or linearly growing (ri = i) policy
may be better than the very aggressive exponential growth
(ri = 2i).
To cover the domain between the previous evaluations, we
will add all the local minima of the posterior mean function v (a)
26 Step-Size Selection - a Case Study 215

in [0, amax] \ {ai }i=0,...,K, that is, excluding previous evaluation


points. Because v(a) is still a cubic spline, there are still at most
K such points.18 Of course it can happen that some or all of 18These points can also still be computed
in O(K), by evaluating the four deriva­
the “cells” between evaluation points do not contain a local
tives v(a),v'(a),v"(a), v'"(a) of the pos­
extremum in the posterior mean. Then these regions are not terior mean, at any location t within
included in the list т of candidates; and if the list only consists of the “cell” between two previous evalua­
tions. These all exist, even though there
the extrapolation point, then extrapolation is the only acceptable is no associated posterior measure over
next step. f" and f111, because the posterior mean
is smoother than posterior samples. To­
gether they fix the four parameters of the
Given this now constructed list т of L (with L < K + 1) candi­ cubic spline (Eq. (26.9)) and thus deter­
date points, which of these points should be chosen for the next mine the location of the internal extrema.

evaluation? There are several possible ways to frame this ques­


tion into a utility. Intuitively speaking, we are looking for the
“most promising” candidate point т; where a point is promising
if it is likely to produce a low function value and a small absolute
gradient. A general way to encode this is to propose that we
should choose the element a € т that maximises a utility 19 Jones, Schonlau, and Welch (1998)

a = arg max{u (т); т € т}, Exercise 26.5 (moderate). As an alterna­


те tive to Eq. (26.16), find the concrete value of
a utility that could be called the “expected
where u(т) = u(p(f (т), f'(т) | Y)) is a function of the marginal gradient improvement”, given by
bivariate Gaussian posterior over the two numbers f (т), f'(т).
uEGI = Ep(f'(т)\Y)(min{0, \ f'(т) 1 — n}),
For example, it could be chosen as the expected improvement19
in f over a previous best guess n, such as n = mini {v (ai)}, the where n = min,■ \v' (ai) | is the lowest ab­
solute expected gradient found so far. (The
lowest mean estimate among previous collected nodes. That is, main challenge in this exercise is the compu­
the expected (negative) distance to n, if f (т) is smaller than ц: tation of the first indefinite moment integral
ofa Gaussian distribution.)

(
uei т) = Ep(f (т)\Y) (min{0, n — f (т) })

= n — v(т) Л + erf n— v(т) \ + Kexp (— (n — v(т))2. (26.16)


2 \ ^ 2 Кт^ ) V 2п у 2 ктт /
Many other options are possible. Chapter V provides an in­
depth introduction to such Bayesian Optimisation utilities. For
example, the probability for the Wolfe conditions to hold, intro­
duced below, can be used as another choice for u. For a practical
implementation, a utility consisting of the product of uEI and
the Wolfe probability defined in Eq. (26.17) has been found to
work well.

> 26.5.3 Probabilistic Wolfe Conditions

Our classic line search would terminate if it found a point a


such that the weak (or strong) Wolfe conditions (26.5) and (26.6)
(or (26.8)) were satisfied. For reference, here they are again:

f'(a) > c2f'(0) weak


f (a) < f (0) + c 1 tf'(a) Л
If’(a) |< c2 If’(0) | strong
216 IV Local Optimisation

Figure 26.5: Gaussian process interpola­


tion allows explicit modelling of obser­
vation noise. Same plot as in Figure 26.4
(samples from the belief omitted for read­
ability) but with independent observa­
tion noise on all values of f and f'. Nei­
ther the posterior mean nor the true func­
tion now necessarily pass through the
observations. But the posterior mean on
f is still a differentiable piecewise cubic
polynomial (and the mean on f' a con­
tinuous piecewise quadratic).
The lower half of the figure shows the
belief over the two Wolfe conditions im­
posed by this Gaussian process posterior
on (f, f1). From top to bottom: Beliefs
over variables aa (encoding the Armijo
condition), ba (encoding the curvature
condition), their correlation coefficient
pa, and the induced probability for the
-2
weak or (approximately) strong condi­
1
tions to hold. While this plot shows con­
-1 tinuous curves for all beliefs for intuition,
in an implementation, all these variables,
1 even the beliefs on f and f' would only
------- weak ever be computed at the finitely many
------ strong evaluation nodes a; and a point would
0.5 ------- datal only be accepted if the probability for
data2 the Wolfe conditions to hold crosses a
threshold after an evaluation has taken
0 place there.
0 t
t

If the optimiser has no access to exact function values and


gradients, then it is impossible to ever know for sure that these
conditions are fulfilled at any particular point, and we can only
hope to compute a probability for them to hold. This probability
should arise from (and thus be consistent with) the Gaussian
process model for f, f' used for the rest of the line search. In
this context, it is a relief to note that the weak form of the Wolfe
conditions can be written as a linear projection of f and f'. They
amount to requiring that two scalar functions aa and ba G R are
both positive at a:

f(0)
aa 1 c1 a -10 f' (0) 0
ba 0 -c2 01 f(a ) 0
f' (a )
Thanks to the closure of the Gaussian family under linear trans­
formations (Eq. (3.4)), the Gaussian process measure on (f, f')
directly implies a bivariate Gaussian measure on (aa, ba) for all
a (see Figure 26.5). If the posterior Gaussian process measure
26 Step-Size Selection - a Case Study 217

on (f, f') is

v( f f1 I Y) = GP ( f • V K K
p(f,f 1 Y ) = G4 f/ ; v , дк дкд

with the posterior mean and covariance functions v, к and their


derivatives given by Eqs. (26.14) and (26.15), then the induced
bivariate Gaussian marginal on [aa, ba] is

aa maa a
Caa a
Cab
p(aa,ba) = N b Caba Cabb
ba mba

with parameters (nb. Caab = Caba)

ma := v0 - Va + c 1 av0, mba :— v'(a) - c2v'(0), and


: 2
Ca = к00 + (c 1a)2'>к00 + Kaa + (c 1a (кd0 — дК0a ) — к0a ),
дd (v^ 1 r- /vdrd \ 1 1'1 1 r„\d„
C^b
: 1 1Г_л,д д
Ca — -c2(к00 + c 1a к00) + ( + c2) к0a + c 1a к0a - Kaa,
дд

bb 2 дд дд дд
Ca — c2 к00 - 2c2 к0a + Kaa•

And the probability for the weak Wolfe conditions to hold is 20This is the “bivariate error function”,
the generalisation of the univariate Gaus­
given by the standardised bivariate normal probability20
sian cumulative density
x
p(aa > 0 Л ba > 0) I N(x;0,1) dxc — 2 (1+ erf(v'a))'
-—TO
1 Pa Just like the univariate case, there is no
dadb,
pa 1 “analytic” solution, only an “atomic” one
(in standard libraries, the error function
is implemented as a special case of the
incomplete Gamma function, computed
with the correlation coefficient -1 < pa :— Cab/ yjCaaCab < 1.
either via a series expansion or a con­
The strong condition is slightly more tricky. It amounts to tinued fraction, depending on the input.
the still linear restriction of ba on either side, to For more, see §6.2 of Numerical Recipes
(Press et al., 1992). A highly accurate ap­
proximation of comparable cost to the er­
0 < ba < -2c2f'(0). ror function was provided by Alan Genz
(2004), whose personal website at Wash­
But of course we do not have access to the exact value of f1 (0), ington State University is a treasure trove
of exquisite atomic algorithms for Gaus­
just a Gaussian estimate for it. A computationally easy, albeit sian integrals.
ad hoc, solution is to use the expectation21 f'(0) « v0 to set an
upper limit b :— -2c2v0, and use it to compute an approximate
probability for the strong Wolfe conditions, as

p (aa > 0 Л 0 < ba < b) 21 Alternatively, one could also use the
95%-confidence lower and upper bounds
1 Pa f'(0) < v0 + 2/4 and f'(0) > v0-
dadb. (26.17)
Pa 1 дк00 to build more lenient or restric­
tive decision rules, respectively.
Algorithm 26.1 provides pseudo-code for the thus completed
probabilistic extension of classic line searches.
218 IV Local Optimisation

CIFAR10 2layer neural net MNIST 2layer neural net

decaying a line search line search decaying a fixed a

024 6810 024 6810 024 6810 024 6810


epoch epoch

Figure 26.6: Empirical performance of


the probabilistic line search on some ba­
► 26.6 Uncertain Observations May Require Additional Ob­ sic benchmarks. Figure adapted from
Mahsereci and Hennig (2015). Left col­
servables
umn: Experiments on CIFAR10 data set
(Krizhevsky and Hinton, 2009). Right
The likelihood defined by Eq. (26.12) contains the covariance ma­ column: MNIST data set (http://yann.
trix Л, in particular its internal parameters afi, Vf1. In a practical lecun.com/exdb/mnist/). The model is a
2-layer feed-forward network in either
empirical risk minisation problem (as described by Eq. (26.3)), case. All plots show a comparison be­
these variables are not necessarily known. And of course they tween stochastic gradient descent (sgd)
run with fixed or decaying learning rate
are not present in the noise-free setting. This is a situation in
a, compared to an sgd optimiser ini­
which computational uncertainty introduces new complexities tialised at the same learning rate, then
that have to be addressed by extra work. For example, they controlled by the line search. Top plots:
Test error after 10 training epochs vs
may require the computation of new observables that are not learning rate. Bottom plots: Test error
part of the classic paradigm. Here is one way to do this for the vs training epoch. Error bars show em­
pirical standard-deviation over 20 repe­
variances required in Л.22 titions. The uncontrolled sgd instances
show strong variance in performance,
If the batch size M is larger than one, then Vf., af1, pi can be while the line search quickly finds and
tracks a good choice of learning rate. It
estimated empirically from the elements £m (t) := t(£Jm,x(a)) does not always perform as well as the
of the batch objective, and its gradients £'m(a) := d^m/da. In best possible learning rate, but finding
that rate requires a tedious manual, or
addition to the already computed empirical batch means
externally controlled search, while the
line search requires no intervention.
1M
yi r(ai) + M I
^m(ai), 22For simplicity, we assume indepen­
dence between function values f and gra­
m=1
dients f1, setting p = 0. The extension to
1M
yi = di ( ^r(ai) + M IL ^£m (ai) an empirical estimator of p is straightfor­
ward, but p = 0 also seems to work well
in practice.
26 Step-Size Selection - a Case Study 219

1 procedure PROBLlNESEARCHSKETCH(f, y0, y0, Vf0, Vff) Algorithm 26.1: Pseudo-code for a prob­
abilistic line search. Adapted from Mah-
2 T, Y, Y! 0, y0, y0 £ R Ц initialise storage sereci and Hennig (2017). The code as­
3 t 1 Ц initial candidate equals previous step-length sumes access to a function handle f that
returns scaled pairs of values and gradi­
4 while length(T) < 10 and no Wolfe-point found do
ents of the form
5 [ y, y ] f (t) / evaluate objective
f(t)= Гf(t) - f (0) f1 (t) 1
6 T, Y, Y' T U t, Y U y, Y'U y*1 / update storage 1 (t) I. f'(0) , f'(0)J ,
7 GP GPinference(t, y, y') where t = a / a—1 is the input variable
8 PWolfe probWolfe(T, GP) Ц Wolfe prob. at T scaled by the step length of the previ­
ous line search. This re-scaling allows
9 if any PWolfe > cW then Ц done?
the use of standardised scales for the
10 I return t* = arg max PWolfe / success! GP prior, thus avoids hyperparameters.
11 else / keep searching! Note that all operations take place on
the scalar projected gradients £ R, not
12 Tcand COMPUTECANDIDATES(GP) / candidates on full gradients £ RN . The code sets
13 EI expectedImprovement(Tcand, GP) a fixed budget of 10 evaluations, after
which the search aborts. In this case, a
14 PW probWolfe(Tcand, GP)
fall-back solution can be returned. For
15 t arg max(PW Q EI) Ц find best candidate example, the point t* that minimises the
16 end if GP posterior mean. In practice, this fall­
back is very rarely required.
17 end while
18 return error / fall-back / no acceptable point found
19 end procedure

this requires collecting the additional statistics

1M
E
Si = M m=1 £m (ai), and
(26.18)
1M
Si = M E
(d I ^^m ( ai ))2,

then using the estimators

V2 = Si - y2 V2 = Si - (y'i)2
Vfi M-1, Vfi M-1

Doing so requires computing M inner products when collecting


Si, at cost O (NM). However, the computation of the batch
losses £m (ai) and gradients V£m(ai) is usually more expensive
than these inner products (in deep networks, these gradients are
computed by back-propagation). And these numbers already
needed to be computed for the other parts of the line search
and overall optimiser anyway. Thus the empirical estimation
of fff., Vf1 only adds a minor computational overhead. Recent
software packages23 give access to these quantities at low or 23E.g. Dangel, Kunstner, and Hennig
(2020), software at backpack.pt
very low computational overhead. A more detailed discussion
can be found in §5.2 of Maren Mahsereci’s PhD thesis (2018),
and also in the original works on the probabilistic line search.24 24 Mahsereci and Hennig (2015; 2017)
220 IV Local Optimisation

Speaking more abstractly, the interesting aspect here is that


the presence of significant computational uncertainty (noise) in
an optimisation problem may not only require the design of
more general probabilistic algorithms, such as the line search
constructed here. It might also require the computation of new
quantities, for example to identify the likelihood. By giving an
explicit role to the likelihood, the probabilistic viewpoint makes
this aspect obvious. One may wonder whether non-probabilistic
formulations have in the past sometimes hidden such challenges
a bit too well.
__________ 27
Controlling Optimisation
by Probabilistic Estimates

Beyond the step size, optimisation algorithms have a number


of other hyperparameters that affect their performance. In the
classic noise-free setting virtually all of these are either auto­
matically set by the algorithm, or their performance is so robust
to the parameter choice that a global value can be set once and
for all by the designer.1 The noisy setting complicates things 1 This is the case, for example, for the
two constants c1, c2 in the Wolfe condi­
both by making some decisions less straightforward, and by
tions (Eqs. (26.7) and (26.8)), which are
introducing new parameters to be tuned. In this section we will frequently set to c1 = 10-4, c2 = 0.9 for
address the following two examples, one each of these cases: quasi-Newton methods; see pp. 33-34 in
Nocedal and Wright (1999) and §28.3 in
& Since a noisy gradient measurement will almost surely never this text.

actually return the value ‘zero’ - even if the optimiser should


pass by the extremum of either the full-data or population
loss -the issue of when to stop the optimiser turns from
a trivial Boolean conditional in the noise-free case into an
actual concern. This is a real problem in empirical risk min­
imisation: As it approaches the minimum of the population
risk, stochastic optimisers at best begin a wasteful diffusion
around the optimum that is difficult to detect. At worst, they
may over-fit, i.e. “home in” to superficial features of only the
empirical data set.

& The batch size M is a new parameter under control of the


optimiser that is not present in the classic case. Hardware
characteristics often put constraints on the available choices
of M. But nevertheless, a good trade-off between cost and
precision can help the optimiser converge faster.

The following two sections present ways to address these issues


222 IV Local Optimisation

from the probabilistic standpoint. These are just case-studies,2 2 The two sections are short summaries
of the following two papers, respectively:
and should not be taken as the sole solutions to these issues.
Balles, Romero, and Hennig (2017) as
Quite in the contrary, both aspects have been studied by many well as Mahsereci et al. (2017). Addi­
authors in the machine learning community. The point of this tional information can be found in M.
Mahsereci’s PhD thesis (2018).
section is to highlight two aspects. First, on a conceptual level,
the probabilistic viewpoint offers a unifying framework in which
to reason about algorithmic design. But second, practical consid­
erations may also require us to once again ease off the Bayesian
orthodoxy. Especially when it comes to nuisance parameters
deep inside a low-level algorithm, not every quantity has to
have a full-fledged calibrated posterior. The probabilistic frame­
work may then still help in the derivations, but point estimates
derived from the probabilistic formulation may just do the trick.

► 27.1 Choosing Batch Sizes

For the moment, we consider the simple (and flawed, yet popu­
lar) case of stochastic gradient descent, i.e. the optimiser given
by the update rule

xi+1 = xi — ai^L(xi) =: xi — aigM(xi).


Let us assume we have found a good value for ai, e.g. by using
the line search algorithm described above. Here we simplified
the notation by introducing the shorthand gM(xi), which explic­
itly exposes the batch size M in the noisy gradient ^L.M (xi).
By Eq. (26.3), the variance of the gradient elements scales in­
versely with M.Ifwecan assume that the entire data set is
very large (K » M) and that the batch elements are drawn
independently of each other, then the elements of gM( xi ) are
distributed according to the likelihood
/ 1 \
gM(xi) -N \ \'L(E,xi), mE(xi )J , (27.1)

where E(x) is the covariance between gradient elements,


1K
E(x) := K E(W(5i,x) - ^С(E,x))(W(&,x) — ^L(E,x))T.
i=1

We will drop the variable x from the notation below, as all


subsequent considerations apply to one specific local value of xi .
The full matrix E is unknown, and very costly to even estimate
empirically. But we already saw above (Eq. (26.18)) that the
diagonal elements of E can be estimated relatively cheaply at
runtime, by computing the additional observable
1M
S = M E W (^j, x) — gM(x) , (27.2)
27 Controlling Optimisation by Probabilistic Estimates 223

where .2 denotes the element-wise square.

For simplicity, we will assume that the optimiser has free con­
trol over the batch size M.3 Deciding for a concrete value of 3In practice, aspects like the cache size
of the processing unit usually mean that
M, the optimisation algorithm now faces a trade-off: A large
batch sizes have to be chosen as an inte­
batch size provides a more informative (precise) estimate of the ger multiple of the maximal number of
true gradient, but increases computation cost. From Eq. (27.1), data points that can be simultaneously
cached.
the standard deviation of gM only drops with M-1/2, but the
computation cost of course rises linearly with M. Sowe may
conjecture that there is an optimal choice for M.It would be
nice to know this optimal value; but since this is a very low-
level consideration about a hyperparameter of an inner-loop
algorithm, an exact answer is not as important as a cheap one.
Thus, we will now make a series of convenient assumptions to
arrive at a heuristic:
First, assume that the true gradient VL is Lipschitz continu­
ous (a realistic assumption for machine learning models) with
Lipschitz-constant L. That is,

|| VL(x) -VL(xi) || < L\\x - xi|| V x e RN.

This assumption implies a quadratic upper bound on the loss


itself:4 4See, e.g., Eq. 4.3 in Bottou, Curtis, and
Nocedal (2016).
L(xi+1) < L(xi) + VL(xi+i - xi) + L ||xi+i - xi ||2.

Re-arranging for the stochastic gradient descent rule xi+1 = xi -


aigi yields a lower bound on the gain in loss for the optimiser’s
step:

L(xi) - L(xi+1) > G := aiVL(xi)Tgi - L0- ||g||2.

Here is where the probabilistic description becomes helpful:


From Eq. (27.1), we know that E(gi) = VL(xi), and
„ tr У
E( ||gi ||2 ) = || VL (xi) ||2 + M,

so we can compute an expected gain from the next step of gradi­


ent descent, as

E(G)= L - LO2} ||VL(xi) ||2 - LM tr(E).

gain computation
Sinceper cost
cost E(G)/M is linear
, which inM,awe
is then consider
rational the expected
function inM,
and we can find the M that maximises this expression. It is

2 La tr У
M* (27.3)
2-lO ||vl(xi)||2.
224 IV Local Optimisation

Since L is usually not known a priori and the norm of the true
gradient is not accessible, we finally make a number of strongly
simplifying assumptions to arrive at a concrete heuristic that
does not add computational overhead. First, assume that VL
is not just Lipschitz continuous but also differentiable, and
the Hessian of L is scalar: B (xi) « hIn . This means5 that the 5 A detailed derivation is in the original

paper by Balles, Romero, and Hennig


Lipschitz constant is L = h and the gradient norm can be
(2017).
approximated linearly as ||VL(xi) ||2 « 2h(L(xi) — L*), where
L* = minai L(xi + aigi) is the loss from the optimal stochastic
gradient descent step size. This simplifies Eq. (27.3) to

M _ a trS
M* = 2 — ha L(xi) — L*'

Under the scalar Hessian assumption, the optimal learning rate


is a = 1/h. Since we started out assuming that the optimiser is
in fact using an approximately optimal step-size, we get

tr S
M*
a L(xi) — L*'

Finally, empirical risks are typically energy functions, with an


optimum L* « 0. So we get an upper bound for the optimal
step size as
tr S
M* < a (27.4)
L(xi)
(recall that we have an unbiased estimate of L in L).

This heuristic has shown good empirical performance on a num­


ber of test cases (Figure 27.1). For the purposes of this text, this
result serves as another example for the use of probabilistic
information in algorithm design. As in the preceding section
on step-size selection, we once again face a parameter in our
algorithm (the batch size M) whose optimal value is fundamen­
tally affected by computational uncertainty. And as before, the
tuning heuristic we arrive at involves the diagonal elements
of the covariance matrix S, a quantity that would not exist in
the noise-free case. We already saw in the preceding section
that it can be estimated at runtime, and that doing so requires
computing an additional observable (of low computational cost)
that would not be part of a classic analysis.

► 27.2 Early Stopping

When should the optimiser stop? In the noise-free case, the


answer is trivial: stop when the gradient vanishes, Vf = 0. But
27 Controlling Optimisation by Probabilistic Estimates 225

MNIST SVHN CIFAR100

0 0.2 0.4 0.6 0.8 1


data read 7

Figure 27.1: Optimisation progress of


stochastic gradient descent when batch
sizes are controlled by the adaptive rule
in the noisy case, we never know with certainty that this is the of Eq. (27.4). Results reproduced from
case, and the optimiser may well not actually converge to a true Balles, Romero, and Hennig (2017). Each
column of plots shows results on a spe­
root of the gradient. In the empirical risk minimisation setting, cific empirical risk minimisation prob­
this holds when significant computational noise arises from lem, an image recognition task on four
different standard benchmark data sets
batching, but it is even a problem when the risk is computed
(MNIST: see Figure 26.6. Street View
over the entire data set. The real target of the optimisation House Numbers: Netzer et al., 2011. CI-
problem is the population risk, which can fundamentally not be FAR10 and 100: Krizhevsky and Hin­
ton, 2009). Top to bottom: Training error
accessed because we only have access to a finite data set. This (a measure of the optimisers’ raw effi­
is an example where computational uncertainty and empirical ciency); test accuracy (a measure of gen­
eralisation); and the batch size M chosen
uncertainty overlap.
by the optimiser over the course of the
optimisation. Results are plotted against
the number of read data-points (not the
The stopping problem has practical relevance. For contempo­ number of optimiser steps) for a fairer
rary machine learning models, the number N of parameters comparison. Especially for the moder­
ately larger problems (CIFAR10 and
to be fitted often exceeds the number K of datapoints. In such CIFAR100), the adaptive schedule im­
situations, if the optimiser is not stopped, it might over-fit to proves on fixed batch sizes, even though
the locally chosen batch sizes generally
a minimum of the empirical risk that reflects features of the
lie within the range of the constant-M
data set that are only due to sampling noise, and not present comparisons.
in the population. The standard approach to this problem is
to separate the data into a training set and a separate validation
226 IV Local Optimisation

set. The optimiser only gets access to the empirical risk on the
training set (possibly sub-sampled into batches). A separate
monitor observes the evolution of the validation risk, and stops
the optimiser when the validation risk starts rising. Apart from
technicalities (reliably detecting a rise in the validation risk is
itself a noisy estimation problem), the principal downside of
this approach is that it “wastes” a significant part of the data
on the validation set. Collecting data often has a high financial
and time cost, and the data in the validation set cannot be used
by the optimiser to find a more general minimum. Even if we
ignore the issue of overfitting, we still need some way to decide
when to stop the optimiser. Many practitioners just run the
method “until the learning-curve is flat”, which is wasteful.
This section describes a simple statistical test as an alterna­
tive, which is particularly suitable for small data sets, where
constructing a (sufficiently large) validation is not feasible. It is
based on work by Mahsereci et al. (2017), and makes explicit
use of the observation likelihood (Eq. (27.1)). Let ppop. be the
distribution from which data points are sampled in the wild,
and f be the population risk

f (x)= r(x) + y £(g,x) dppop.(g).

Then the gradient of the empirical risk L is distributed as

p(VL(x) | f ) = N f VL(x); Vf (x), ф


K

So the probability to observe a particular empirical loss gradient


if the true population loss is actually zero is the evidence,

p(VL(x) | 0) = N(VL(x);0,^x)
K

Once again the quantity E makes an appearance. Since we


already decided to estimate the diagonal of E empirically in
the above sections, we may as well re-use the estimator S from
Eq. (27.2). Making once again the simplifying assumption that
the sampling noise is independent across gradient elements,6 6 This independence assumption is a rel­
atively strong simplification, which may
we get
be possible to improve on with some ex­
tra work. In particular if other structural
p (VL (x )|0)= ПП N (L%&
d^7
;0, Kx information about the data set and the
model encoded in € is available, a block,
i=1 i or low-rank structure for E is probably
helpful.
Analogous to the treatment in §11.3 (see Eq. (11.15)), we can now
construct a log-likelihood ratio test. If the probability for the
observed gradients under the assumption that the population
27 Controlling Optimisation by Probabilistic Estimates 227

risk vanishes is larger than the expected value of this probability,


then the hypothesis that we really are close to the population
optimium cannot be ruled out any more, and we may consider
stopping the algorithm. This happens when

logp(VL(x) I 0) - Ep(VL{x)Ю)(p(VL(x) I 0)) > 0,

1 kN / [ vl (X)] i \
1 - N 1Д S > > 0.

(The variable L is the integration variable for the expectation.)


That is, when the average gradient element is within the “error
bar” estimate S.

More details can be found in Mahsereci et al. (2017). The stop­


ping test provides another example for how probabilistic analy­
sis of the result of computations can inform algorithm design.
It also demonstrates how uncertainty arising from external and
computational sources can interact with each other in prac­
tice. From the probabilistic viewpoint there is no problem with
this mixing. Mathematically, uncertainties arising from different
sources do not need to be distinguished formally.
__________ 28
First- and Second-Order Methods

In this section, we address the design question of iterative


algorithms: in the generic update rule (see Algorithm 25.1)

xi+1 = xi + aidi, ai E R, di E. RN,

how should the direction di be chosen? There are myriad ways to


answer this question, we will discuss only a few important basic
choices. Very roughly, they split into algorithms motivated by
gradient descent (a first-order method), and methods motivated
by the Newton-Raphson rule (a second-order method). More
specifically, in §28.1 we will first discuss algorithms in which
the update rule for di can be phrased independently for each
element. That is,

[di]n =[di]n ( {f,, {[Vf]]n }j=0... i).

Section 28.3 will discuss a second type of rules that also consider
interactions between gradient elements. This classification is pri­
marily a computational consideration, not an analytic one. Some
of the methods in this first class converge asymptotically faster
than gradient descent; virtually all the methods in the second
class converge slower than Newton’s method. But element-wise
rules scale more readily to very high-dimensional problems.
This is one of the reasons why they are currently the popular
choice in machine learning, where the number N of parameters
to be optimised is frequently in the range beyond 106.

► 28.1 Element-wise Methods

Typically, the update rule is phrased in terms of some sufficient


statistics kept in memory and updated locally. Here are some
classic choices. In the following, all operations are element-wise.
230 IV Local Optimisation

The base case is gradient descent


1 Polyak (1964). Due to the derivation

below, the algorithm is historically also


xi+1 = xi — ai Vf. known as the heavy ball method.

& Gradient descent with momentum1 is motivated by the New­


tonian dynamics2 of a massive particle moving, with friction,
in a potential field given by f.It uses an auxiliary variable vi, 2 Since this physical interpretation has an
interesting connection to the probabilis­
and is described by the update-rule
tic derivation below, we briefly derive
it here. Assume the particle has mass m
Vi = -ai(1 - ei) Vf(xi) + eiVi-1 with ei € (0,1), and moves in a potential given by f, with
friction coefficient к. Then its Newtonian
xi+1 = xi .
+ vi dynamics are described by the second-
(28.3) order ODE

Nesterov's accelerated method3 is a variant of momentum that mx( t) = —кx (t) — Vf (x (t)). (28.1)

uses a look-ahead: A simple approximate way to solve this


problem is to assume that f is locally lin­
1 1
vi = -ai( — ei) Vf(xi + ai( — ei)vi) + eivi-1, ear (that Vf is locally constant), which
turns Eq. (28.1) into a first-order sepa­
xi+1 = xi + vi . rable linear ODE in the velocity v(t) :=
x. This locally constant approximation
Note that this is an implicit definition, because vi features amounts to an explicit Euler step (see §37).
Given the initial values x(ti) = xi, and
both on the left-hand side and in the argument of Vf. Nes­
v(ti) = vi, the approximate solution
terov’s method will not be further discussed in this chapter, at time ti+1 := ti + т is then given by
it is more readily understood in the framework of implicit Eq. (28.3), with the constants given by

ODE solvers4 ai = 1/к,


ei = exp (— -т) . (2M
& Other recently popular algorithms include rules such as m
AdaDelta5 and Adam.6 These will not be discussed further Note that this dynamical model reverts
to that of gradient descent in the mass­
here, but we note that these algorithms, which are specifically
less limit m 0 (since then в 0).
designed for stochastic optimisation, both retain a running
average of the element-wise square of the gradient, similar 3 Nesterov (1983)
to the statistic described in Eq. (27.2). Due to this they can 4 Nesterov’s method can be derived from
the Newtonian dynamics of a massive
in fact be interpreted as a form of “uncertainty-damping”,
particle analogous to the derivation of
i.e. the length of a step, in each individual dimension, is the momentum method above, where the
scaled by the signal-to-noise ratio. The technical details are explicit Euler step is replaced with an im­
plicit step. For further discussion of the
involved, though.7 relationship between ODE solvers and
optimisation methods, see Scieur et al.
(2017), and also Exercise 38.3 in Chapter
For the purposes of this text, it is important to note that the VI.
5 Zeiler (2012)
first three methods on this list - gradient descent, momentum,
6 Kingma and Ba (2015)
accelerated momentum - can all be motivated and analysed 7 Balles and Hennig (2018)
purely on noise-free optimisation problems. If an optimisation
problem is stochastic, one can then add analysis of the noise’s
effect on these methods, but noise is not a principal ingredient
in their construction.
We thus now turn to ask whether the presence of noise can
be more explicitly encoded in these algorithms. As in previous
chapters, we will thus construct generative probabilistic models
28 First- and Second-Order Methods 231

1 procedure prob_optimise(l(x), x0) Algorithm 28.1: Pseudo-code for a prob­


abilistic iterative optimisation algorithm.
2 lx0 l(x0) / first evaluation
This is a variant of the generic classic
3 L lx0 / initialise storage optimiser in Algorithm 25.1. The differ­
4 for i = 0,1,... do ences to that algorithm are that the prob­
abilistic variant assumes access to a likeli­
5 di P_Direction(L) / decide search direction
hood l (x) = p(y, y | f (x), Vf (x)) for the
6 ai PLSearch(diTl(t)) / probabilistic line search objective and its gradient, rather than
noise-free access to those quantities. For
7 xi+1 xi+ aidi / move
generality, the data structure returned
8 li+1 l(xi+1) / observe by the likelihood is not further specified.
9 if Terminate(L) then Ц done? As a concrete example, one may think of
a joint or independent Gaussian distri­
10 return xi / success!
bution over y e R, y! e RN centred on
11 end if (f, Vf), with covariance Л e RN+1N+1.
Observations are stored in a suitable
12 L L U li+1 / update storage
structure L. The algorithm uses the prob­
13 end for abilistic line search described in §26.5 to
14 end procedure choose step-sizes. To signify that this sub­
routine only operates on the univariate
sub-problem, line 6 uses the suggestive
notation d-1(t) (cf. the same line in Algo­
that revert to gradient descent and its momentum variant in rithm 25.1). The search direction is com­
puted in a decision step from the poste­
the limit of a Dirac observation likelihood. Since stochastic
rior on the problem arising from the data
noise is such a prevalent problem in contemporary optimisation in L. Importantly, the resulting algorithm
problems, however, we will be less interested in analysing the is structurally virtually identical to the
classic method. The probabilistic opera­
underlying prior assumptions that yield these classic algorithms, tions are encapsulated in the subroutines.
and more focussed on how the presence of noise can be included If those routines can be designed with
cost comparable to the classic method,
explicitly in such an optimiser. The result will be a series of
then the probabilistic optimiser need not
“add-ons” to these classic methods, rather than a probabilistic have higher computational cost than the
re-interpretation for them. Algorithm 28.1 shows how these can classics.

be realised within the structure of classic methods by updating


specific parts of the optimisation loop. It is worth doing these
extensions carefully. The probabilistic formulation allows a clear
separation between the effects of observation noise and those of
the geometry of the underlying smooth objective.

► 28.2 Probabilistic Element-wise Rules

Since the above rules are defined element-wise, we will simplify


the notation for this section and use the scalar quantity fn (Xi) e
R to represent the element [Vf (xi)]n of the gradient. Where the
index n e [1, ..., N] does not matter, we will also drop it. As
the above methods can be motivated within the framework of
noise-free gradient observations, we need to add a probabilistic
estimation model for these gradients. Doing so requires a joint
generative model p(Vf (x)) for the gradients - both over the N
output dimensions and over the input domain RN . In the classic
methods, this is not necessary: if one assumes the gradients to
be accessible without uncertainty, there is never a need to guess
232 IV Local Optimisation

what the real gradient may be. The optimiser can just check.

Once again, computational cost constraints will guide our model


design just as much as knowledge about the loss function itself.
To keep the element-wise structure of the above rules, we will
assume that the elements of Vf evolve independently of each
other,8 imposing the factorisation structure 8This assumption is patently incorrect,
simply because Vf is the gradient of the
N smooth scalar field f; thus, the gradient
p (Vf (x)) =: n pn (fn (x)).
n=1
theorem imposes constraints on the ele­
ments of Vf relative to each other: for
any two points x1 , x2 and any curve c
For the same algorithmic reasons as in previous chapters, we from x1 to x2 ,the gradient must satisfy
are again drawn to Gaussian process models as candidates for
f (x2) — f(x1) = cVf(x)dx.
pn . But even within that class of probabilistic models, there is
now a fundamental choice to be made. The conceptually cleaner The discussion in §28.3 will show that
including these constraints in a compu­
path is to define a consistent Gaussian process model over the
tationally efficient way is not straightfor­
input domain RN . However, this would impose the usual cubic ward. Further discussion of these issues
cost in i, the number of the optimiser’s iterations, a problem that can be found in Hennig (2013).

would require further approximations to fix. We also know that


the optimiser will only ever ask for gradients along its trajectory,
which forms a univariate curve, albeit no linear sub-space of
RN . For this reason, we make another leap of faith and treat the
individual gradient observations fn (xi) as separate univariate
time series fn (ti) for ti G R. Inference can then be addressed
with the Kalman filter (§5). This raises the question how the
multivariate input xi should be transformed into a scalar t: is the
difference from one optimisation step to another unit (t t + 1)
or does it have a length? For the purposes of this section, we will
use the latter option, and set ti+1 = ti + т with т := ||х,+1 — xi ||.

With this, we can consider two basic choices for the SDE defin­
ing the Kalman filter: The Wiener process and the Ornstein-
Uhlenbeck process,

dfn (t)= 0dt + 6n d^t, and (28.4)


dfn (t) = —Ynf'(t) dt + @n d^t, respectively. (28.5)

Here we have allowed for separate drift (Yn) and diffusion (6n)
scales for each element of the gradient. They translate into the
Kalman filter parameters (see Algorithm 5.1 and Eq. (5.21))

An = 1, Q = ^nт (Wiener), (28.6)


л2 . .
An = e—Ynт Q = ,n (1 — e—2YnT} (OU). (28.7)
2Yn

We recall from §5 that the Wiener process assumes the gradient


elements to drift in a free random walk, while the OU process
28 First- and Second-Order Methods 233

expects the gradients to revert to zero. As Eq. (28.7) shows,


the latter model is slightly more complicated to implement,
but it might provide a more realistic model of the gradients
produced by an optimisation routine that actively tries to drive
the gradients to zero.

Consider the behaviour of the Kalman filter (28.6) arising from


the Wiener prior (28.4). Assume that, at step i - 1, the N inde­
pendent filters for each gradient element together hold posterior
(estimation) mean mi-1 G RN and the vector of scalar (element­
wise) covariances Pi- 1 G RN. To build a probabilistic version
of stochastic gradient descent, we take mi-1 as the estimate for
Vf (xi), so the optimiser moves to

xi xi- 1 + aimi- 1.

Now the optimiser collects an observation yi G RN . In the classic


derivation, this is taken as a direct observation yi = Vf (xi).
To build a probabilistic extension, we instead use the explicit
likelihood

p(yi | Vf(xi)) =N(yi;Vf(xi),diagR),

where R G RN is a vector of individual observation likelihoods -


we already saw in Eqs. (26.18) and (27.2) how to build empirical
estimators for this observation noise in empirical risk minimisa­
tion problems, by summing squares of gradient elements over
batches. Since we can compute a noisy observation of the gradi­
ent itself, we set the observation projection in Algorithm 5.1 to
H = 1. From Algorithm 5.1 one can then deduce (taking care
to do all operations element-wise) that the updated mean and
covariances at xi will be given by the vectors (all operations
element-wise!)
Exercise 28.1 (easy, instructive). What is
the value of the Kalman gain K, and thus the
mi =(1 - K)mi-1 + Kyi and update in Eq. (28.9) if the Wiener process
(Pi- 1 + 02 т) R (28.8) prior (28.4) is replaced with the Ornstein-
Pi =(1-K)Pi- Uhlenbeck prior (28.5)? Compare the role of
Pi- 1 + 02 т + R' т in this update with the values of Eq. (28.2)
and compare the role of the friction coefficient
where K = (Pi- 1 + 02т)/ (Pi- 1 + 02т + R) is the Kalman gain. к in the derivation of the momentum method
Thus the algorithm - which we might call probabilistic gradient with that of the drift coefficient y.

descent - will now step to the new location

Xi+1 = Xi - aimi = Xi - ai((1 - K)mi- 1 + Kyi). (28.9)

Despite the simplicity of these derivations, there are few inter­


esting aspects to observe here:
234 IV Local Optimisation

Not surprisingly, for noise-free observations (R = 0), the


algorithm simply reverts to gradient descent.

& The diffusion scale в introduces a new free parameter into


the algorithm. Assuming that R is empirically estimated
as described above, в directly determines the gain K. So
the algorithm could also be phrased in terms of K. Setting
K empirically is about as hard as setting the momentum
parameter в in gradient descent with momentum.

It is tempting to note the similarities between Eqs. (28.9) and


(28.3) and think that we have re-constructed the momentum
method in a probabilistic fashion. But the two rules are not
just subtly different (note the position of the learning rate ai
inside and outside the brackets), they are also constructed
from entirely different motivations. The momentum method
was originally designed for noise-free problems, to address
a problem with under-damped oscillations in gradient de­
scent (hence the quite literal notion of friction). Here, we
have added another feature to this method to address an en­
tirely different challenge: the evaluation uncertainty arising
from noisy gradients. In fact, it is of course possible to com­
bine both notions with each other and build a probabilistic,
smoothed, momentum method. This is perhaps important
to note because the contemporary literature on stochastic
optimisation, especially in the machine learning community,
tends to conflate these two aspects and present the notion
of momentum as a remedy for stochasticity. But noise on
gradients and under-damping of gradient descent are two
separate issues. If the long-term goal is to develop a clear
9 Chen et al. (2020)
and reliable theory for optimisation, then separate problems
should be addressed by separate notions.

Research on element-wise optimisation rules motived by Kalman


filtering continues, for example with the idea to use curvature
information to calibrate the gain of the filter.9 10 For reasons that will become clearer

below, the same class of methods has


also historically been known as secant
► 28.3 Quasi-Newton Methods and variable metric methods.
11 Fletcher and Reeves (1964)
We now move on to optimisation rules that treat the entire gradi­
ent jointly, allowing for interactions between gradient elements.
12Winfield (1970); Powell (1975); More
Here, too, we will focus on one sub-class of this wider area, (1983).
algorithms that aim to approximate the behaviour of Newton’s
method and are thus known as quasi-Newton methods.10 Other
popular families of update rules, not discussed here, include
nonlinear conjugate gradients 11 and trust-region methods.12
28 First- and Second-Order Methods 235

Newton’s method13 is about the fastest optimisation method for 13Also known as Newton-Raphson op­
timisation. The idea itself is ancient, a
multivariate problems anyone could wish for. A straightforward
primitive form may already have been
way to derive this iterative update rule is to consider the second- known to the Babylonians (Fowler and
order Taylor expansion of the objective f around the current Robson, 1998, p. 376).

iterate xi,

1
f (Xi + d) « f (Xi) + dт Vf (Xi) + 2 d T B (Xi) d.

(Recall that B is the notation for the Hessian of f.) If the Hessian
is symmetric positive definite (if f is locally convex), then this
quadratic approximation has a unique minimum, which defines
the next Newton iterate,

Xi+1 = Xi - B-1(Xi)Vf (Xi).

Newton’s method has some issues with convergence, in par­


ticular on non-convex problems. But when it converges, it can
converge quadratically fast,14 which is to say: very fast. The 14 Nocedal and Wright (1999), Theorem
3.5
primary problem of Newton’s method for medium- and large-
scale problems is not its stability, but the need to compute the
Hessian B(Xi ) and invert it (or rather, to solve the linear prob­
lem B(Xi)z = Vf(Xi)). Quasi-Newton methods15 are one way to 15A great contemporaneous review with
extensive analysis and discussion can be
address the computational cost of Newton’s method by con­
found in Dennis and More (1977).
structing an approximation B (Xi) to B (Xi). They are based on
the observation that each pair of subsequent gradient obser­
vations [Vf (Xi), Vf (Xi-1)] collected by the optimiser provides
information about the Hessian function, because the Hessian is
the rate of change of the gradient, or more precisely:

yi := Vf (Xi) - Vf (Xi-1) = B(Xi-1 - Xi) =: BSi,

where B is the average Hessian

B := J® B(Xi—i + t(Xi - Xi-1)) dt.


Thus, any matrix B that satisfies this so-called secant-equation

yi = B Si (28.10)

is a candidate estimator for the Hessian B(Xi). If we manage


to invert this estimator, we have awayto estimate the Newton
direction. Alternatively, if we are lucky enough to collect noise-
free gradients, we could also use the inverse secant equation

?
Si = H yi

to estimate the inverse Hessian H = B-1.


236 IV Local Optimisation

Name ci Reference
Symmetric Rank-1 (SR1) ci = yi - Bi-1 si Davidon (1959)
Powell Symmetric Broyden ci = si Powell (1970)
Greenstadt’s method ci = Bi-1 si Greenstadt (1970)
DFP ci = yi Davidon (1959); Fletcher & Powell (1970)
Broyden (1969); Fletcher & Powell (1970);
BFGS ci = yi +1 ^B^ Bi-1 si
si i-1 si Goldfarb (1970); Shanno (1970).
Table 28.1: The most popular members
of the Dennis family, Eq. (28.11), defined
This situation sounds familiar, and indeed it is closely related by their choice of ci (middle column); see
Martinez R. (1988) for more details. Note
to the setup discussed at length in Chapter III on linear alge­
that the names DFP and BFGS consist
bra. Here as in the earlier chapter, an algorithm collects linear of the first letters of the names of their
projections of some matrix, and has to estimate that matrix inventors (right column).

or its inverse. So we can save alot of derivations and re-use


the results from Chapter III. However, there are pitfalls in the
nonlinear optimisation setting that complicate both the classic
theory of quasi-Newton methods and the development of a
useful probabilistic theory for them. The primary problem is
that, in nonlinear optimisation, the Hessian is of course not a
constant function. Since the algorithm only ever gets to collect
rank-1 projections of the Hessian, it has to make assumptions
not just about the aspects of the matrix it has not yet seen, but
also about how previously observed aspects may have changed
and become outdated over the preceding steps.

This is also the reason why there is not just one ‘best’ quasi­
Newton method. In contrast to the linear setting, where the
method of conjugate gradients is a contestant for the gold stan­
dard for spd problems, there are entire families of quasi-Newton
methods. A widely studied one is the Dennis family16 of update 16 Dennis (1971)
rules of the form
(yi - BiSi)ci + ci(yi - BiSi)t cisi(yi - BiSi)ci
Bi+1 = + --------------- --- ---------------------------- ------ ----- ------
Bi
ci Si ( ci Si )2
(28.11)
where ci G RN is a parameter that determines the concrete
member of the family. Quasi-Newton methods were the subject
of intense study from the late 1950s to the late 1970s. The most
widely used members of the Dennis family are presented in
Table 28.1. Among these, the BFGS method is arguably the most
popular in practice, but this should not tempt the reader to
ignore the other ones. This is particularly true for problems
with noisy gradients.

One can easily check that every non-zero choice of ci yields a


28 First- and Second-Order Methods 237

matrix estimate Bi+1 that satisfies the secant equation (28.10).17 17 The DFP and BFGS methods can also

be constructed as estimates of the inverse


We can also recognise in Eq. (28.11) the rank-2 update form
Hessian H: If, in Eq. (28.11), one replaces
we already know from the linear algebra chapter, for example all instances of B with H and swaps the
from Eq. (19.21). Since quasi-Newton methods take a step in the roles of si and yi, then DFP and BFGS
correspond to the respective updates in
direction of the estimated Newton direction, Table 28.1, where one also has to replace
si and yi in these equations. More details
Xi+1 = Xi - aiBi+1 Vf (Xi), can be found in §7.3 of Dennis and More
(1977).
their algorithmic structure is very similar to that of the generic
linear probabilistic solver of Algorithm 17.2. Since we used that
algorithm to construct a probabilistic version of conjugate gra­
dient, it is perhaps not surprising that quasi-Newton methods
are closely related to the method of conjugate gradients. In fact,
when run on the quadratic problem of Eq. (16.2), and using
exact line searches, all five of the methods specifically listed
above produce a sequence of estimates that is identical to that of
the method of conjugate gradients.18 18 Dixon (1972a); Dixon (1972b).

This connection allows us to transfer some of the results from


the linear algebra chapter.19 In particular, we can read off the 19Historically, the probabilistic interpre­
tations developed the other way round;
following result directly from Eq. (19.21).
starting with a study of quasi-Newton
methods (Hennig and Kiefel, 2012), then
Corollary 28.2. Let Wi e RNxN be a spd matrix with the prop­ extending towards the linear case. More
erty Wisi = ci. Then the Dennis family estimate (28.11) equals the about these connections, as well as
posterior mean on B arising from the Gaussian prior proofs of some of the results quoted
herein, can be found in Hennig (2015).

p (B) = N(B; Bi, Wi ® Wi)

and the single observation likelihood p (yi | B, si) = S(yi - Bsi).

Thus, if one is interested in a direct probabilistic interpretation of


quasi-Newton methods, it is possible to think about the Dennis
family in terms of a Kalman filter, albeit with a somewhat
disappointing choice for the underlying dynamics: Initialise
the filter with m0 = B0 = eI for some reasonable choice of
e, so that the first search direction s0 of the optimiser is the
negative gradient -Vf (X0 ) (this is a standard choice for such
methods; alternatively, use a pre-conditioner). The algorithm
performs one optimisation step, collecting the first observation
pair (y1, s 1). Now implicitly set P— = W- ® W— with any
matrix W1- so that W1-s1 = c1, for the value of c1 we aim to
reproduce from Table 28.1. This yields a filter estimate with
updated estimation mean m 1 = B 1 equal to the estimate of
the Dennis family member, and updated estimation covariance
(from Eq. (19.11))

P1 = W— — W1- s 1( s { W1- s 1) s { W1-.


238 IV Local Optimisation

So far so good, but now, in the final step to complete the con­
nection so that the next iteration still behaves like the Den­
nis method, we have to implicitly force the filter to add, in
the prediction step, just the right terms to P1 so that P2- =
A 1 P1 A} + Q1 = W- 0 W- with an spd matrix W2 that yields
the required match W2s2 = c2 to the Dennis family. It turns out
that such a step does not always exist, because the necessary
update W2 - W1 is not always symmetric positive definite. So,
beyond the special case of the linear problems discussed at
length in Chapter III, we cannot hope to find a general and
one-to-one interpretation of existing quasi-Newton methods as
Kalman filtering models.

However, given how deep quasi-Newton methods have been


studied in the past, and how long it took to develop a good
theoretical grasp on them, it is perhaps not so urgent to add yet
another probabilistic interpretation to them. What is urgently
needed are efficient optimisation methods for stochastic, in
particular high-dimensional stochastic optimisation problems.
Here, probabilistic approaches have very recently begun to yield
promising advances. For example, Wills and Schon (2017) report
on stochastic quasi-Newton methods with stochastic line-search
algorithms. Some ideas can also be found in Hennig (2013), and
in Chapter 9 of Mahsereci (2018).
Summary

This chapter discussed uses for probabilistic reasoning in opti­


misation. Classic algorithms remained an important reference
point. We reviewed line searches, element-wise optimisation
routines, and advanced methods like quasi-Newton algorithms.
But in contrast to earlier chapters, we were less interested in
explicitly re-constructing these algorithms probabilistically, and
instead identified some contemporary problems where these
methods expose weaknesses, then addressed these problems
from the probabilistic perspective:

Readers who have used classic optimisation routines, on


the noise-free setups they were designed for, know such
methods as well-designed black boxes. Given any reasonably
well-conditioned task, they tend to converge to a good esti­
mate without user intervention. Contemporary problems, in
particular in areas like machine learning where large, sub­
sampled data sets play a key role, introduce a significant
degree of computational imprecision - uncertainty. In such
settings even simple routines like gradient descent and its
element-wise variants suddenly require laborious and time­
consuming manual tuning by the user. From the probabilistic
perspective, one cause for this problem is that classic routines
implicitly assume that all numbers are computed to machine
precision, a notion reflected in the Dirac likelihood we find in
probabilistic interpretations of such methods. We addressed
this issue in two ways. First, we introduced an explicit Gaus­
sian likelihood. Identifying this likelihood required addi­
tional computations, of quantities that have no direct classic
analogue. We found those estimators to be relatively cheap
to compute, adding acceptable computational overhead. Sec­
ond, we saw how internal parameter-tuning routines (the
line search) of classic methods can be generalised to the prob­
abilistic setting by taking the existing method and carefully
replacing every instance of a point estimate with a proba­
240 IV Local Optimisation

bilistic generalisation: cubic splines with integrated Wiener


process regression; bisection search with a discrete Bayesian
decision rule; and the Boolean Wolfe conditions with a proba­
bilistic confidence. The resulting method once again liberates
the user from frustrating parameter choices.

The core part of an iterative optimisation routine for con­


tinuous problems is formed by the decision rule that maps
observed gradient values into search directions. When faced
with a stochastically corrupted version of an otherwise stan­
dard optimisation problem (such as empirical risk minimisa­
tion in machine learning), it is tempting to just keep using the
classic methods for high-dimensional problems and analyse
their stability to the noise. The risk in this approach is that it
can lead one to re-purpose algorithmic design choices orig­
inally intended to address an entirely different problem to
now deal with noise. An example is the use of momentum in
gradient descent (originally intended to dampen oscillations
in noise-free optimisation) to stabilise against noise. Here,
it can be conceptually and practically helpful to introduce
an explicit probabilistic sub-routine to take care of inference,
and only then combine its output with the old design tricks.
Doing so can help identify some hyperparameters, or to
separate challenges in an optimisation problem that arises
from the objective’s geometry from those originating from
observation noise.

The take-away from this chapter is that Probabilistic Numerics


does not always have to mean the reinvention of classic meth­
ods from an inference perspective. It can also help address new
challenges, for example those arising from severe computational
noise in Big Data applications. By combining the analytical and
practical strengths of inference with those of classic numerical
algorithms, it is possible to build new methods that address
the challenges of the present, while heeding the lessons of the
past. The probabilistic viewpoint here provides guardrails, a
mechanism for the principled development of solutions to the
issues. This is important at a time when the unsatisfactory
performance of classic deterministic methods has led to a bewil­
dering growth of new ad-hoc stochastic routines.20 In a world 20 Schmidt, Schneider, and Hennig (2021)
where practitioners can choose among well over a hundred dif­
ferent optimisers, each with their own set of hyperparameters,
it is less important to find one that sometimes works well, but
to develop conceptual principles to guide the development of
methods, that autonomously tune their parameters and work
28 First- and Second-Order Methods 241

robustly across many problems.


This chapter argued that internal hyperparameters of a nu­
merical algorithm should not be tuned by the user, but set
efficiently by the algorithm itself. Even where this approach is
wholly successful, there are often also external parameters of
a software solution that still require tuning. For example, in
a machine learning system, the machine learning model itself
may have to be tuned. The entire software solution then turns
into a utility function to be optimised. Even a single evaluation
of this global utility function then can be quite expensive, as
it requires training the entire architecture to some reasonable
precision. Where this is the case, sample efficiency (converging
in fewer steps), rather than computational efficiency (cheaper
steps), becomes the primary objective of algorithm design. The
next chapter discusses probabilistic optimisation algorithms for
such experimental design problems.
Chapter V
Global Optimisation
29
Key Points

This chapter presents Bayesian optimisation, the probabilistic


numerical approach to global optimisation. It consists of the
problem of finding the lowest point of a likely multi-modal ob­
jective function. Here, unlike in local optimisation (Chapter IV),
gradients of the objective function might not exist, or might be
too expensive to compute. The chapter’s core messages are the
following.

Bayesian optimisation is different from the other examples


of pn in this text. First, it was conceived as an ab initio
probabilistic approach, competing with a diverse range of
other global optimisation algorithms. Second, at time of writ­
ing, Bayesian optimisation is already popular and successful,
whether judged by citations, competition results, usage of
software libraries, or economic impact.

A Bayesian optimiser is a probabilistic agent that acts by


evaluating the objective function, receiving back data in the
form of evaluations. As such, the designer of a Bayesian
optimiser must make two important choices: its surrogate
and its loss function.

& The surrogate is the optimiser’s model for the objective func­
tion, reflecting the designer ’s prior assumptions. Little is
more important in practice than designing a surrogate that
is well-informed about the features of the objective (partic­
ularly where the objective’s input space possesses unusual
features, such as graphs or strings). This surrogate must be
a probabilistic model, such as a Gaussian process, as the
the probabilistic treatment of uncertainty is vital to global
optimisation. First, evaluations of the objective are often in­
exact — for instance, they may be corrupted by noise - and
246 V Global Optimisation

so the surrogate must be able to accommodate uncertain


data. Second, it is always important to reason correctly about
the uncertainty in unvisited regions of the objective. It is
the surrogate’s assessment of uncertainty that controls the
challenging task of exploration.

The loss function quantifies how exploration should be bal­


anced against exploitation, i.e. against focusing on an iden­
tified mode of the objective. The choice of loss function is
non-trivial, as the task of finding the optimum can be sen­
sibly framed in a number of distinct ways. When mapped
through the surrogate, each loss function leads to a different
expected loss, known as an acquisition function, each with
different strengths and weaknesses.
30
Introduction

Chapter IV presented methods designed to find the extremum1 1Without loss of generality, our discus­
sion in this chapter will focus on minimi­
of an objective function that is assumed to be convex. These
sation. Maximisation can be achieved by
methods are often used on objectives that are, instead, multi­ minimising the negative of the objective.
modal: in such a case, they will converge only to the mode of
the objective nearest (in some sense) their starting position. We
will refer to such methods as providing local optimisation.
Global optimisation, as its name suggests, instead tackles the
problem of finding the global minimiser x* of an, e.g., multi­
modal objective f (x) G R, where f (x*) := minx f (x) is the
minimum of all local modes of the function2 . This is a much 2 Note that f may have no minimum at
all - minx f (x) is, however, defined if f
more challenging problem, demanding the balancing of the
is continuous and the domain is compact
exploration-exploitation trade-off.3 (Garnett, 2022).
Exploitation corresponds to making an evaluation with a high 3This term is more common in the multi­
armed bandit and reinforcement learn­
probability of improvement. Typically, an exploitative move
ing literatures (Sutton and Barto, 1998).
is an evaluation whose result is known with high confidence,
usually near a known low function value, and that is expected
to yield an improvement, perhaps an improvement over that
known low function value, even if only an incremental one.
Exploitation hence typically hones in on a local mode, as would
be performed by local optimisation.
Exploration instead corresponds to evaluating the objective
in a region of high uncertainty. Such an evaluation is high
risk, and may be expected to supply no improvement over
existing evaluations at all. Nonetheless, exploration is warranted
by improbable, high-payoff, possibilities: such as finding an
altogether new local mode.
4 For instance, if the search domain is
Exploration is hard. To be clear, for most problems, it is [0, 1]D, evaluating just the corners of the
not difficult to find evaluations that are high-uncertainty. The search box will require 2D evaluations.
This number for a realistic problem, say
search domain is normally enormous4 and begins as uniformly
with D = 20, could easily exceed the
high-uncertainty. The low-uncertainty regions produced by our permitted budget of evaluations.
248 V Global Optimisation

Figure 30.1: A global optimisation prob­


lem. We are given an objective function
f (x), which can be evaluated (possibly
corrupted by observation noise) for any
chosen location x. The objective may be
expensive to evaluate, limiting the num­
ber of such evaluations: in the figure,
only three evaluations have been gath­
ered thus far. We must choose future
evaluation locations so as to determine
the objective’s minimiser xt and/or min­
imum f(xt).

existing evaluations are few, like stars dotted in the void. The
challenge of exploration is hence in sifting through the mass of
uncertainty to find the evaluation that best promises potential
reward. This is a challenge common to many aspects of intelli­
gence: think of artistic creativity, or venture capital, or simply
finding your lost keys. When humans explore, we draw upon
some of our most profoundly intelligent faculties: we theorise,
probe and map. As such, exploration for global optimisation
motivates sophisticated algorithms.
Relative to local optimisation, global optimisation typically:

is less amenable to theoretical treatment;

& requires more computation for the process of optimisation


itself (distinct from that required for the evaluation of the
objective), and is hence used predominately for objectives
whose expense, in computation, time and/or money, justifies
the computational overhead; and

has a computational cost that scales more poorly in the di­


mension of the objective (e.g. convergence only being reliable
in practice for problems of no more than around 20 indepen­
dent, relevant, inputs).

An illustration of a global optimisation problem is provided in


Figure 30.1.
Despite such obstacles, research into global optimisation
has been spurred on by the importance and ubiquity of its 5 Calandra et al. (2014a)
applications. These encompass robotics,5 sensor networks,6 en- 6 Garnett, Osborne, and Roberts (2010)
30 Introduction 249

vironmental monitoring,7 and software engineering,8 amongst 7 Marchant and Ramos (2012)
many more. The real world is very often more complex than 8 Hoos (2012)

convex. To address non-convex problems, a zoo of competing


global optimisation techniques have been developed. Popular
approaches include evolutionary methods, branch-and-bound
methods and Monte-Carlo-based algorithms.9 9 Weise (2009)
Bayesian optimisation is a probabilistic framework for global
optimisation. It is an exception to other probabilistic numerics
approaches: it was directly conceived from a probabilistic view­
point, with no direct non-probabilistic predecessor (although
there is, of course, a rich literature on non-probabilistic global
optimisation).10 Although still a young area by the standards of 10cf. Horst and Tuy (2013) and Mitchell
(1998)
applied mathematics,11 compared to other areas discussed in
11Bayesian optimisation’s rich history,
this text, Bayesian optimisation is already a surprisingly mature, detailed in Garnett (2022), is perhaps
developed field, and has developed a rich set of algorithms. best traced back to Kushner (1962).
In this chapter, our ambition is to provide a compact take on
Bayesian optimisation from a probabilistic-numerics perspective:
more comprehensive overviews of Bayesian optimisation exist
elsewhere.12 12 Garnett (2022)
31
Bayesian Optimisation

As for any probabilistic numerical procedure, it is important


to distinguish the two components of a Bayesian optimisation
algorithm: its prior and its loss function. Its prior must provide
a model for the objective function, p (f), and hence also for its
minimum, p(f (x.))1, whereas the loss function specifies the 1 Both densities exist under mild assump­
tions (Garnett, 2022).
goals of optimisation.

► 31.1 Prior

The prior for global optimisation must, first, act as a means


of modelling the objective function. In the language of global
optimisation, a prior for the objective can be viewed as provid­
ing a surrogate2 for the objective. A surrogate is usually viewed 2A surrogate is the equivalent of the
model within numerical integration.
as a function informative of, but easier to optimise than, the
objective itself. Indeed, as we will discuss in §31.2, Bayesian
optimisation does require optimising over a function derived
from the prior. The choice of such a prior will seem familiar
from our efforts in Chapter II to construct prior distributions
for integrand functions. There, we concentrated on priors that
reflected expectations that the integrand be smooth or other­
wise structured to some degree; the objective is another function
for which we are likely to have such expectations. Indeed, if
there were no structure at all, optimisation could be no more
effective than attempting to find a needle in a haystack. Just as
in Bayesian quadrature, Bayesian optimisation predominantly
uses Gaussian process priors encoding strong structure. An
illustration of how a gp prior can inform optimisation is pro­ 3 Hutter, Hoos, and Leyton-Brown (2011)
vided in Figure 31.1. Other choices include random forests3 and
neural networks.4 As a generalisation, the attractive scaling (of 4 Snoek et al.; Springenberg et al. (2015;
2016); Springenberg et al. (2016).
the computational cost, in both the number of evaluations and
252 V Global Optimisation

Figure 31.1: A Bayesian answer to the


global optimisation problem is to assign
a gp prior to the latent function. This
particular gp prior arises from a zero
prior mean and rational quadratic ker­
nel with unit length scale and degree of
freedom a = 0.5. Given the three obser­
vations from Figure 30.1, we plot: the gp
posterior mean as a dark line; marginal
densities as shading; three sample func­
tions as dotted lines; and the location of
each sample’s minimum as squares. This
gp gives rise to an (intractable) probability
density function (pdf) over the location,
xt, of the function’s minimum. This is
plotted along the bottom of the figure.
For this univariate problem, and given
sufficient computational resources, we
can represent this pdf as a histogram
from exhaustive sampling. Note that
there is a finite probability for the mini­
mum to lie exactly at the domain bound­
ary (one of the samples is an example
the number of dimensions) of these alternative models is often case).

offset by the poorer calibration of their credibility intervals (un-


certainties).5 Managing such uncertainty is at the heart of the 5 Shahriari et al. (2016)
exploration-exploitation trade-off faced in global optimisation.
While the most common setting for Bayesian optimisation is
x G X C Rd , for compact X, this is by no means a fundamental
constraint for this class of techniques. Any domain for x for
which an appropriate prior p f (x) can be defined is feasible. It
is not uncommon in Bayesian optimisation to consider richer in­
put spaces, including those that are discrete, or that correspond
to graph-based representations (of, for example, molecules).6 6 Gomez-Bombarelli et al. (2016)

► 31.2 Loss Function

Effective Bayesian optimisation depends crucially on the choice


of a faithful loss function. The loss function for Bayesian opti­
misation quantifies how the exploration-exploitation trade-off
should be negotiated. When first considering a loss function for
optimisation, it might be thought that the goal to be achieved
is simply specified: find the minimum. However, the goal of
finding the minimum may be encoded in at least several distinct
but plausible ways.

1. First, our loss might be the lowest function value evaluated,


such that our goal is to uncover as low a function value as
possible: we will call this the value loss (vl).2

2. Alternatively, our loss might be the entropy in the location of


31 Bayesian Optimisation 253

Figure 31.2: A conceptual sketch of the


Bayesian optimisation decision problem.
Given the current data set, Dn, we must
decide upon the decision variable (rep­
resented as a diamond node) xn .The
objective is then evaluated, returning
yn = f (xn), and (xn, yn) added to Dn
to give Dn+1. Given Dn+1, a decision
will be made about xn+i — as this de­
cision belongs to the future, and de­
pends on uncertain yn, xn+1 is uncertain.
This pattern repeats until the final de­
cision, for the input to be returned to
the minimum, x., which we will call the location-information the user, xN, must be made and the fi­
loss (lil). Figure 31.1 depicts a posterior for the minimiser nal value, yN, returned. Note that the
x* , whose entropy would serve as the lil. diagram does define a graphical model
(Bayesian network), but its joint distri­
bution is challenging to determine. Con­
sidered a graphical model, all variables
3. Another competing possibility is the value-information loss along the dark line are to be taken as
dependent.
(vil), equal to the entropy in the value of the minimum, f (x*).

These are not the only plausible candidate losses; we will meet
alternatives below. Crucial to distinguishing these losses is a
careful treatment of the end-point of the optimisation. The loss
function must make precise what is to happen to the set of
obtained objective evaluations once the procedure ends, and
how valuable this outcome truly is. One crucial question is that
of when our algorithm must terminate. Termination might be 7We will regard this final point as ad­
ditional to our permitted budget of N
upon the exhaustion of an a priori fixed budget of evaluations,
evaluations.
or, alternatively, when a particular criterion of performance or
convergence is reached. The former assumption of a fixed bud­
get of N evaluations is the default within Bayesian optimisation, 8 In the absence of noise, limiting to
the set of evaluation locations enforces
and will be taken henceforth.
the constraint that the returned func­
We present in Figure 31.2 the decision problem for Bayesian tion value (the putative minimum) is
optimisation. We seek to illustrate the iterative nature of op­ known with complete confidence. This
is not unreasonable; however, in some
timisation and its final termination. In particular, the termi­ settings, the user may be satisfied with
nating condition for optimisation will often require us to se­ a more diffuse probability distribution
over the returned value: such consider­
lect a single point7 in the domain to be returned: we will de­
ations, of course, motivate the broader
note this point as xN . At the termination of the algorithm, probabilistic-numerics vision. It is worth
we will define the full set of evaluation pairs gathered as noting that the limitation to the set of
evaluation locations does not permit re­
DN := (xi, yi) | i = 0, ..., N - 1 . Here the ith evaluation turning unevaluated points, even if their
is yi = f (xi). We will assume, for now, that evaluations are values are known exactly. As an example
where this is important, consider know­
exact, hence noiseless. The returned point will often be limited
ing that a univariate objective is linear:
to the set of evaluation locations, Xn G Dn, but this need not then, any pair of evaluations would spec­
necessarily be so.8 ify exactly the minimum, on one of the
two edges of a bounded interval. In such
The importance of the loss function can be brought out a case, would we really want to require
through consideration of the consequences of the terminal de­ that this minimum could not be returned
until it had been evaluated?
cision of the returned point, xN . With our notation, the loss
254 V Global Optimisation

functions can be defined as follows:

Avl (xN, yN, DN ) = yN,


Alil (xN, yN, DN) = H(x* | xN, yN, DN),
Avil (xN, yN, DN ) = H(f (x* ) | xN, yN, DN)• fel-l)
It is not difficult to find an application demanding each of
these three losses. The value loss would be appropriate if the
final evaluation, yN, provided a persistent object with worth
equal to the objective value. An example might be optimising
the activity of a drug molecule: after the budget of expensive
trials (evaluations) has been exhausted, the best of the trialled
molecules is chosen for further development. The value loss
hinges entirely on the returned value yN , subsequent to which
no future evaluations (no variation) will be permitted. Recall
that our definitions (31.1) assume exact observations - in §32.3,
we will consider the case in which evaluations are corrupted by
noise.
The location-information loss might be appropriate if, at
the end-point of the optimisation process, it were possible to
make evaluations with inputs in some neighbourhood of xN
(rather than exactly at xN, as is considered by the value loss).
For instance, in the drilling of an oil well, after drilling a certain
number of test wells down into a plane, it might be possible to
drill a small distance sideways from the best well until an even
better location were found. The location-information loss might
also be appropriate if the selected location for the minimum
xN was corrupted by a noise contribution e before the ultimate
value yN = f (xN + e) was realised.
The value-information loss might be appropriate if the mini­
mum were a quantity of scientific interest, as is the equilibrium
state in an economic model of loss-minimising consumers. Here,
it is not the minimum itself that has value, but what its deter­
mination reveals about the world around us.
As in any application of decision theory, the quantity that
most directly determines our actions is not the loss, but the 9The acquisition function fills the same
role as a design rule in integration.
expected loss. In Bayesian optimisation, the term acquisition func­
tion (or, less commonly, infill function or query selection function)
is used to describe expected loss functions.9 We will henceforth
place ourselves at the nth step in the optimisation procedure,
such that the optimiser has gathered a set of evaluation pairs,
Dn { xii, f (xi)) | i = 0, ..., n — 1} (where n < N, the total
budget of evaluations). Then, an acquisition function a(xn) is
considered to be a function of the next evaluation location, xn :
its optimum represents the optimal placement for this next eval-
31 Bayesian Optimisation 255

uation.10 It is also used to describe other functions used for 10More precisely, in the literature, the
term acquisition function is more com­
selecting evaluations: we will develop some of these subtleties
monly used to describe a function that
below. An acquisition function is also sometimes distinguished is to be maximised to determine the
from a recommendation strategy used to select the final putative next evaluation location. Nonetheless, to
maintain consistency with our expected-
location for the minimum, xN . In a decision theoretic, prob­ loss framework, we will use the term ac­
abilistic numeric, framework, the acquisition function should quisition function to describe a function
that is to be minimised rather than max­
be derived from a loss function defined on the results of the
imised. In some cases, this will result in
final selection: we have no need for a separate recommendation us describing acquisition functions as the
strategy. negation of their more common forms.

Within an acquisition function, an expectation must be com­


puted over the variables in Figure 31.2 that are neither already
observed (Dn ) nor decided upon (xn). That is, an expected
loss for Bayesian optimisation must, in general, marginalise
not just the evaluations, yn, ..., yN, but also the future loca­
tions xn, ..., xN. This marginalisation is typically impossible in
closed-form.11 Kushner (1964) asserted that the full, multiple­ 11One exception is that of independent,
discrete-valued, evaluations y (Gittins,
step, expected loss
1979). This is a substantially simpler set­
ting than one which takes a prior, like a
“depends on the locations of the future observations, and it gp, that allows an evaluation for one lo­
is generally so complicated that it usually is not a practical cation to be informative of others. Other,
uninteresting, exceptions exist: for in­
calculation with present-day computing equipment”.
stance, the expected loss is closed-form
for a loss function equal to a constant,
Computing this expected loss remains profoundly challenging ignoring the evaluations.
to this day. We will tackle this problem in §32.4. Until then, we
will adopt severe myopic (short-sighted) approximations that
ignore the potential impact of all future evaluations other than
the very next one, f (xn). Kushner (1964) motivates myopic
approximations by noting that the surrogate model is often
wrong, encoding assumptions that do not hold for the objective
at hand. Non-myopic approaches, in forecasting the impact of
many future evaluations, rely more heavily on the model than
do myopic approaches. Speaking roughly, performance may
be aided by reducing reliance upon a model that is wrong -
when the model is wrong, a myopic approach may be helpfully
conservative. As another justification for this approximation,
as described by Hennig and Schuler (2012), note that, unlike
general planning problems, active inference does not suffer from
‘dead ends’. That is, the consistency of Bayesian inference means
that an uninformative evaluation can always be overcome by
future informative evaluations: it will always remain possible
to learn the entire function. The myopic loss functions that we
will describe will all, to some degree, promote exploration and
perform reasonably in practice.
Under a myopic approximation, the expected loss (associ­
ated with loss function Л) will take the form of the acquisition
256 V Global Optimisation

function
a ( xn I Dn ) = E( A ( xn , yn , Dn ))
= I A ( xn , yn , Dn ) p ( yn I Dn ) d yn .

Having evaluated at xn, we will then use the acquisition func­


tion a (x | Dn+1) to select xn+1. Note that a (x | Dn) will be
similar, but not identical, to a(x | Dn+1): the former’s underly­
ing probability distributions, such as p(yn I Dn), will have been
updated in light of new data, (xn, yn) (becoming less diffuse
in the process). Some such distributions, notably p(xN | Dn+1),
and hence p(yN | Dn+1), must also reflect that there are fewer
future evaluations remaining. Referring back to Figure 31.2,
we observe that the posterior for xN (the returned location)
at the nth step, p(xN | Dn), will be more diffuse than that at
the (n + 1)th step, p(xN | Dn+1), even if the nth evaluation is
entirely uninformative. With fewer remaining evaluations, there
will be fewer potential surprises (such as the discovery of a new
mode of the objective) in the future to consider: such surprises
can lead to large changes in beliefs about the location of the
minimum. With fewer such surprises in store, it is reasonable
for the optimisation agent to be more confident about its actions
in the future, at the final, Nth, step. Importantly, for coherent
optimisation, the loss function itself should remain the same
throughout the optimisation algorithm.
Most acquisition functions, aiming to balance both exploita­
tion and exploration, will be non-convex. That is, there will be a
diverse set of potentially valuable locations (some near existing
modes, others far removed from them) at which to evaluate
the objective. As such, practical Bayesian optimisation requires
the use of a further global optimiser to optimise the acquisition
function and hence select the next evaluation location.
You could be forgiven for stumbling over the previous sen­
tence. If our stated goal is global optimisation, what have we
achieved in re-framing this in such a way as to require yet an­
other global optimisation problem? The answer is found in the
realisation that the acquisition function is more amenable to
optimisation than the original objective function. To begin, the
acquisition function is usually substantially less expensive than
the objective: for instance, it can be evaluated on a computer,
where the objective might require the drilling of an oil well.
Second, the acquisition function usually admits closed-form ex­
pressions for its gradient and Hessian, where the objective may
not: this additional information can greatly aid optimisation. A
final argument is that performance is usually relatively insensi­
31 Bayesian Optimisation 257

tive to the optimisation of the acquisition function: even local


minima of the acquisition function will usually result in usefully
informative evaluations. As such, the optimiser used for the
acquisition function can be taken as something relatively cheap
and dirty, without necessarily sacrificing ultimate performance.
Such considerations are common in numerics: it is common
for one numerical problem to require the solution of another.
There is a roughly perceived hierarchy of algorithms, with more
important algorithms (like quadrature) permitted to call less
important algorithms (like linear solvers). As discussed in §, one
goal of Probabilistic Numerics is to formalise this hierarchy, by
designing numerical algorithms that universally communicate
probability distributions, and propagate uncertainty throughout
a pipeline of algorithms. In such a setting, it is not impossible
to imagine that design decisions hitherto underpinned only by
intuition, such as the rule that the optimisation of the acquisition
function is relatively unimportant for performance, might be a
naturally emergent result.
__________ 32
Value Loss

The vl is perhaps the most intuitive of the loss functions de­


scribed in §31.2, and has been developed through a number of
avenues within Bayesian optimisation.

► 32.1 Expected Improvement

Let us begin a tour through the canonical combination of prior


and loss function for Bayesian optimisation: a gp prior (see §4.2)
with the expected improvement (ei) acquisition function.1 The 1 Mockus, Tiesis, and /ilinskas (1978)
latter is best seen as an approximation to the value loss. Figure
32.1 provides an illustration of the combination of ei and a gp.
Let us place ourselves at the nth step in the optimisation
procedure, such that the optimiser has gathered a set of eval­
uation pairs, Dn = xi, f (xi) | i = 0, ..., n - 1 (where
n < N, the total budget of evaluations). We will again assume
exact, noiseless, evaluations. The gp posterior for the objec­
tive, p(f | Dn) = GP(f; m, V), has posterior mean function
m(x) and posterior covariance function V(x). We now make the
approximation2 2Note that, under this approximation,
the loss is not a function of xN .
Avl(xn, f (xn), Dn) - Aei(Dn+i) := min f (Xi).
iE{ 0,..., n}

Recall that Dn+1 includes xn and f(xn ). This approximation is


myopic as discussed in 31.2: it disregards all future evaluations
other than f(xn ). The approximation also restricts the returned
location xN to the set of evaluations; consequently, the returned
location is the one that corresponds to the lowest evaluation.
This assumption removes the dependence of the loss on xN .
Defining the lowest function value available at the nth step as

П := -cr,min , f (xi),
iE{0,..., n — 1}
260 V Global Optimisation

Figure 32.1: The expected improvement


(ei) acquisition function at x is the ex­
pected amount that f (x) improves upon
the current lowest evaluation, n (marked
by a circle). Naturally, the expected im­
provement cannot be worse than zero.
The light grey curves at integer values of
x are the integrands f((x) — 7) P^f (x) |
Dn-1 of Eq. (32.1), whose integral is the
acquisition function, aei (x), at those val­
ues of x. On a grid of example locations,
the plot shows as a dark, thick line the
resulting acquisition function aei(x). Its
maximiser, marked with a square, gives
the best possible location for the next
evaluation, xn . Note that this plot consid­
ers noisy evaluations, tackled more fully
in §32.3.

we can simply rewrite the loss as

Aei (Dn+1) = min {n, f (xn)}.

The expected loss is hence

E(Aei(Dn+1)) = J' min{n, f (xn)}P(f (xn) | Dn) df (xn)

= n + f min {0, f (xn) — n }P (f (xn) | Dn) df (xn)

= n +[ 0 x P (f (xn) \Dn) df (xn)


n

+ /L (f (xn) — n) P (f (xn) | Dn) df (xn)

= n + !(f (xn) — n)P(f (xn) \Dn) df (xn).

Just as promised by its name, this expected loss does indeed


consider the expected improvement over the current best point,
n . Note that, owing to this text’s preference for minimisation
over maximisation, “improvement” here is considered as being
better when more negative. Given that n is “in the bag”, the
overall outcome cannot be worse (higher) than n• The expected
loss will be determined by the probability mass that p(J(xn) |
Dn assigns to the fortunate outcomes in which f(xn ) improves
upon n, and the magnitude of those improvements.
To fulfil its purpose as an acquisition function, we usually
rewrite this expected loss as a function of the next evaluation
location, so that
E( A ei ) = E( A ei )(xn).
32 Value Loss 261

EI
For convenience, our notation no longer reflects the real de­

EI )
pendence of Л on Dn +1. To be explicit, we will pick the next

EI E( EI)
evaluation location as the minimiser of E (Л (xn), which is
identical to the minimiser of a (xn) := Л (xn) — ц,

ПП
aei(xn) =1 (f (xn) — П)p(f (xn) \Dn) df (xn). (32.1)

Now, given p f (xn) |Dn = N f (xn); m(xn), V(xn) , letting


Ф(x; a,b2) be the cumulative distribution function of the Gaus­
sian distribution N(x;a, b2),
Exercise 32.1. (easy, solution on p. 366)
Given Eq. (32.1), derive Eq. (32.2).
a ei (xn) = — V( xn) N (n ;m (xn), V( xn)) (32.2)
+ (m(xn) — n) Ф(n; m(xn), V(xnУ).

An example of this acquisition function is provided in Figure


32.1. Note that Eq. (32.2) will be low (indicating a desirable lo­
cation for the next evaluation) where m(xn) is very low, and/or
where V(xn ) is very large. The former desideratum encapsu­
lates the drive towards exploitation, and the latter that towards
exploration. As such, ei gives one means of balancing these two
competing goals. That said, empirically, ei has been observed
to be under-exploratory,3 weighting its evaluations too heavily 3 Calandra et al. (2014b)
towards exploitation. This can be understood as a consequence
of the myopic approximation underpinning ei. A model that
believes it has only a single evaluation remaining is less likely
to feel it can afford the luxury of exploration than one that has
many evaluations at its disposal: the former model must focus
on achieving immediate value through exploitation.
The ei acquisition function is relatively cheap to evaluate,
but also multi-modal, and admits evaluations of gradient and
Hessian (omitted here). The ei approach hence provides a clear
example of why it can be productive to convert the optimisation
of an objective into the optimisation of an acquisition function.

► 32.2 Knowledge Gradient

Knowledge gradient (kg)4 is an acquisition function that allows 4 Frazier, Powell, and Dayanik (2009)
the relaxation of one of the assumptions of ei. In particular,
while kg is myopic, it does not restrict the returned point xN
to be amongst the set of evaluated locations. Instead, after the
nth step of the optimisation procedure (that is, after the nth
evaluation), xN is chosen as the minimiser of the posterior
+
mean mn 1 for the objective f (x), which is conditioned on
Dn+1 := xi, f(xi) \ i = 0, ..., n ,the set of evaluation pairs
262 V Global Optimisation

after the (n + 1)th step. That is, kg considers the posterior mean
after the upcoming next step.
This modification offers much potential value. Valuing im­
provement in the posterior mean rather than in the evaluations
directly eliminates the need to expend a sample simply to re­
turn a low function value that may already be well-resolved
by the model. For instance, if the objective is known to be a
quadratic, the minimum will be known exactly after (any) three
evaluations, even if it has not been explicitly evaluated. In this
setting, evaluating at the minimum, as would be required by ei,
is unnecessary.
kg does introduce some risk relative to ei, however. Note
that the final value f (xN) may not be particularly well-resolved
after the (n + 1)th step: the posterior variance for f (xN) may
be high. kg, in ignoring this uncertainty, may choose xN such
that the final value f (xN) is unreliable. That is, the final value
returned, f (xN), may be very different from what the optimiser
expects, mN+1 (xN).
The kg loss is hence the final value revealed at the minimiser
of the posterior mean after the next evaluation. Let’s define that
minimiser as
xn+i := arg min mn+i(x),
x
where the posterior mean function,

:
mn+1 (x) = E(f (x) I Dn+1),

takes the convenient form of Eq. (4.6) for a gp. The kg loss can
now be written as

:
A kg (Dn+1) = f (xn+1). (32.3)

The expected loss, the acquisition function, is hence5 5 Note that the KG acquisition function

presented here differs from that in Fra­


zier, Powell, and Dayanik (2009) in omit­
a KG ( xn ) := A KG ( Dn + 1 ))
E(
ting an additive constant.
= j f (xn+1)p(f (xn+1) I f (xn), Dn) p (f (xn) I Dn)

df (x:n+1) df(xn)

minmn+1 (X)p(f (xn) | DX) df (xn)■


x'
Unfortunately, the KG acquisition function is not closed-form 6 Frazier, Powell, and Dayanik (2009)
due to the required minimisation within the integral, but ap-
proximations6 exist and have demonstrated useful empirical
performance. The effect of the KG is to value improvements in the
posterior mean, rather than simply improvements in the evalua­
tions.
32 Value Loss 263

The kg loss (32.3) differs in an important respect from the vl


of Eq. (31.1). kg rewards improvements in the posterior mean,
rather than improvements in the evaluations. The vl is influ­
enced only by the final objective value, yN, where kg considers
the posterior mean at all locations. That is, kg values not just the
next evaluation, but its impact on our beliefs about evaluations
at all other locations. In this, the vl, and the resulting ei, can
be seen as local methods, where the kg is global.

► 32.3 Noisy Expected Improvement

It is routine in regression settings, particularly in Gaussian


process regression, to manage observations y of a function f
that have been corrupted by noise. Noisy observations are less
common in numerics: usually, the function is evaluated using
deterministic software. Nonetheless, given that Bayesian opti­
misation (and much of Probabilistic Numerics) is built upon
a regression model, it might be expected that managing noise
in optimisation might be achieved with similar confidence as
that with which one manages noise in performing regression.
Unfortunately, this expectation would be ill-founded. To un­
derstand why, imagine that the objective is observed only up
to some i.i.d. Gaussian noise. What then should be returned
as the minimum f (xN)? In fact, noisy optimisation makes for
an excellent case study of the importance of considering the
terminating stages of optimisation.
Let us begin with ei. A simple and common approach to
ei for noisy objectives would be to take the returned value as
the lowest of the (noisy) evaluations. Explicitly, this approach
commits to returning an evaluation that is known to have been
corrupted by noise. In fact, this approach to ei for noisy eval­
uations introduces a version of the winner’s curse:7 the lowest 7 Thaler (1988)
function evaluation is probably more noise-corrupted than other
evaluations. ei can hence be a problematic approach for objec­
Exercise 32.2. (moderate, solution on
tives that are substantively corrupted by noise. For a deeper p. 366) Take a gp prior for the objective f (x),
discussion of the problems of noisy ei, including proposals for along with an i.i.d. Gaussian noise model,
yi = f (Xj) + ti, Si ~ N(0; a2). Demon­
overcoming these problems, see §6 and §8 of Garnett (2022).
strate the effect of the winner’s curse by com­
Now let us return to kg, a more natural choice for noisy puting the posterior for S1 - S2 given two
evaluations. Recall that kg rewards improvements in the pos­ observations y1 and y2 . It may be assumed
that f(x1) and f (x2) have negligible prior
terior mean, rather than improvements in the evaluations. The covariance (e.g. that x1 and x2 are far apart),
surrogate model, usually a gp, will smooth over noise in the and that their prior variances are equal. Hint:
p(S) = p(S | y).
evaluations: the use of its posterior mean should ameliorate the
impact of large negative noise contributions in assessing the
current estimate of the minimum. As such, kg is less sensitive
264 V Global Optimisation

Figure 32.2: Imagine that we can afford


—•— —— —•—•—
two evaluations of a function on the do­
main indicated by the line segment. The
best first evaluation location (dot), ac­
to the winner’s curse described for ei. cording to any ignorant prior and my­
Another simple approach to noisy Bayesian optimisation was opic acquisition function, will be the mid­
point of the domain (the left plot). How­
introduced by Osborne, Garnett, and Roberts (2009) following ever, this choice means that the second
kg in returning the optimum of the posterior mean. The authors and final evaluation, which will fall in
one or the other halves of the domain,
add one extra constraint to the kg approach: the variance in the
will leave the other half entirely unex­
returned value f (xN) is restricted to being no greater than a plored (the centre plot). If a non-myopic
specified threshold. This additional constraint means that the strategy were used, the two evaluations
could be more sensibly (and, in this case,
approach avoids the problem of returning a putative minimum more uniformly) distributed across the
which is very uncertain. This gives another example of the domain (the right plot).
importance of thinking clearly about the goals of optimisation:
in this case, how important is the trustworthiness of the returned
minimum?
In §33.3, we will discuss how information-theoretic approaches
provide a natural solution to managing noisy function evalua­
tions.

► 32.4 Overcoming Myopia - Multi-Step Look-Ahead

In §31.2, we introduced myopic approximations to the expected


loss. Our motivation was the severe difficulty of computing the
expected loss of an evaluation when considering the potentially
many evaluations that will follow. Unfortunately, myopic ap­
proximations can introduce real performance impairments. The
most important is that myopia results in a preference for ex­
ploitation over exploration. An illustration of this phenomenon
is provided in Figure 32.2. A non-myopic optimiser, that knows
it has the luxury of many evaluations remaining, is more likely
to indulge in risky exploration over exploitation. Note also that,
ideally, an optimiser should slowly shift its behaviour from ex­
plorative to exploitative over the course of optimisation. That is,
early on, with a large budget of evaluations in hand, exploration
is more attractive than later on. However, a myopic optimisation
strategy cannot be influenced by the true number of evaluations
remaining: it is static, where we would prefer dynamism.
What would it take to abandon the myopia of ei and kg
and move towards a better approximation of the value loss?
The challenge to be overcome is captured by Figure 31.2. Our

E(Л Vl
goal is to compute the expected loss of evaluating next at x0 ,
). If N evaluations remain, we must marginalise N + 1
values, y0, ..., yN, (recall that we’re assuming that the final
returned value is “free”, additional to our budget) and N loca­
32 Value Loss 265

tions, x1, ..., xN. The latter random variables emerge from a
decision process: xi will be the optimiser of the ith acquisition
function - we assume that all future decisions will be made
optimally. That is,

p(Xi | Di) = M Xi - arg min E(Лvl (x) I Di)). (32.4)


x

This means that the thorny problem to be solved is an inter­


leaved sequence of numerically integrating over yi variables and
numerically optimising over Xi variables. This is a sequential
decision-making problem, related to the Bellman equation and
solvable, in principle, by dynamic programming.8 Our problem 8 Jiang et al. (2019)
shares with others of this class a cost that is exponential in the
horizon, N - n.
The difficulty of this problem has meant that progress has
been largely limited to various specialisations or relations of
the generic Bayesian optimisation problem. As examples, non-
myopic results have been presented for independent, discrete­
valued, evaluations,9 finding a level-set of a one-dimensional 9 Gittins (1979)
and Markov objective,10 and active search.11 Those approaches12 10 Cashore, Kumarga, and Frazier (2015)
that introduce approximations to tackle the full multi-step prob­ 11 Jiang et al. (2017)

lem have managed to consider no more than around 20 fu­ 12Streltsov and Vakili (1999); Osborne,
Garnett, and Roberts (2009); Marchant,
ture steps. To give a flavour of how such approaches proceed, Ramos, and Sanner (2014); Gonzalez, Os­
Gonzalez, Osborne, and Lawrence (2016) and Jiang et al. (2020) borne, and Lawrence (2016); Jiang et al.
(2020).
propose schemes in which the strong knowledge of the sequen­
tial selection of observations, as in Eq. (32.4), is set aside in
favour of a model which assumes that all locations are chosen
at once, as in a batch (batch Bayesian optimisation will be de­
scribed in §34.1). This approximate model is depicted in Figure
32.3. This coupling of locations and removal of nesting provides
a substantially simpler numerics problem, one solvable using
batch Bayesian optimisation techniques for the optimisation of
locations. Gonzalez, Osborne, and Lawrence (2016) additionally
use expectation propagation13 for the marginalisation of their 13 Cunningham, Hennig, and Lacoste-

Julien (2011)
values.

Figure 32.3: Approximate graphical


model (Bayesian network) for the
Bayesian optimisation decision problem.
Given the current data set, Dn, we must
decide upon the decision variable (dia­
mond node) Xn : unlike in the true prob­
lem (Figure 31.2), however, the sequen­
tial nature of the problem is ignored. All
variables along the dark line are depen­
dent.
__________ 33
Other Acquisition Functions

Acquisition functions (beyond those derived from the vl) re­


main an active area of research within Bayesian optimisation,
and the field has produced a diverse range of proposals. In this
section, we will review a few of the most prominent, and dis­
cuss their probabilistic numerical interpretation. All acquisition
functions proposed in this section are myopic.

► 33.1 Probability of Improvement

One of the earliest acquisition functions to be proposed1 is pi 1 Kushner (1964)


(also known as maximum probability of improvement (mpi)). It
is myopic, and, as with ei, defines the lowest function value
available at the nth step as

Пп := min f (xi).
iE{ 0,..., n —1}

We will also introduce new hyperparameters, en > 0. Now let


us define a loss function specific to the nth step:

Xn,pi(Dn+1) := I(f (Xn) > Пп - en). (33.1)

Here I is the indicator function, so that the loss is 0 when


f (xn) < nn - en and 1 otherwise (expressing that the former
is the preferred outcome). With that, the pi expected loss, and
hence acquisition function, has the simple form

an ,PI (xn ) := E( Xn ,PI (Dn + 1)) = P\f (xn ) > nn £n I Dn} .

The hyperparameter en controls the degree of exploration (at


the nth step), with smaller en resulting in more exploitative
behaviour. Kushner (1964) suggested setting en to a large value
for small n (so as to be more exploratory early in optimisation)
268 V Global Optimisation

Figure 33. 1: The probability of improve­


ment (pi) acquisition function at x is the
probability that f (x) improves upon the
current lowest evaluation, n (marked by
a circle). The light grey curves at inte­
ger values of x describe p f (x) |Dn for
f (x) < n. On a grid of example locations,
the plot shows as a dark, thick, line the
resulting negative acquisition function
-api(x) (the integral of the grey curves),
affinely rescaled for visualisation. Its
maximiser, marked with a square, gives
the best possible location for the next
evaluation, xn .The plot is analagous to
that for ei in Figure 32.1. Note the subtle
differences between the two acquisition
functions: in particular, pi is the more
exploitative of the two.

and to a small value for large n (so as to be more exploitative


late in optimisation). Jones (2001) recommended using several
values of en, and identifying clusters of resulting x values. The
hyperparameter is commonly taken as en = 0, which results in
aggressively exploitative behaviour.2 The pi acquisition function 2 Jones (2001)
for en = 0 is depicted in Figure 33.1.
In fact, with en = 0, pi is sometimes used to (retrospectively)
score the exploitativeness of evaluations that have been made

PI
by any acquisition function. Its utility here is promoted by its
fixed range, an, (xn) G [0,1] C R, with 1 deemed completely
exploitative and 0 completely explorative, easing (visual) com­
parisons across different iterations of optimisation. Doing so
is helpful in interrogating a completed Bayesian optimisation
run, whatever the acquisition function. Quick inspection might
reveal, for instance, that exploitation was never performed, or if
the objective was inadequately explored.
Notably, this acquisition function does not distinguish be­
tween improvements of different magnitudes: any improvement
(above the threshold), however small, is equally valued. This
caps the potential rewards from gambling on exploration.
We could view the nth step loss function (33.1) as emerging
from an approximation to a single loss function applicable
across all steps,

a pi ( Dn ) = i(f (xn) > nn - en; n = 1,..., N. (33.2)

From the decision-theoretic perspective, this reveals another de­


ficiency of pi: Eq. (33.2) stipulates an odd goal for optimisation:
33 Other Acquisition Functions 269

Figure 33. 2: The upper confidence bound


(ucb) acquisition function at x is a linear
combination of the mean and standard
deviation (sd) of the gp posterior. For the
purposes of this plot, в = 1.5. On a grid
of example locations, the plot shows as a
dark, thick, line the resulting acquisition
function aucb (x) affinely rescaled for vi­
sualisation. Its maximiser, marked with
a square, gives the best possible location
for the next evaluation, xn .The plot is
analagous to that for ei in Figure 32.1
and pi in Figure 33.1. Note the subtle
differences between the three acquisition
functions: in particular, ucb (for в = 1.5)
is the most explorative of the three.

incremental improvement at each step. Why should an opti-


miser weight each step in the optimisation process equally? The
loss functions described in §31.2 instead put their emphasis on
uncovering a single, exceptional, function value. A probabilistic-
numerics view would argue that these goals are more coherent,
and hence more suitable for optimisation.

► 33.2 Upper Confidence Bound

A popular acquisition function finds its roots in the multi-armed


bandit literature:3 the ucb. Again, this acquisition function is 3 Lai and Robbins (1985)
myopic in considering no further ahead than the next function
value. Rather than marginalising over that function value, yn,
this criterion adopts an optimistic approach: it assumes that yn
will take the value that is better than its expectation according
to some fixed probability. Srinivas et al. (2010) framed the ucb,
given a gp surrogate posterior (with mean m(xn) and variance
V(xn)) at the proposed xn, as

aucb(xn) := m(Xn) - впV(Xn) 1- (33-3)

Note that, owing to this text’s preference for minimisation over


maximisation, Eq. (33.3) describes a lower confidence bound
rather than an upper confidence bound - we use the term ucb
out of tradition. The first term in Eq. (33.3), the posterior mean,
m(Xn), rewards exploitation, by encouraging evaluation near
to existing low evaluations. The second term, proportional to
the posterior standard deviation, V(xn) 2, promotes exploration.
270 V Global Optimisation

Figure 33.3: The information-theoretic


rationale for Bayesian optimisation:
when considering a proposed next evalu­
ation location xn , one should marginalise
over potential observations yn to be col­
lected at this point. The plot shows two
equally-probable potential scenarios for
two different potential values of yn . Each
of the two potential observations would
lead to a change in the gp posterior (rep­
resented using dashed lines), which in
turn would give rise to a global change
of the distribution over the minimiser
(each corresponding p (xt) shown using
matching dashed lines at the bottom of
the plot, at arbitrary scale).

The parameter en € R+ explicitly specifies the exploration­


exploitation trade-off. For an appropriately large choice of en,
ucb can be made more explorative than ei; as you will recall
from Eq. (32.2), the myopia of ei can lead to insufficient ex­
ploration. As such, the greater explorative nature of ucb has
been observed to yield superior performance to that of ei and
pi.4 Figure 33.2 illustrates this ucb acquisition function. 4 Calandra et al. (2014b)
The severely optimistic assumption underlying ucb is moti­
vated through the resulting simplicity of the acquisition function.
This lends itself to theoretical treatment, yielding, for instance,
regret bounds.5 This theory also provides schedules for the 5Srinivas et al. (2010); Freitas, Smola, and
Zoghi (2012).
adaptation of en as a function of n. Nonetheless, it is difficult
to reconcile ucb with a defensible loss function (see Exercise
33.1). As such, this class of approaches has no known (sensible)
Exercise 33.1. (hard, solution on p. 367)
probabilistic numerical interpretation6 . Derive a loss function for which the ucb
acquisition function (33.3) is the myopic ex­
pected loss, given that the posterior for yn is
► 33.3 Information-Theoretic Approaches that from a gp: N (yn; m(xn), V(xn)).

6 That said, an interesting re­


Let us now return to the two alternatives to the value loss (vl) interpretation of the ucb criterion
developed in detail above: the location-information loss (lil) and is provided by Jones (2001)).

value-information loss (vil), which we collectively describe as


information-theoretic. The lil and vil select observations that
best yield information about the minimiser and minimum, re­
spectively. Note, first, that the meanings of exploration and ex­
ploitation are not completely clear for an information-theoretic
approach. In an information-theoretic approach, the worth of an
evaluation is not contained within its value, as would be true for
33 Other Acquisition Functions 271

an exploitative evaluation with the vl, but instead is contained


in the information it yields. As such, all evaluations selected by
an information-theoretic method are in some sense exploratory:
the sole goal of these methods is to gather information.
For reasons similar to those discussed in §31.2, implementing
information-theoretic loss functions exactly is computationally
infeasible. To improve tractability, existing information-theoretic
approaches adopt (like the other approaches above) myopia.
Nonetheless, information-theoretic approaches do possess some
advantages over their equally-myopic alternatives. Recall that,
as mentioned in §32.2, ei is local: it values improvements only
at the location of the next evaluation. pi and ucb are equally
local. On the other hand, even if treated myopically, the lil and
vil are truly global: they value information about the entire
domain. This can be of value in improving exploration, which,
in turn, can improve performance. As discussed above, the
myopia underpinning alternative acquisition functions leads to
under-exploratory behaviour.
Several myopic acquisition functions based on the lil have
been proposed, differing only in implementation details. An
illustration of such approaches is given by Figure 33.3. Recall
that the lil demands minimisation of the entropy of x*. More
precisely, under the myopic approximation, which considers
only the impact of the next, (n + 1)th, step, we must consider
the loss function

Alil(Dn+1)= H(x* \Dn+1).

This loss, of course, depends on the yn, unobserved at the


current (nth) step.
Typically, x* is continuous, an element of Rd .But determin­
ing the distribution for this minimiser, p(x* ), on a continuous
domain is intractable. For one thing, note from Figure 33.3 that
p(x* ) may give a finite probability (mass) to the minimiser x*
being exactly at a boundary. As a result of these difficulties,
all implementations of the lil to date discretise x*: this then
requires only the maintenance of a (discrete) probability dis­
tribution. The approximation of x* as discrete also enables its
entropy to be more easily computed,

H(X* \ Dn+1) = - EP(x*,i \ Dn+1) logP(x*,i \ Dn+1).


If x* is (correctly) treated as continuous, we might think to
use a differential entropy, which suffers from two significant
drawbacks: it is not invariant to changes of variables; and it
272 V Global Optimisation

is difficult to compute. These pathologies can be addressed


through approaches described in §8.8 of Garnett (2022).
The first acquisition functions built on the lil are named
informational approach to global optimisation (iago)7 and entropy 7Villemonteix, Vazquez, and Walter
(2009)
search (es).8 They differ in implementation details: while these
8 Hennig and Schuler (2012)
differences are of practical significance, they will not concern us
here. Both consider an acquisition function that is the myopic
expected loss

a iago ( xn ) — a es ( xn )
= E( Л lil (Dn+1))

— H H( x* | Dn + 1) p (yn | xn, Dn ) dyn


— : E yn (H( x* 1 yn, xn, Dn )) .

EynH (x* | yn, xn, Dn) is a conditional entropy, the expected


entropy in x* after an observation yn whose value is currently
unknown.
Predictive entropy search (pes)9 is an alternative acquisition 9 Hernandez-Lobato et al. (2015)
function derived from the lil. It first notes that

arg min EynH (x* | yn, xn, Dn)


xn
— arg maxH(x* |Dn) - EynH(x* | yn, xn, Dn) ,
xn

as the prior entropy of the minimiser is independent of the next


measurement. pes then makes use of the identity

I(x* ; yn )—H (x* |Dn ) - EynH (x* | yn, xn, Dn )


—H (yn | xn , Dn ) - Ex*H (yn | x*, xn , Dn ) , (ЗЗ.4)

where I (•; •) is the mutual information between two random


variables, and Ex* (H (yn | x*, xn, Dn)) is the conditional en­
tropy of yn given the random variable x* . Eq. (33.4) yields the
acquisition function

aPES — — H(yn 1 xn, Dn ) + Ex * (H(yn 1 x* , xn, Dn )) .

This acquisition function will select identical evaluations to


those of Es and iago. Nonetheless, the rearrangement is well-
motivated, following the arguments of the Bayesian active learn­
ing by disagreement (bald) algorithm,10 which itself relies on 10 Houlsby et al. (2011)
old insights about the mutual information. First,H (yn | xn, Dn)
is straightforward to calculate: it is the entropy of a univari­
ate Gaussian. The second term requires the computation of
another univariate Gaussian’s entropy:H (yn | x*, xn, Dn). This
33 Other Acquisition Functions 273

is complicated by having to condition on x* being a minimiser,


achieved readily in practice through heuristics like ensuring
that the objective at x* has zero gradient and positive curva­
ture. The term must also be marginalised over the posterior
over the minimiser, P(x* | Dn); a task whose central difficulty
is constructing that posterior. In comparison, es/iago require
P(x* | yn, xn, Dn ) (so as to compute its entropy). The principal
difference between es/iago and pes is that P(x* | yn, xn, Dn)
must be constructed afresh for each proposed sampling location,
xn, whereas P(x* |Dn ) need only be constructed once per step
n.
The vil was first proposed by Hoffman and Ghahramani
(2015) giving an acquisition function known as output-space en­
tropy search (opes). Follow-on work11 produced an acquisition 11 Wang and Jegelka (2017)
function known as max-value entropy search (mes) that provided
some improvements in implementation. These acquisition func­
tions both modify pes only in replacing the minimiser, x*, with
the minimum, y* ,

a opes — a mes
= -H (yn | xn, Dn) + Ey* H (yn | y*, xn, Dn) .

As discussed in §31.2, the vil may be preferred to the lil for


some applications. Moreover, opes and mes have advantages
in implementation over pes. For instance, the posterior for the
minimum, p(y* | Dn), is univariate, whereas the posterior for
the minimiser, P(x* | Dn), has dimension equal to that of the
search domain.
Returning to the discussion in §32.3, all the information-
theoretic acquisition functions described above are relatively
robust to noise in the objective function. Information-theoretic
acquisition functions reward the information yielded byan
observation, rather than measuring characteristics of the obser­
vation itself, as do ei, ucb, and pi. In particular, all information-
theoretic acquisition functions are influenced by prospective
observations only through their impact on entropy terms. So
long as posteriors given the noisy observations can be obtained,
such entropy terms will naturally accommodate the noise.
Figure 33.4 depicts a comparison of the information-theoretic
acquisition functions from this section against those of previous
sections.
274 V Global Optimisation

Figure 33.4: A direct comparison of


many of the acquisition functions in this
chapter. Since each has its own units of
measure and given that only the location
of the extremum matters (indicated by
a dot), each acquisition function is plot­
ted on a different (arbitrary) scale. Recall
that iago, es and pes are differing imple­
mentations of the same underlying loss
function, lil, with opes and mes like­
wise representing different implementa­
tions of vil. We caution that the exact
locations of the acquisition function op­
tima, indicated by labelled vertical bars,
are not particularly general, and can be
changed significantly through seemingly
innocuous variations in the data and the
gp model.

► 33.4 Portfolios of Acquisition Functions

The acquisition functions described in this chapter each possess


various limitations. One answer to such limitations is to propose
the use of a portfolio of multiple acquisition functions12 - the Hoffman, Brochu, and Freitas (2011);
12

Shahriari et al. (2016).


hope is that an average over different acquisition functions will
be less vulnerable to the failure modes of any individual one
of those averaged acquisition functions. This approach requires,
first, finding the best candidate evaluation location according
to each acquisition function in the portfolio, forming a set of
candidates. The actual location chosen is then the element of
that set that maximises an independent meta-criterion.
Of course, any decision-theoretic problem requires a single
loss function to be chosen. Considering multiple acquisition
functions (which correspond to distinct expected loss functions)
is inconsistent with this view. However, the portfolio Bayesian
optimisation approaches do, in fact, operate according toa
single loss function: that inherent in the meta-criterion. The
acquisition functions within the portfolio are typically computa­
tionally cheap, such as ei, pi and ucb. The meta-criterion, on
the other hand, is expensive but powerful: for instance, Shahri-
ari et al. (2014) choose the lil loss function of §33.3. Portfolio
approaches, then, are useful in providing a cheap heuristic for
the optimisation of an expensive meta-criterion.
34
Further Topics

► 34.1 Batch Evaluation

It is not uncommon in optimisation to be permitted many si­


multaneous evaluations of the objective: this is known as batch
optimisation. For instance:

& several time-consuming drug trials might be run in paral­


lel, with the goal of determining the most effective drug
molecule;

& in optimising machine learning model architectures, many 1Ginsbourger, Le Riche, and Carraro
(2008, 2010); Chevalier and Ginsbourger
such architectures might be simultaneously evaluated to ex­ (2013); Marmin, Chevalier, and Gins-
ploit parallel computing resources; and bourger (2015); Marmin, Chevalier, and
Ginsbourger (2016); Wang et al. (2016);
& in searching for optimal policy parameters, one (or many) Rontsis, Osborne, and Goulart (2020).
agent-based simulations of an economic system may be able
to be run simultaneous with a real-world trial.

Batch optimisation requires proposing a set of evaluation lo­ 2Desautels, Krause, and Burdick (2012);
Daxberger and Low (2017).
cations, xB, before knowing the values f (xB) at (any) of the
locations. Here the Probabilistic Numerics framing of Bayesian
optimisation offers an explicit joint probability distribution over
the values f (xB), acknowledging the probabilistic relationships
amongst the batch. These relationships are crucial to select­ 3 Wu and Frazier (2016); Wu et al. (2017).
ing a good batch, where desiderata include the exclusion of
redundant measurements (that are likely to return the same
information) and the balancing of exploration against exploita­
tion. 4 Shah and Ghahramani (2015)
The technical challenges of batch Bayesian optimisation are
bespoke to the different priors and acquisition functions. Batch
approaches exist for ei1 (sometimes called multi-point ei), ucb2, 5 Azimi, Fern, and Fern (2010); Azimi,

Jalali, and Fern (2012); Gonzalez et al.


kg3, pes4, and for a flexible family of acquisition functions5
(2016).
(particularly, ei).
276 V Global Optimisation

► 34.2 Reinforcement Learning

Bayesian optimisation has connections to, but distinctions from


reinforcement learning.6 Both reinforcement learning and Bayes­ 6 Sutton and Barto (1998)
ian optimisation address a (partially-observed) Markov decision
process. As a first point of distinction, however, reinforcement
learning and Bayesian optimisation tackle slightly different prob­
lems. Reinforcement learning cares about the returned evalua­
tion at every iteration of the procedure (typically by considering
some discounted sum of evaluations as the objective), where
optimisation should properly care only about the final returned
value. In Bayesian optimisation, it is comparatively more accept­
able to make an exploratory evaluation that has low expected
value. In reinforcement learning, it is also much more common
for the evaluation to change the state of the function: the agent
actually modifies the objective function in its choice of evalua­
tions. This is usually not the case for optimisation (for which
there is no state).
Second, the methods used in reinforcement learning and
Bayesian optimisation are distinct. Reinforcement learning usu­
ally attempts to learn a policy to govern the agent’s behaviour,
whereas Bayesian optimisation is more explicitly concerned
with a decision-theoretic approach.
Third, there’s a cultural difference. Reinforcement learning is
often used as the primary, outer system; Bayesian optimisation is
usually used internally within larger machine learning systems
(for instance, to tune their hyperparameters: see §34.3). Inter­
estingly, Bayesian optimisation has been used within reinforce­
ment learning: paul_alternating_2016 (paul_alternating_2016)
use a scheme that alternates between Bayesian quadrature to
marginalise environmental variables and Bayesian optimisation
for policy search.

► 34.3 Application: Automated Machine Learning

One of the most prominent current uses of Bayesian optimisa­


tion is to tune the configuration (particularly hyperparameters)
of (other) machine learning algorithms.7 This broad field is 7Bergstra et al. (2011); Hutter, Hoos, and
Leyton-Brown (2011); Snoek, Larochelle,
known as Automated Machine Learning, or AutoML. Bayesian
and Adams (2012).
optimisation has many arguments in its favour for this applica­
tion. The objective function is typically computationally expen­
sive: it might be the validation loss or negative-log-likelihood
of a model that takes hours to train. Indeed, the objective itself
might include some acknowledgement of that computational
34 Further Topics 277

cost, e.g. the error achieved per second.8 The arguments of the 8 Snoek, Larochelle, and Adams (2012)
objective, such as hyperparameters, might include regularisa-
tion penalties (parameters of the prior), architecture choices,
and the parameters of internal numerics procedures (such as
learning rates). Conveniently, there are often not more than 10
or 20 such hyperparameters that are known to be important: the
dimensionality of such problems is compatible with Bayesian
optimisation. In real-world cases, these hyperparameters have
been historically selected manually by practitioners: it is not
difficult to make the case for the automated alternative pro­
vided by Bayesian optimisation. As such, Bayesian optimisation
is a core tool in the quest for automated machine learning.9 9See, e.g., www.ml4aad.org/automl and
autodl.chalearn.org.
As one example, Bayesian optimisation was used to tune the
hyperparameters of AlphaGo for its high-profile match against
Lee Sodol10. 10 Chen et al. (2018)
Perhaps most interestingly, from a Probabilistic Numerics
perspective, there are many characteristics of the objective that
can be used to inform the choice of prior and loss function. First,
note that the relevance of some hyperparameters to an objective
function for hyperparameter tuning is often conditional on the
values of other hyperparameters. This includes examples in
which the objective has a variable number of hyperparameters:
we may wish to search over neural network architectures with a
variable number of layers. In that case, whether the number of
hidden units in the third layer will influence the objective will
be conditional on the value of the hyperparameter that specifies
the number of layers. The covariance function of a gp surrogate
can be chosen to capture this structure, leading to improved
Bayesian optimisation performance.11 11 Swersky et al. (2013)
Another common feature of hyperparameter tuning problems
is that the objective can return partial information even before its
computation is complete. Concretely, such computation is used
in training a machine learning model, typically through the
use of a local optimiser (see Chapter IV). Even before that local
optimisation has converged, its early stages can be predictive of
the ultimate value of the objective. This is observable through
the familiar decaying-exponential shape of so-called training
curves or learning curves, giving training loss as a function of
the number of iterations of local optimisation. If the early
parts of a training curve do not promise a competitive value
after more computation is spent (a value strongly correlated
with or equal to the associated objective function value), it
might make sense to abort the computation prematurely. This
intuition was incorporated into a Bayesian optimisation model
278 VI Global Optimisation

with some success by Swersky, Snoek, and Adams (2014). Their


approach built a joint model over training curves (internal to
each objective function evaluation) and the objective function
itself: an excellent example of Bayesian optimisation’s ability to
incorporate structure.
Relatedly, hyperparameter tuning problems often allow eval­
uations of variable fidelity. An observation of higher fidelity is
usually associated with higher computational cost. The canoni­
cal example is the choice of training set size. A larger training set
requires more computation to evaluate the loss on that training
set, but the result will be more informative about how effective
the associated model would be for the full data set. In this case,
the size of the training set can itself be taken as a variable for
optimisation. Given this variable, a bespoke Bayesian optimisa­
tion model (comprising both a novel surrogate model and loss
function capturing the variable cost) can be produced.12 Result­ Nickson et al. (2014); Klein et al. (2017);
12

McLeod, Osborne, and Roberts (2015).


ing algorithms can offer orders-of-magnitude acceleration in
finding effective hyperparameter settings.

> 34.3.1 Software

At the time of writing, dozens of open-source Bayesian optimisa­


tion libraries exist. Many implement all of the acquisition func­
tions introduced in this chapter. There are further open-source
Bayesian optimisation packages specialised to particular appli­
cations, notably hyperparameter tuning. We recommend the
open-source emukit package,13, which provides a full-featured 13 Paleyes et al. (2019), available at
emukit.github.io
sublibrary for Bayesian optimisation, if only because emukit
additionally supports other pn methods.
Chapter VI
Solving Ordinary
Differential Equations
__________ 35
Key Points

The solution x (t), t e [0, T], of an ordinary differential equation


(ODE) is defined as the integral of a vector field f along its own
path {x(s); 0 < s < t} up to time t; see Eq. (37.2). Accordingly,
solving ODEs is often described as the “nonlinear ” extension of
univariate integration, since the subset of ODEs whose vector
field is independent of x are quadrature problems over t. These
linear instances can, therefore, be solved by all integration meth­
ods from Chapter II. Probabilistic solvers for ODEs, however,
also require an iterative learning of x(t) (using its own previous
estimates) and a tracking of its accumulated uncertainty over
time. Nonetheless, those that model x byaGP should reduce to
Bayesian quadrature (as they do!).
There is, however, another (perhaps less obvious) connection
with the above-presented material: on the one hand, ODEs are
the mechanistic model for a dynamical system without observa­
tions of x. On the other hand, time series (or temporal signals)
are, vice versa, the statistical description of a dynamical sys­
tem with available observations of x, but without a mechanistic
model - which can be estimated by filtering and smoothing, as
explained in §5.3. The probabilistic framework will naturally
unify these two complementary viewpoints by applying the lat­
ter to the former, that is filters and smoothers to ODEs - leading
to a wide class of ODE solvers called ODE filters and smoothers.
The topic of ODEs has been extensively studied by numerical
analysts. Likewise, probabilistic ODE solvers are, compared to
algorithms for other domains, relatively well-understood and
begin to match parts of the deep classic theory. On the proving
ground of inverse problems, they can already improve upon
classic methods. The principal take-aways of this chapter are:

Approximating the solution of an ODE, x : [0, T] Rd with


282 VI Solving Ordinary Differential Equations

initial value x(0) = x0 G Rd, can be regarded as curve fitting


of a time series using information about x'(t), 0 = 10 < 11 <
• • • < tN = T, from evaluations of f. To this end, we can (a
priori) jointly model [x, x'] with a GP and treat its approx­
imation as a GP regression problem, with the usual cubic
cost of O(N3) in the number of time steps {t1, ..., tN}. If we
restrict our choice of GP priors to Gauss-Markov processes,
then the GP posterior can be computed by Bayesian filters
(and smoothers) in linear time O(N). This view engenders a
broad family of ODE solvers, now known as ODE filters and
smoothers, which, like in signal processing, span the entire
spectrum of fast-and-Gaussian (extended) Kalman filters to
expressive-but-slower particle filters.

If there is an additional final condition on x (T) (i.e. if the


ODE is a boundary value problem), similar fast-and-Gaussian
solvers can be constructed.

& All modelling assumptions of such an ODE filter or smoother


can be concisely captured by a probabilistic (Bayesian) state
space model (ssm), consisting of a dynamic model (a prior) and
a measurement model (a likelihood). Given such a ssm, an
ODE filter or smoother computes the posterior distribution
which for some SSMs requires approximations. The choice
of ssm and approximate inference schemes, therefore, com­
pletely defines an ODE filter or smoother, whose parameters
(most importantly: the step size) can be adapted in a proba­
bilistic manner. Notably, the integrated-Wiener-process (IWP)
prior is the most suitable dynamic model for generic ODEs
because it extrapolates with Taylor polynomials; if more is
known, other (“biased”) priors can, however, be more precise.

& Like in previous chapters, specific models yield ODE filters


(of the fast-and-Gaussian type) that coincide with classic
methods. Other choices lead to completely new solvers that
encode new kinds of prior knowledge and output more de­
tailed representations of the (potentially multimodal) set of
plausible trajectories.

& The main analytical desiderata for ODE solvers are high
polynomial convergence rates and numerical stability - as
well as (additionally for probabilistic methods) a calibrated
posterior uncertainty. For extended Kalman ODE filters with
an IWP prior (and, locally, with other priors), global conver­
gence rates on par with standard methods hold, namely of
the same order as the number of derivatives contained in
35 Key Points 283

the ssm. In this setting, the expected numerical error (the


posterior standard deviation) is well-calibrated in the sense
that it asymptotically matches these convergence rates. For
the MAP estimate, such rates even hold in more SSMs. Re­
cent practical implementations have drastically improved the
numerical stability of these methods.

Unlike in previous chapters, there is an important line of


probabilistic numerical methods for ODEs that fundamen­
tally deviates from the philosophy of this book, i.e. from
applying GP regression to numerics. These perturbative solvers
do not compute a posterior distribution p(x | {f (x(ti) }=),
but instead design a stochastic simulator by perturbing some
classic numerical method - such that the size of these stochas­
tic perturbations are proportional to the numerical error. Its
randomised simulations are then, in a similar vein to particle
ODE filters, considered to be samples from the distribution
over the set of numerically possible trajectories (given a nu­
merical integrator and a discretisation). These nonparametric
methods are more expressive, but need to simulate the ODE
multiple times. Unlike Gaussian solvers, they can represent
bifurcations and chaos.

If their output distribution is used as a likelihood, both ODE


filters (or smoothers) and perturbative solvers can prevent
overconfidence in (otherwise likelihood-free) ODE inverse­
problems. If an extended Kalman ODE filter or smoother is
employed, then this likelihood is even twice differentiable,
and its gradients (and Hessians) can greatly increase the
sample efficiency and, thereby, the overall speed of existing
inverse-problem solvers. To date, the usefulness of proba­
bilistic ODE solvers has also been demonstrated in a few
additional settings.

& Efficient implementations of ODE filters and smoothers are


available as part of the ProbNum Python-package.1 1Code at probnum.org. See the corre­
sponding publication by Wenger et al.
(2021).
36
Introduction

Since their invention by Isaac Newton and Gottfried Wilhelm


Leibniz in the seventeenth century, differential equations have
become the standard mathematical description of any (con­
tinuous) dynamical systems with explicitly known mechanics.
Broadly defined as any equation that relates one (or more) func­
tions and their derivatives, there are two main categories: ordi­
nary differential equations (ODEs) and partial differential equations
(PDEs).1 While ODEs contain derivatives in only one indepen­ 1 Stochastic differential equations (SDEs),

which relate a stochastic process with its


dent variable (e.g. time), PDEs contain partial derivatives in
derivatives, are not included in this defi­
multiple independent variables (e.g. three-dimensional space). nition; see §5.3.
In this text, we restrict our attention to ODEs and only give
some pointers to probabilistic solvers for PDEs in §41.4.
Over the centuries, mathematical analysts have provided a
deep and beautiful theory for ODEs which has been compiled
into many comprehensive books - such as the ones by Arnold
(1992) and Teschl (2012). Nonetheless, closed-form solutions
are only known for very few ODEs and numerical approxi­
mations are needed in all other cases. The numerics of ODEs
has therefore become an equally well-explored analytical topic.
The accumulated classical numerical theory is, for example, pre­
sented in the excellent textbooks by Hairer, N0rsett, and Wanner
(1993), Hairer and Wanner (1996), Deuflhard and Bornemann
(2002), and Butcher (2016). The following setup will allow us to
be undistracted by some analytical corner cases and focus on
the introduction of (probabilistic) numerical methods.

Definition 36.1. We say that an ordinary differential equation


(ODE) is a relation of the form

x (t)= f (x(t)) for all t e [0, T], (36.1)

between a curve x : [0, T] Rd on a interval [0, T] c R and a


286 VI Solving Ordinary Differential Equations

vector field (aka dynamics) f : V Rd on some non-empty open set 2 Most classical methods can only solve
V C Rd. first-order ODEs, which is no restriction
to their applicability: any ODE of nth
order,
By this definition, we restrict our attention to first-order au­
tonomous ODEs - but this comes without loss of generality, as x(n)(t) = f (x(n—1)( t),..., x (t), x (t)} ,
higher-order ODEs can be reformulated as first-order2 and as we can be transformed into an ODE of first-
only exclude the (mathematically analogous) non-autonomous order by defining the new object

case f (x(t), t) to declutter our notation.3 x(t) := [x(t) x'(t) ■■■ x(n)(t),]T

which implies a first-order ODE with the


Definition 36.2. We say that x : [0, T] Rd is a solution of the modified vector field
initial value problem (IVP)
f( x( t )) =

x(t)= f (x(t)), for all t e [0, T], [x'(t)....... x(n- 1)(t), f (x(1:n- 1)(t))]T .

with initial value x(0) = x0 e Rd. This, however, hides the derivative-
relation of the components of x(t). In
Probabilistic Numerics, this structure
If there is an additional final condition x(T) = xT e Rd, then
can be explicitly modelled in a state­
Eq. (36.2) turns into a boundary value problem (BVP).4 space model; see Exercise 38.2 and Bosch,
Tronarp, and Hennig (2022) who consid­
Note that, under this definition, IVPs can be ill-posed; in par­ ered second-order ODEs directly.

ticular, some ODEs do not have a well-defined solution, either 3Note, however, that we ignore, as most
when Eq. (36.2) is satisfied for multiple5 choices of x or when a textbooks, the more general case of im­
plicit ODEs
local solution x on [0, t], 0 < t < T, cannot be extended6 to the
entire interval [0, T]. BVPs, of course, only admit a solution if xT 0 = F x(n)(t),..., x(t)
equals the final value x (T ) of a solution of the underlying IVP. because only little is known about them;
Consequently, the solutions of many IVPs (and even more BVPs) see Eich-Soellner and Fuhrer (1998).
are not well-defined. Since there is nothing to approximate in
4Strictly speaking, this is only a special
these cases, the whole concept of a numerical error loses its case of a BVP; the boundary condition
meaning. Fortunately, the next two theorems will enable us to can more generally be g(y(a), y(b)) = 0
for an arbitrary function g.
exclude such cases by requiring the following assumptions.
5Example: Consider the ODE x1 (t) =
2sign|x(t)|) which admits a unique
Assumption 36.3. The set V C Rd is open and x0 e V. The vector solution for all x0 = 0. However, if x0 =
field f : V Rd is locally Lipschitz continuous, i.e. for each open 0, then the curves
subset U C V there exists a constant LU > 0 such that 0, for t e [0, t0]
x(t) =
±(t — t0)2, for t e (t0, T]
II f (x) — f (y) II < Lu llx — y\\, for all x, y e U.
are solutions for all t0 e [0, T].
6Example: Consider the ODE x (t) =
While more general preconditions for uniqueness and existence
x(t)2. Its solution
exist in the literature,7 the following version of the Picard-
Lindelof theorem8 is adapted to Definition 36.1 and will suffice
for our purposes. only exists on [0, 1/x0] and cannot be
extended beyond its singularity at t =
Theorem 36.4 (Picard-Lindelof theorem). Under Assumption 36.3, 1/x0.
let us choose a S > 0 such that BS(x0) := {x e Rd : ||x — x01| <
°)
S} с V and set M := supxeBj(x || f (x) ||. Then there exists a unique
7 For the most general existence and

uniqueness results we are aware of, see


§2.3. in Teschl (2012).
local solution x : [0, T] Rd of the IVP (36.2) for all T e [0, S/M].
8Our Theorem 36.4 is a modified version
More specifically, if V = Rd and M < те, then there exists a unique of Theorem 2.2 from Teschl (2012), where
global solution x : [0, те) Rd, i.e. for T = те. a proof is provided.
36 Introduction 287

To ensure sufficient regularity of such an x, we need an addi­


tional assumption.

Assumption 36.5. The vector field f is (q - 1)-times continuously


differentiable, i.e. f e Cq-1 (V, Rd), for some q e N.

Theorem 36.6 (Regularity of IVP solutions). Under Assump­


tions 36.3 and 36.5, the unique solution x is one order more regular
than f, i.e. x e Cq([0, T], Rd).9 9 Cf. Lemma 2.3. in Teschl (2012).

Proof. By induction over q e N. Theorem 36.4 implies that x is


well-defined; the base step (q = 1) then follows directly from
the ODE (36.2) with continuous f by use of the fundamental
theorem of calculus. The inductive step (q q + 1) is obtained
by differentiating the ODE for the (q + 1)th time. □

In the sequel, we will restrict our attention to IVPs with a well-


defined solution x e Cq([0, T], Rd) for some q e N and some
T > 0 - such as, amongst others, those satisfying Assump­
tions 36.3 and 36.5.10 Moreover, to simplify the notation, we 10 This will allow probabilistic numerical
solvers to model q derivatives in a state­
will, w.l.o.g., set V = Rd .
space model; see §38.1. Note that this is
With these preparations in place, we will from here on fo­ no restriction of their applicability since,
cus on numerics. A numerical IVP solver (aka ODE solver) for q = 1, Assumption 36.5 is implied
by Assumption 36.3, which is the stan­
is any algorithm that receives an IVP (f, x0, [0, T]) as inputs, dard assumption of classical numerical
and computes a (discrete or continuous-time) approximation analysis. Thus, the probabilistic solvers
(which we will introduce in §38 and §40)
of x : [0, T] Rd as an output.11 A probabilistic IVP solver (or
are as generically applicable as classical
probabilistic ODE solver) outputs a probability distribution over methods.
x instead - either by iteratively updating a prior on x (§38) or by 11 We will exclusively focus on IVPs for

the most parts, and only briefly treat the


sampling from the set of numerically possible approximations topic of BVPs in §41.1.
of x (§40). But, first, we will give an intuitive introduction to
classical ODE solvers.
__________ 37
Classical ODE Solvers
as Regression Methods

From a physics viewpoint, the solution x G Cq([0, T],Rd) of an


IVP can be thought of as the flow of a massless particle in the
force field f G Cq-1(Rd, Rd) starting at x0. Hence, any estimate
X(t) of x(t) always comes with a derivative estimate yt, written

yt := f(x(t)) x(t)^x(t) f(x(t)) (ODE) x (t), (37.i)


which can be used to construct a local linearisation x(t + h) =
x(t) + hyt to extrapolate forwards. This is the underlying princi­
ple of all ODE solvers. Euler’s method represents this principle
in its purest form as it simply iterates this linear extrapolation
along time; see Figure 37.1 for a graphical depiction.
Figure 37.1 also highlights the important fact that, after the
first step, a numerical solver essentially follows another IVP -
with the same f, but initial value x(t) = x(t). This is why one
often considers the flow of the ODE

Фt(a) := a + ^ f (x(s)) ds, (37.2)

where x(t) = Фt(a) is the solution of the IVP (36.2) with initial
value x(0) = a.
Given any numerical estimate x( t), standard ODE solvers
now extrapolate from t to t + h by approximating this flow
Фh(x(t)) as precisely and cheaply as possible. In fact, Euler’s
method can simply be interpreted as using a first-order Tay­
lor expansion, x(t + h) = x(t) + hyt, to approximate Фh(x(t)).
Hence, it is only natural that Euler’s method produces (by Tay­
lor’s theorem) a local error of O(h2), and (after N = T/h G
O(1/h) steps) a global error of O(h).
290 VI Solving Ordinary Differential Equations

1 Figure 37.1: The underlying principle:


------- true solution * An ODE solvers receives an IVP, i.e. an
0.8 - initial value x0 and a rule f to get the
derivatives at every point which are de­
0.6-/////////////
picted by the grey arrows (upper left).
///////////// The true IVP solution is a trajectory of
0.4 - ///////////// a massless particle along these deriva­
tives (upper right). In the beginning we
0.2 - can extrapolate with the true derivative;
Euler’s method simply linearises with it
0 (lower left). This procedure is iterated,
1 but at a numerical estimate (solid circle)
and with an imprecise derivative for ex­
0.8 trapolation (lower right); see Eq. (37.1).

0.6
///////////
0.4 ///////////.-/■/ /,

0.2

"I I г
0 0.5 1 0 0.5 1
time t time t

To achieve higher polynomial rates than Euler, one just has


to extrapolate with better approximations of Фh(x(t)). In most
of classical numerics, this is achieved by either of two fami­
lies of methods: single-step and multi-step methods. They both
construct higher-order Taylor expansions of Фh(x(t)) - either
by collecting additional information from Eq. (37.1) at multi­
ple sub-steps of one step [t, t + h] (single-step) or by exploiting
the already-collected information {yt-h = f (X(t — h),yt-2h =
f (x(t — 2h),... } from multiple previous steps (multi-step). In
both cases, the underlying principle of Eq. (37.1) remains un­
changed; but the thus-created information is then used to con­
struct a better regression (in the form of a better polynomial
extrapolation)1 of the flow Ф at every step. In the worst case, 1In fact, Nordsieck (1962) already ob­
served that “...all methods of numerical
the convergence rate will then, of course, depend on how many
integration are equivalent to finding an
summands of the Taylor series of Фh(X(t)) are matched in each approximating polynomial...”.
step. Fortunately, this Taylor series is known2 to be 2 фh(x(t))
Eq. (37.3) follows from dti =
f W (x(t)), which is proved in Appendix
“ hi /'\ E of Kersting, Sullivan, and Hennig
фh (x(t)) = E i!f { ! (x(t)),
i=0 i!
(37.3) (2020) by iterative application of the
chain rule. This formula can also be
obtained as a consequence of Faa di
where the {f W }i= 0 are recursively defined by f (0(a) := a,
Bruno’s formula, as given by Lemma 2.8
f (1 (a) := f (a), and in Hairer, N0rsett, and Wanner (1993).

f ® (a) :=[Vxf(l—1 >© f](a), (37.4)

where © denotes the elementwise product. Any solver that


matches the first p E N summands of Eq. (37.3) will accordingly
37 Classical ODE Solvers as Regression Methods 291

have local convergence rates of O(hp+1), and global rates of


O(hp). Such solvers are called pth-order methods.
A numerical solver thus decomposes the numerical problem
of solving ODEs - i.e. of fitting the curve x(t) = Фt(x0) - into
a series of local regressions of Фh(x(t)). Moreover, a pth-order
single-step solver (such as a pth-order Runge-Kutta) matches
the first p derivatives {f 'i (x( t)); i = 1,..., p} of Ф h (x( t)) at
h = 0 in every step. This is to say that, at every step t t + h, it
locally performs Hermite interpolation3 of Фh (x(t)) with data 3 Spitzbart (1960)

|ф0(x(t)) = x(t), dtiфt(x(t))|t=0 = f®(x(t)); i = 1,...,p} .

(37.5)

However, since x(t) « x(t) and thus Фt(x(t)) « Фt(x(t)), it is


unclear how these local regressions should be combined globally.
Classical solvers simply pretend that estimates on the right-hand
side of the data assignments (=! ) in Eq. (37.5) relate to the true
IVP solution, instead to the flow map started at x( t). This would
mean that, given a discretisation 0 = 10 < 11 < • • • < tN = T,
the solver uses iterated Hermite interpolation of x(t + h) from
the data

|x(t) = i(t), x(i)(t) = fw(x(t)); i = 1,...,p} . (37.6)

This removal of the flow map Ф from Eq. (37.5) to Eq. (37.6)
means that the solver, after every completed step, falsely as­
sumes that its current estimate x(t) is the true x(t) - a property
of classical numerics, referred to as uncertainty-unawareness.4 To 4For more details on uncertainty-
(un)awareness in numerics see §1 in Ker-
satisfy this overly-optimistic internal assumption of the solver,
sting (2020).
one would have to replace x( t) by the exact x (t) in the en­
tire data set, Eq. (37.6). Iterated local Hermite interpolation on
this more informative (but, to the solver, inaccessible) data set
indeed yields a more accurate regression of x - which is numer­
ically demonstrated in Figure 37.2 for the popular fourth-order
Runge-Kutta method (RK4).
As a remedy, we can build more “uncertainty-aware” numer­
ical solvers by modelling the ignored uncertainty with proba­
bility distributions, that is by adding appropriate noise to the
Hermite extrapolation performed by classical ODE solvers (as
in the generic GP regression from §4.2.2).
But our probabilistic, regression-based view of numerics will
lead us further than that, beyond the conventional categories
of single-step and multi-step methods. To see how, let us first
recall that classical solvers iterate local Hermite interpolations
on t t + h using the data set from Eq. (37.6) for each respective
292 VI Solving Ordinary Differential Equations

Figure 37.2: Comparison of the maximal


error on fourth-order Runge-Kutta (RK)
6 and iterated fourth-order Hermite inter­
polation with exact data (i.e. x(t) instead
4 of x( t) in Eq. (37.6)) on the linear and the
Van-der-Pol ODE. The linear system is
2 given by x'(t) = x(t), x(0) = 1.0 and the
Van-der-Pol system by (x1 (t), x2(t)) =
0 (M(x1(t) - 3x 1 (t)3 - x2(t), x1^1), x(0) =
(2, — 5), m = 5. The graphs show the max­
time t time t imal error up to time t. As expected, Her­
mite interpolation produces a lower er­
ror.
time t. But, as long as we have a x(t) when needed, nothing
prevents us from using the union of these data sets for all times
t visited by the solver. Second, we recall from above that even
our first (direct and unaltered) interpretation of ODE solvers as
iterated local regression with higher derivatives, from Eq. (37.5),
is constructed from the sole principle of Eq. (37.1). Hence - if
we trust our regression methods to aggregate the information
about x' from Eq. (37.1) at least as skillfully as the classical ODE
solvers - we can treat any evaluation f (x(t)) at a numerical
estimate x(t) of x(t) as what it is: data on x'(t).
Via these two considerations, we thus regard both the local
clustering of all derivative information from a finite number of
steps (or sub-steps) and the ensuing local derivation of higher-
derivative information as artificial constructs. After discarding
them, we arrive at the insight that, given a discretisation 0 =
10 < 11 < • • • < tN = T, approximating the ODE solution
x : [0, T] Rd is nothing but a regression on the data set

|x(0) = x0, x'(tn) = f (f(tn)); n = 0,...,n} , (37.7)

where x(tn) is a numerical estimate of x(tn). This regression


problem might appear to be defined in a circular way, since
computing x(tn) is the very goal of the problem. But this will be
no problem because ODE solvers - just like the Gauss-Markov
regression tools from §5 that we will employ - go through time
sequentially. Hence, at every tn, there is a x(tn) readily available
(e.g. the predictive mean conditioned on the preceding steps).
In other words, the global regression problem of Eq. (37.7) is
revealed to the ODE solver one step at a time (i.e. as a time
series).
Below, we will show how this regression formulation is rig­
orously realised in a state-space model (ssm), without the circular
appearance of x in the problem formulation. While it will (for
full generality) employ a different kind of information zn (see
Eq. (38.13)), the connection to yt from Eq. (37.1) will become ev-
37 Classical ODE Solvers as Regression Methods 293

Figure 37.3: Sketch contrasting classic


and probabilistic ODE solvers. Four
steps of size h, true solution in black.
Top row: Classical solvers construct an
extrapolation x(ti) (solid black circle)
which, in this example, is also used
as the probing point x(ti) to construct
an observation yi = f (x(ti), ti). Bottom
row: Probabilistic solvers do the same,
but return a probability measure p(x(t))
1 (grey delineated shading, with lightly
coloured samples), rather than the point
estimate x(t) (although for example the
mean of p could be used as such an
н 0.5
estimate). For a well-calibrated classic
solver, the estimate x should lie close
to the true solution. The same applies
0 for the mean (or mode) estimate of a
Gaussian probabilistic solver; but addi­
tionally, the width (standard deviation)
of the posterior measure should also be
ident via an intuitive (but less general) ssm in §38.3.4. Bayesian meaningfully related to the true error.
regression of x(t) in such a ssm is then performed by methods In the case of the (nonparametric) per­
turbative solvers the resulting samples
known as ODE filters and smoothers. The difference between a
should accurately capture the entire dis­
classical and a probabilistic solver is visualised in Figure 37.3. tribution of numerically possible trajec­
tories; e.g. by covering both sides of a
bifurcation (Figure 38.2).

► 37.1 A Brief History of Probabilistic ODE Solvers

John Skilling (1991) was the first to recognise that ODEs can,
and perhaps should, be treated as a Bayesian (GP) process re­
gression problem. But two decades passed by before, in parallel
development, Hennig and Hauberg (2014) and Chkrebtii et al.
(2016) set out to elaborate on his vision. While both papers used
GP regression as a foundation, the data generation differed.
Hennig and Hauberg (2014) generated data by evaluating f at
the posterior predictive mean, and Chkrebtii et al. (2016) by eval­
uating f at samples from the posterior predictive distribution,
i.e. at Gaussian perturbations of the posterior predictive mean.
This difference stemmed from separate motivations: Hennig
and Hauberg had the initial aim to deterministically reproduce
classical ODE solvers in a Bayesian model, as had been previ­
ously achieved in e.g. Bayesian quadrature. Chkrebtii et al., on
5 This is not the only categorisation of
the other hand, intended to sample from the distribution of so­ probabilistic ODE solvers. Another pos­
lution trajectories that are numerically possible given a Bayesian sible distinction would be nonparamet­
ric vs Gaussian or deterministic vs ran­
model and a discretisation. Thus, these two papers founded two
domised which would both group the
distinct lines of works which we call ODE filters and smoothers particle ODE filter/smoother with the
and perturbative solvers;5 see §38 and §40 respectively. perturbative solvers.
Note that the perturbative solvers have
The former approach, after an early success of reproducing been called “sampling-based” solvers in
Runge-Kutta methods (Schober, Duvenaud, and Hennig, 2014), several past publications.
294 VI Solving Ordinary Differential Equations

acquired its name from the recognition that Kalman filtering is


a much faster form for probabilistic ODE solvers, than conven­
tional GP regression (Schober, Sarkka, and Hennig, 2019). This
first filtering formulation was conceptually somewhat vague,
which was rectified when Tronarp et al. (2019) introduced a
rigorous state-space model that unlocks all Bayesian filters and
smoothers for ODEs (§38). Since then, several publications have
further developed the theory and practice of these ODE filters
and smoothers.
The perturbative approach, on the other hand, was next, in
a seminal paper by Conrad et al. (2017), divorced from its ini­
tial Bayesian formulation. Instead of imposing a prior, Conrad
et al. (2017) proposed to model the local numerical errors as
suitably-scaled random variables which are added after every
step of a classical solver. In contrast to Chkrebtii et al. (2016),
the resulting samples are therefore drawn from the distribu­
tion of solution trajectories that are numerically possible given
a classical solver and a discretisation. This non-Bayesian view­
point became the predominant approach in this line of work.
Since then, several publications have introduced and analysed
multiple perturbative solvers - not necessarily with additive per­
turbations (Abdulle and Garegnani, 2020) - that output samples
in this framework.
Due to the selection of authors, this text focuses on ODE
filters and smoothers in great detail - while only providing a
careful presentation of some of the most important perturbative
methods and theorems, as well as a comprehensive referencing
of the literature.
38
ODE Filters and Smoothers

As explained in the last section, an ODE solver makes use


of evaluations of f as information about the derivative x' (ti)
at discrete time points 0 = 10 < 11 < • • • < tN = T along
the time interval [0, T] to infer the continuous-time signal x :
[0, T] Rd that solves the IVP (36.2). This lead us to regard
IVPs as regression problems with data on [x, x'] according to
Eq. (37.7). While in principle all (time-series) regression methods
are now applicable to ODEs by this reformulation, we will once
again take a Bayesian view and use linear-time Gauss-Markov
regression by Bayesian filtering and smoothing (as introduced
in §5). To this end, we first (as always in Bayesian inference)
need to provide a statistical model consisting of a prior p(x)
and a likelihood for p(f (X) | x) - or (counter-intuitively, but
more generally) a likelihood depending on f (X) to condition
directly on the ODE X(t) = f (x(t)). The latter will take the
form of the general ssm of §38.1.2, and the former will emerge
as a linear-Gaussian alternative model for one of the algorithms
(the EKF0) in §38.3.4. These SSMs will then in §38.3 allow us
to solve ODEs with Bayesian (Kalman) filters and smoothers.
This chapter thus consists of a description of models for ODEs
(§38.1 and §38.2) and a presentation of methods to do inference
in these models (from §38.3 on).

► 38.1 State Space Models (SSMs) for ODEs

Before we can define algorithms, we need a model. But only


some models admit fast algorithms, and - as we are only able
to match the linear-in-time-steps cost, O(N = T/h), of classical
solvers in the Markovian state-space models (SSMs) of §5 - we
restrict ourselves to such SSMs. To this effect, we model the prior
296 VI Solving Ordinary Differential Equations

on x : [0, T] Rd by a Gauss-Markov process - which we refer


to as the (continuous-time) dynamic model because it models the
evolution of x along [0, T]. Our Bayesian model is completed
by adding a likelihood, aka a (continuous-time)1 measurement 1 The continuous-time dynamic and mea­
surement model, will be replaced by
model.
their discrete-time versions in §38.1.2.
While the discrete-time case is sufficient
> 38.1.1 ODEs as Continuous-Time Filtering Problems to define the algorithms, it is important
to comprehend that they are obtained by
discretising a continuous-time stochastic
We use a stochastic process X(t) (referred to as the system in the
filtering problem.
filtering-theory literature) to jointly model x (t) and x' (t) such
that

x(t) — X(0)(t) := HоX(t), for some matrix H0 G RdxD, (38.1)


x'(t) - X(1)(t) := HX(t), for some matrix H G RdxD. (38.2)

The entire vector-valued function modelled by X(t) will be


denoted by the state vector X : [0, T] RD, i.e. — (t) — X(t). It
contains x and x' via

x (t) = H0 #r( t), and x' (t) = H #—( t). (38.3)

Hence, a prior p (x-) on X immediately implies a joint prior


p (x, x') whose marginal p (x) is the prior on the ODE solution
x. Due to —(tt) — X(t), this prior distribution p(—) is nothing
but the law of X(t) which (as in Definition 5.4) we define by a
linear time-invariant SDE, written

dX(t) = FX(t) dt + L dwt, (38.4)

with Gaussian initial condition

X(0) - N(m0, P0), (38.5)

where wt denotes a standard Wiener process. In the standard


case where x0 G Rd is known without uncertainty (i.e. P0 = 0),
the initial distribution is Dirac, namely p(x(0)) = 3(x(0) — x0).
From Eqs. (5.20) and (5.21), we already know that this yields

p (x( t ))= N (x( t); A (t) m0, A (t) P0 A (t )T + Q (t)), (38.6)

for all t G [0, T], with matrices

A(t) := exp(tF), and Q(t) : J eFrLLTeFTT dт,

as in Eq. (5.21). Now, with the aid of the nonlinear transforma­


tion

g : RD Rd, I ^ Hl — f (H01), (38.7)


38 ODE Filters and Smoothers 297

we can define the state misalignment2 z(t) := g(#-(t)) which is 2For more intuition on this state mis­
alignment, see §4 in Kersting, Sullivan,
equal to 0 for all t G [0, T], since —(t) solves the ODE (36.1)
and Hennig (2020).
due to (38.3). Again, we model this feature by a (this time, 3 The pair of system X(t) and obser­

non-Gaussian) Markov process vations Z(t) define a (continuous-time)


stochastic filtering problem which con­
sists of computing the distribution of
z(t) ~ Z(t) := g(X(t)) = HX(t) - f (H0X(t)), (38.8)
X(t), given {Z(s); 0 < s < t}. Its gen­
eral solution is the stochastic process
called the observation process.3 E(Xt | Gt), where Gt is the a-algebra
By this construction, the law of Z(t) is equal to the push­ generated by {Z(s); 0 < s < t}; see
Thm. 6.1.2 in Oksendal (2003).
forward measure g*(p(—(t))). Observing that z(t) = 0 for all
t G [0, T] under the likelihood4 Exercise 38.1. In the filtering-theory liter­
ature, the observation process Z(t) is often
P (z(t) I #-(t))= $(g(#-(t))) (38.9) defined by an SDE as well. Derive this SDE
from Eqs. (38.8) and (38.4) by use of ltd's
now amounts to conditioning on all information contained lemma. For which ODEs is this SDE a Gaus­
sian process?
in the IVP, (36.2). Thus, the Picard-Lindelof theorem (Theo­
rem 36.4) ensures that the solution to this continuous-time 4 Recall that z (t) = g (#-( t)).
filtering problem is indeed the true x : [0, T] Rd - or, in other
words, that there is no posterior uncertainty: p(x | z = 0) =
$( x - x (t)). While this continuous-data limit case already hints
at the favourable convergence properties of this approach (see
§39.1), we will first have to discretise the processes (X (t), Z(t)) -
which will lead to a discrete-time ssm and actual numerical
algorithms.

> 38.1.2 A Discrete-Time SSM for ODEs

In practice, we have only a finite computational budget. Hence,


the above continuous-time formulation of the ODE filtering
problem is intractable. We thus consider a discretisation 0 =
10 < • • • < In = T with step sizes hn := tn — tn—1. By re­
stricting the continuous-time processes X(t) and Z(t) to these
time points {tn}nN=0, we obtain discretised versions of the prior
(Eq. (38.6)) and the likelihood (Eq. (38.9)). From these we can
now, after denoting #v(tn) and z(tn) by xn and zn, assemble the
following discrete-time ssm:

p(x0)=N(x0;m0,P0), (38.10)
p(xn+1 I xn)=N(xn+1;A(hn+1)xn,Q(hn+1)), (38.11)
It might, however, be advantageous
p(zn I xn) = $(zn - g(xn)), (38.12) 5

to add a positive variance R > 0 to


with data zn = 0. (38.13) Eq. (38.12), e.g. to facilitate particle fil­
tering or to account for an inexact vector
It resembles the linear-Gaussian ssm from Eqs. (5.8)-(5.9) - with field f. This leads to a likelihood of the
form
the only difference that now Hnxn = g(xn) and Rn = 0 in
p(zn I xn) = N (zn; g(xn), R).
Eq. (38.12), which makes it nonlinear.5 As this is a complete and
rigorous ssm, all of Bayesian filters and smoothers can now, in
principle, be applied to ODEs right away.
298 VI Solving Ordinary Differential Equations

Note, however, that this ssm is only known since its intro­
duction by Tronarp et al. (2019); all preceding publications
employed a less-general linear-Gaussian ssm which we will also
define below in Eqs. (з8.з1)-(з8.зз). The new nonlinear ssm
(з8.1о)-(з8.1з), instead, leaves the task of finding approxima­
tions to the inference algorithm and, in this way, engenders both
Gaussian (§з8.з) and non-Gaussian inference methods (§з8.4).
This difference between the SSMs in the literature is at the
heart of a common source of confusion over the new ssm (з8.1о)-
(з8.1з): To some readers, it might appear that the constant data
zn = 0 contains no information whatsoever. But this is mistaken
because the use of information does not only depend on the data
but also on the likelihood. While in the regression-formulation
of classical solvers (§з7) the data was an evaluation of f, this
dependence on f is now hidden in the likelihood via the def­
inition of g, Eq. (з8.7). Since g(xn ) is by construction equal to
0 for the true xn = — (tn), the observation of the constant data
zn = 0 amounts by the form of the likelihood, Eq. (з8.12), to
“conditioning on the ODE” by imposing that x'(tn) = f (x(tn)) -
which is similar to Eq. (з7.7). In §з8.з.4, we will explain how
the alternative linear-Gaussian ssm echoes the logic of classical
solvers.

Exercise 38.2 (SSMs for higher-order ODEs). Many ODEs are


originally formulated as a higher-order ODE; see note 3 of §36. Most
classical solvers reduce such a higher-order ODE to a system offirst-
order ODEs, and solve it instead. One advantage of ODE filters is
that they can solve a higher-order ODE directly, by using a different
ssm. How does the ssm (з8.1о)-(з8.1з) have to be changed in order
to model the following nth-order ODE?

x(n)(t) = f xx(n- 1)(t),..., x'(t), x(t)) .

How many derivatives (at a minimum) have to be included in this


modified ssm? (See Bosch, Tronarp, and Hennig (2022) for a solution
in the case of second-order ODEs.)

► 38.2 Choice of Prior

Before we get to the algorithms, we conclude our exposition of


SSMs for ODEs with a discussion of the dynamic model p(xn+1 |
xn ) of Eq. (з8.11) and its underlying continuous-time prior
p(x-(t)) of Eq. (з8.6). This exposition of the prior is completed
by a description of the initialisation of p(x0) in §з8.2.1. As
the likelihood (measurement model) of Eq. (з8.12) is (prior
38 ODE Filters and Smoothers 299

to approximations of it) fully determined by f, this prior is


our only available modelling choice in the ssm of Eqs. (38.10)­
(38.13).
Let us recall that this prior is nothing but the law of the
solution X (t) of the linear time-invariant SDE (38.4) - which
spans all standard Gauss-Markov priors, such as the ones from
§5.4 and §5.5. While we can freely choose from this class of
priors, some are more suitable for ODEs than others. One prior
stands out in particular: the q-times integrated Wiener process
(IWP),6 where q G N is the number of modelled derivatives. 6The IWP is also referred to as the inte­
grated Brownian motion in the literature.
To simplify the notation, we will from now on w.l.o.g.7 assume
See Appendix B of Kersting, Sullivan,
7
that d = 1. and Hennig (2020) for why this comes
It is natural to consider the subset of priors which model x with no loss of generality.
and the derivatives of x as its coordinates.8 To this effect, we 8 In fact, the only exception to this choice
construct X = [X(0),..., X(q)]T such that its coordinates are the is the below-discussed Fourier ssm from
Kersting and Mahsereci (2020) which
derivatives, i.e. x(i) ~ X(i) for all i = 0,..., qУ As this imposes models the Fourier coefficients instead.
that dX(i-1)(t) = X(i) (t)dt for i = 1, ...,q,this construction 9 Note that this implies that H0 =
[1,0,...,0] and H = [0,1,0,...,0] G
restricts the first q rows of the SDE (38.4) such that its drift and R1 x(q+1) in EqS. (38.1) and (38.2).
diffusion matrices read
0 1 0 ... 0 0 0'
..
0 0 1 0 . 0
.. .. .. .
F= . . and L = ..
. 0
0 0 1 0
-a0 -a1 ••• -aq-1 - aq u

The last row can still be set flexibly by choosing the scale u > 0
of the Wiener process and the non-negative drift coefficients
(a0,...,aq) > 0 - which parametrise the Matern covariance
family with v = q + 1/2, as we saw in §5.5.
Although Matern priors are popular for GP regression, they
have (in their general form) not yet been explored for ODE
filtering. Only the special case of (a0, ...,aq-1)=0 has been
studied, where the only free parameter is aq > 0. In this case,
X (t) is the q-times integrated Ornstein-Uhlenbeck process with
(mean-reverting) drift coefficient aq . While this prior can be
advantageous for some exponentially decreasing curves (such as
radioactive decay),10 it is, to date, not known if these advantages 10 Magnani et al. (2017)
extend to more ODEs.
Meanwhile, the q-times IWP, which sets (a0, ..., aq )=0, has
become the standard prior for ODEs because the q-times IWP
extrapolates (as we saw in §5.4) by use of polynomial splines of
degree q. And this polynomial extrapolation also takes place for
the derivatives: under the q-times IWP prior, the ith mean of the
dynamic model (38.11) is, by Eq. (5.24), for all i = 1, ...,q + 1
300 VI Solving Ordinary Differential Equations

given by

q+1 hkn+1
[ A (hn+1) xn ] i = L (k - i)
k=i
[xn]k, (38.14)

11This insight will be the basis of the


i.e. bya(q + 1 - i)th-order Taylor-polynomial extrapolation. convergence-rates analysis of §39.1.1.
In particular, the solution state (i = 1) is predicted forward
by a qth-order Taylor expansion, which is by Taylor ’s theorem
(absent additional information about x) the best local model.11
Note that this is in keeping with classical solvers which - in light
of Eq. (37.3) - extrapolate forward along Taylor polynomials of
the flow of the ODE.12 12 Accordingly, all known equivalences

with classical models hold for the IWP


Therefore, it is only natural that the IWP is the standard prior
prior; see Schober, Sarkka, and Hennig
for ODEs, and that any deviation from it requires specific prior (2019).
knowledge on the solution x : [0, T] Rd. Hence, the utility of
adapting prior-selection strategies from GP regression depends
on how much knowledge on x can be extracted from f . With
this in mind -to draw from the full inventory of GP priors -
one can even go beyond the Matern class and use state-space
approximations of non-Markov covariance functions.13 The fre­ 13For a comprehensive overview of such
approximations, see §12.3 in Sarkka and
quent case of periodic ODEs (oscillators) can, following earlier
Solin (2019).
work on GP regression,14 be modelled by such a state-space 14 Solin and Sarkka (2014)
approximation of the periodic covariance functions.15 Remark­ 15 Kersting and Mahsereci (2020)
ably, this model extrapolates with a Fourier (instead of a Taylor)
expansion - which is indeed how a periodic signal x is usually
approximated. Unfortunately, Fourier series are (unlike Taylor
series) global models, and therefore the utility of this periodic
model is (so far) limited to fast-and-rough extrapolations with
large step sizes after an initial learning period. It remains to
be seen whether this (or another) radical deviation from the
Taylor-expansion logic of classical numerics can give solvers
that can compete with probabilistic solvers that use the IWP
prior.

> 38.2.1 Initialisation

To fully determine the prior, we now consider the Gaussian


initial distribution p(x0) = N (x0; m0, P0) of Eq. (38.10), which
corresponds to X(0) from Eq. (38.5) in continuous time. Here,
we restrict our attention to the standard case of SSMs that model
the first q derivatives of x.16 If x0 is known (that is [m0]1 = x0), 16That is, we exclude the Fourier model,
see note 8.
then all q derivatives are determined by

x(i )(0) = f ® (x0), (38.15)


38 ODE Filters and Smoothers 301

with the recursively defined f W from Eq. (37.4). Hence, the


ideal initialisation is

m0 = [x0,f(x0),f<2(x0),...,f(x0)]T e Rq+1,

P0 = 0 e R(q+1)x (q+1).

Kramer and Hennig (2020, §3.1) recognised that this can be


efficiently achieved by Taylor-mode automatic differentiation
(AD).17 This way, the computational complexity grows at most 17Bettencourt, Johnson, and Duvenaud
(2019)
quadratically in the order of the approximation - instead of
exponentially for standard AD. In this case, P0 is set to zero.

► 38.3 Gaussian ODE Filters and Smoothers

In the last section, a generally-applicable discrete-time ssm


for ODEs was introduced in Eqs. (38.10)-(38.13). By virtue of
this model, the numerical solution x : [0, T] Rd can now be
solved by Bayesian filtering and smoothing on this ssm. Recall
that we already encountered the most prototypical Bayesian
filters and smoothers -the Kalman filter and Rauch-Tung-
Striebel smoother (aka Kalman smoother) - in Algorithms 5.3
and 5.4. For ODEs we will, however, not limit ourselves to
these elementary methods, but include all applicable filters
and smoothers (including non-Gaussian ones).18 This leads to 18The most important filters and
smoothers can, e.g., be found in Sarkka
Bayesian ODE filtering and Bayesian ODE smoothing which are
(2013).
defined in Algorithms 38.1 and 38.2. To include all filters and
smoothers, the algorithmic descriptions here are more general
in that they only contain the high-level procedural steps, and
not the precise algebraic computations (which will differ from
method to method). Note that the model adaptation and step­
size selection in lines 4 and 5 of Algorithm 38.1 are optional
(but recommended). We therefore postpone their description to
§38.5.
To define an ODE filter, we thus have to determine the pre­
diction and the update step, in lines 6 and 8 of Algorithm 38.1.
For an ODE smoother, only line 5 of Algorithm 38.2 has to
be defined additionally. An ODE filter or smoother is named
by adding “ODE” to its classical name; e.g. the “particle filter”
becomes the “particle ODE filter”. The term “Gaussian ODE
filters and smoothers” refers to any ODE filter or smoother that
outputs Gaussian approximations of the posterior. Since any
smoother is an extension of a filter, we sometimes use the word
“filters” instead of “filters and smoothers”.
Below, we will spell out these computations for the filters
and smoothers that we consider most useful for ODEs - ordered
302 VI Solving Ordinary Differential Equations

1 procedure ODE filter(f, x(0), p(xn+1 | xn)) Algorithm 38.1: Bayesian ODE filtering
2 initialise p(X0) / with available information about x(0) iteratively computes a sequence of pre­
3 for n = 0 : 1 : N — 1 do dictive and filtering distributions. Recall
from the graphical model of filtering
4 optional: adapt dynamic model p (xn+i | xn)
(Figure 5.1) (with z instead of y) that the
5 optional: choose step size hn > 0 sequential form of this inference proce­
6 predict p(Xn+1 | z 1:n), from p (Xn | z 1:n) // by (38.11) dure (i.e. the for-loop) is legitimate. The
7 observe the ODE: zn + 1 = 0 / according to (38.13) form of the computations in lines 6-8
8 update p(Xn+1 | z 1:n+1), from p(Xn+1 | z 1:n) // by (38.12) depend on the choice of filter. The ini­
tialisation (line 2) is explained in §38.2.1.
9 end for
The optional (but recommended) lines 4
10 return {p (xn | z 1: n); n = 0,..., n} and 5 are detailed in §38.5.
11 end procedure

by statistical complexity from Gaussian to Monte-Carlo approx­


imations, i.e. from extended Kalman ODE filters via iterated
extended Kalman ODE filters to particle ODE filters.19 But, to 19 As well as the corresponding smoother
(in all cases). Note that this is not a com­
prevent confusion, we will first explain why Gaussian ODE
plete list of ODE filters discussed in the
filters are statistically admissible for nonlinear ODEs whose literature, as it e.g. leaves out the un­
true posterior is non-Gaussian. scented Kalman ODE filter; see §2 in
Tronarp et al. (2o19). These, as well as
completely new ODE filters, can be ob­
> 38.3.1 The Cost-Accuracy Trade-off: Why Use Gaussian Solvers tained by inserting the corresponding
computations into Algorithm 38.1, or (for
for Nonlinear ODEs? smoothers) into Algorithm 38.2.

The ssm from Eqs. (38.10)^38.13) is nonlinear, and thus yields


non-Gaussian posteriors. It does so for good reasons: while the
rest of the book mostly uses Gaussians for their faster computa­
tional speed, ODE solutions are almost always highly nonlinear
(and often even chaotic). Hence, the arguments for non-Gaussian
posteriors are more compelling than they were for the other
numerical problems in this book. To satisfy this need, we will
later present solvers (the particle ODE filter and the perturba­
tive solvers) which output a Monte-Carlo representation of the
non-Gaussian posterior.2o 2o Or, in case of most perturbative solvers,
of a set of numerically possible trajecto­
Nonetheless, the inevitable trade-off between computational
ries.
speed and statistical accuracy (which is central to numerics) is
equally real for ODEs. In this regard, it is instructive to remem­
ber that even in signal processing with real data - where the
need for fast algorithms can sometimes be less critical, since the

1 procedure ODE Smoother(f, X(0), p(Xn+1 | Xn)) Algorithm 38.2: Bayesian ODE smooth­
2 {p(Xn | z 1:n), p(Xn | z 1:n — 1) }n=0,...,N = ing extends Alg. 38.1 by iteratively up­
3 ODE FILTEr(f, X(0), p(Xn +1 | Xn)) dating its output, the filtering distribu­
tions p(Xn | z1:n), to the full posterior
4 for n = N — 1 : — 1 : 0 do
p(Xn | z1:N). Note that, in line 3, the filter
5 I compute p(Xn | z 1:N), from p(Xn +1 | z 1:N) // by (38.26) additionally returns the posterior predic­
6 end for tive distributions p(Xn | z1:n—1 ) which it,
7 end procedure for all n, computes as an intermediate
step anyway; see line 6 of Alg. 38.1.
38 ODE Filters and Smoothers 303

prediction step update step Figure 38.1: Depiction of the first step
1 of the EKF0 with 2-times IWP prior ini­
......... true solution tialised at xо and the implied deriva­
0.8
------- samples tives as in (38.15). In the prediction step
0.6 ------- mean ‘ (left column), the predictive distribution
------ uncertainty/ p(x 1) is computed by extrapolating for­
0.4
ward in time along the dynamic model.
0.2 10 The samples can be thought of as dif­
ferent possibilities for the trajectory of
0
x(t), x(t) and x"(t) (from the top to the
bottom row). Then, in the update step
(right column), the predictive distribu­
tion is conditioned on x' (11) = f(m-)
(recall that the intuitive ssm can be used
in the Gaussian case). This yields the fil­
tering distribution p(x1 | z1 ) whose sam­
ples are now restricted to those with a
first derivative of f (m1- ) at t1. The uncer­
tainty (dashed line) is drawn at two stan­
dard deviations in both directions of the
mean, thus capturing a 95% credible in­
terval. Note the reduction of uncertainty
after conditioning on z1.
The same procedure is then repeated
for 11 12, that is all possible trajectories
(samples) are predicted forward in time,
and then restricted to x' (12) = f (m-)
in the subsequent update step. (Second
step not depicted.)

frequency of data cannot be increased by shortening the step


size as in numerics - Gaussian methods are often used first.21 21 Sarkka (2013), §13.1
Thus, it is only natural that we will recommend Gaussian ODE
filters (and smoothers) as a standard choice - unless nonpara­
metric uncertainty quantification is especially relevant.22 22 See §38.6 for more details.

> 38.3.2 The Extended Kalman Filters and Smoothers

To define an ODE filter in the framework of Algorithm 38.1,


we (as explained above) only have to specify the prediction
and update step. Since our dynamic model Eq. (38.11) is linear-
Gaussian, all Gaussian ODE filters can, given a Gaussian

p(xn | z1:n) = N (xn; mn, Pn), (38.16)

compute an exact predictive step, written

p(xn+1) = N(xn+1;mn-+1,Pn-+1), with (38.17)


m-+1 = A(hn+1)mn, Pn+1 = A(hn+1)P»A(hn+1)T + Q(hn+1).

The update step is, however, more complicated as it involves the


nonlinear mapping f (via g). To stay in the Gaussian family, we
thus have to approximate f by a simpler function. As is common
304 VI Solving Ordinary Differential Equations

use,23 we use Taylor approximations around the predictive 23 Sarkka (2013), §5.1
mean mn-+1 which makes the update tractable. We will restrict
our attention to the most important cases of a zeroth and first-
order Taylor approximation. The resulting methods are known
as the (zeroth and first-order) extended Kalman ODE filter - or
EKF0 and EKF1 abbreviated.24 24 Tronarp et al. (2019), §2.4
In terms of specific computations, the EKF0 and EKF1 thus
perform the following approximate update step using the data
zn+1 = 0, where the difference between the EKF0 and EKF1 lies
in the choice of Й:

+
Zn+1 := f (H0m— 1) — Hm- 1, + (innovation residual) (38.18)
+
Sn+1 := HiP- 1 HiT + Rn+1, (innovation cov.) (38.19)
Kn+1 := P—+1 HiT S—+1, (gain) (38.20)
mn + 1 mn + 1 + Kn + 1 z n + 1, (з8.21)
Pn+1 := (Id k Kn+1 H)P—+1. (38.22)

The resulting (posterior) filtering distribution of this step is -


just like at the end of the previous step, Eq. (38.16) - a Gaus­
sian p(xn+1 | z1:n+1) = N(xn+1;mn+1,Pn+1). Thus, these filters
never leave the Gaussian exponential family, which is why they
are instances of Gaussian ODE filters.
If Hi = H, this is the EKF0,25 which is depicted in Fig­ 25 Cf. Eqs. (5.12)-(5.13) to see the sim­
ilarity of the EKF0 with the standard
ure 38.1. This case can be thought of as an exact update af­
Kalman filter.
ter replacing f by the constant function g ^ f (H0m— 1 ).26 + 26 Tronarp et al. (2019), Proposition 2

If instead Hi = H — Jf(H0m- . 1)H0, where Jf denotes the Ja­


cobian matrix of f, this is the EKF1; it can be interpreted
as an exact update after replacing f by its linearisation g ^
f (H0 m—+1) + Jf(H0 m—+1)[5 — H0 m—+1 ].
Note that the variance R > 0 can - despite its absence from
the likelihood Eq. (38.12) - be chosen positive because the ap­
proximations of f cause the likelihood-data pair to be inaccu-
rate.27 We, however, recommend to use R = 0 as the default 27In the same way, the impact of an only-
approximately known ODE can also be
choice.
captured by an increase in R > 0; see
The so-computed filtering distributions p(xn | z1:n) = N (xn; Kersting, Sullivan, and Hennig (2020,
mn, Pn ) of the EKF0 and EKF1 can then be extended to the full­ §2.3).

posterior (smoothing) distributions p(xn | z1:N) = N (xn; mS


n, PnS)
by the following computations:

:
Gn = PnA(hn+1)T(P— 1)—1, + (gain) (38.23)
msn := mn + Gn msn+1 — mn
—+1 , (38.24)
Psn := Pn + Gn (Pn+1 — P-+i) Gn. (38.25)

These backward-recursion equations are, simply, an exact Gaus­ We already provided this equation in
28

Eq. (5.7).
sian execution of the well-known smoothing equation:28
38 ODE Filters and Smoothers 305

(
p(Xn | Z1:N) = p(Xn | Z1:n) I' [p xn+1
p(xn+1
( z1:n) 1 z
। xn)p xn+1 1:N)1 dxn+1, (38.26)

which (in our ssm) does not differ between the EKF0 and EKF1
+
because their dynamic model p(xn 1 | xn) is the same.
With this in mind, we define the (zeroth and first-order)
extended Kalman ODE smoothers, EKS0 and EKS1,29 as the in­ 29In some recent publications, the EKS0
and EKS1 are referred to as EK0 and
stances of Algorithm 5 that employ the EKF0 or EKF1 in line 3
EK1 - because smoothing has become
and then compute line 5 by Eqs. (з8.2з)-(з8.25). The resulting the default (see §з8.6).
smoothing-posterior distributions p(xn | Z1:N) = N (xn; msn, Pns)
can be extended beyond the time grid {tn}n =
N 0 by interpolation
along the dynamic model, Eq. (з8.11), and therefore contain
the same information as the full GP posterior of Eqs. (4.7) and
(4.6).з0 з0 This was also discussed above for
generic Gaussian smoothers in §5.2.
Remark (Relation to Bayesian quadrature). Before introducing
more ODE filters, let us briefly clarify the relation to Bayesian quadra­
ture (BQ) - namely that the EKF0/EKS0 is a generalisation of
BQ in the following sense: if the ODE is really just an integral
(i.e. x'(t) = g(t)), then its solution is given by

x(t) = x0 + t g(s) ds. (з8.27)

Thus, computing x(t) by approximating the integral in Eq. (з8.27)


with the Kalman-filter version of BQ (Algorithm 11.2 from §11.2) is
equivalent to solving the ODE

x(s) = g(s), s e [0, t], with initial value x(0) = x0,


з1 See Proposition 1 in Tronarp et al.
(2019) for the details of this equivalence.
by the EKF0 or EKS 0.31

Exercise 38.3 (Recognise the shared principle). Recall gradient


filtering for optimisation from Eq. (28.8). Prove that this gradient
filtering coincides with the EKF0, when applied to the gradient-flow
ODE x (t) = Vf (t), by showing that the filtering means mn and
covariance matrices Pn are the same for all n = 0, ..., N.

> 38.3.3 Iterated Smoothers for MAP Inference

In the last section, we presented the extended Kalman ODE


filters and smoothers (EKF0/1 and EKS0/1) as the standard
method for approximate Gaussian inference of ODE solutions.
But almost nothing can be said about the true non-Gaussian
posterior that these Gaussians approximate, and therefore we
will (in the next section) introduce the particle ODE filter which
306 VI Solving Ordinary Differential Equations

outputs a Monte-Carlo representation to the true posterior. Since


particle filtering comes at significantly higher cost than Gaussian
filtering, we present in this section how (as a compromise) one
can approximate the maximum a posteriori (MAP) estimate
by another Gaussian ODE smoother. This is a compromise
because the MAP estimate is the most likely value under the
true posterior, i.e. the most likely sample trajectory computed
by the particle ODE filter. Unlike the previous filtering and
smoothing mean, it is defined as the solution of the so-called
global MAP problem, a non-convex optimisation problem which
consists of maximising the posterior p(—(t) | z1:N = 0).32 Let 32See §2.3 in Tronarp, Sarkka, and Hen­
nig (2021). Also note that, according
us denote this MAP estimate by
to Proposition 3 in the same paper,
Eq. (38.29) is equivalent to optimisation
x * (t) := arg min [- log (p (#r(10:N) | z 1:N = 0))]. (38.28) in an rkhs.
#—( t)

Since our prior (38.6) is Markovian, its value at the discretisation


{ti }iN=0 can be written as

#—( 10:N) .
0
X * ((t0:N) = argmin ||#-( ) - m0II2
P +
n=1
£
II#—(tn) - A(hn)#—(tn- 1)IIQ(hn)

, (38.29)
#—( 10:N)
subject to z1:N = 0. Here, for a fixed positive definite matrix p,
||x||P := xTP- 1 x is the Mahalanobis norm associated with P. By
use of Eq. (38.1), we can extract a global MAP estimate

x* (t) := H0 x * (t), (38.30)

which includes the discrete-time MAP estimate x* (t0:N) :=


H0 x- *(10:N). While x* (10:N) is not directly related to the EKS0
or EKS1, it can be approximated by more involved Gaussian 33 Bell (1994)
filters and smoothers: namely by the iterated extended Kalman
smoother (IEKS).33 The corresponding filter is the iterated ex­
tended Kalman filter (IEKF).34 These methods iterate the EKS1 34 Bell and Cathey (1993)
and EKF1, respectively, as we will see next.35 35In fact, we could also call them IEKF1
and IEKS1, but it is not necessary to dis­
Recall from Eq. (38.21) how the EKF1 computes the filtering
tinguish them from other iterated meth­
mean mn+1 from the predictive mean mn-+1, and recall from ods here.
Eq. (38.24) how the EKS1 computes the smoothing mean mS n
from mn .The IEKF and IEKS now simply iterate these computa­
tions, but with a re-linearisation around the new estimate. That
is, they linearise f around mn+1 and mS n and then repeat the
computations of Eqs. (38.21) and (38.24) with this new lineari­
sation. This procedure is then iterated until convergence toa
fixed point. The hope is that these fixed points will be related to
the MAP estimate: the IEKF, indeed, computes the local MAP
estimate (i.e. for one isolated step in time). The IEKS, on the
other hand, will (under some additional conditions) converge
38 ODE Filters and Smoothers 307

to a local minimum of the non-convex global MAP problem. Al­


though this local minimum is not always a global minimum, the
IEKS is often considered to be a suitable estimator of the global
MAP estimate. It, therefore, is the probabilistic ODE solvers
that approximates the true posterior best, while maintaining
fast Gaussian computations without sampling. More details
on the IEKF/IEKS and the MAP problem can be found in the
literature.36 36 Tronarp, Sarkka, and Hennig (2021),
§3.2
We will return to this link between the IEKS and the global
MAP estimate in §39.1.1 which contains convergence rates for
the MAP estimate (and thus for the IEKS, if its fixed point is
the global MAP).

> 38.3.4 A Linear-Gaussian SSM for the EKF0

The above ssm (з8.1о)-(з8.1з) is a general and rigorous Bayesian


model for ODEs - and it is sufficient to understand it, hence this
section is optional reading. But the general ssm is only known
since its introduction by Tronarp et al. (2о19), the EKF0 (and
by extension also: the EKS0) have been introduced before in
a different ssm - under the name “Kalman ODE filter”37 and 37Schober, Sarkka, and Hennig (2019);
Kersting and Hennig (2016).
“Gaussian ODE filter”.38 This older ssm dealt with the necessary
38 Kersting, Sullivan, and Hennig (2020)
approximations due to nonlinearity from f in a fundamentally
different way: instead of considering them as a part of the
inference method, the approximations were included in the ssm
itself - which is thus linear-Gaussian so that exact inference is
possible.
This is conceptually somewhat vague because approxima­
tions are (strictly speaking) part of the method, and not the
model. Nonetheless, it matches the common statistical practice
of modelling nonlinear or non-Gaussian systems with linear-
Gaussian ones, to enable faster methods. Moreover - as we will
explain next - it resembles the classical logic where ODE solvers
treat evaluations f (X(t)) as data on x'(t); see Eq. (37.1).
For the EKF0, this linear-Gaussian ssm is given by

p(x0)=N(x0;m0,P0), (38.31)
p(xn+1 | xn) = N(xn+1;A(hn+1)xn,Q(hn+1)),
p(yn | xn)=N(yn;Hmn-,Rn), (38.32)
with data yn = f(H0mn-), (38.33)

where mn- is the mean of the predictive distribution p(xn |


y1:n-1). Since H0 mn- is the most likely estimate for x(tn) given
y 1:n, the analogy between this yn and the data f (x(tn)) of
308 VI Solving Ordinary Differential Equations

classical solvers from Eq. (37.7) should be evident. The likeli­


hood (38.32) resembles a probabilistic version of the (implicitly-
used) classical Dirac likelihood p(f (x(tn) | x(tn)) = S(x'(tn) —
f (X(tn))), where yn and Hm— play the roles of f (X(tn) and
x' (tn) respectively.
Next, we will give an intuitive derivation of Eqs. (38.32) and
(38.33) for some given n G {0,..., N}. In a recursive manner,
we assume that the filtering distribution p(xn—1 | y1:n—1) is a
Gaussian N (xn—1; mn—1, Pn—1) with mean mn—1 and covariance
Pn—1. Then, integration over xn—1 yields

p(xn | y1:n—1) = p(xn | xn—1)p(xn—1 | y1:n—1) dxn—1

(38=.32) N (xn; mn—, Pn—),

with predictive mean mn — = A(hn)mn—1 and covariance Pn— =


A(hn)Pn- 1 A(hn)T + Q(hn). Recall that, by Eq. (38.3), the predic­
tive normal distributions over x (tn) and x (tn) are now given
by

p(x(tn) | y 1:n- 1) = N(x(tn); H0m—, H0P— HJ), and

P(x (tn) 1 У 1: n- 1) = N(x(tn); Hm—,HP— H T))• (38.34)


To generate data, we observe that - as for the classical solvers -
our estimate of x(tn) given the information y1:n—1 can be ex­
tracted by mapping it through f. This results in the non-Gaussian
pushforward measure f* pp (x (tn) | y1:n—1)). To ensure closed-
form Gaussian computations, we approximate itby moment­
matching which yields

f*(p(x(tn) | y1:n—1)) &N(yn; pn, Vn), with moments


Hn := J f (I) dN(I;H0m—,H0P— H0), and (38.35)

Vn := /[f(^) — H][f(^) — H]TdN(^;H0m—,H0Pn— HoT). (38.36)

Unfortunately, these integrals have to be numerically approxi­


mated by quadrature, i.e. by a weighted sum of evaluations as
in Eq. (10.8). To match the speed of standard classical solvers
which evaluate f only once for every tn, we use only a single
evaluation for these intervals, namely the most-informative one
at the mean mn —. This gives us the data yn = f (H0mn —). For the
likelihood, we simply insert the mean Hmn — of (38.34) (i.e. the ex­
pected derivative) as the expectation, and Rn > 0 as the variance
which is supposed to capture the missing uncertainties - from
not computing the integrals in (38.35) and (38.36), as well as
from the predictive covariance HPn Hт on x:(tn). Note, however,
38 ODE Filters and Smoothers 309

that the standard choice is Rn = 0, which ignores these uncer­


tainties. It is standard because it reproduces classical methods,
as Schober, Sarkka, and Hennig (2019) demonstrated.39 In view 39See §39.3 for these connections with
classical methods.
of the uncertainty-unawareness of classical solvers (discussed
in §37) this is unsurprising.40 40To make the SSM (38.31)-(38.33) more
uncertainty-unaware, Kersting and Hen­
As the above construction shows, the linear-Gaussian ssm
nig (2016) proposed to capture the un­
(З8.з1)-(з8.зз) can be thought of as an application of GP re­ certainty from the quadrature approxi­
gression with data on the derivative41 to ODEs. It is related to mation of (38.35) and (38.36) by Bayesian
quadrature and add itto Rn in (38.32).
the EKF0 on the general nonlinear ssm (з8.1о)-(з8.1з) in that it
41 Solak et al. (2002)
already includes the zeroth-order Taylor expansion of f in its
model. More precisely, Kalman filtering on the linear-Gaussian
ssm (з8.з1)-(з8.зз) with Rn = 0 is the same algorithm as EKF0 -
as can easily be seen by comparing the Kalman-filtering update
step with Eqs. (з8.18)-(з8.22) in the EKF0 case (H = H).
Nevertheless, this alternative ssm is not a rigorous model
since the data yn - as in classical numerics; see Eq. (з7.1) -
depends via mn- on all previous computations (and thereby on
the prior, the likelihood, and previous data y1:n-1).42 While 42 In fact, Wang, Cockayne, and Oates
(2018) showed that, at the point of
it is useful for intuition, the nonlinear ssm (з8.10)-(з8.1з) has
their writing, no rigorous Bayesian ODE
become the standard and is used in the rest of this text. solver existed in that they do not satisfy
the criteria specified by Cockayne et al.
Remark (Terminology). As noted above, the existing literature using (2019b). This was, however, before the
introduction of the rigorous ssm (38.10)­
the Gaussian ssm (38.31) -(38.33) has different names for the EKF0
(38.13), which has not been examined in
and EKF1. There, the terms “Kalman ODE filter” and “extended this regard.
Kalman ODE filter ” refer to the EKF0 and EKF1 respectively.

► 38.4 Particle ODE Filters and Smoothers

As mentioned above, the update step (i.e. the conditioning on


the nonlinear ODE) renders all Gaussian posteriors inexact.
Fortunately, the general ssm (38.10)-(38.13) can accommodate
all filters and smoothers, whether Gaussian or not, in the form of
Algorithms 38.1 and 38.2. As the true posterior is nonparametric
for almost all ODEs, it can only be captured by a step-wise
sequence of Monte Carlo approximations at the discrete time
points {ti}iN=1, i.e. by use of so-called sequential Monte Carlo
(SMC) methods or particle filters (and smoothers). As they are
only used here in this - otherwise Gaussian - book, we will now,
to give intuition, introduce them in a very dense way, tailored 43 See e.g. Doucet, Freitas, and Gordon
(2001) for a general introduction. A de­
for the context of this chapter.43
tailed collection of popular particle filters
To this end, we first recall from Algorithm 38.1 that there (and smoothers) is provided by Chapters
are (parameter and model adaptation aside) only two com­ 7-11 of Sarkka (2013). For ODEs, the orig­
inal presentation is to be found in §2.7
putational steps in ODE filtering: computing p(xn+1 | z1:n ) of Tronarp et al. (2019).
from p(xn | z1:n) (“predict”), and computing p(xn+1 | z1:n+1)
310 VI Solving Ordinary Differential Equations

(“update”) from p(xn+1 | z1:n). The predictive distribution


p(xn+1 | z1:n ) linking both steps is thus only a means to ob­
tain the filtering distribution p(xn+1 | z1:n+1). In the Gaussian
case, this intermediate p(xn+1 | z1:n ) was Gaussian, which en­
sured that the updated p(xn+1 | z1:n+1 ) was Gaussian as well.
In the non-Gaussian case, however, all three of these distribu­
tions are nonparametric and have to be represented by sets
of (weighted) samples. While p(xn+1 | z1:n) cannot be used to
compute p(xn+1 | z1:n+1) in closed form, it is still necessary to
provide a distribution that serves as a bridge from p(xn | z1:n)
to p(xn+1 | z1:n+1) .
In other words, a proposal distribution (aka importance distri­
bution) п(xn+i | xn, z 1:n+1) is called for that is informed by
p(xn+1 | z1:n) and serves as a “best guess” for p(xn+1 | z1:n+1),
given a value (sample) of xn .Inaway,the predictive distribu­
tion p(xn+1 | z1:n ) is nothing but a computationally convenient
proposal distribution that allows for computing closed-form
Gaussian updates without sampling. An even more general
description of the ODE filter than the one in Algorithm 38.1
would therefore replace p(xn+1 | z1:n ) with a generic proposal
п(xn+i | xn, z 1:n+1) in lines 6 and 8. Then, the prediction step
can be thought of as the construction of the proposal distri­
bution, with the help of Eq. (38.11), and the update step as
importance sampling with this proposal. We will now detail
how such an update by importance sampling works.
Given a sampling representation of the preceding filtering
distribution, written

p (Xn | z 1:n ) - £ wn) 5 Xxn - Xni (38.37)


i=1

we can, for any proposal distribution п, construct a sampling


representation for the new time point, written

p (Xn+1 | z 1:n+1) - E w(n+1 5 Xx++1 - xni+ 0 ,


i=1
(38.38)

as follows. First, we observe by Bayes’ theorem that


( I (i) ( I (i)
p ,
xn+1 | xn z1:n+1 = p xn+1 | ,
zn+1 xn , z1:n
К p Zn n+1 | Xn+1, Xni), z1:„} p Xnn+1 | Xni), z1:(38.39)
_ p (zn+1 | xn+1) p (xn+1 | хПi)) / I Ji> \
(i) п \ nn + 1 | xn , z 1:n + 1 J ,
п (Xn+1 | xn, z 1:n+1)
where we used the Markov properties of the ssm (see Figure 5.1)
in the last step. Now, the importance-sampling step follows. We
38 ODE Filters and Smoothers 311

draw a set of samples from the proposal distribution

x(ni+ 1 ~ П xx n+1 I xni),Z1:n+1) , for all i = 1,...,M,

whose respective (normalised) weights are, by virtue of Eq. (38.39),


recursively given by44 44Note that the recursion (38.40) is only
valid under the modest additional as­
(i)
wn+1 A wn ------- i z () + () + 4
(i) p(zn+1 I x(ni+) 1)p(x(ni+) 1 I x(ni))
i
n ( xn+1 1 xn , z 1: n+1)
----- . (38.40)
sumption that

n (x0:n+1 1 z1:n +1)


= П (xn+1 I X0:n, Z1:n+1) П (x0:n | Z1:n ) ;
The desired Monte-Carlo approximation (38.38) is thereby com­
see Eq. (7.19) in Sarkka (2013).
pletely specified. This process of sequentially moving from
Eq. (38.37) to Eq. (38.38) on its own forms - when combined
with an initialisation x0i) ~ p (x0), w0i) = 1/M - a well-defined
filter known as sequential importance sampling (SIS). To avoid
the weights to “degenerate” to (nearly) zero, it is necessary to
regularly resample the particles from the sampling distribu­
tion (38.38), which resets the weights to 1/M.45 The addition 45 Sarkka (2013), §7.4
of this resampling to SIS completes the definition of the par­
ticle filter.46 When applied to ODEs, such a particle ODE filter 46 Sarkka (2013), §7.5
indeed (as we hoped) approximates the true nonparametric 47 Tronarp et al. (2019), Thm. 1
posterior p(xn I z1:n ) with the standard Monte-Carlo error-rate 48 Due to the resampling step, the par­
ticles tend to form clusters. Hence, the
of O(M-1/2).47 Figure 38.2 depicts how a particle ODE filter
amount of particles appears to decrease
with 30 particles accurately captures a bifurcation.48 in time t, although it actually remains
Whether applied to ODEs or not, the performance of particle constant. A close look reveals that these
clusters also branch out sometimes.
filter depends - along with the modelling and approximation
choices already present in Gaussian filters - on the design of
the resampling and, most importantly, of the proposal distri-
button n. For instance, simply setting n equal to the dynamic
model p(xn+1 I xn ) yields the bootstrap filter. When more com­
putational overhead is admissible, one can also compute closer
approximations of the optimal proposal, i.e. of p(xn I xn-1, zn),
by use of the posterior of any Gaussian filter.49 As in the Gaus­ 49 See Eqs. (24) and (25) in Tronarp et al.
(2019).
sian case, there are canonical ways to extend any particle ODE
filter to a particle ODE smoother.50 50 Sarkka (2013), §11

► 38.5 Uncertainty Calibration and Step-Size Adaptation

We have now concluded the high-level introduction of ODE


filters and smoothers (Algorithms 38.1 and 38.2), and turn our
attention to the remaining details, namely the uncertainty cali­
bration and step-size adaptation in lines 4 and 5 of Algorithm
51 Since smoothers inherit the uncertainty
38.1. As we will see, the step-selection will naturally follow calibration and step sizes from their un­
from the calibration of a2 via local error estimates. Like all pre­ derlying filters, they do not require an
additional treatment. All statements on
vious publications on this topic, we only consider the EKF0 and
filters in this section also hold for the
EKF1.51 corresponding smoothers.
312 VI Solving Ordinary Differential Equations

bifurcating ODE flow particle-filtering representation Figure 38.2: Bifurcation detection by use
of particle ODE filtering. We consider the
Bernoulli ODE

x(t) = rx(t)(1 - lx(t)I),

with rate r = 1.25. Its flow bifurcates at 0,


and then asymptotically concentrates at
the attractors at ±1. We assume that we
don’t know which side of this bifurcation
we are on at t = 0; more precisely, we set
p(x0) =N (x0; 0.05, 0.252). This incom­
plete knowledge can stem either from an
unknown initial value or from numeri­
cal inaccuracy from previous computa­
tions for t < 0. Accordingly, in the nota­
tion of Eq. (37.2), the true distribution for
each t > 0 is the push-forward measure
[Фt]* (P(x0)), which we simulate in the
left column. For a growing t, this is an
increasingly bimodal distribution at ±1
with vanishing derivative x' (t). This non­
Gaussian distribution is nicely captured
by the particle ODE filter (right column).
Here, we employed a bootstrap ODE fil­
ter with 30 particles, a 2-times IWP prior
and step size h = 0.4. Note that a Gaus­
sian (Kalman) ODE filter (just like any
classical ODE solver) would only cap­
ture the middle sample solution of the
left column and thus miss the bifurcation
completely; cf. Figure 38.1.
> 38.5.1 Global Uncertainty Calibration

The validity of the posterior distributions does not only de­


pend on the convergence rates of the mean (see §39.1.1 and
§39.1.2), but also on the width of the (co-)variance. This pos­
terior variance scales (in the Gaussian case: linearly) with the
prior variance a2 - as can be seen from Eq. (4.7) for general GP
regression.
The standard estimator for a2 is (absent prior knowledge)
the maximum-likelihood (ML) estimator, which maximises the
likelihood
N
p(z 1:N | a2) = p(z 1 I a2) П P(Zn I z 1:n-1,a2).
n=2

Computing the factors p(zn I z1:n-1, a2) exactly is, however, as


expensive as solving the ODE. Approximations, such as the
ones from Chapter 12 in Sarkka (2013), are therefore necessary.
Fortunately, Tronarp et al. (2019) provided an elegant quasi­
ML estimator for the EKF0 and EKF1, written

aa2 =
= —
N ZT S-1Zzn
/ , znSn (38.41)
n=1
38 ODE Filters and Smoothers 313

where Sn is the innovation covariance Sn of Eq. (38.19) if we set


a := I.52 Since a2 is based on the EKF0/EKF1 likelihood ap­ A multidimensional generalisation for
52

d > 2 was added by Bosch, Hennig, and


proximations, it is strictly speaking only a quasi-ML estimate.53
Tronarp (2021), in the case of the EKF0.
But, for the same reason, it comes at almost no additional cost, 53 Lindstrom, Madsen, and Nielsen
and can be efficiently used to calibrate the width of the posterior. (2015), §5.4.2
We refer to Bosch, Hennig, and Tronarp (2021) for more details.
For particle filters, the adaptation of t2 is more difficult and
no simple estimator has been proposed. There is, however, vast
literature on this topic that is summarised in §4.4 of Tronarp
et al. (2019).

> 38.5.2 Local Uncertainty Calibration

A local version of the global Eq. (38.41), i.e. a quasi-ML estimate


for a single step of the EKF0, was already proposed earlier:54 54 Schober, Sarkka, and Hennig (2019)

a2 = Zn [HQ(hn)HT] 1Zn, (38.42)

where Q (hn) is the matrix Q (hn) if we set a2 = 1. A multivari­


ate extension is also available.55 For any n G {1,...,N}, this 55 Bosch, Hennig, and Tronarp (2021)
local estimate a2 is designed to capture the added numerical
uncertainty in the step tn- 1 tn of size hn. While it ignores
the accumulated uncertainty from previous steps, it lends itself
perfectly to local error estimation.

> 38.5.3 Local Error Estimation

In classical numerics, the difference between the extrapolations


of a solver and a less-accurate one are commonly used as a local
error estimate. For example, the canonical Dormand-Prince
method56 (aka ode45 in Matlab) is a fifth-order Runge-Kutta 56 Dormand and Prince (1980)
method whose error estimates are obtained by comparison with
a fourth-order RK method.
While the EKF0/EKF1 does not provide such a comparison,
its probabilistic nature instead provides expected local errors in
the form of the standard deviations of the predictive distribution
P(xn I z 1:n- 1). More formally, in every step (tn- 1 tn) of size
hn, the covariance matrix of the expected additional error is,
by Eq. (38.11), Q(hn) with the calibrated a from Eq. (38.42).
From the entries of this Q(hn), one can now construct local
error estimates. At first glance, it might seem most natural to
use УH0 Q (hn) HJ, the expected additional error on x (tn), as
such an error estimate. While this is indeed a valid choice, the
relevant publications57 recommend 57Schober, Sarkka, and Hennig (2019);
Bosch, Hennig, and Tronarp (2021).
D(hn) := ^HiQ(hn)Hт
314 VI Solving Ordinary Differential Equations

instead, with H as in Eq. (38.19). Indeed, this D(hn) is arguably


better-calibrated because it is the covariance of the additional
residual Zn in step (tn- 1 tn), i.e. of the very quantity used
to calibrate a2 in Eq. (38.42). But regardless of the specifics of
its construction, this cheap probabilistic error estimate D(hn) can
now be used in place of a classical error estimate in existing
step-size control methods.

> 38.5.4 Step-Size Selection

Step-size selection in ODE solvers aims to find step sizes that


are as large as possible (to limit the computational cost), and
as small as necessary (to limit the numerical error). To this end,
one usually sets a local error tolerance £ > 0 as an upper bound.
If indeed D (hn) < £, the step hn is accepted. If not, the step size
is decreased in a pre-specified way, and the decreased step is
tested in the same way - until a sufficiently small step is found.
This step-size selection scheme can be executed in differ­
ent ways. The most canonical way is the proportional control
algorithm,58 which employs the new reduced step size 58Hairer, N0rsett, and Wanner (1993),
§II.4
— 1
£ q+1
hnew = hn p
D (hn)

where q + 1 € N is the local convergence rate (see Theorem 39.2)


and p € (0,1] is a safety factor. Other reduction methods (as well
as more details) are mentioned in the relevant publications.59 Schober, Sarkka, and Hennig (2019);
59

Bosch, Hennig, and Tronarp (2021).


To date, step-size control has only been introduced for the EKF0
and EKF1.
Another approach to local error estimation and step-size
adaptation in SSMs for probabilistic differential equation was
derived from Bayesian statistical design60 for the perturbative 60 Chkrebtii and Campbell (2019)
method by Chkrebtii et al. (2016).

► 38.6 Which ODE Filter/Smoother Should I Choose?

We conclude our exposition of ODE filters and smoothers with


a brief recommendation on how to choose from them. We are
here concerned with the choice of method once a ssm has been
decided on.61 It is instructive to compare this section with §13.1 61The choice of prior was discussed in
§38.2.
in Sarkka (2013) where the same question is considered for
filtering/smoothing with regular data.
Which ODE filter/smoother should I choose? Short answer:
We recommend the EKS1. The reasons for this recommendation
are the following: First of all, the Gaussian filters and smoothers
38 ODE Filters and Smoothers 315

are much faster and more stable than the non-Gaussian ones
(particle filtering). Among the Gaussian ones, the first-order
versions (EKF1 and EKS1) make use of the Jacobian of f (avail­
able by automatic differentiation).62 This tends to produce a 62 Griewank and Walther (2008), §13
more precise mean with better-calibrated uncertainty. Moreover,
smoothing returns (unlike filtering) the full GP posterior distri­
bution which exploits the whole data set z1:N along the entire
time axis [0, T] - while maintaining the O(N) complexity of fil­
tering, both in the number of steps and of function evaluations.
Therefore the EKS1 is, altogether, our default recommendation.
But a longer answer would also involve other methods. As
a first alternative to the EKS1, both the EKF1 and the EKS0
recommend themselves. The EKF1 omits the smoothing pass
(38.23)-(38.25) backwards through time. It is therefore a bit
cheaper, i.e. its cost O(N) has a smaller constant. This can, e.g.,
be advantageous when only the distribution at the final time T
(where the filtering and smoothing distributions coincide) is of
interest.
The EKS0, on the other hand, does not require the Jacobian.
Compared with the EKS1, this again reduces the constant in the
O(N) cost. The Jacobian is beneficial to solve stiff ODEs and to
calibrate the posterior uncertainty accurately. But when rough
uncertainty estimates suffice, the EKS0 is an attractive cheaper
alternative for non-stiff ODEs.
Lastly, the EKF0 combines both of the modifications of the
EKF1 and EKS0, with respect to the EKS1. It is thus appropriate
for the intersection of cases where the EKF1 and EKS0 are
suitable.
The other above-mentioned ODE filters and smoothers are
more expensive and trickier to implement efficiently. Hence, we
recommend to only consider them in very specific cases. For
instance, if the MAP estimate is desired, the IEKS is best suited
to compute it. The particle ODE filter should only be used when
capturing non-Gaussian structures is crucial. It is thus not really
an alternative to Gaussian ODE filters and smoothers, but rather
to the perturbative solvers of §40.
Efficient implementations of our recommended choice (the
EKS1), and its next best alternatives (EKF1, EKS0, EKF0) are
readily available in the ProbNum package.

► 38.7 Implementation in the ProbNum Package


63Code at probnum.org. See the corre­
The ProbNum package63 contains efficient implementations of sponding publication by Wenger et al.
ODE filters and smoothers in Python. New algorithmic im­ (2021).
316 VI Solving Ordinary Differential Equations

provements are continuously added. At the time of writing, it


contains the EKF0, the EKF1 and the unscented Kalman ODE
filter - as well as the corresponding smoothers: EKS0, EKS1, and
unscented Kalman ODE smoother. The efficiency of the ProbNum
implementation is ensured by the recent improvements in nu­
merical stability (§39.2) and step-size control (§38.5.4) which
were published alongside experimental demonstrations of their
practical utility.64 64Kramer and Hennig (2020); Bosch,
Hennig, and Tronarp (2021).
__________ 39
Theory of ODE Filters and Smoothers

We will now focus on the theoretical properties of ODE fil­


ters and smoothers. In classical numerics, there are two main
analytical desiderata: convergence rates and numerical stability.
The convergence rates describe how quickly the numerical er­
ror ||X — x|| converges to zero when the step size h goes to zero,
i.e. when the invested computational budget goes to infinity. As
we explained in §37, these rates are usually polynomial since
the (polynomial) Taylor expansion is, absent very specific restric­
tions, the optimal extrapolation. Accordingly, the convergence
rates of ODE filters and smoothers known today are also poly­
nomial; they are presented in §39.1.1 and §39.1.2. Notably, in
two specific settings, equivalence with classical methods gives
even higher polynomial convergence rates; see §39.3. Like all
probabilistic solvers, ODE filters raise the additional question of
their uncertainty (variance) calibration; §39.1.1 contains a result
relating thereto.
While the convergence rates capture the behaviour of an
ODE solver for “sufficiently small” step sizes h, it might exhibit
inconvenient behaviour for practical step sizes h > 0. In par­
ticular, the numerical solution X can exhibit rapid variations
although the analytical solution x does not - a phenomenon
known as numerical instability. This usually happens in so-called
stiff ODEs,1 which are hard to define rigorously, but usually 1 Hairer and Wanner (1996)
contain some term in f that can cause rapid variations in X. In
§39.2, we will present a first A-stability result (for the EKF1)
as well as recent algorithmic improvements that increase the
stability of the necessary numerical linear algebra (for the EKF0
and EKF1). Note that we here employ a wider definition of
numerical stability (that includes its meaning in linear algebra)
than normally used for classical ODE solvers, which do not rely
318 VI Solving Ordinary Differential Equations

on numerical linear algebra to the same extent.


Finally we will detail in §39.3 some links between ODE filters
and classical methods, namely how the EKF0 is connected to
the trapezoidal rule and Nordsieck methods. Throughout §39,
we will simplify the notation by assuming w.l.o.g. that the step
size h > 0 is constant.

► 39.1 Convergence Rates

ODE filters and smoothers apply techniques from signal pro­


cessing (or scattered-data approximation) to numerical analysis.
Accordingly, their convergence rates can be analysed with ei­
ther the tools of classical numerical analysis2 (i.e. by Gronwall’s 2 Kersting, Sullivan, and Hennig (2020)
inequalities) or of scattered-data approximation3 (i.e. by opti­ 3 Tronarp, Sarkka, and Hennig (2021)
misation in an rkhs) - which we explore in §39.1.1 and §39.1.2
respectively. Both approaches yield similar, but distinct result:
namely global convergence rates of hq if q derivatives are mod­
elled, but under different assumptions and restrictions.

> 39.1.1 Classical Convergence Analysis

In this section, we follow Kersting, Sullivan, and Hennig (2020)


in executing a classical convergence analysis of order q € N.
To this end it is, just like for Runge-Kutta methods,4 necessary 4Hairer, N0rsett, and Wanner (1993),
§II.2
to ensure that q derivatives exist and the remainder of the
qth-order Taylor expansions of x are sufficiently small. This is
secured by the following assumption.

Assumption 39.1. Let f € Cq(Rd, Rd) for some q € N. Further­


more, let f be globally Lipschitz and let all its derivatives of order
up to q be uniformly bounded and globally Lipschitz. In other words,
we assume that there exists some L > 0 such that \\Daf ||TO < L
for all multi-indices a € Nq with 1 < Li ai < q, and \\Daf (a) -
Daf (b) || < L\\a — b|| for all multi-indices a € NQ with 0 < Li a <
q.

Under these conditions, the (by Taylor ’s theorem: optimal) lo­


cal convergence rate of O (hq+1 ) are obtained in the following
theorem.

Theorem 39.2. Let the prior be a q-times integrated Wiener process


or q-times integrated Ornstein-Uhlenbeck process5 for some q € N, 5See §38.2 for the definition of these pri­
ors.
let R > 0 and let m(h) be the filtering mean computed by one step
of the EKF0. Then, under Assumption 39.1, there exists a constant
C > 0 such that for all sufficiently small h > 0

||m0(h) — x(h)|| < Chq+1.


39 Theory of ODE Filters and Smoothers 319

Proof. By bounding all relevant quantities and applying Taylor ’s


theorem. See the proof of Theorem 8 in Kersting, Sullivan, and
Hennig (2020) for details. □

As in classical numerics, the implied6 global convergence rates 6 Strictly speaking, one can obtain even
faster convergence rates because filters
are O(hq), which are satisfied under some additional restric­
exchange information between adjacent
tions. steps via the derivatives - like multi-step
methods do. §39.3 presents two settings
Theorem 39.3. Under the same assumptions as in Theorem 39.2, (for q = 1 and q = 2), where the EKF0
let us add the restrictions that q = 1, that the prior is a q-times indeed has global convergence rates of
order hq+1.
integrated Wiener process and that R G O (hq) is constant for all
times tn, n = 1, ..., N. Then, there exists a constant C(T) > 0,
depending on final time T > 0, such that for all sufficiently small
h>0

||m(T) - x(T) || < C(T)hq,


where m(T) := H0mN denotes by Eq. (38.3) the posterior mean
estimate of x ( T ) computed by the EKF0. The same bound holds for
the EKS0.

Proof. By a combination of fixed-point arguments and a specific


version of the discrete Gronwall’s inequality; see the proof of
Theorem 14 in Kersting, Sullivan, and Hennig (2020) in case of
the EKF0. The extension to the EKS0 follows because, at the final
time T, the filtering and smoothing distribution are equal.7 □ 7 This extension to smoothing was not
contained in the original work by Kerst-
In the same setting, the posterior standard deviation is asymp­ ing, Sullivan, and Hennig (2020), but was
later pointed out by Kramer and Hennig
totically well-calibrated in the sense that it globally converges to
(2020).
zero with the same rate.
Theorem 39.4. Under the same assumptions and restrictions of
Theorem 39.3, there exists a constant C(T) > 0 such that the final
posterior standard deviation P(T) : \/H0PNH0 is bounded by

УР^ < C (T) hq,

for both the EKF0 and EKS0.8 8Recall, from Eq. (38.22), the notation PN
for the posterior variance.

Proof. See Theorem 15 in Kersting, Sullivan, and Hennig (2020).


The main shortcoming of the global convergence results from


Theorems 39.3 and 39.4 is their restriction to q = 1. Fortunately,
they were published alongside experiments that demonstrated
their validity for q G {2,3}.9 In the meantime, experimental 9 See §9 in Kersting, Sullivan, and Hen­

nig (2020).
evidence for choices up to q = 11 was added.10 Hence, it is
10 See §4 in Kramer and Hennig (2020).
widely believed that Theorem 39.3 can be extended to a general
q G N.
320 VI Solving Ordinary Differential Equations

> 39.1.2 Convergence Analysis by Scattered-Data Interpolation

While the above classical analysis is based directly on the re­


cursion equations of the EKF0, an alternative analysis obtains
convergence rates directly from the state-space formulation of
Eqs. (з8.1о)-(з8.1з). This analysis is due to Tronarp, Sarkka, and
Hennig (2021); we follow their construction in this section.
First, recall from Eq. (з8.28) that the global MAP estimate
#— * (t) maximises the posterior p(#—(t) | z 1:N = 0) - i.e. it max­
imises the prior under the restriction that the information opera-
tor11 Z (which extracts the data zn from the ODE) is zero for all 11For a rigorous definition of informa­
tion operators, see Cockayne et al. (2019b,
{tn }nN=0. More formally, by Eq. (з8.18), this (nonlinear) operator
§2.2).
is

Z[• ] := dd[• ] “ f[• ], (39.1)

and ODE filters and smoothers infer x by conditioning on

Z[X](ti) = 0.

By use of Z, Tronarp, Sarkka, and Hennig (2о21) describe the


approximation of the MAP estimate #— * (t) as scattered-data
interpolation in Sobolev spaces.12 In this framework, compared 12Wendland and Rieger (2005); Ar-
cangeli, Lopez de Silanes, and Torrens
to Assumption з9.1, no Lipschitz and boundedness conditions
(2007).
but one additional derivative are required.

Assumption 39.5. Let f e Cq+1 (Rd, Rd) for some q e N.

The following theorem is a simplified version of Theorem з


from Tronarp, Sarkka, and Hennig (2о21).

Theorem 39.6. Under Assumption 39.5 and for any prior X(t) of 1з That is, for any process with a.s. q-
times differentiable sample paths. In par­
smoothness q,13 there exists a constant C (T) > 0 such that
ticular, this includes the Matern family
with v = q + 1/2 (§38.2) and its special
sup t Z[x*(s)] ds < C (T)hq, (з9.2) cases: the q-times integrated Wiener pro­
te[0,T] 0 cess and Ornstein-Uhlenbeck process.
See §2.1 in Tronarp, Sarkka, and Hen­
where x* (t) = H0 x * (t) is the MAP estimate of x (t) given a discreti­ nig (2021) for an alternative definition of
such priors by use of Green’s functions.
sation 0 = 10 < 11 < ■ ■ ■ < tN = T.14
14 Recall Eq. (38.30).

Proof. The proof idea is to first analyse (with the help of tools
from nonlinear analysis) which regularities the information
operator Z inherits from f under Assumption з9.5, and then to
apply results from scattered-data interpolation in the Sobolev
space associated with the prior X(t). Details in Tronarp, Sarkka,
and Hennig (2021), Theorem 3. □

As F. Tronarp pointed out to us in personal communication, the


convergence rates of the MAP then follow as a corollary.
39 Theory of ODE Filters and Smoothers 321

Corollary 39.7. If Assumption 39.5 holds and f is globally L-


Lipschitz continuous for some constant L > 0, then, for any prior
X(t) of smoothness q,13 there exists a constant C(T) > 0 such that

sup ||x* (t) — x(t) || < C(T)hq,


te [0,T]

where x* (t) = HоX* (t) is the MAP estimate of x(t) given a dis­
cretisation 0 = 10 < 11 < ■ ■ ■ < tN = T.14 (NB: In particular,
this uniform bound also holds for the discrete MAP estimate x* (t0:N)
which the IEKS aims to estimate; see §38.3.3.)

Proof. By the fundamental theorem of calculus (with x* (0)=


x(0) = x0), the triangle inequality, and Eq. (39.1), we have

||x* (t) — x (t) II = I,'d x(s) - f(x (s))d s

t Z[x*(s)]ds 0t f(x*(s)) — f (x(s)) ds


< +

< C(T)hq, by (39.2) < ft L\\x*(s) - x (s) IId s

Now, application of the integral form of Gronwall’s inequality


(see e.g. Lemma 2.7 in Teschl (2012)) concludes the proof. □

> 39.1.3 Discussion of Convergence-Rate Results

Like Theorem 39.3, Corollary 39.7 also proves global conver­


gence rates of O(hq) of an estimate to the true solution x(t).
But apart from that, these two theorems are separated by mul­
tiple major differences - in addition to the aforementioned
difference between Assumptions 39.1 and 39.5. To begin with,
Corollary 39.7 is applicable to any prior with q derivative, while
Theorem 39.3 is restricted to an EKF0 with q-times IWP prior
and q = 1. The main restriction of Corollary 39.7 is, instead, that
it only provides convergence rates for the MAP estimate x* (t)
which the standard (extended) Kalman filters and smoothers do
not approximate for a nonlinear f . Fortunately, as we discussed
in §38.3.3, the iterated extended Kalman smoother (IEKS) does
converge to a local minimum of the non-convex discrete-time
MAP problem from Eq. (38.29), but it is unclear when this local
minimum coincides with the global minimum X * (t,:n ). Thus,
Corollary 39.7 (strictly speaking) does not contain convergence
rates for any particular algorithm.
Nonetheless, it is another indicator that (extended) Kalman
ODE filters and smoothers - which, after all, compute MAP
estimates given local approximations to f - estimate x with
322 VI Solving Ordinary Differential Equations

a global numerical error O(hq) if the first q derivatives of x


exist and are appropriately modelled in the state space. In
fact, the corresponding experiments15 report that not only the 15 Tronarp, Sarkka, and Hennig (2021), §7
IEKS (but also the EKS0 and EKS1!) exhibit these qth-order
convergence rates of the MAP estimate x* (t). Moreover, the
difference between the EKS1 and IEKS appears to be small in
these experiments.
It will be up to future research to find the exact conditions on
f and on the prior X (t) under which these convergence rates
hold for each filter and smoother. To this end, both generalising
Theorem 39.3 to q G {2,3,... } and replacing x* (t) with an out­
put of an actual algorithm in Corollary 39.7 seem like promising
ways forward.
Compared to classical methods, the hq global convergence
rates are optimal and on-par with Runge-Kutta methods, if we
ignore the possibility of information sharing between steps -
i.e. if we regard it as a single-step method. This is because the
rate of an iterated Taylor expansion with q derivatives is also hq .
But, since an ODE filter that models q derivatives has informa­
tion from the previous steps stored in these derivatives,16 it is 16 See Exercises 39.14 and 39.15.
in a way more like a multi-step method which can achieve even
higher rates in some settings - as we will see in §39.3.

► 39.2 Numerical Stability

The excellent polynomial convergence rates of ODE filters and


smoothers will, however, only be obtained when their compu­
tations are numerically stable. This is of particular importance
for stiff ODEs.17 To characterise the ability of an ODE solver 17 Hairer and Wanner (1996)
to solve stiff equations, multiple notions of stability of an ODE
solver exist. The most common of these notions (A-stability) has
been proved for the EKF1,18 but does not hold for the EKF0; see 18 Tronarp et al. (2019), §3
§39.2.1. But there are additional stability concerns for Gaussian
ODE filters (and smoothers) as they heavily rely on numer­
ical linear algebra.19 The resulting rounding errors can - in 19Recall Eqs. (38.17)-(38.25). In particu­
lar, Eqs. (38.20) and (38.24) involve ma­
particular if many derivatives are included in the ssm - make
trix inversions.
ODE filters unstable, even for non-stiff ODEs. Fortunately, these
linear-algebra instabilities were decisively rectified by a recent 20 Kramer and Hennig (2020)
publication20 that we summarise in §39.2.2. As in the existing lit­
erature, we only consider the q-times integrated Wiener process 21Recall from §38.2 why the IWP is the
standard prior.
(IWP) prior in this section.21
39 Theory of ODE Filters and Smoothers 323

> 39.2.1 A-Stability

A-stability is the most common criterion for the numerical sta­


bility of an ODE solver. It is defined by the solver ’s asymptotic
performance on the so-called Dahlquist test equation, written

x (t) = Л x (t), x (0) = xо = 0, (39.3)

for some real22 matrix Л whose eigenvalues lie in the unit circle 22In the classical literature Л e Rd d is a
complex matrix, but ODE filters are only
around zero, i.e. for which limt те x(t) = 0. An ODE solver
designed for real-valued ODEs. Hence,
is said to be A-stable, if and only if its numerical estimate x(t) we here use the real-valued analogue
also converges to zero (for a fixed step size h > 0) as t те.гз (з9.з) instead; cf. Eq. (з1) in Tronarp et al.
(2о19).
Accordingly, a Gaussian ODE filter is A-stable if and only if its
2з Dahlquist (196з)
mean estimate H0mn goes (for a fixed step size h > 0) to zero,
as n те.
The following recursion holds by Eqs. (з8.17)-(з8.21) for the
predictive mean mn- of both the EKF0 and EKF1 (but with
different Kn ):

mn-+1 = [A(h) -A(h)KnB]mn-, (39.4)

where B = H — H0Л. In particular, we have B = [—Л, 1, 0,..., 0]


in case of the q-times IWP prior. If Kте = limn те Kn exists24 24The limit Kalman gain Kте indeed ex­
ists in some important cases; see §з.1 in
and if [A (h) — A (h)Kте B] has eigenvalues in the unit circle, then
Schober, Sarkka, and Hennig (2019) and
Eq. (39.4) implies that this ODE filter is A-stable. Proposition 1о in Kersting, Sullivan, and
Hennig (2020).
Theorem 39.8 (Tronarp et al., 2019). The EKF1 and EKS1 with a
q-times IWP prior are A-stable.

Proof. Theorem 2 in Tronarp et al. (2019) shows (using filter­ 25 Anderson and Moore (1979), p. 77
ing theory)25 that indeed, for the EKF1, Kте exists and [A (h) —
A(h)KтеB] has eigenvalues in the unit circle. As explained above,
the A-stability of the EKF1 follows. For the corresponding
smoother (EKS1), the claim follows from the fact that, at the final
time (and thus also for t те), the smoothing mean coincides
with the filtering mean. □

Note that Tronarp et al. (2019) use a different terminology in


their Theorem 2. The term “Kalman filter” there is the opti­
mal estimator in the ssm (з8.1о)-(з8.1з) which here is linear-
Exercise 39.9. Show that the EKF0 is
Gaussian due to f (x) = Лx. In other words, “Kalman filter” not A-stable. To this end, consider the one­
there signifies the method we call EKF1. In Tronarp et al. (2о19), dimensional ODE (з9.з) with Л = —a < 0,
and show that (for any given step size h > 0)
this equivalence is exploited in Corollary 1 which is essentially
the filtering mean H0mn does not converge
our Theorem з9.8. Importantly, the EKF0 is not A-stable (see to 0 as n те, if a is large enough. How
Exercise з9.9). large does a have to be for this?

The A-stability of the EKS1 was recently demonstrated on 26Bosch, Hennig, and Tronarp (2021),
§6.1
a very stiff version of the Van-der-Pol ODE.26 There are other
324 VI Solving Ordinary Differential Equations

less-common notions of stability which have (at the time of


writing) not been discussed in the context of ODE filtering.

> 39.2.2 Stability of Linear-Algebra Operations

As detailed in §3, Gaussian inference requires nothing but


(numerical) linear algebra, i.e. nothing but matrix-vector and
matrix-matrix operations. These are all (up to rounding errors)
exact - with the exception of matrix inversions (in particular,
if the state-space dimension D is large). The same is true for
Gaussian ODE filters and smoothers since they are in essence
an iterative application of Gaussian inference.27 The required 27 Recall Eqs. (38.17)-(38.25).
inversion of the (covariance) matrices can be facilitated by the
Cholesky decomposition - or by the QR decomposition if the
matrix is only positive-semidefinite.28 28 Kramer and Hennig (2020), §3.4
But this does not solve an even more fundamental problem:
the accumulation of rounding errors, which can be insignificant 29 This was already stated in Eq. (5.26).

A detailed derivation (with zero-based


in other cases, but play a decisive role here. To see why, consider
indexing!) can be found in Appendix A
the diffusion matrix Q(h) for step size h > 0 from Eq. (38.11). of Kersting, Sullivan, and Hennig (2020).
For the q-times IWP prior, its entries are29
30 In fact, this ill-conditioning has pre­
2 h2q+3-i-j
vented earlier work to test the expected
[Q(h)]ij = a (2q + 3 - i - j)(q + 1 - i)!(q + 1 - j)!, O(hq) convergence rates beyond q = 4.
But Kramer and Hennig (2020) could,
for i, j = 1, ...,q + 1. Thus, Q(h) has entries that span 2q or­ by virtue of their stabilising implemen­
tation, test these rates up to q = 11; see
ders of magnitude, from O (h2q+1) to O(h), which leads to ill- their Figure 3.
conditioned matrix computations for large q.30 To alleviate this 31 Kramer and Hennig (2020), §3.2

problem, a recent publication31 suggested to use the re-scaled


variable T-1x instead of x, with transformation matrix

(hq hq-1
T := y/hdiag I — ,-,-----туг,..., h, 1 (39.5)
q! (q-1)!

In the underlying continuous-time model, this variable transfor­


mation X = T-1X amounts to using

dX(t) = T-1 FTX(t) dt + T-1L dWt, (39.6)

instead of the original SDE (38.4).32 In the discrete-time ssm, Cf. the alternative Nordsieck transfor­
32

mation of Eq. (39.14).


this leads to an analogously-transformed initial distribution and
predictive distribution, written

p(x0)= N(x0; T 1 m0, T 1 PoT t), (39.7) 33Note that, for notational simplicity, we
here assumed a constant step size h. See
p(xn+1 I Xn) = N(Xn+1; T-1A(h)Txn, T-1Q(h)T-T), Kramer and Hennig (2020) for a general­
N=0 (in
isation to variable step sizes {hn}n
instead of Eqs. (38.10) and (38.11).33 In other words, we re­ zero-based indexing!).

placed the original predictive matrices (A, Q) from Eq. (38.13)


with the new ones (A := T-1A(h)T, Q := T-1Q(h)T-T) to
39 Theory of ODE Filters and Smoothers 325

obtain Eq. (39.7). As desired, these new matrices are now scale­
invariant:
q+1-i
[A]ij = I(j > i) (39.8)
q+1-j , [ Q ]ij 2 q + 3 - i - j'

for all i, j G {1,..., q + 1}, where the entries of A are binomial


coefficients. The new projection matrices from Eq. (38.3) are
accordingly given by
- -
H := HоT, H := HT. (39.9)

After this transformation, the Gaussian ODE filter/smoother


works in the re-scaled variables (h! x (t),..., x(q)(t)) instead of
the original (x(t), ..., xq(t)). This stabilises the computations in
two ways: first, the entries (and thus the condition number) be­
come independent of h.34 Second, (A, Q) can be pre-computed 34 Kramer and Hennig (2020), Table 1
for varying step sizes, which is not possible for the step-size
dependent matrices (A, Q).
We conclude this section by summarising the other major
contribution35 by Kramer and Hennig (2020) to the numeri­ 35Note that we already covered their
third contribution (initialisation by au­
cal stability of the linear algebra in Gaussian ODE filters (and
tomatic differentiation) in §38.2.1.
smoothers): the square-root implementation of Kalman filters
which, instead of the full covariance matrices, tracks their matrix
square-roots. The resulting square-root filters (of which multiple
variants exist)36 have roughly the same computational complex­ 36 Grewal and Andrews (2001), §6.5
ity, but better numerical stability.
There exist multiple definitions of the matrix square-root
vA of a matrix A. Here, it signifies any matrix such that
A = vAvAT. This matrix vA is not always unique. For in­
stance, the (lower-triangular) Cholesky factors of the (symmetric
positive-definite) predictive covariance matrices {Pn- }n N=1 are
their matrix square-roots under our definition.
With this in mind, we follow Kramer and Hennig (2020) to
propose the following square-root implementation for the EKF0
and EKF1. First, we observe, in view of Eq. (38.17), that the pre­
dictive covariance matrix was constructed via the factorisation

LP A т
Pn-+1 = AnLP LQ (39.10)
LQT ,

where LP and LQ are the respective Cholesky factors of Pn


and Q(hn+1). This means that [AnLP, LQ] G R(q+1)x2(q+1) is al­
ready a matrix-square root of Pn-+1. But, as a symmetric positive-
definite matrix, Pn-+1 admits a lower-dimensional square-root,
namely the upper-triangular R from its Cholesky decomposi- 37Note that this R has nothing to do with
the Rn+1 in Eq. (38.19).
tion:37
326 VI Solving Ordinary Differential Equations

P-+1 = RT R. (39.li)
Fortunately, this R can be obtained without assembling Pn-+1
Exercise 39.10. Prove the claim in the text,
from its square-root factors as in Eq. (39.l0), since it (as Ex­ i.e. show that the upper-triangular matrix in
ercise 39.l0 reveals) is equal to the upper-triangular factor of the QR decomposition of [An Lp, Lq]t is the
transpose of the lower-triangular Cholesky
the QR decomposition of [AnLP, LQ]T. Hence, we may replace
factor of Pn-+1 . (For a solution, see §3.3 in
the original prediction step (39.l0) by the lower-dimensional Kramer and Hennig (2020).)
matrix multiplication (39.ll) in which the Cholesky factor R
is efficiently obtained by a QR decomposition of [AnLP, LQ]T
(i.e. without ever computing Pn -+1).
Thereby, the filter can summarise the predictive distribu­
tion of Eq. (38.l7) by (mn-+1, R) instead of (mn
-+1,Pn-+1). In the
subsequent update step, the innovation-covariance matrix S of
Eq. (38.l9) can again be captured by its Cholesky factor which
is (via analogous reasoning) again available (without assem­
bling S) in the form of the upper-triangular QR-factor of (HR)T.
The conditioning on zn+1 from Eqs. (з8.20)-(з8.22) can then be
executed solely by use of this Cholesky factor of S.38 Finally, For details, see Appendix A.3 in
38

Kramer and Hennig (2020).


the resulting filtering distribution N (mn+1, Pn+1) is obtained
directly from the previously computed Cholesky factors. Again,
the computation of Pn+1 is replaced by computing its Cholesky
factor instead.
Altogether, this square-root implementation represents all re­
quired covariance matrices (including the hidden, intermediate
ones) by their Cholesky matrix square-root. Since the Cholesky
factors of the matrices Pn -+1, S and Pn+1 can be obtained by
QR decompositions of already-available matrix square-roots,
they never have to be assembled - which further reduces the
computational cost.
We refer the reader to Appendix A39 in Kramer and Hennig 39In particular, the square-root versions
of the EKS0 and EKS1 are detailed in
(2020) for a complete description of this square-root implemen­
their Appendix A.4.
tation, and to the book by Grewal and Andrews (200l) for more
implementation ideas which might help address the remain­
ing practical challenges.40 All of the above-described tricks are 40 These are mainly the efficient integra­
tion of high-dimensional and very stiff
included in the ProbNum Python-package.4l
ODEs; see §5 in Kramer and Hennig
(2020). In this regard, Kramer et al. (202l)
► 39.3 Connection with Classical Solvers recently published a further advance
demonstrating that ODE filters can ef­
ficiently solve ODEs in very high dimen­
The probabilistic reproduction of classical numerical methods sions.
has, especially in its early days, been a central strategy of pn 4l Code at probnum.org. See the corre­
sponding publication by Wenger et al.
to invent practical probabilistic solvers. For ODE filtering, this (202l).
changed when Tronarp et al. (20l9) introduced the rigorous
ssm of Eqs. (38.l0)-(38.l3) because, from then on, new research
could also draw from the accumulated wisdom of signal pro­
cessing (instead of numerical analysis). Since then, most ODE
39 Theory of ODE Filters and Smoothers 327

filters and smoothers were designed directly from the first prin­
ciples of Bayesian estimation in SSMs, without attempting to
imitate classical numerical solvers. While some loose connec­
tions have been observed,42 it has not been studied in detail how 42For instance, it has been repeatedly
pointed out that both the EKF1 and the
the whole range of ODE filters and smoothers relates to classical
classical Rosenbrock methods make use of
methods. Nonetheless, earlier research43 has established one the Jacobian matrix of f.
important connection - in the form of an equivalence between 43 Schober, Sarkka, and Hennig (2019)
the EKF0 with IWP prior44 (more precisely, its filtering mean) 44It is unsurprising that the equivalences
are only known for the IWP prior as it is
and Nordsieck methods,45 which we will discuss in §39.3.2. But,
the only one with Taylor predictions; see
first, we will present another, more elementary, special case.46 Eq. (38.14).
45 Nordsieck (1962)
> 39.3.1 Equivalence with the Explicit Trapezoidal Rule 46Note that, even earlier, the pioneer­
ing work by Schober, Duvenaud, and
Hennig (2014) showed an equivalence
In the case of the 1-times IWP prior and R = 0, the Kalman
between a single Runge-Kutta step and
gains {Kn}n N=1 are the same for all n. In other words, it is always GP regression with an IWP prior. How­
in its steady state KTO = limn TO Kn. Therefore, the recursion ever, as it relies on imitating the sub-step
structure of Runge-Kutta methods, this
for the Kalman-filtering means {mn}n N=0 is independent of n equivalence cannot be naturally repro­
which leads to the following equivalence. duced with ODE filters. Therefore, we
do not further discuss this result here.
Proposition 39.11 (Schober, Sarkka and Hennig, 2018). The
EKF0 with 1-times IWP prior and R = 0 is equivalent to the explicit
trapezoidal rule (aka Heun’s method). More precisely, its filtering
mean estimates xn := H0mn of x(tn) follow the explicit trapezoidal
rule, written
h
xn+1 = xn + 2 (f ( xn) + f (xn+1)), (39.12)

with Xn+1 := xn + hf (xn).

Proof. See Proposition 1 in Schober, Sarkka, and Hennig (2019),


where the explicit trapezoidal rule is referred to as “the P(EC)1
implementation of the trapezoidal rule”. □

> 39.3.2 Equivalence with Nordsieck Methods

Classical introductions to Nordsieck methods can be found in


most textbooks.47 To comprehend Nordsieck methods in our We especially recommend §III.6 in
47

Hairer, N0rsett, and Wanner (1993).


context, let us first recall from Eqs. (38.1)-(38.3) that ODE filters
do not only model the ODE solution x(t), but a larger state
vector —(t), which contains at least x(t) and x'(t). For the most
important priors, — (t) is simply the concatenation of the first q
derivatives, written

#—(t)= [x(t),x' (t),...,x(q)(t)] . (39.13)

In this section we again restrict our attention to the standard


prior, the q-times IWP.48 Now, recall that the dynamics of #r (t) 48 See §38.2.
328 VI Solving Ordinary Differential Equations

is modelled by the solution X(t) of the SDE (38.4) which can be


transformed as in Eq. (39.6). Previously in Eq. (39.5), we chose
the transformation matrix T such that the computations became
less step-size dependent and more stable. As an alternative we
now use a different transformation matrix, written

т -Л- Л 1! 2! (q -1)! q! А / X
TNord : diag ^1, h , h 2, ..., hq- 1 , hq j . (39.14)

This choice is motivated by the insight (see §37) that standard


ODE solvers construct an approximating polynomial for x(t)
that (in every step) captures as many summands of its Taylor
series (37.3) as possible. Indeed, the so-transformed system
Exercise 39.12. In Eq. (37.3) a Taylor se­
XNord (t) := TN-o1rd X (t) now models the re-scaled state-space ries of the ODE flow, started at a numerical
vector estimate x?( t), was given as an extrapolation
model. In contrast, the definition of the Nord-
h2 hq sieck vector xNord (t) is not dependent on
XNord( t )= |x (t), hx( t), x (t) ..., q! x(q)(t )J ,(39.15)
a numerical estimate. This difference stems
from the “uncertainty-unawareness” of clas­
instead of the original X(t) of Eq. (39.13). In the classical litera­ sical methods, discussed in §37. To under­
stand this phenomenon, show that
ture, x-Nord ( t) is called the Nordsieck vector whose entries are the
f ® (x(t)) = x(i)(t),
first q Taylor summands of x(t + h) = Фh(x(t)); see Exercise
39.12. where f is defined as in Eq. (37.4). (Hint:
use induction over i e N.) Then, show that
Analogously to Eq. (39.8), we can now derive the Nordsieck
it follows immediately that the qth-order Tay­
form of the prediction matrices; in particular, for A we obtain lor expansion of Фh(x(t)) is equal to the
the Pascal upper-triangle matrix A = A Nord with entries sum of the entries of xNord (t).

, (j - 1\
[''■]tj = I(j > i) i 1 .
i-1

For H, we now have H := [0,h-1,0,...,0] due to Eq. (39.9);


H0 =[1, 0, ..., 0] remains unchanged.49 The recursion of the 49 Remember that, for the IWP prior,

we have H0 = [1,0, ..., 0] and H =


means {mn}n N=0 (computed by the EKF0) is after combining
[0,1,0,...,0] e R1 x(q+1).
prediction (38.17) and update (38.21) given by

mn+1 = [I — Kn+1H] A(h)mn + Kn+1 f (H0Amn), (39.16)

where mn and Kn are also re-scaled into Nordsieck form. For­


mulated without a statistical frame, a Nordsieck method X(t) =
#-Nord( t) executes the prediction and update step in a single
step:50 Cf. Eq. (28) in Schober, Sarkka, and
50

Hennig (2019).
x (t + h) = [I - lHi] A x(t) + hlf (H0A x (t)), (39.17)

where the weight vector l can be freely chosen to define the


method. The structural resemblance of the recursions for filter­ 51The h in front of the l in Eq. (39.17)
comes from the re-scaled observation
ing (39.16) and Nordsieck methods (39.17) should be evident -
matrix H. Since the gain is re-scaled in
with the main difference that the Kalman gain Kn+1 plays the the same way, it is hidden in Kn+1 in
role of l, but can unlike l depend on n.51 And, in fact, these two Eq. (39.16).
39 Theory of ODE Filters and Smoothers 329

recursions are the same in the steady state of the filter, i.e. after
Kn+i has reached its limit Kto := limn to Kn.52 Note, however, 52For the details, see §3.1 in Schober,
Sarkka, and Hennig (2019).
that Kto depends on the ssm on which the EKF0 performs
inference. While more equivalences with different Nordsieck
methods should hold, the following result is so far the only one
known.

Theorem 39.13 (Schober, Sarkka, and Hennig, 2018). The EKF0


with 2-times IWP prior and R = 0 is a Nordsieck method of third order,
after its Kalman gain has reached its steady state Kto = limn to Kn.
In particular, if initialised in this steady state, we have

\\m(T) - x(T) || < C(T)h3.

Proof. First, derive the steady-state Kalman gain: Kto = [ 12 3, 3


1, 3-У3]T. Then, insert Kto as the Nordsieck weight-vector l into
Theorem 4.2 from Skeel (1979), which yields global convergence
rates of order h3 . For the details, see Theorem 1 in Schober,
Sarkka, and Hennig (2019). □

Remark. For the q-times IWP prior with q = 2, this theorem gives
hq+1 instead of the hq convergence rates suggested by Theorem 39.3
and Corollary 39.7, but only in the steady state. These rates indeed
hold in practice.53 53Schober, Sarkka, and Hennig (2019),
Figure 4
In the same way, one could interpret any instance of the EKF0 (in
its steady state) as a Nordsieck method. But there is no guarantee that
the corresponding Nordsieck method is practically useful for at least
two reasons: First, the numerical stability could be insufficient if more
than two derivatives are included (see §39.2.2); second, the order of
such a method will depend on the entries of Kto according to Theorem
4.2. from Skeel (1979).

Exercise 39.14 (Information sharing between adjacent steps).


The equivalence with the trapezoidal rule (39.12) shows how the EKF0
uses both yn-1 and yn in the step from tn-1 to tn. It is therefore
smarter than Euler's method xn = xn- 1 + hyn 1 which only uses
yn-1. Which part of the recursion, the prediction (38.17) or the up­
date (38.21), are responsible to go from Euler’s method to the trape­
zoidal rule? In other words, how do you have to alter Eq. (38.17) or
Eq. (38.21) to fit only Euler’s method instead of the trapezoidal rule?

Exercise 39.15 (ODE filters in steady state as multi-step meth­


ods). Recall from Eq. (39.15) that the Nordsieck vector for the q-times
330 VI Solving Ordinary Differential Equations

IWP prior models x(t) and its first q derivatives. At any given discrete
time point tn (n = 0, ..., N), the filtering mean estimate for x(tn),
computed by the EKF0, will of course depend on all previous function
evaluations {yi = f (H0mi-)}in=1. But, in the steady state, the mean
estimates for the q modelled derivatives [ x1 (tn)..., x (q)(tn))] will
depend only on a finite number j e N of these function evaluations,
namely on {yi = f (H0m-) }n= n-j+1. What is j for a given q e N?
__________ 40
Perturbative Solvers

So far in this chapter on ODEs, all methods (with the sole ex­
ception of the particle ODE filter) were probabilistic, but not
stochastic. By this, we mean that they use probability distribu­
tions to approximate x(t), but are not randomised (i.e. they
return the same output when run multiple times). This design
choice stems from a conviction held by some in the pn com-
munity1 that it is never optimal to inject stochasticity into any 1See the corresponding discussion in
§12.3.
deterministic approximation problem - except in an adversarial
setting where it can be provably optimal. This view is, however,
not shared by others who make the following arguments in
favour of randomisation in ODE solvers.

Consider any classical single-step method, e.g. Runge-Kutta,


whose local numerical error is in O (hq+1) for some q > 1 and
whose global numerical error is consequently in O(hq). After
computing an estimate X(h) in the first step (0 h), it goes on
to compute the second step starting at X(h). In other words, the
second step is equal to the first step for a different IVP, namely
for the same ODE with the new initial value x0 = X(h) (assum­
ing that the ODE is autonomous). This means in the notation of
Eq. (37.2) that, after computing X(h), the ODE solver tries to fol­
low the flow map Фt-h(X(h)), in order to approximate x(t). But,
strictly speaking, the only thing that we know after the first step
is that there is a constant C > 0 such that x(h) G Bchq+1 (X(h)).
In view of this, we might as well try to follow any one of the
possible flow maps {Фt-h(a) : a G BChq+1 (X(h))}, in order to
compute X(t).2 For any later time s > h, the impact of the 2Recall that, for this reason, we called
classical solvers “uncertainty-unaware”
numerical error at time h on our knowledge of X(s) will thus
in §37.
depend on the spread of {Фs-h(a) : a G Bchq+1 (X(h))}, in addi­
tion to all subsequent numerical errors. Hence, the uncertainty
332 VI Solving Ordinary Differential Equations

over later estimates will be larger if Фt (a) is highly sensitive on


a - that is, if x(t) is highly sensitive on the initial value x0.
Fortunately, this sensitivity on initial values has been exten­
sively studied.3 Since Edward Lorenz’s seminal work on weather 3 See, e.g., §2.4. in Teschl (2012).
models,4 it is known that even simple ODEs, such as the Lorenz 4 Lorenz (1963)
equations,5 can be so sensitive to initial values that the system’s 5 The Lorenz equations with constants
long-term behaviour is impossible to predict - a phenomenon (a, р, в) are:

known as chaos. Clearly, such chaotic long-term behaviour can­ x 1 = a(x2 - x 1),

not be captured by Gaussians (or any other parametric family).6 x2 = x 1 (P - x3) - x2, (40.1)

But even in most non-chaotic ODEs, the long-term effect of the z = x 1 x2 - ex3.

numerical errors is non-Gaussian. 6Recall that, in Figure 38.2, we already


observed an elementary instance of
chaos in the form of a bifurcation at
► 40.1 Randomisation by Locally Adding Noise x = 0.

But how can we capture such a non-Gaussian distribution of


possible trajectories, given a classical solver with local error
in O(hq+1)? The first answer was provided by Conrad et al.
(2017) who proposed to imitate the numerical error by adding a
calibrated Gaussian random variable after every step.7 More for­ 7 An even earlier perturbative solver was
introduced by Chkrebtii et al. (2016).
mally, let Yh : Rd Rd be our classical deterministic solver of lo­
This method performs a GP regression
cal order q +1 on a time mesh {tn }N=0, i.e. X(tn) = Yhn (X(tn- 1)) on stochastically generated data; i.e. it
with nth step size hn := tn - tn-1. shares the Bayesian GP-regression view
with Gaussian ODE filters, while stochas­
Assumption 40.1. The numerical method Yh : Rd Rd has uniform tically generating its “data” by evaluat­
ing f at a sample, i.e. of a perturbation of
local numerical error of order q + 1, i.e. there exists some constant the predictive mean used in Eq. (38.33));
C > 0 such that for any step sizeh> 0: see §2.3 in Schober, Sarkka, and Hennig
(2019) for a detailed discussion of this
sup || Yh (u) - Фh (u) || < Chq+1. difference. While an important early ad­
ue Rd vance, this method uses the same vague
formulation of GP regression as the early
Under this assumption, the local error in the nth step, £n (hn) := ODE filters in the linear-Gaussian ssm
(38.31)-(38.33), and only converges with
Yhn(x(tn- 1)) - Фhn (x(tn- 1)), is in O(hn+), for all n e { 1....... N}.
linear rate (q = 1)to the true x(t). There­
This unknown error £n(hn) is now modelled by a random vari­ fore, we here focus on later solvers with­
able tn (hn): out these shortcomings.

г hn
:
£n (hn) ~ tn (hn) = Jo Xn (s) ds, (40.2)

where xn is a stochastic process on [0, hn] that models the off-


mesh error accumulation between tn-1 and tn .8 8 Note the fundamental difference with
ODE filters: while ODE filters put their
Assumption 40.2. Let the error processes {Xn}n N=1 be independent. model (the prior) on x(t) as in Eq. (38.4),
Conrad et al. instead put their probabilis­
For each n, let Xn be a zero-mean GP defined on [0, hn] such that, for
tic model (40.2) over the numerical error.
all t e [0, hn],

E (||£n(t)tn(t)T|||) < Ct2p+1,


9This assumption can be relaxed; cf. As­
sumption 3.3 in Lie, Stuart, and Sullivan
where || ■ ||f is the Frobenius norm and C > 0 is a constant indepen­
(2019).
dent of the step sizes {hn}n N=1.9
40 Perturbative Solvers 333

Given the error model (40.2) and the numerical solver Yh on a


mesh {tn}nN=0, the algorithm by Conrad et al. now outputs a se­
quence of random variables {Jfn }N=0 according to the recursion
Exercise 40.3. Choose Yh as a fourth-order
Runge-Kutta method and implement the per­
Xn = Yhn (Xin-1) + fn (hn), X0 = x0. (40.3)
turbative solver (40.3) with a small fixed step
size h > 0 and a Gaussian error model £n (h)
In this method, the assumed numerical error fn (hn) is added with an appropriate variance. Then, apply it
to every iteration of the solver Yhn and each random sample of to the Lorenz system (40.1) and draw some
samples of {n }N=0. What does the distri­
{Xn }^=0 is considered a possible numerical approximation.10
bution of sampled trajectories look like, as
Perturbing a carefully-designed classical method Yh as in n от ?
Eq. (40.3) might lead to a higher numerical error for any fixed 10For a pictorial representation of this
step size h > 0. But if the variance of fn (hn) is small enough, it randomised method, see Figure 2 in Con­
rad et al. (2017).
might not lower the convergence rate - as the following theorem
shows where h := maxn=1,...,N hn denotes the largest step size.
But first, we need to assume some regularity on f .

Assumption 40.4. The vector field f admits a t* e (0,1] and C > 1


such that, for 0 < t* < t, the flow map Ф t is globally Lipschitz with
Lipschitz constant Lip(0t) < 1 + Ct.11 (This assumption is satisfied The flow map Фt is defined in
11

Eq. (37.2).
iff is globally Lipschitz.)

Theorem 40.5 (Lie, Stuart, and Sullivan, 2019, generalised from


Conrad et al., 2017).12,13,14 Suppose Assumptions 40.1, 40.2 and 40.4 12Theorem 40.5 is a simplified version of
the most general result, Theorem 3.4 in
hold and fix x0 e Rd. Furthermore, for all t e [0, T], assume that
Lie, Stuart, and Sullivan (2019).
Yt (Y) e L2 for all random variables Y e L2. Then, there exists a con­ 13The first version of this result, Theo­
stant C > 0 (independent of h) such that the randomised probabilistic rem 2.2 in Conrad et al. (2017), was more
restrictive in that the maximum was out­
solver (40.3) converges in mean-square. That is,
side the expectation and f was assumed
to be globally Lipschitz.
E ( max IIXn - x(tn)||^ < Ch2min(p,q).
14Note that Lie, Stahn, and Sullivan
n=0,...,N n n (2022) provided an extension of this the­
orem to operator differential equations
In particular, the expected global error is of rate min( p, q):
in Banach spaces.

E^ max ||Xin — x(tn)|Й < Chmin(p,q). (40.4)


n=0,...,N

Proof. See Theorem 3.4. in Lie, Stuart, and Sullivan (2019). □

This is an important result for the following reason. Recall from


Assumptions 40.1 and 40.2 that the randomised solver (40.3)
tries to mimic the local numerical error of order O (hq+1) by 15Note that we only assume Gaussian-
ity of perturbations for simplicity. In
adding Gaussian15 noise with standard deviation in O(hp+1/2).
fact, Theorem 3.4 by Lie, Stuart, and
Now, Eq. (40.4) ensures that the expected global error converges Sullivan (2019) also holds for a class of
with the lower rate of p and q. This means that, if the added non-Gaussian perturbations; see their As­
sumption 3.3.
noise is sufficiently small (i.e. p > q), the random output X0:N
from Eq. (40.3) is (in expectation) “not worse” than the determin­
istic estimate x(10:N) computed by the classical method Yh - in
334 VI Solving Ordinary Differential Equations

the sense that both the expected global error of the former and
the fixed global error of the latter are in O(hq). However, if
the added noise is larger than that (i.e. p < q), then the ex­
pected global error is only in O(hp), i.e. larger than without
randomisation.
This is intuitive. Loosely speaking, it means that one can at
most perturb the local error (in O(hq+1)) by a slightly larger16 16 Due to the independence of random

variables £n (hn), n = 1,..., N, an order


additive noise (in O (hq+1/2)) without reducing the global con­
of hq+1/2 in the local noise is already suf­
vergence rate. Accordingly, Conrad et al. (2017) recommend ficiently small to have an expected global
choosing p := q, i.e. to add the maximum admissible stochastic- error of hq; see Remark 8 in Abdulle and
Garegnani (2020).
ity that still preserves the accuracy of the underlying determin­
istic method Yh.

► 40.2 Randomised Step Sizes for Geometric Integrators

The algorithm (40.3), however, has two shortcomings. First, even


if the convergence rate is maintained (i.e. if p > q), the addi­
tive noise will usually increase the global error for any fixed
h > 0. Second, if the underlying integrator Yh is geometric17 - 17 Hairer, Lubich, and Wanner (2006)
i.e. it preserves geometric properties such as mass conservation,
symplecticity, or conservation of first integrals - then these are
not maintained after adding noise. As a remedy, Abdulle and
Garegnani (2020) proposed to randomise the step sizes {hn}nN=1
instead of adding noise. Formally, this amounts to replacing
each hn with a random variable Hn .The resulting method then
follows the recursion

Xn = YHn (Xn-1), X0 = x 0, (40.5)

i.e. the stochasticity was, compared to Conrad et al.’s method


(40.3), transferred from the random additive noise Cn (hn) to
the random step size Hn .18 Here, Xn is treated as a stochastic For a pictorial representation of
18

Eq. (40.5), see Figure 2 in Abdulle and


simulation of x(tn), at time tn = £n=1 hi. And, indeed, this alter­
Garegnani (2020).
native perturbative method comes with the same convergence
guarantees as in the additive case (Theorem 40.5). The following
assumption on {Hn }n N=1 takes the place of Assumption 40.2.

Assumption 40.6. The random variables {Hn}n N=1 satisfy the fol­
lowing properties, for all n e {1,..., N }.-

(i) Hn > 0 a.s.,

(ii) there exists h > 0 such that E( Hn )=h,

(iii) there exists p > 1/2 and C > 0 independent ofn such that

E (\Hn - h\2} < Ch2p+1,


40 Perturbative Solvers 335

(iv) (if the integrator Yt is implicit) Hn is a.s. small enough for Yt


to be well-posed.

Theorem 40.7 (Abdulle and Garegnani, 2020). Suppose Assump­


tions 40.1 and 40.6 hold. Furthermore, let f be globally Lipschitz and
tn = nh for n = 1, 2, ..., N, where Nh = T. Then, there exists a con­
stant C > 0 (independent of h) such that the randomised probabilistic
solver (40.5) converges in mean-square. That is

max E (IIXn - x(tn)||2) < Ch2min(p,q).


n=0,...,N n n

In particular, the global maximum of expected errors is of rate min(p, q):

max E (||Xn - x(tn)||) < Chmin(p,q).


n=0,...,N n n

Proof. See Theorem 2 in Abdulle and Garegnani (2020). □

Just like Theorem 40.5, Theorem 40.6 shows that the local error
rate of q + 1 and the local standard deviation of p + 1/2 combine
to a convergence rate of min( p, q). For the same reasons as
above, it is again recommended to choose p = q. Note that, like
in a weaker earlier version of Theorem 40.5 by Conrad et al., the
maximum is outside of the expectation here and f is assumed
to be globally Lipschitz. For Theorem 40.5 these restrictions
were later lifted by Lie, Stuart, and Sullivan (2019); maybe this
is also possible for Theorem 40.6. Since the desired properties
of geometric integrators hold for all h > 0, they a.s. carry over
to a sample of {Xn} from Eq. (40.5).19 19 Abdulle and Garegnani (2020), Thm. 4
Notably, both of these methods can be thought of as frequen-
tist as they sample i.i.d. approximations of x(t). The particle
ODE filter (from §38.4), on the other hand, is Bayesian as it
20Teymur, Zygalakis, and Calderhead
computes a dependent set of samples that approximate the true (2016)
posterior distribution. While there are first experimental com­
parisons (Tronarp et al., 2019, §5.4), more research is needed
21 Teymur et al. (2018)
to understand the differences between all of these nonpara­
metric solvers. There are further important methods - such as
the aforementioned one by Chkrebtii et al. (2016) as well as
stochastic versions of linear multistep methods20 and of im­
plicit solvers.21 Finally, note that Abdulle and Garegnani (2021)
recently published an extension of their ODE solver (40.5) to
PDEs by randomising the meshes in finite-element methods.
336 VI Solving Ordinary Differential Equations

Figure 40.1: Estimating an Arenstorf


orbit. Each figure shows two large
point masses (black and grey circle,
► 40.3 Perturbative vs Gaussian Methods representing earth and moon) orbiting
around each other, along with the true
Ultimately, all of these nonparametric solvers offer non-Gaussian (black curve) and estimated orbits (grey
curves) of a spacecraft following the
uncertainty quantification at a higher computational cost than IVP (40.6)-(40.7). The estimated orbits
Gaussian ODE filters and smoothers - in the number of eval­ are computed by different probabilistic-
numerical methods: the additive-noise
uations of f, not necessarily in wall-clock time, as further dis­ perturbative method (40.3) (left plot), the
cussed below. Unfortunately, at the time of writing, there are no randomised-step perturbative method
strong analytical tools available yet to explain or study the value (40.5) and EKF0 with 1-times IWP prior
and R = 0 (§38.3.2). Both perturbative
of this structured uncertainty. Generally speaking, we expect methods use Heun’s method (i.e. q = 2
perturbations to be practical in the same settings as particle in Assumption 40.1) as the underlying
integrator Y such that they are com­
ODE filtering; see the discussion in §38.6. But the utility of
parable with the filter due to Proposi­
nonparametric (vs Gaussian) methods remain a hotly debated tion 39.11. All solvers receive the same
topic because of the inevitable trade-off between computational budget of 50 000 equidistant steps; the
perturbative methods compute two sam­
speed and statistical rigour. ples, each with 25 000 equidistant steps.
While other opinions exist, proponents of fast-but-approximate Already for the amount of two samples,
we can see that the filtering mean (be­
methods (such as the authors of this text) have phrased this
cause it takes twice as many steps) is a
trade-off as follows. Assume that a perturbative solver would much more accurate estimate than any
need S samples to capture the numerical distribution of trajec­ of the samples. This difference will be
even larger when the budget is split over
tories “sufficiently well”, for a step size h > 0. Then, the budget more samples. Details in text.
of these samples could (absent parallelisation) equally well be
spent on one sample with step size h/S. This single estimate
will be more precise, and in some (but not all)22 settings, one
precise sample might be preferable to a set of S less precise
samples. But, this sample will offer no uncertainty quantifica­ 22 See Figure 38.2 for an example where

multiple samples may be preferable.


tion whatsoever: it is just a perturbed classical method and has
no advantage over its unperturbed version. Extended Kalman
ODE filters, on the other hand, are designed to compute a good
estimate together with a calibrated standard variation as an
uncertainty estimate - which of course also causes overhead,
but not in the amount of evaluations of f .
For a depiction of this intuition, consider Figure 40.1. It
reproduces a classic three-body problem from astronomy: the
search for a periodic “Arenstorf orbit” of a spacecraft between
40 Perturbative Solvers 337

two astronomical objects.23 This concrete example setup is taken 23 Hairer, N0rsett, and Wanner (1993),
from Hairer, Norsett, and Wanner (1993, p. 129). Figure 40.1 p.129
shows two bodies (representing earth and moon) of mass 1 — ц
and ц, respectively, rotating in a plane around their common
centre of mass, with a third body of negligible mass orbiting
them in the same plane. The equations of motion are given by
the ODE system

x'1 = x 1 + 2x2 — ц' x 1D1 ц — ц x1D—2 ц , (40.6)

x2 = xY2„ - 22xY1' _ ц11' TV


Y" - x2 _ 11 x2
цD^_,
D1 2

D1 = ((x1 + ц)2 + x22)3/2,


D 2 = ((x 1 — Ц )2 + x2 )3/2,
ц = 0.012277471 and Ц = 1 — ц.

It is known that there are initial values that give closed, periodic
orbits. One example is

x1 (0) = 0.994, x1 (0) = 0, x2 (0) = 0,


x 2 (0) = — 2.00158510637908252240537862224,
T= 17.0652165601579625588917206249. (40.7)

Each plot shows the true solution (found by a Runge-Kutta


solver with tiny step size) as a black curve which is approxi­
mated by a different probabilistic ODE solvers (grey curves) in
each plot. The first plot depicts two samples from the additive­
noise perturbative solver (40.3), the second two samples from
the randomised-step perturbative solver (40.5), and the third
the posterior mean of a EKF0 with 1-times IWP prior (all based
on Heun’s method). In the first plot, the employed additive
noise model is gn(h) ~ N(0, [104 • h2q+1]12) for each step n; see
Assumption 40.2. In the second plot, the randomised step Hn is
drawn from a uniform distribution, Hn ^ U (h — 104 • h2q+1, h +
104 • h2q+1) for each step n; see Assumption 40.6. For both per­
turbative solvers, the width of the distribution was scaled such
that the samples are visibly different while maintaining the
quality of the original integrator Y.

These plots visualise the trade-off between statistical flexibility


and computational speed: Given a fixed budget, since the pertur­
bative methods can take fewer steps per sample, the precision of
the samples is greatly reduced - compared to a single estimator
with more steps (from ODE filtering or otherwise). This effect
is already pronounced for two samples, and will only increase
338 VI Solving Ordinary Differential Equations

for a larger number of samples. Since a numerical error acts


like a statistical bias, the empirical mean of samples cannot be
expected to rise as the number of (independent) samples goes
up - to increase the quality, one has to increase the number of
steps per solve, not the number of solves.
While the ODE filter appears more useful in this experiment,
it should be noted that ODE filters also cause an overhead to
compute their posterior uncertainty. While it stays inside the
linear O(N) cost in the number of steps N, the constant is
increased by a margin depending on the implementation. A
comparison in wall-clock time could therefore be different, but
is difficult to study in a principled way as it will depend on
the ODE (e.g. on the cost to evaluate f). Nonetheless, we em­
phasise that the wall-clock overhead of ODE filters can instead
be spent on a limited number of samples - even in the fast
implementation of ODE filters in the ProbNum package (§38.7).
Moreover, perturbative methods do pick up more structure
than Gaussian solvers. While it is difficult to generally assess
which additional structure, this can be recognised more clearly
in specific situations. For example, Chkrebtii et al. (2016) and
Conrad et al. (2017) demonstrated that their solvers pick up the
strange attractor of the Lorenz equations (40.1). And in com­
putational neuroscience, Oesterle et al. (2021) recognised that
the numerical uncertainty from solving the ODEs of neuronal
models, such as the Hodgkin-Huxley model, critically affects
how accurately they are simulated. This effect can be sufficiently
strong to change the behaviour of the simulation qualitatively,
even adding or removing individual spikes from the simulation.
Notably, they were able to capture this structure by the pertur­
bative solvers of Conrad et al. (2017), Eq. (40.3), and of Abdulle
and Garegnani (2020), Eq. (40.5). It seems difficult to achieve
this with Gaussian methods.
Which type of probabilistic solver should be used, then, de­
pends on the application and the role of uncertainty quantifi­
cation over point estimation in it. If the error estimate is more
of a diagnostic for the quality of the point estimate, ODE filters
may make better use of computational resources. If a structured
uncertainty estimate is sought, then independent paths may be a
more useful tool - in particular for chaotic or bifurcating ODEs
like Figure 40.1. However, we stress again that, at the point
of writing, the meaning of the structure of the independent
samples is poorly understood from an analytical perspective.
Future research will hopefully uncover a clearer picture.
41
Further Topics

The preceding sections of this chapter on ODEs were mainly


concerned with forward solutions of IVPs. While we aspired
to provide a comprehensive treatment of these algorithms and
their theory above, there is more related research that we did
not cover. In this section, we touch upon some of these topics
briefly and provide additional pointers to related publications.

► 41.1 Boundary Value Problems

In Definition 36.1, we defined a boundary value problem (BVP)


as an IVP with an additional final value:1,2 1 We here inserted the argument t into f

because it is relevant to explain the work


x(t)= f (x(t), t) for all t G [0, T], by John, Heuveline, and Schober (2019);
everywhere else, we could w.l.o.g. omit
with boundary values x(0) = x0 and x(T) = xT. it.
2 Note that in general, as mentioned in

In the same way as for IVPs (§38.2), we can model the BVP Definition 36.1, there is a more general
boundary condition g(x(0), x(T)) = 0.
solution x(t) and its derivative x'(t) by a GP prior p(—(10:N)),
Accordingly, the referenced probabilistic
such that x(t) = Hо#-(10:N) and x'(t) = Hx-(10:N) are contained BVP solvers are not limited to our re­
in #-(t). The BVP posterior distribution is now written strictive definition which only serves to
declutter the notation.

p(x(10:N) | z 1:N = 0,x(T) = xt), (41.1)

i.e. it is (compared with IVPs) additionally conditioned on


x(T)=xT.Iff is linear, i.e. iff(x, t) = M(t)x + s(t), the like­
3 John, Heuveline, and Schober (2019), §3
lihood is linear and the BVP posterior (41.1) for any Gaussian
prior p(—(t)) can be computed in closed form by standard GP
regression (§4.2). The resulting probabilistic BVP solver was pro­
posed by John et al.,3 based on previous work on probabilistic
4 Cockayne et al. (2017a), §3
solvers for linear PDEs.4 Furthermore, John et al. generalised
this idea to a nonlinear f by use of quasi-linearisation, an ap­
plication of Newton’s method that partitions the BVP into a
series of linear BVPs to which the linear solver is then applied.
340 VI Solving Ordinary Differential Equations

Note, however, that these methods do not restrict the prior to


Markov processes, and therefore do not exploit the linear-time
state-space formulation of GPs that made ODE filtering and
smoothing possible for IVPs.
Fortunately, this gap was closed by constructing a ssm for
BVPs.5 To see how, recall that our ssm (з8.1о)-(з8.1з) was de­ 5 Kramer and Hennig (2021)
signed for IVPs and thus does not include the final condition
x (T)=xT .But it can be extended to BVPs by adding the Dirac
likelihood

p(XT | Xn) = 5(XT - H0Xn),

to incorporate xT as data on x(T) = H0xN (recall that T = N /h).


Inference in this augmented ssm is now the same as in the
original ssm - with the sole difference that, in the final step,
the solver does not only condition on zN = 0 but also on
H0XN = XT.6 Consequently, as for IVPs, we can either use the 6 A graphical model for this augmented
ssm with two likelihoods is presented in
EKS0 or EKS1,7 after linearising f, or employ an iterated ex­
Figure 2 of Kramer and Hennig (2021).
tended Kalman smoother (IEKS) to compute the MAP estimate Note, however, that they also include
X* (10:N) from Eq. (38.30).8 In fact, Kramer and Hennig (2021) the initial value X0 with a likelihood -
instead of regarding it as part of the
demonstrated that the latter solver (IEKS) converges quickly on prior, as we did in §з8.2.1. Despite this,
test problems. It yields (due to its linear-time complexity) a sig­ both approaches are equivalent in that
they yield the same algorithms. In fact,
nificant speed-up over John, Heuveline, and Schober (2019), in­
Kramer and Hennig also take up our
cludes step-size selection as well as hyperparameter-calibration prior view by introducing the concept of
schemes, and is therefore the state-of-the-art in pn for BVPs. a bridge prior, i.e. a prior that incorporates
both the initial and final value.
Such probabilistic BVP solvers have (in previous forms with­ 7 For BVPs, both boundary values con­

out the linear-time formulation) been applied to numerically tain equal information. Hence, filtering
is less natural as it moves only in one
approximate shortest paths and distances on Riemannian man­
direction through time.
ifolds. This approach has concrete applications in fields such 8 See§з8.з.2 and §з8.з.з for these two
as Riemannian statistics and brain imaging where quantifying approaches, respectively.
the numerical uncertainty plays an important role for the final
objective. This was first recognised by Hennig and Hauberg
(2014) and further exploited by Schober et al. (2014). The cur­
rent state of the art was provided by Arvanitidis et al. (2019).
However, note that, in principle, all above (newer) probabilistic
BVP solvers are applicable to this task as well.

► 41.2 Inverse Problems

One main purpose of pn is to consistently pass numerical un­


certainty along computational chains. This should enable an
efficient distribution of computational budget, in order to min­
9See paragraph “Pipelines of computation
imise the negative effect of all sources of uncertainty on the final demand harmonisation” in Section “This
objective.9 A particularly well-studied chain involving ODEs is Book” in the Introduction of this text.
the ODE inverse problem. Inverse problems, in general, consist
41 Further Topics 341

forward map (likelihood): 6 ^ F(6) Figure 41.1: Depiction of a generic in­


verse problem. An ODE inverse prob­
parameter 6 e 0 simulation F(6) e Rd lem is the special case where the output
F(6) is the solution x : [0, T] Rd of the
) + ‘noise’ ^ 6
inverse problem: F(6о о
parametrised IVP, Eq. (41.2).

of inferring the parameter for which some forward map outputs


some given data. More formally, assume that some forward
map F : Rn Rd is given and that we can simulate its output
F (6) e Rd for any given parameter 6 e 0 C Rn. Given such a
(potentially noisy) simulation F(60), the inverse problems now
conversely consists of recovering the unknown parameter 6о
that gave rise to this data F(60); see Figure 41.1. This estimation
is complicated by the fact that many inverse problems are ill-
posed, e.g. when there are two parameters 60 = 60 such that
F (60) = F (60).
Standard inverse problem solvers simply try out several pa­
rameters (“samples”) {6i}iS=0 and compare their simulations
{F(6i) у 0 with F(60). Roughly speaking, the smaller ||F(6i) —
F(60) ||, the more likely it is considered for 6i to be close to 60.
This process is usually guided by statistical principles such as
MCMC. In fact, random-walk Metropolis methods remain the
go-to solution for most applications.10 While this is valid in the 10 Tarantola (2005), §2.4
limit of infinitely many simulations, the number of samples S
required to infer 60 satisfactorily can be very large. Thus, en­
tire research fields such as approximate Bayesian computation
(ABC), simulation-based inference, and likelihood-free inference
have taken aim at reducing this high computational cost.11 11A recent survey of these closely re­
lated fields was provided by Cranmer,
An ODE inverse problem is now simply an inverse problem
Brehmer, and Louppe (2020).
whose forward map is the solution of an IVP (36.2) with added
parameter 6 e 0 C Rn,12 i.e. of the IVP 12The initial value x0 can also be added
to the parameter vector 6.
x(t)= f (x(t), 6) for all t e [0, T],
(41.2)
with initial value x(0) = x0 e Rd.

Simulating the forward map F thus amounts to approximating


the solution of the IVP (41.2), i.e. F (6) = x6 = x for any chosen
6. The data, which we denote by z, is thus equal to x60 = F(60)
(under additive, zero-mean Gaussian noise) at M discrete time
points 0 < 11 < • • • < tm < T, i.e.

z(ti) := x(ti) + & e Rd, & ~ N(0, с2Id),

which we stack to form a data vector z written

:
z = [z 1 (11),...,z 1(tM),...,Zd(11),...,Zd(tM)]T e RdM.
342 VI Solving Ordinary Differential Equations

Again, we will w.l.o.g. assume d = 1. Analogously, we combine


the values of the true solution xe0 at {ti}i=i to form the stacked
vector xe0 such that the probability of observing z reads

p(z | xe0) = N zz; xe0, a2iM) . (41.3)

To obtain the likelihood of the forward problem, we now (given


some q) have to integrate over p(xq | q). In classical methods
(i.e. without uncertainty quantification), p(xq | q) is assumed to
be a.s. equal to the employed numerical estimate xe, i.e. p(xe |
в) := S(xe - xe). This yields the uncertainty-unaware likelihood,
written

p (z | в) = p p (z | xe) p (xe | в) dxe (41.4)

(4=3) N zz; xe, a2M , (41.5)

which is then tacitly taken for the “true” likelihood.


Fortunately, the output of any probabilistic ODE solver, when
applied to the parametrised IVP (41.2) with fixed e, is (by its
very nature) a carefully designed version of p(xe | e).13 It 13Remember that, due to multiple trade­
offs (fast vs accurate, Gaussian vs non­
is thus natural to, after choosing a probabilistic ODE solver,
parametric, choice of prior, etc.), there
insert its output distribution for p(xe | e) in Eq. (41.4). This is no uniquely correct way to construct
yields a different p(z | e ) than the conventional uncertainty- p(xe | e) - just like there is no single
true regression for any data. Nonethe­
unaware likelihood (41.5). For large step sizes h > 0 (i.e. for less, the distinction between ignoring
large numerical uncertainty relative to the statistical noise a2), and accounting for the uncertainty is piv­
otal.
the resulting uncertainty-aware likelihood indeed corrects for the
overconfidence of the unaware likelihood, as is demonstrated in
Figure 41.2 for the EKF0. 14 The list of perturbative inverse­
problem solvers includes the papers by
In this new statistical model, one can now do statistical infer­
Chkrebtii et al. (2016), Conrad et al.
ence of e0 as needed. Depending on the precise shape of p(z | e) (2017), Teymur et al. (2018), Lie, Sulli­
(stemming from the probabilistic solver), different inference van, and Teckentrup (2018), and Abdulle
and Garegnani (2020).
schemes recommend themselves and have been explored in the
literature - both for perturbative solvers14 and Gaussian ODE
filters.15 15 Kersting et al. (2020)

In the case of perturbative solvers, most publications have fol­


lowed the Bayesian approach to inverse problems,16 wherein a 16 Dashti and Stuart (2017)
prior p (e) over the parameter space 0 is added to the model.
Solving the corresponding Bayesian inverse problem now con­
sists of computing the posterior distribution

p (e | z) a p (e) p p (z | xe) ph (xe | e) dxe , (41.6)

where we inserted Eq. (41.4) for the likelihood, with superscript


h added to p(xe | e ) to highlight the dependence on the step
41 Further Topics 343

Large steps Small steps Figure 41.2: Uncertainty-(un)aware likeli­


hoods w.r.t. (01,02) of the Lotka-Volterra
ODE
0.051
X1 = 01X1 - 02X1X2,
X2 = —03 X2 + 04 X1X2,
with fixed (03, 04) = (0.05, 0.5) and X0 =
0.049
(20, 20). An EKF0 with 1-times IWP prior
was used to construct p(x0 | 0), lead­
ing to the uncertainty-aware likelihood
(41.12) in the top row. The black cross
marks the true parameter. The unaware
likelihood (bottom row) is overconfident
0.051 for the large step size (h = 0.2), i.e. for
large P, while the aware likelihood is
well-calibrated such that the true pa­
rameter has non-zero likelihood. For the
small step size (h = 0.025) this effect is
0.049
less pronounced since P is small. This
figure is adapted from Figure 2 in Ker-
sting et al. (2020). This removal of over­
0.49 0.51 0.49 0.51 confidence has also been demonstrated
for perturbative solvers, see e.g. Figures
5 and 6 in Conrad et al. (2017), or Fig­
ures 3 and 11 in Abdulle and Garegnani
size h. The posterior, Eq. (41.6), is (in general) not available in (2020).

closed form and has to be approximated by a suitable sampling


scheme. The aforementioned reduction of overconfidence (see
Figure 41.2) in this uncertainty-aware posterior has been demon­
strated in multiple experiments.17 Due to the stochasticity of the 17See e.g. Figures 5 and 6 in Conrad et al.
(2017), or Figures 3 and 11 in Abdulle
forward map ph (x0 | 0), a pseudo-marginal MCMC approach is
and Garegnani (2020).
particularly suited to approximate the posterior.18 Convergence 18 Abdulle and Garegnani (2020), §8
guarantees of such inference schemes to the numerical-error­
less posterior (i.e. when indeed X0 = x0 for all 0), as h 0, were
proved by Lie, Sullivan, and Teckentrup (2018).

In the case of ODE filters and smoothers, the (to date) only publi­
cation is Kersting et al. (2020). Like the perturbative approaches,
it managed to reduce the overconfidence in the likelihood by in­
serting an EKF0 in lieu of a classical ODE solver, as Figure 41.2
demonstrates. But, on top of that, it exploited the resulting,
more structured Gaussian form of the likelihood to estimate its
gradients and Hessian matrices in the following way.
First, let us assume w.l.o.g. that h > 0 is fixed, and recall,
from Eqs. (38.1) and (38.2), that the functions X and x' are a
priori jointly modelled by a Gauss-Markov process, written

X ; X0 k kd
= GP
p
x' ; f (X0) dk dkd

where the functions kd = dk(t, t)/dt1, dk = dk(t, t)/dt and


344 VI Solving Ordinary Differential Equations

"д" = д2 k (t, t1)/ dtdt' are derivatives of the covariance function


k. The EKS019 thus computes (for a fixed в e 0) a posterior GP 19While Kersting et al. (2020) considered
the EKF0, we here present the analogous
whose marginals at the data time points {ti}iM=1 reads
construction for the EKS0 because it is
easier to comprehend.
p (xe I в) = N(xe; me, P), (41.7)

where the posterior mean is given by20 20We here, for its simplicity, use the no­
tation of the linear-Gaussian ssm (38.31)­
(38.33) to work with the EKF0 and EKS0.
me (~) x01 M + k3 TMTN [3KTNTN] [y 1:N - f(x0, в)1N] (41.8) This is admissible because Kalman filter­
ing (and smoothing) on this ssm is equal
with TN = {jh}N=i, y1: N = [f (m- (h), в),..., f (m- (Nh), в )]т, to the EKF0 and EKS0 on the general ssm
(38.10)-(38.13) which yields Eq. (41.8);
and 1M = [1,..., 1]T e RM. While the dependence of m- on see §38.3.4. For notational simplicity we
в is at least as involved as the dependence f (x, в ) on в, the set R = 0, but a non-zero R > 0 can
simply be added to 3kTNTN if needed.
following simplifying assumption fortunately holds for many
ODEs.21 21 In fact, most ODEs collected in Hull
et al. (1972, Appendix I), a standard set
Assumption 41.1. The dependence of f on в is linear. More precisely, of ODE benchmarking problems, satisfy
Assumption 41.1 either immediately or
we assume that f (x, в) = LП=1 Gifi (x), for some continuously dif­ after re-parametrisation. In particular,
ferentiable fi : Rd Rd,for all i = 1,..., n. (Note that no linearity Assumption 41.1 does not restrict the
assumption is placed on the fi.) numerical difficulty of the ODE since
f inherits all inconveniences from the
{fi }in=1 for which nothing (but the usual
Under this assumption, Eq. (41.8) becomes22
continuous differentiability) is assumed.
22 Note that Eq. (41.9) reveals how eas­
x0 ily x0, if unknown, can be treated as an
тв = 1 m J = x01 m + №, (41.9)
в additional parameter.

where the Jacobian-matrix estimator J = KY is the product of


the kernel pre-factor23 23K is called a pre-factor because it is
(absent any adaptation of the ssm to в)
independent of в and can thus be pre­
K = k3TMTN [dKTNTN] -1 e RM XN (41.10) computed for all forward solutions at
once.
and the data matrix Y e RNxn with entries

[Y]ij = fj(me (ih)) - fj(x0), (41-11)

where fj is as in Assumption 41.1. It is important to note that our


notation omits that [Y]ij actually depends on в via the predictive
mean m- (ih). For any i e {1,..., N}, this m- (ih) changes with
в through the import of в on the previous steps up to time
t = ih — which is a nonlinear and potentially highly sensitive 24The precise Jacobian matrix can be
computed at an additional cost by sensi­
dependence. Nonetheless, all local (“first-order”) effects are
tivity analysis (Rackauckas et al., 2018).
captured by the linearisation, Eq. (41.9). Therefore, it is not The Jacobian estimator J, on the other
unreasonable to hope that J will be a useful estimator of the hand, comes as an almost-free byprod­
uct of the forward solution with EKF0 or
true Jacobian of the map в ^ m^, despite ignoring the global EKS0: it is simply the product of the pre­
(“higher-order”) dependence of the true Jacobian on в via Y.24 computable K (41.10) and the previously
collected function evaluations (41.11). Ex­
In any case, Kersting et al. (2020) proceeded to demonstrate the
act sensitivity analysis, in contrast, re­
usefulness of this heuristic estimator J for the inference of в, as quires more expensive global computa­
we will see next. tions along the entire time axis [0, T].
41 Further Topics 345

To this end, we observe that, again, insertion of the prob­


abilistic numerical likelihood (41.7) for p (x0 | 0) in Eq. (41.4)
yields an uncertainty-aware likelihood

p (z | 0) = N(z; me, P + a2Im ), (41.12)

which is depicted in Figure 41.2. By plugging (41.9) into (41.12),


we now obtain the estimators

V 0 L (z) := - JT P + a2 Im| [ z - mg ], and (41.13)

:
V2L(z) = JT P + a2M J (41.14)
for the gradient and Hessian of the corresponding log-likelihood
L(z) := log p(z | 0).25 Both of these estimators can then be 25 Note that the size of these gradient
and Hessians scale, as desired, inversely
inserted into any existing gradient-based sampling or optimisa­
with the combined uncertainty of the
tion method to infer 00. Classically, such gradient and Hessian numerical error P and statistical noise
estimators are not accessible without an additional sensitivity a2IM. This means that they inherited the
uncertainty-awareness of the probabilis­
analysis (Rackauckas et al., 2018). tic likelihood (41.12).
To summarise, we have seen how the EKF0 gives rise to
an uncertainty-aware26 likelihood (41.12) with freely available 26 See Figure 41.2.
estimators for its gradient (41.13) and Hessian (41.14). Our expo­
sition followed Kersting et al. (2020) who, however, executed this
strategy for the EKF0 (and not the EKS0). The filtering case is
more complicated (because the equivalence with GP regression
only holds locally), but essentially analogous. For the filter­
ing case, the experiments in Kersting et al. (2020) demonstrate
that, indeed, the resulting gradient-based inference schemes
are significantly more efficient (i.e. they need fewer forward
ODE solutions) than their gradient-free counterparts - both
for MCMC sampling and for optimisation. Recently, Tronarp,
Bosch, and Hennig (2022) introduced another method to solve
ODE inverse problems by insertion of ODE filtering/smoothing
into the likelihood.

Future research should try to link these two uncertainty-aware


approaches, MCMC (by perturbations) and Gaussian approxi­
mations (by filtering). Notably, the particle ODE filter or smoother
(§38.4) could alternatively be used to compute ph (x0 | 0) in the
perturbative scheme of Eq. (41.6).
Finally, we want to point out that the numerical error in
inverse problems can also be modelled without explicitly em­
ploying a probabilistic ODE solver -by instead modelling the
discretisation error as random variables and estimate their vari­
ance jointly with the ODE parameter (Matsuda and Miyatake,
2021).
346 VI Solving Ordinary Differential Equations

► 41.3 Probabilistic Synthesis of Numerical Simulations and


Observational Data

Recall that one ensuing benefit of a well-designed probabilistic


numerical method is the possibility to extend its probabilistic
model to include observational data.27 Classical methods, in 27 For a more detailed description, recall

paragraph “pn consolidates numerical com­


contrast, do not treat numerical information probabilistically
putation and statistical inference” from the
and are thus not able to jointly exploit numerical and statis­ Introduction (§).
tical information. pn envisions to leverage this larger set of
information for better, faster inference.
For ODE filters and smoothers, this hope was recently put
into practice.28 More precisely, they extended the ssm (38.10)­ 28 Schmidt, Kramer, and Hennig (2021)
(38.13) to include additional linear observations y1o:bMs = [y1obs,
..., yoMbs] of the ODE solution x at specific time points through
another likelihood

( Hobsx
p(ynobs | | H ) (
N obs Hobsx Robs
xn)=N (yn ; H xn, R ), ) (( 1 ))
411.15

with Hobs G RkxD and 0 < Robs G Rkxk, for some M G N


and k G {1,..., D} .By this construction, this statistical data
yobs is incorporated in the same way as the numerical data
z = [z1, ...,zN] - in the sense that the observation likelihood
(41.15) has the same form as the numerical likelihood (38.12).
Hence, the resulting extended ssm is nothing but the previous
ssm whose likelihood now also includes (41.15). This model
is particularly advantageous because it is still a probabilistic
ssm.29 29See Figure 3 in Schmidt, Kramer, and
Hennig (2021) for the corresponding
For the application of inferring the latent force model of an
graphical model.
ODE from observational data, Schmidt, Kramer, and Hennig
(2021) go on to detail how to add a dynamic model for the
latent force to the extended ssm, and how to then use the
EKF1/EKS1 to infer this latent force model from data - in
a single filtering/smoothing loop, i.e. jointly with the ODE
solution and in linear time. This effectively removes an entire
outer for-loop that is usually wrapped around classical solvers,
leading to drastic performance increases (in real, wall-clock
time, even accounting for the computational overhead of the
EKF1/EKS1 over classic packaged solvers).
As a concrete example they use data from the Covid-19 pan­
demic, which provides an intuitive idea for the kind of data
and mechanistic knowledge typical for real-world problems. We
know that infection numbers are driven by a dynamic process
that can be described approximately by an ODE (the widely
used SIRD family of models), but that this model contains var­
ious unknown quantities - in their example, the time-varying
41 Further Topics 347

contact rate among the population, modelled by a latent Gaus­


sian process. But these infection numbers are also counted
empirically. Both the mechanistic knowledge about infection
dynamics and the empirical information provided by the case
counts provide information that the EKF1/EKS1 can directly
incorporate via an observation model as in Eq. (41.15). This
allows inference on the latent contact rate in a single forward
pass (rather than a laborious outer loop to fit the ODE solution),
directly followed by forward simulation.
This insight is crucial from the point of view of machine
learning. It is quite typical for real-world inference problems
involving dynamical systems to feature both some knowledge
about the underlying dynamics (up to some unknown parame­
ters or latent forces, for example) and to observe some physical
quantities. For such dynamical systems described by PDEs, Giro-
lami et al. (2021) presented a similar way to fuse numerical and
observational information by use of a probabilistic numerical
PDE solver.

► 41.4 Partial Differential Equations (PDEs)

PDEs are an equally well-established field of Probabilistic Nu­


merics, but are beyond the scope of this text. We here only point
to the relevant literature and link itto the above material on
ODEs.
For linear PDEs, it is possible to construct exact probabilistic
meshless methods by conditioning a (Gaussian) prior over the
PDE solution on evaluations of the right-hand side. The corre­
sponding Bayesian PDE solver was introduced by Cockayne et al.
(2017a).30 These methods were then applied to PDE-constrained 30This approach is particularly related
to this chapter because, on linear ODEs,
inverse problems by Cockayne et al. (2017b) and to a real-world
they coincide with the EKS0.
engineering problem by Oates et al. (2019b). However, as in
the case of ODEs, exact inference is not possible for nonlinear
PDEs, and approximations are necessary. To this end, Wang
et al. (2021) introduced a first extension of the linear solvers to
nonlinear PDEs. This line of work is philosophically similar to
ODE filters and smoothers (§38). Importantly, Kramer, Schmidt,
and Hennig (2022) showed how ODE filters - in combination
with discretising space by a Gaussian process interpretation of
finite difference methods -can be used to solve time-dependent
PDEs.
There are also other probabilistic PDE solvers which fol­
low the paradigm of perturbative solvers (§40). Chkrebtii et
al. (2016) applied their ODE solver, without modifications, to
348 VII Solving Ordinary Differential Equations

a parabolic PDE. Echoing their ODE solver (40.3), Conrad et


al. (2017) provided a distinct solver for linear elliptic PDEs by
perturbing a standard finite element method (FEM) with an
additive random field. By a similar extension of their ODE
solver (40.5), Abdulle and Garegnani (2021) introduced a FEM
with randomised meshes. Another probabilistic modification of
FEMs, which synthesises finite element models with data from
external measurements, was proposed by Girolami et al. (2021);
see §41.3.
There are further related approaches. For instance, Raissi,
Perdikaris, and Karniadakis (2017) solve PDEs by use of prob­
abilistic machine learning. Owhadi (2017) and Owhadi and
Zhang (2017) employ gamblets to solve PDEs with rough coeffi­
cients.
Chapter VII
The Frontier
42
So What?

You’ve read the book: what should you do now? As you’ve


learned, the future of Probabilistic Numerics (pn) is wide open,
and your contributions to its shaping would be welcome. Many
fundamental mathematical, engineering, and philosophical is­
sues within the field remain to be addressed. In this section, we
will highlight some open questions that are likely to influence
the academic discussion on pn for at least the coming decade.

► 42.1 Where Will Probabilistic Numerics Find Application?

The Probabilistic Numerical offerings of superior computation


usage and uncertainty quantification are broadly applicable.
However, to date, as detailed in Chapter V, pn has had substan­
tive impact within only one application: global optimisation,
particularly for hyperparameter tuning. Where will the next
breakthrough success for pn be found? Relatedly, what software
would best support the usage of pn?

► 42.2 Can Randomness be Banished from Computation?

In §12.3, we argued against the use of randomness in compu­


tation. However, the non-random alternatives to most existing
stochastic numerical algorithms are not practical. Can the Prob­
abilistic Numerical view help to construct such alternatives that
are both effective and lightweight?

► 42.3 Can we Scale Probabilistic Numerics?

In some domains, notably Bayesian optimisation and Bayesian


quadrature, pn remains limited by its practical restriction to
352 VII The Frontier

I 1 Figure 42.1: An adaption of Figure 1.


Data Evaluations
A priori known structure is impor­
tant to computational agents interacting
I I with the world. Hard-coding knowledge,
Actions whether through architectural symme­
tries or priors, is superior to having to
WORLD AGENT NUMERICS learn it from data. Identically, such struc­
ture is important to Probabilistic Numer­
ical agents.
Structure
г

comparably low-dimensional problems. Can pn truly be made


to scale to higher-dimensional problems in these settings? So­
lutions are likely to entail both better prior models and smart
engineering. Note that this challenge has arguably already been
overcome in the domains of differential equations and linear
algebra, at least in the sense that, there, PN scales just as well
as competing classic methods, and achieves very comparable
runtimes. But for integration in particular, MCMC methods
continue to provide a formidable competitor.

► 42.4 How Can Numerics be Tailored?

This book has argued that structure is central to the perfor­


mance of numerical algorithms (as per Figure 42.1). Numerical
algorithms that are tailored to the needs of a problem will
perform much better than generic algorithms. Examples were
particularly rich in Chapter II (integration). Of course, Chapter
III (linear algebra) also underlined that there are constraints on
incorporating structure into algorithms that must be exceed­
ingly lightweight. A core open question remains: how is such
structure to be designed?
One approach is to leave the tailoring of a numerical algo­
rithm to a human designer. We have argued that a user familiar
with probabilistic modelling are well-equipped for such tailor­
ing. If such a user has additionally acquired experience with
a particular problem, there is likely to be no-one better placed
to build a performant numerical algorithm. This user’s time at
the coalface of the problem, their hard-won intuitions, can be
distilled into priors. These priors, the embodiment of human
work, will save much computation work.
However, humans cannot be expected to design a unique
42 So What? 353

numerical algorithm for each required instance. Many numerical


problems, hidden within nested layers of an intelligent system,
remain inscrutable to the system’s human overseer. Even if the
overseer were omniscient, no human would have the capacity
to design each and every numerical algorithm within a large
system. For these reasons, automation is essential to numerical
tailoring.
How should such automation be introduced? As we have
mentioned, there is an obvious source of prior information about
most numerical problems: its source code. Parsing this code
to infer the structure of a partner numerical algorithm seems
very useful. Of course, constraints on computation mean that
not all features of the source can be incorporated. How should
the decision of which structure to incorporate be automatically
made?
The domain of ODEs provides an interesting pointer outside
of the box. Here pn methods have recently shown drastic im­
provements in runtime over classic methods in situations where
the numerical problem (here, the ODE) provides mechanistic
structure to an otherwise empirical inference problem.1 This 1 Schmidt, Kramer, and Hennig (2021)
provides a hint that prior information in computation may take
a different form to the traditional notion in statistics. Perhaps it
is not so much that human users provide priors explicitly. But
that, by interacting more directly with each other through the
probabilistic framework, computational steps can inform each
other in a way that removes the need for some computations.

► 42.5 Can Probabilistic Numerical Models Be Identified at


Runtime?

Section 42.4 posed questions about how to construct Probabilis­


tic Numerical models. In addressing such questions, we might
draw from existing analysis of statistical models, for instance,
within statistical learning theory. Such analysis usually focusses
on the information content of samples with respect to the model.
Simply put, learning theorists ask whether a certain statistical
model can principally be identified from a certain data source,
and at which rate this identification is possible. When statistical
concepts are applied to numerical computations, the cost of per­
forming the computation enters the theoretical analysis as a new
quantity of interest. Given a particular numerical task, one may
try to find a solution by using computational resources in many
different ways. An elaborate prior distribution, perhaps con­
structed from an automated analysis of the source code defining
354 VII The Frontier

the numerical task, followed by a small number of well-chosen


“observations” of a function involved in the numerical task?
Or perhaps a very simple, lightweight prior, combined with a
large number of “observations”? The practical requirement that
numerical computations themselves have to be fundamentally
tractable on a binary computer imposes a structural constrained
on the model space that is currently not understood.

► 42.6 What Can We Say about Numerical Error given Finite


Computation?

Section 42.5 asked what kind of computational tasks can be


solved in principle. The natural extension is to ask to which
degree the error of a computation (the correct amount of un­
certainty over it) can be identified at runtime, and at what
computational cost. Evidently, as in all statistical settings, the
precise error (here of a computation) cannot be found live, at
runtime - if this were possible, one could simply subtract this
error from the current estimate, and would arrive at a precise,
tractable answer to the supposedly intractable numerical task.
Nonetheless, we have seen throughout this book that the statis­
tical toolbox does provide a useful variety of means of assessing
the error. In fact, the scale, and some structure of the error
can often be estimated at runtime with minimal or negligible
computational overhead! As elsewhere in statistics, meaningful
inference has to rely on assumptions of regularity and structure.
But more work is needed to understand the precise limits of
such statistical uncertaintly calibration.

► 42.7 What Does Uncertainty Mean within Probabilistic Nu­


merics?

Some questions within pn cannot be separated from philosophy.


For example, there is an ongoing, deep, discussion about the
meaning of uncertainty in the result of a deterministic compu-
tation.2 In many ways, this discussion mirrors general charges 2 Fillion and Corless (2014)
levelled against probabilistic inference in the statistics and ma­
chine learning communities. By removing the ability to appeal
to “randomness of nature”, the numerical setting highlights
the philosophical challenges of the probabilistic framework. But
there are also some aspects that are genuinely unique to the
numerical setting. Not all complicate the argument: some even
simplify things. For example, prior assumptions can be argued
for or against with much more precision in the numerical setting,
So What? 355

because a numerical task is defined in a formal (programming)


language. Still, many members of the involved academic com­
munities remain skeptical of the notion of uncertainty over the
result of a computation. These central philosophical questions
find new evidence and urgency within pn.

► 42.8 Can Computational Pipelines Be Harmonised?

As argued in the Introduction, pn offers a framework for man­


aging pipelines of numerical algorithms. Constructing such a
framework promises both rigorous quantification of uncertainty
and computational savings. However, to date, little work on
Probabilistic Numerical pipelines has been completed. Promis­
ing foundations for such work are to be found in the study
of graphical models, particularly of message passing (featured
in §5.1). Graphical models are an exceedingly powerful con­
cept for the design of probabilistic models, and the automated
construction of efficient inference algorithms.3 3Basic introductions can be found in the
textbooks by Bishop (2006) and Barber
(2012); an impressive, encyclopedic, treat­
► 42.9 Are There Limits to the Probabilistic Synthesis of Nu­ ment is provided by Koller and Fried­
man (2009).
merical and Statistical Information?

The statistical model - used by a probabilistic numerical meth­


ods to link numerical information to its quantity of interest - can
sometimes be augmented to include observational data.4 This 4 See paragraph “pn consolidates numeri­
cal computation and statistical inference” in
information fusion between statistics and numerics has been
the Introduction for the details of this
shown to yield significant performance increases.5 argument.
From a pure information- or decision-theoretic viewpoint, 5See §41.3 for a concrete example appli­
cation.
there should be no limits to this approach. No matter if truly
noisy or just approximate, all (uncertain) information can be
expressed by probability distributions and exploited by use of
probabilistic inference, that is by pn.
Future research will determine how far this equivalence can
be taken, and whether a firm distinction between statistical
inference and numerical computation should really be maintained.
Chapter VIII
Solutions to Exercises
Solutions to Exercises

Selected Exercises from Chapter I “Mathematical Background”

Solution to Exercise 4.7. Define the short-hand wi(x) = [KX-X1 kXx]i


for the regression weights. To show the theorem, we insert the
definition of mx from Eq. (4.6) and use the reproducing property
of k to write all instances of f (x), f (xi) as an inner product:

/N
s(x) : = sup (mx - fx)2 = sup £ f (xi)Wi(x) - fx
f eh, IIf ll< 1 feH, || f ||< 1 \i=1

/ \ 2

= sup
feH,IIf ||< 1 \ i
£w W (x)k(■, x ) - k(■, x), f (■))
i i
/H

Consider the function

f n = £i Wi (x) k(■, xi) - k (■, x)


fx (^) : || £i Wi(x)k(■, xi) - k(■, x) ||'

By Definition 4.6, fx is in H, and it clearly has unit norm. We


thus have

( £ Wi (x)k(■, xi) - k(■, x), fx (■ Й < s(x),


iH

because s( x ) is a supremum over such rkhs functions. But we


also have, from the Cauchy-Schwarz inequality,

s(x) < II £ Wi (x)k(■, xi) - k(■, x) ||2 ■ 1


i
/ \ 2
= / £ Wi (x)k(■, xi) - k(■, x), fx (■ Й

Thus, s( x ) is bounded above and below by that expression, so


has to be equal to it. Using the reproducing property once again,
360 VIII Solutions to Exercises

we can rewrite:

I \ 2
s(x) = \ EWi(x)k(•, Xi) - k(•, x), fx(•) /

2
E Wi (x) k(•, xi) - k (•, x)
i H
= E Wi(x)Wj(x)k(xi, xj) - 2 E Wi(x)k(x, xi) + k(x, x)

= kxx - kxXK-1kXx,

which completes the proof. □

Solution to Exercise 5.5. The main challenge in this exercise is


to compute the matrix exponential of F3t in dz (t) = Fz (t)dt +
Ldo>. The way to do this is to see that, by construction (from
Eq. (5.28)), the three F matrices in Eqs. (5.29)-(5-3o) (of size
N x N for N = 1,2,3) have only the one degenerate eigenvalue
Л = -, and the corresponding eigenvector is given by u G RN
with

[ u ] i = (-1) N-1 -i 5- (N-1 -i), i = 1,..., N.

Thus, the eigenvalue of A = exp(Fh) is e-h, and one can find


the three forms:

A 0 = exp(-h),

5h +1 h
A1 = exp h
-I2 -2 h (1 - th)
0 1 0
A2 = exp h 0 0 1
-3 - 3 £2 - 3t
1/2( g2 h2+ 2 gh + 2) h (gh + 1) 1/2 h2
= e-h -1/2g3 h '-2 - (^2h2 - ^h - 1) -1/2h(^h - 2)
1/2£3h(£h - 2) ^2h(^h - 3) 1/2(^2h2 - 4^h + 2)

The form for Q then just requires a comparably straightforward


element-wise integral over polynomials in h.

Solution to Exercise 5.6. For the estimation covariances Pi, the


DARE corresponding to Eq. (5.31) is

Pi+1 = Q + A ((Pi)-1 + HiT R -1H)-1 >A t. (42.1)


Selected Exercises from Chapter II “Integration” 361

Its parameters are transformations of those of the discrete-time


system:

Hi := HA, Q := (Q-1 + HтR-1H)- 1, (42.2)


R := R + HQHt, A := QQ-1A. (42.3)

For the smoothed covariances Pis, let us write the corresponding


DARE as

P+1 = Q + A ((ps)-1 + aT R-1 a)-1AT.

Finding its parameters is more tricky if one tries to derive them


from scratch, but becomes simpler once one has found the above
DARE for P. Phrased in terms of the variables from Eqs. (42.2)
and (42.3), they are

H = HA, Q:=( Q -1 + H T R -1H) - 1,


R:= R + H Q Hl t, A:= Q Q-1A.

A hint in case you are confused how to even reach a recursive


statement for Ps : Start by re-arranging the smoother update
Eq. (5.15) so that it only contains terms in P and Ps (i.e. not in
P-):

Pt = Pt + PtAT((APtAT + Q)-1 Ps+1 - I)(APtAT + Q)-1 APt.

Now write down the same equation for t + 1, and replace all
occurrences of Pt+1 using (42.1). Then use the matrix inversion
lemma, Eq. (15.9) to simplify the expression.

Selected Exercises from Chapter II “Integration”

Solution to Exercise 9.3. As an example of a pair f, p leading


to pathological variance, consider the case if both the intregrand
f and the proposal p are centred Gaussian measures

1 x2
f(x)=N(x;0,s2)= e exp v 2s2J , and
V 2 ns t2
p (x) = N (x ;0, a2)

and set s2 > a2. This question is a variation on Exercise 29.13


from MacKay (2003). MacKay also suggests the following inter­
esting extension: implement MC integration for this situation,
and examine the empirical mean and variance of this estimator
as the number of drawn samples increases.
362 VIII Solutions to Exercises

Selected Exercises from Chapter III “Linear Algebra”

Solution to Exercise 15.1. Equations (15.4)-(15.7) follow di­


rectly from the definition in Eq. (15.2). For example, for Eq. (15.3):

[(A 0 B) C]ij = E AikBj£Cm = E AikCB-kf = ACBtj.


It is advantageous to prove Eq. (15.6) next. It follows directly
from Eq. (15.4) using the eigendecomposition A = VADA VA-1
(remember that DA and DB are diagonal matrices, containing
eigenvalues Aa,i and Ab,j on their diagonals). To see that it
is indeed the eigendecomposition of A 0 B, we simply use
Eq. (15.4) and the definition (15.2) again to see that the matrix
VA ® VN has the property

[(A ® B)(VA ® VB)]ij,k^ = [(VA ® VB)(Da ® db)]ij,k^


= E
ab
VA,ia VB,jb$ak$ЫAA,a AB,b

AA,a AB,b ( VA ® .
VB )ia,jb

Hence, this matrix contains the eigenvectors of A 0 B in its


columns, and the eigenvalues are A A 0 B,ab = A A,a AB,b. Using the
previous result, we can now show Eq. (15.5), using properties
of the determinant (| AB| = | A|| B|) and the eigendecomposition
(the determinant is the product of the eigenvectors)

IA ® B1 = V ® VB ||Da ® Db 11 VA ® VB | = lDA ® DB |.
Note again that (Da ® Db ) is a diagonal matrix with diagonal
entries (DA ® DB) ij,ij = AA,i AB,j. So its determinant is

IA ® BI = П Aa,i Aa,j = (П AAk B) (П ABkjA .


ijij

Solution to Exercise 15.2. The fact that the Kronecker product


of two spd matrices is spd follows directly from the results
to Exercise 15.1. To see that Eq. (15.8) defines a matrix norm,
simply note that it is a Euclidean vector-norm on A, and thus
inherits all the properties defined in note 3, on p. 128.

Solution to Exercise 19.1. This is primarily an exercise in


performing nested sums. Choosing Eij,= 5ikSj^Wij gives

N_
((I 0 S )E)nm,k£ E> ^niSjm^ik$j£Wij $nkWk£S£m,
ij
Selected Exercises from Chapter IV “Local Optimisation” 363

and thus the Gram matrix G has elements


N
((I 0 ST)£(I 0 S))nm,ab = £S WS£mSkaSa
и
N
Sna £ WnemSeb.
t

The remaining sum, with its non-trivial structure, can be writ­


ten as computing the elements of a tensor Cnmb G RNxMxM,
whose computation requires O ( N2M2 ) operations (note that
it is “symmetric” under exchange of m and b). The Kronecker
symbol in front of the sum indicates that G can be written in
block diagonal form, with N different blocks of M x M matri­
ces, collected in C. So the inversion of G costs O ( NM3 ) (even
though this is less than the cost of computing the elements of
G themselves). One tempting way out is to assume that W is a
rank-one object. This is of course not really acceptable since it
restricts the model class severely, but leads to a pleasing insight.
For choosing Wni = wnwi gives

Gnmm,ab = Snawn( S T (diag W ) S mb


= ( diag w 0 (ST (diag w)S)) nm,ab,
which is not a surprise, since this choice amounts to setting
E = diag w 0 diag w.

Selected Exercises from Chapter IV “Local Optimisation”

Solution to Exercise 25.2. The first robot moves a distance


of ||x 1 — x01| = 0.1 x Vf (x0)[m] = 0.5 m, where the potential
energy density is approximately

f (x 1) ~ f (x 1) — ||x1 — x0 ||Vf (x0)


= 4473 J — 0.5m x 5—J— = 4448 J.
kg kg ■ m kg
The second robot makes the literally microscopic step 0.1 x
Vf (x0) [ft] = 1.03 x 10—6 [ft] « 0.3 pm, to where the potential
energy density is

f (x 1) = 3.031 x 10—2 Cal — 1.03 x 10—6 ft x 1.03 x 10—5 -Cal-


1 oz oz ■ ft
= 3.031 x 10—2 Cal = 4 473 J
oz kg

(i.e., the energy density is unchanged, up to four significant


digits). The point of this exercise is to show that gradient de­
scent with a fixed step-size, although it may initially seem like
364 VIII Solutions to Exercises

the intuitive approach to optimisation, is not a well-defined


algorithm. If the units of measure of f are [f] and the units of x
are [x], then the units of Vf are [f]/[x], so adding a constant
multiple of Vf to x makes no sense. To get a meaningful algo­
rithm, the step size a has to have units itself, and they must be
[x]2/[f]. This is the case, for example, for Newton’s method.
The reason gradient descent with fixed step size has long
been popular for some problem classes (like deep learning) is an
implicit assumption that f has been smartly designed such that
the Hessian’s eigenvalues are all approximately unit, i.e. that the
problem is well-conditioned. In this example, this holds for SI
units: Moving about half a meter (similar to the length of human
step) in between re-alignments is a plausible paradigm. The
necessary pre-conditioning is sometimes concious, sometimes
unconscious. For example, the deep learning community has
found many “tricks of the trade”6 to facilitate learning with 6 Hinton (2012); Goodfellow, Bengio, and
neural networks (e.g. weight initialisation, data standardisation, Courville (2016).
etc.), many with the conscious or unconscious aim to condition
the optimisation problem.

Solution to Exercise 26.1. We consider the finite set of function


values fx G RN. The exercise stipulates a Gaussian prior p(fX)
and likelihood p(Y | fX). For Gaussians, the location of the
mean and maximum of the pdf coincide. Thus, the posterior
mean mX = Ep(fX|Y) (fX) is equal to the maximum of the product
of prior and likelihood

mX = arg max p(fX)p(Y | fX).

(The evidence is not a function of fX, thus does not affect the
location of the maximum and can be ignored.) The location
of maxima is not affected by any monotonic re-scaling. Sowe
can take the logarithm (a monotonic transformation) of the
posterior. The logarithm of a Gaussian (again ignore all constant
scaling factors) is a quadratic form. Because we are interested
in minimisation instead of maximisation, we also multiply by
(-1), to get

mX = argmin-logp(fX) - log p(Y | fX)


11
= 2 (fx - ux)Tkxx(fx - ux) + 2 ||Y - fx |

=1 II fx-ux II2xx =: r(x)

1N
= r(x) + 2 E
(Yi - f (xi))2.
Selected Exercises from Chapter IV “Local Optimisation” 365

Which is clearly of the form of Eq. (26.2). MAP inference on


the latent quantity w in probabilistic models (here, w = fX) can
generally be identified with regularised empirical risk minimi­
sation if the data are conditionally independent given w. That
is, if
N
p(Y I w) = П
i=1
P(yi I w).

A well-known theorem byde Finetti7 states that this is the case 7 Diaconis and Freedman (1980)
if the likelihood is exchangeable (invariant under permutations
of the data). However, the opposite direction does not always
work: not all regularisers r (w) and not all loss functions f (y^; w)
can be interpreted as the negative logarithm of a prior and
likelihood, respectively, since their exponential may not have
finite integral, and thus can not be normalised to become a
probability measure. For example, the hinge loss used in support
vector machines is not the logarithm of a likelihood.8 8For further discussion, see pp. 144-145
in Rasmussen and Williams (2006).

Discussion of Exercise 26.4. Some formal discussion of this


problem can be found in §6.3 of Rasmussen and Williams (2006).
The following argument gives an intuition: assume we do not
want to encode the evaluation nodes ti in the model a priori
(i.e., we want the model to produce cubic spline mean functions
in the noise-free, Л 0, limit of any choice of ti). Recall again
that this posterior mean is a weighted sum of the covariance
function k(ta, tb ) of our Gaussian process model. Hence, that
kernel k has to be a piecewise cubic polynomial in ta and tb,
with discontinuities in the second derivative only for ta = tb, at
the evaluation nodes. That is, assuming our case of ta, tb > 0,
it must be possible to write k as a polynomial in Ita - tb I, plus
polynomial terms in ta and tb .9 Hence, the posterior marginal 9This is not contradicted by the form of
Eq. (26.13), since
variance var(f (t)), which is also a weighted sum, of terms
kt,ta kt,tb, is always at most a polynomial containing terms Ita - min(ta,tb) = 1/2(Ita + tbI - Ita - tbI).

tb |€ with 0 < f < 6. Thus, it is certainly possible to construct


priors which, given Y with likelihood (26.12), assign a different
absolute uncertainty to each input location t but still revert to
the cubic spline mean in the limit Л 0. But their qualitative
behaviour is equivalent in the sense that the marginal standard
deviation (the “sausage of uncertainty” around the posterior
mean) is locally cubic in t. This is again an instance of the deeper
insight that, since a Gaussian process posterior mean and (co-
)variance both involve the same kernel, a classic numerical
estimate that is a particular least-squares MAP estimator is
consistent with only a restricted set of probabilistic posterior
error estimates in the sense of posterior standard-deviations.
366 VIII Solutions to Exercises

Selected Exercises from Chapter V “Global Optimisation"

Solution to Exercise 32.1. We compute:

aei(xn) = J f (xn)N(f (xn);m(Xn), V(Xn)) df (Xn)

- n Г N(f (Xn); m(Xn), V(Xn)) df (Xn)


- -
/•n г/ X 1 1 (f (Xn) - m (Xn)) .
= /-- f (Xn) exp 2 V(Xn) df (Xn)

- n фn;m(Xn), V(Xn))
_ гn-m(Xn) z + m(Xn) 1 z2 d
= У-- x 2nV(Xn) exp - 2 V(Xn) dz

- n Фn; m(Xn), V(Xn))


_ /V(Xn) уn-m(Xn) -z 1 z2
= V 2n 2- V(Xn) exp - 2 V(Xn) dz

+ m (Xn) ф-m (Xn );0, V( Xn))


- n Фn; m(Xn), V(Xn))
Vx ( 1 z2 \ n-m(Xn)
= N 2^ ' - 2 V x ) --

+ (m(Xn) - 7 Фn; m(Xn), V(Xn))


= - V( Xn) N (n ;m (Xn), V( Xn ))
+ (m(Xn) - 7 Фn; m(Xn), V(Xn)). □

Solution to Exercise 32.2. With a gp prior for the objective


f (x), we have the multi-variate Gaussian prior for the two
function values f1 := f (x1) and f2 := f (x1) (assuming zero
covariance between f (x 1) and f (x2)),

m1 c1 0И
P m2 .0 c2я.
- -
=: C

If we take an i.i.d. Gaussian noise model, yi = fi + si, si ^


N(0; a2), and using standard Gaussian identities (see §3), we
can compute in closed-form the joint distribution,

/ rs 7 \ ^ s1 0 a0- 0 a2 0
s2 s2 0 0 a0- 0 a2
P =N
У1 У1 m1 a0- 0 C1 + aa- 0
\ Ly2J ) \Ly2. m2 0 a0- 0 C2 + a2_| )
Selected Exercises from Chapter V “Global Optimisation” 367

We can now calculate the posterior for the noise contributions:

£1 y1
p
£2 y2

г1 y1 - m1
; а2 (C + а21) 1 , а21 - а4 (C + а21) 11
£2 y2 - m2

£1 (C1 + а2) 1 (у1 - m 1Я di а2 - а4(C1 + а2) 1И


; а2
£2 (C2 + а2)-1 (у2 - m2) , g а2 - а4(C2 + а2)-1 J

Note that if an observation yi is lower than its prior mean mi,


the expected value for the noise contribution is negative. We
will now assume that C 1 = C2 = E, and, finally, calculate the
posterior for the difference between noise contributions:

у1
p £1 - £2 = N £11 - £2; а2 (E + а21) 1 (у 1 - у2 - (m 1 - m2)), 2а2 - 2а4(C2 + а2) 1).
у2

For clarity, we will now additionally assume that m1 = m2 .It


can hence be seen that the expected value of the difference be­
tween noise contributions is proportional (with positive constant
of proportionality а2(E + а21)- 1) to the difference between ob­
servations. As such, and exploiting the linearity of expectation:
if у1 is smaller than у2, the expected value for £ 1 is also smaller
than £2. When combined with our note above, we conclude that
if у1 and у2 are both lower than their prior means (as is typical
for putative minima), the lower of the two is likely to have the
noise contribution of larger magnitude.

Solution to Exercise 33.1. Our goal is to find a Л(xn, Dn) such

is, we require a loss function Л UCB


that the myopic expected loss is equal to the ucb criterion. That
such that

Л Лucb (xn, Dn ) p (yn I Dn-1 ) dyn = m (xn ) enV (xn ) 2 .

There are many such solutions: the most trivial is

Лucb (xn, Dn) = m(xn) en V( xn) 2, (42.4)


which expresses no dependence on the function value yn at all.
This is incompatible with the stated goal of optimisation to find
low function values.
Another solution is the contrived loss function

Лucb (xn, Dn) = yn ^ fyn ( m(xn) enV(xn)2) 'j. (42.5)


368 VIII Solutions to Exercises

The expectation of this loss against p(yn | Dn-1 ) will trivially


produce an expected loss equal to the ucb acquisition function,
(33.3).
Further solutions can be found by writing

Aucb (xn, Dn)p(yn | Dn-1) dyn

=- yn p(yn | Dn-1 ) dyn - Pn( (yn - m(Xn))2p(yn I Dn-1) dyn

A(Aucb (xn , Dn )+ yn)p (yn lDn-1 ) d yn — fin^f (yn m (xn )) p (yn lDn-1) d yn

and, multiplying both sides by enV(xn)1,

enV(xn)1 J (Aucb (xn , Dn )+yn p(yn l Dn-1 ) dyn

— -в2n I(yn - m(xn))2p(yn | Dn-1) dyn.

One possible solution has the integrand of the left-hand side


equal to that of the right, so that

fin V (xn ) 2 (Aucb ( xn , Dn ) + yn( — - en(yn - m (xn )) .

We can hence finally arrive at

A ucb (xn , Dn ) — ~fin(yn - m ( xn )) V( xn ) 2 - yn . (42.6)

Note that, as en > 0 and


V(xn) > 0, this loss function is a
concave quadratic in yn . This is inappropriate for optimisation:
it neglects the fact that the loss should lead to an unambigu­
ous preference for yn to be as low (e.g. large and negative) as
possible. Speaking informally, a user would not be delighted if
yn was returned as positive infinity, whereas Eq. (42.6) would
award such an outcome infinitely negative loss.
Note that all loss functions (Eqs. (42.4), (42.5), and (42.6))
implictly depend on the prior mean and covariance functions of
the underlying gp, so as to be able to compute the gp posterior
mean m(xn) and variance V(xn) given Dn. It is odd that the
loss function should depend on the prior in this way; it implies
that, even for identical outcomes, the loss will be different for
different priors. In the case of Eq. (42.5), it is indefensible that
the loss depends on yn in no other way: that is, it is only if
yn — m(xn) + enV(xn) 2 that Eq. (42.5) is non-zero.
More on this topic can be found in §6.1 of Garnett (2022).
References

Abdulle, A. and G. Garegnani. “Random time step probabilistic


methods for uncertainty quantification in chaotic and geo­
metric numerical integration”. Statistics and Computing 30.4
(2020), pp. 907-932.
- “A probabilistic finite element method based on random
meshes: A posteriori error estimators and Bayesian inverse
problems”. Computer Methods in Applied Mechanics and Engi­
neering 384 (2021), p. 113961.
Adler, R. The Geometry of Random Fields. Wiley, 1981.
- An introduction to continuity, extrema, and related topics for
general Gaussian processes. Vol. 12. Lecture Notes-Monograph
Series. Institute of Mathematical Statistics, 1990.
Ajne, B. and T. Dalenius. “Nagra Tillampningar av statistika
ideer pa numerisk integration”. Nordisk Mathematisk Tidskrift
8.4 (1960), pp. 145-152.
Akhiezer, N. I. and I. M. Glazman. Theory of linear operators in
Hilbert space. Vol. I& II. Courier Corporation, 2013.
Alizadeh, F., J.-P. A. Haeberley, and M. L. Overton. “Primal-
Dual interior-point methods for semidefinite programming:
Convergence rates, stability and numerical results”. SIAM
Journal on Optimization (1988), pp. 746-768.
Anderson, B. and J. Moore. Optimal Filtering. Prentice-Hall, 1979.
Anderson, E. et al. LAPACK Users’ Guide. 3rd edition. Society
for Industrial and Applied Mathematics (SIAM), 1999.
Arcangeli, R., M. C. Lopez de Silanes, and J. J. Torrens. “An
extension of a bound for functions in Sobolev spaces, with
applications to (m, s)-spline interpolation and smoothing”.
Numerische Mathematik 107.2 (2007), pp. 181-211.
Armijo, L. “Minimization of functions having Lipschitz contin­
uous first partial derivatives”. Pacific Journal of Mathematics
(1966), pp. 1-3.
Arnold, V. I. Ordinary Differential Equations. Universitext. Springer,
1992.
370 References

Arnoldi, W. “The principle of minimized iterations in the solu­


tion of the matrix eigenvalue problem”. Quarterly of Applied
Mathematics 9.1 (1951), pp. 17-29.
Aronszajn, N. “Theory of reproducing kernels”. Transactions of
the AMS (1950), pp. 337-404.
Arvanitidis, G. et al. “Fast and Robust Shortest Paths on Mani­
folds Learned from Data”. The 22nd International Conference on
Artificial Intelligence and Statistics, AISTATS. Vol. 89. Proceed­
ings of Machine Learning Research. PMLR, 2019, pp. 1506­
1515.
Azimi, J., A. Fern, and X. Z. Fern. “Batch Bayesian Optimization
via Simulation Matching”. Advances in Neural Information
Processing Systems, NeurIPS. Curran Associates, Inc., 2010,
pp. 109-117.
Azimi, J., A. Jalali, and X. Z. Fern. “Hybrid Batch Bayesian
Optimization”. Proceedings of the 29th International Conference
on Machine Learning, ICML. icml.cc / Omnipress, 2012.
Bach, F. “On the equivalence between kernel quadrature rules
and random feature expansions”. Journal of Machine Learning
Research (JMLR) 18.21 (2017), pp. 1-38.
Bach, F., S. Lacoste-Julien, and G. Obozinski. “On the Equiv­
alence between Herding and Conditional Gradient Algo­
rithms”. Proceedings of the 29th International Conference on Ma­
chine Learning, ICML. icml.cc / Omnipress, 2012.
Baker, C. The numerical treatment of integral equations. Oxford:
Clarendon Press, 1973.
Balles, L. and P. Hennig. “Dissecting Adam: The Sign, Magni­
tude and Variance of Stochastic Gradients”. Proceedings of
the 35th International Conference on Machine Learning, ICML.
Vol. 80. Proceedings of Machine Learning Research. PMLR,
2018, pp. 413-422.
Balles, L., J. Romero, and P. Hennig. “Coupling Adaptive Batch
Sizes with Learning Rates”. Proceedings of the Thirty-Third
Conference on Uncertainty in Artificial Intelligence, UAI. AUAI
Press, 2017.
Bapat, R. Nonnegative Matrices and Applications. Cambridge Uni­
versity Press, 1997.
Barber, D. Bayesian reasoning and machine learning. Cambridge
University Press, 2012.
Bardenet, R. and A. Hardy. “Monte Carlo with Determinantal
Point Processes”. Annals of Applied Probability (2019).
Bartels, S. et al. “Probabilistic linear solvers: a unifying view”.
Statistics and Computing 29.6 (2019), pp. 1249-1263.
References 371

Belhadji, A., R. Bardenet, and P. Chainais. “Kernel quadrature


with DPPs”. Advances in Neural Information Processing Systems,
NeurlPS. 2019, pp. 12907-12917.
Bell, B. M. “The Iterated Kalman Smoother as a Gauss—Newton
Method”. SIAM Journal on Optimization 4.3 (1994), pp. 626­
636.
Bell, B. M. and F. W. Cathey. “The iterated Kalman filter update
as a Gauss- Newton method”. IEEE Transaction on Automatic
Control 38.2 (1993), pp. 294-297.
Benoit. “Note sure une methode de resolution des equations
normales provenant de 1’application de la methode des moin-
dres carres a un systeme d’equations lineaires en nombre
inferieure a celui des inconnues. Application de la meth­
ode a la resolution d’un systeme defini d’equations lineaires.
(Procede du Commandant Cholesky)”. Bulletin Geodesique
(1924), pp. 67-77.
Berg, C., J. Christensen, and P. Ressel. Harmonic Analysis on
Semigroups — Theory of Positive Definite and Related Functions.
Springer, 1984.
Bergstra, J. et al. “Algorithms for Hyper-Parameter Optimiza­
tion”. Advances in Neural Information Processing Systems, NeurIPS.
2011, pp. 2546-2554.
Bertsekas, D. Nonlinear programming. Athena Scientific, 1999.
Bettencourt, J., M. Johnson, and D. Duvenaud. “Taylor-mode au­
tomatic differentiation for higher-order derivatives”. NeurIPS
2019 Workshop Program Transformations. 2019.
Bini, D., B. Iannazzo, and B. Meini. Numerical solution of algebraic
Riccati equations. SIAM, 2011.
Bishop, C. Pattern Recognition and Machine Learning. Springer,
2006.
Bjorck, A. Numerical Methods in Matrix Computations. Springer,
2015.
Blight, B. J. N. and L. Ott. “A Bayesian Approach to Model In­
adequacy for Polynomial Regression”. Biometrika 62.1 (1975),
pp. 79-88.
Borodin, A. N. and P. Salminen. Handbook of Brownian Motion -
Facts and Formulae. 2nd edition. Probability and Its Applica­
tions. Birkhauser Basel, 2002.
Bosch, N., P. Hennig, and F. Tronarp. “Calibrated adaptive
probabilistic ODE solvers”. Artificial Intelligence and Statistics
(AISTATS). 2021, pp. 3466-3474.
Bosch, N., F. Tronarp, and P. Hennig. “Pick-and-Mix Information
Operators for Probabilistic ODE Solvers”. Artificial Intelligence
and Statistics (AISTATS). 2022.
372 References

Bottou, L., F. E. Curtis, and J. Nocedal. “Optimization Meth­


ods for Large-Scale Machine Learning”. arXiv:1606.04838
[stat.ML] (2016).
Bougerol, P. “Kalman filtering with random coefficients and
contractions”. SIAM Journal on Control and Optimization 31.4
( -
1993), PP. 942 959.
Boyd, S. and L. Vandenberghe. Convex Optimization. Cambridge
University Press, 2004.
Bretthorst, G. Bayesian Spectrum Analysis and Parameter Estimation.
Vol. 48. Lecture Notes in Statistics. SPringer, 1988.
Briol, F. et al. “Frank-Wolfe Bayesian Quadrature: Probabilistic
Integration with Theoretical Guarantees”. Advances in Neural
Information Processing Systems, NeurlPS. 2015, pp. 1162-1170.
Briol, F.-X. et al. “Probabilistic Integration: A Role in Statistical
Computation?” Statistical Science 34.1 (2019), pp. 1-22.
Broyden, C. “A new double-rank minimization algorithm”. No­
tices of the AMS 16 (1969), p. 670.
Butcher, J. C. Numerical Methods for Ordinary Differential Equations.
3rd edition. John Wiley & Sons, 2016.
Calandra, R. et al. “Bayesian Gait Optimization for Bipedal
Locomotion”. Learning and Intelligent OptimizatioN (LION8).
2014a, pp. 274-290.
Calandra, R. et al. “An experimental comparison of Bayesian
optimization for bipedal locomotion”. Proceedings of the Inter­
national Conference on Robotics and Automation (ICRA). 2014b.
Cashore, J. M., L. Kumarga, and P. I. Frazier. “Multi-Step Bayesian
Optimization for One-Dimensional Feasibility Determination”
(2015).
Chai, H. R. and R. Garnett. “Improving Quadrature for Con­
strained Integrands”. The 22nd International Conference on Ar­
tificial Intelligence and Statistics, AISTATS. Vol. 89. Proceedings
of Machine Learning Research. PMLR, 2019, pp. 2751-2759.
Chen, R. T. Q. et al. “Self-Tuning Stochastic Optimization with
Curvature-Aware Gradient Filtering”. Proceedings on "I Can’t
Believe It’s Not Better!" at NeurIPS Workshops. Vol. 137. Pro­
ceedings of Machine Learning Research. PMLR, 2020, pp. 60­
69.
Chen, Y., M. Welling, and A. J. Smola. “Super-Samples from
Kernel Herding”. Proceedings of the Twenty-Sixth Conference
on Uncertainty in Artificial Intelligence, UAI. AUAI Press, 2010,
pp. 109-116.
Chen, Y. et al. “Bayesian optimization in AlphaGo”. arXiv:1812.06855
[cs.LG] (2018).
References 373

Chevalier, C. and D. Ginsbourger. “Fast Computation of the


Multi-Points Expected Improvement with Applications in
Batch Selection”. Learning and Intelligent Optimization. Lecture
Notes in Computer Science. Springer Berlin Heidelberg, 2013,
PP. 59-69.
Chkrebtii, O. A. and D. Campbell. “Adaptive step-size selection
for state-sPace Probabilistic differential equation solvers”.
Statistics and Computing 29 (2019), pp. 1285-1295.
Chkrebtii, O. A. et al. “Bayesian solution uncertainty quantifi­
cation for differential equations”. Bayesian Anal. 11.4 (2016),
pp. 1239-1267.
Church, A. “On the concept of a random sequence”. Bulletin of
the AMS 46.2 (1940), pp. 130-135.
Cockayne, J. et al. “Probabilistic Numerical Methods for Par­
tial Differential Equations and Bayesian Inverse Problems”.
arXiv:1605.07811v3 [stat.ME] (2017a).
Cockayne, J. et al. “A Bayesian conjugate gradient method (with
discussion)”. Bayesian Analysis 14.3 (2019a), pp. 937-1012.
Cockayne, J. et al. “Bayesian Probabilistic Numerical Methods”.
SIAM Review 61.4 (2019b), pp. 756-789.
Cockayne, J. et al. “Probabilistic numerical methods for PDE-
constrained Bayesian inverse problems”. AIP Conference Pro­
ceedings 1853.1 (2017b), p. 060001.
Colas, C., O. Sigaud, and P.-Y. Oudeyer. “How Many Random
Seeds? Statistical Power Analysis in Deep Reinforcement
Learning Experiments”. arXiv:1806.08295 [cs.LG] (2018).
Conrad, P. R. et al. “Statistical analysis of differential equations:
introducing probability measures on numerical solutions”.
Statistics and Computing 27.4 (2017), pp. 1065-1082.
Cottle, R. “Manifestations of the Schur complement”. Linear
Algebra Applications 8 (1974), pp. 189-211.
Cox, R. “Probability, frequency and reasonable expectation”.
American Journal of Physics 14.1 (1946), pp. 1-13.
Cranmer, K., J. Brehmer, and G. Louppe. “The frontier of simulation­
based inference”. Proceedings of the National Academy of Sci­
ences (2020).
Cunningham, J., P. Hennig, and S. Lacoste-Julien. “Gaussian
Probabilities and Expectation Propagation”. arXiv:1111.6832
[stat.ML] (2011).
Dahlquist, G. G. “A special stability problem for linear multistep
methods”. BIT Numerical Mathematics 3 (1963), pp. 27-43.
Dangel, F., F. Kunstner, and P. Hennig. “BackPACK: Packing
more into Backprop”. 8th International Conference on Learning
Representations, ICLR. 2020.
374 References

Dashti, M. and A. M. Stuart. “The Bayesian Approach to Inverse


Problems”. Handbook of Uncertainty Quantification. Springer
International Publishing, 2017, pp. 311-428.
Davidon, W. Variable metric method for minimization. Tech. rep.
Argonne National Laboratories, Ill., 1959.
Davis, P. “Leonhard Euler’s Integral: A Historical Profile of
the Gamma Function.” American Mathematical Monthly 66.10
(1959), pp. 849-869.
Davis, P. and P. Rabinowitz. Methods of Numerical Integration.
2nd edition. Academic Press, 1984.
Dawid, A. “Some matrix-variate distribution theory: Notational
considerations and a Bayesian application”. Biometrika 68.1
(1981), pp. 265-274.
Daxberger, E. and B. Low. “Distributed Batch Gaussian Process
Optimization”. PMLR. 2017, pp. 951-960.
Daxberger, E. et al. “Laplace Redux-Effortless Bayesian Deep
Learning”. Advances in Neural Information Processing Systems,
NeurIPS. Vol. 34. 2021.
Demmel, J. W. Applied Numerical Linear Algebra. SIAM, 1997.
Dennis, J. “On some methods based on Broyden’s secant ap­
proximations”. Numerical Methods for Non-Linear Optimization.
1971.
Dennis, J. E. and J. J. More. “Quasi-Newton methods, motivation
and theory”. SIAM Review 19.1 (1977), pp. 46-89.
Desautels, T., A. Krause, and J. W. Burdick. “Parallelizing Exploration­
Exploitation Tradeoffs with Gaussian Process Bandit Opti­
mization”. Proceedings of the 29th International Conference on
Machine Learning, ICML. icml.cc / Omnipress, 2012.
Deuflhard, P. and F. Bornemann. Scientific Computing with Ordi­
nary Differential Equations. Vol. 42. Springer Texts in Applied
Mathematics. Springer, 2002.
Diaconis, P. “Bayesian numerical analysis”. Statistical decision
theory and related topics IV (1988), pp. 163-175.
Diaconis, P. and D. Freedman. “Finite exchangeable sequences”.
The Annals of Probability (1980), pp. 745-764.
Diaconis, P. and D. Ylvisaker. “Conjugate priors for exponential
families”. The Annals of Statistics 7.2 (1979), pp. 269-281.
Dick, J., F. Y. Kuo, and I. H. Sloan. “High-dimensional integra­
tion: the quasi-Monte Carlo way”. Acta Numerica 22 (2013),
pp. 133-288.
Dixon, L. “Quasi-Newton algorithms generate identical points”.
Mathematical Programming 2.1 (1972a), pp. 383-387.
References 375

- “Quasi Newton techniques generate identical points II: The


proofs of four new theorems”. Mathematical Programming 3.1
(1972b), pp. 345-358.
Dormand, J. and P. Prince. “A family of embedded Runge-Kutta
formulae”. Journal of computational and applied mathematics
(1980), pp. 19-26.
Doucet, A., N. de Freitas, and N. Gordon. “An Introduction
to Sequential Monte Carlo Methods”. Sequential Monte Carlo
Methods in Practice. Statistics for Engineering and Information
Science. Springer, New York, NY, 2001, pp. 3-14.
Driscoll, M. “The reproducing kernel Hilbert space structure of
the sample paths of a Gaussian process”. Probability Theory
and Related Fields 26.4 (1973), pp. 309-316.
Eich-Soellner, E. and C. Fuhrer. “Implicit Ordinary Differen­
tial Equations”. Numerical Methods in Multibody Dynamics.
Vieweg+Teubner Verlag, 1998, pp. 139-192.
Einstein, A. “Zur Theorie der Brownschen Bewegung”. Annalen
der Physik (1906), pp. 371-381.
Falsbender, H. Symplectic methods for the symplectic eigenproblem.
Springer Science & Business Media, 2007.
Fillion, N. and R. M. Corless. “On the epistemological analysis
of modeling and computational error in the mathematical
sciences”. Synthese 191.7 (2014), pp. 1451-1467.
Fletcher, R. “A new approach to variable metric algorithms”.
The Computer Journal 13.3 (1970), p. 317.
- “Conjugate Gradients Methods for Indefinite Systems”. Dundee
Biennial Conference on Numerical Analysis. 1975, pp. 73-89.
Fletcher, R. and C. Reeves. “Function minimization by conjugate
gradients”. The Computer Journal (1964), pp. 149-154.
Fowler, D. and E. Robson. “Square root approximations in Old
Babylonian mathematics: YBC 7289 in context”. Historia Math-
ematica 25.4 (1998), pp. 366-378.
Frazier, P., W. Powell, and S. Dayanik. “The Knowledge-Gradient
Policy for Correlated Normal Beliefs”. INFORMS Journal on
Computing 21.4 (2009), pp. 599-613.
Fredholm, E. I. “Sur une classe d’equations fonctionnelles”. Acta
Mathematica 27 (1903), pp. 365-390.
Freitas, N. de, A. J. Smola, and M. Zoghi. “Exponential Regret
Bounds for Gaussian Process Bandits with Deterministic
Observations”. Proceedings of the 29th International Conference
on Machine Learning, ICML. icml.cc / Omnipress, 2012.
Freund, R. W. and N. M. Nachtigal. “QMR: a Quasi-Minimal
Residual Method for non-Hermitian Linear Systems”. Nu-
merische mathematik 60.1 (1991), pp. 315-339.
376 References

Frohlich, C. et al. “Bayesian Quadrature on Riemannian Data


Manifolds”. Proceedings of the 38th International Conference on
Machine Learning, ICML. Vol. 139. Proceedings of Machine
Learning Research. PMLR, 2021, pp. 3459-3468.
Garnett, R. Bayesian Optimization. Cambridge University Press,
2022.
Garnett, R., M. A. Osborne, and S. J. Roberts. “Bayesian opti­
mization for sensor set selection”. ACM/IEEE International
Conference on Information Processing in Sensor Networks. ACM.
2010, pp. 209-219.
Garnett, R. et al. “Bayesian Optimal Active Search and Survey­
ing”. Proceedings of the 29th International Conference on Machine
Learning, ICML. icml.cc / Omnipress, 2012.
Gauss, C. F. Theoria motus corporum coelestium in sectionibus conicis
solem ambientium. Perthes, F. and Besser, I.H., 1809.
- “Methodus nova integralium valores per approximationem
inveniendi”. Proceedings of the Royal Scientific Society of Gottin­
gen. Heinrich Dieterich, 1814.
Gautschi, W. Orthogonal Polynomials—Computation and Approxi­
mation. Oxford University Press, 2004.
Genz, A. “Numerical computation of rectangular bivariate and
trivariate normal and t probabilities”. Statistics and Computing
14.3 (2004), pp. 251-260.
Gerritsma, J., R. Onnink, and A. Versluis. “Geometry, resistance
and stability of the delft systematic yacht hull series”. Inter­
national shipbuilding progress 28.328 (1981).
Ginsbourger, D., R. Le Riche, and L. Carraro. “A multi-points
criterion for deterministic parallel global optimization based
on Gaussian processes” (2008).
- “Kriging is well-suited to parallelize optimization”. Computa­
tional Intelligence in Expensive Optimization Problems 2 (2010),
pp. 131-162.
Girolami, M. et al. “The statistical finite element method (stat­
FEM) for coherent synthesis of observation data and model
predictions”. Computer Methods in Applied Mechanics and Engi­
neering 375 (2021), p. 113533.
Gittins, J. C. “Bandit processes and dynamic allocation indices”.
Journal of the Royal Statistical Society. Series B (Methodological)
(1979), pp. 148-177.
Goldberg, D. Genetic Algorithms in Search, Optimization, and Ma­
chine Learning. Addison-Wesley, 1989.
Goldfarb, D. “A family of variable metric updates derived by
variational means”. Mathematics of Computation 24.109 (1970),
pp. 23-26.
References 377

Golub, G. and C. Van Loan. Matrix computations. Johns Hopkins


University Press, 1996.
Gomez-Bombarelli, R. et al. “Automatic chemical design us­
ing a data-driven continuous representation of molecules”.
arXiv:1610.02415 [cs.LG] (2016).
Gonzalez, J., M. A. Osborne, and N. D. Lawrence. “GLASSES:
Relieving The Myopia Of Bayesian Optimisation”. Proceedings
of the 19th International Conference on Artificial Intelligence and
Statistics, AISTATS. Vol. 51. JMLR Workshop and Conference
Proceedings. JMLR.org, 2016, pp. 790-799.
Gonzalez, J. et al. “Batch Bayesian Optimization via Local Penal­
ization”. Proceedings of the 19th International Conference on Arti­
ficial Intelligence and Statistics, AISTATS. Vol. 51. JMLR Work­
shop and Conference Proceedings. JMLR.org, 2016, pp. 648­
657.
Goodfellow, I. J., Y. Bengio, and A. Courville. Deep Learning. MIT
Press, 2016.
Gradshteyn, I. and I. Ryzhik. Table of Integrals, Series, and Products.
7th edition. Academic Press, 2007.
Grcar, J. “Mathematicians of Gaussian elimination”. Notices of
the AMS 58.6 (2011), pp. 782-792.
Greenbaum, A. Iterative Methods for Solving Linear Systems. Vol. 17.
SIAM, 1997.
Greenstadt, J. “Variations on variable-metric methods”. Mathe­
matics of Computation 24 (1970), pp. 1-22.
Grewal, M. S. and A. P. Andrews. Kalman Filtering: Theory and
Practice Using MATLAB. John Wiley & Sons, Inc., 2001.
Griewank, A. Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation. Frontiers in Applied Mathematics.
SIAM, 2000.
Griewank, A. and A. Walther. Evaluating Derivatives. Cambridge
University Press, 2008.
Gunter, T. et al. “Sampling for Inference in Probabilistic Models
with Fast Bayesian Quadrature”. Advances in Neural Informa­
tion Processing Systems, NeurIPS. 2014, pp. 2789-2797.
Hager, W. “Updating the Inverse of a Matrix”. SIAM Review 31.2
(1989), pp. 221-239.
Hairer, E., C. Lubich, and G. Wanner. Geometric numerical inte­
gration: structure-preserving algorithms for ordinary differential
equations. Vol. 31. Springer Science & Business Media, 2006.
Hairer, E., S. N0rsett, and G. Wanner. Solving Ordinary Differen­
tial Equations I-Nonstiff Problems. 2nd edition. Vol. 8. Springer
Series in Computational Mathematics. Springer, 1993.
378 References

Hairer, E. and G. Wanner. Solving Ordinary Differential Equations


II-Stiff and Differential-Algebraic Problems. 2nd edition. Vol. 14.
Springer, 1996.
Hartikainen, J. and S. Sarkka. “Kalman filtering and smoothing
solutions to temporal Gaussian process regression models”.
IEEE International Workshop on Machine Learning for Signal
Processing (MLSP), 2010. 2010, pp. 379-384.
Helmert, F. “Uber die Berechnung des wahrscheinlichen Fehlers
aus einer endlichen Anzahl wahrer Beobachtungsfehler”.
Zeitschrift fur Mathematik und Physik 20 (1875), PP- 300-303.
Henderson, P. et al. “Deep Reinforcement Learning That Mat­
ters”. Proceedings of the Thirty-Second AAAI Conference on Arti­
ficial Intelligence. AAAI Press, 2018, pp. 3207-3214.
Hennig, P. “Fast Probabilistic Optimization from Noisy Gradi­
ents”. Proceedings of the 30th International Conference on Machine
Learning, ICML. Vol. 28. JMLR Workshop and Conference Pro­
ceedings. JMLR.org, 2013, pp. 62-70.
- “Probabilistic Interpretation of Linear Solvers”. SIAM Journal
on Optimization (2015), pp. 210-233.
Hennig, P. and S. Hauberg. “Probabilistic Solutions to Differen­
tial Equations and their Application to Riemannian Statistics”.
Proceedings of the Seventeenth International Conference on Artifi­
cial Intelligence and Statistics, AISTATS. Vol. 33. JMLR Work­
shop and Conference Proceedings. JMLR.org, 2014, pp. 347­
355.
Hennig, P. and M. Kiefel. “Quasi-Newton methods: Anew
direction”. International Conference on Machine Learning, ICML.
2012.
Hennig, P., M. Osborne, and M. Girolami. “Probabilistic numer­
ics and uncertainty in computations”. Proceedings of the Royal
Society of London A: Mathematical, Physical and Engineering
Sciences 471.2179 (2015).
Hennig, P. and C. Schuler. “Entropy search for information­
efficient global optimization”. Journal of Machine Learning
Research 13.Jun (2012), pp. 1809-1837.
Hernandez-Lobato, J. M. et al. “Predictive Entropy Search for
Bayesian Optimization with Unknown Constraints”. Proceed­
ings of the 32nd International Conference on Machine Learning,
ICML. Vol. 37. JMLR Workshop and Conference Proceedings.
JMLR.org, 2015, pp. 1699-1707.
Hestenes, M. and E. Stiefel. “Methods of conjugate gradients
for solving linear systems”. Journal of Research of the National
Bureau of Standards 49.6 (1952), pp. 409-436.
References 379

Hinton, G. “A practical guide to training restricted Boltzmann


machines”. Neural Networks: Tricks of the Trade. Springer, 2012,
-
PP. 599 619.
Hoffman, M., E. Brochu, and N. de Freitas. “Portfolio Alloca­
tion for Bayesian OPtimization”. UAI 2011, Proceedings of the
Twenty-Seventh Conference on Uncertainty in Artificial Intelli­
gence. AUAI Press, 2011, pp. 327-336.
Hoffman, M. and Z. Ghahramani. “OutPut-SPace Predictive
Entropy Search for Flexible Global Optimization”. NIPS work­
shop on Bayesian optimization. 2015.
Hoos, H. H. “Programming by optimization”. Communications
of the ACM 55.2 (2012), pp. 70-80.
Horst, R. and H. Tuy. Global optimization: Deterministic approaches.
Springer Science & Business Media, 2013.
Houlsby, N. et al. “Bayesian Active Learning for Classification
and Preference Learning”. arXiv:1112.5745 [stat.ML] (2011).
Hull, T. et al. “Comparing numerical methods for ordinary
differential equations”. SIAM Journal on Numerical Analysis
9.4 (1972), pp. 603-637.
Huszar, F. and D. Duvenaud. “Optimally-Weighted Herding is
Bayesian Quadrature”. Proceedings of the Twenty-Eighth Confer­
ence on Uncertainty in Artificial Intelligence, UAI. AUAI Press,
2012, pp. 377-386.
Hutter, F., H. Hoos, and K. Leyton-Brown. “Sequential Model­
Based Optimization for General Algorithm Configuration”.
Proceedings of LION-5. 2011, pp. 507-523.
Hutter, M. Universal Artificial Intelligence. Texts in Theoretical
Computer Science. Springer, 2010.
Ibragimov, I. and R. Has’minskii. Statistical Estimation: Asymp­
totic Theory. Springer, New York, 1981.
Ipsen, I. Numerical matrix analysis: Linear systems and least squares.
SIAM, 2009.
Islam, R. et al. “Reproducibility of Benchmarked Deep Reinforce­
ment Learning Tasks for Continuous Control”. arXiv:1708.04133
[cs.LG] (2017).
Isserlis, L. “On a formula for the product-moment coefficient of
any order of a normal frequency distribution in any number
of variables”. Biometrika 12.1/2 (1918), pp. 134-139.
Jaynes, E. and G. Bretthorst. Probability Theory: the Logic of Science.
Cambridge University Press, 2003.
Jiang, S. et al. “Efficient Nonmyopic Active Search”. Proceedings
of the 34th International Conference on Machine Learning, ICML.
Vol. 70. Proceedings of Machine Learning Research. PMLR,
2017, pp. 1714-1723.
380 References

Jiang, S. et al. “Efficient nonmyopic Bayesian optimization and


quadrature”. arXiv:1909.04568 [cs.LG] (2019).
- “BINOCULARS for efficient, nonmyopic sequential experi­
mental design”. Proceedings of the 37th International Conference
on Machine Learning, ICML. Vol. 119. Proceedings of Machine
Learning Research. PMLR, 2020, pp. 4794-4803.
John, D., V. Heuveline, and M. Schober. “GOODE: A Gaussian
Off-The-Shelf Ordinary Differential Equation Solver”. Proceed­
ings of the 36th International Conference on Machine Learning,
ICML. Vol. 97. Proceedings of Machine Learning Research.
PMLR, 2019, pp. 3152-3162.
Jones, D. “A taxonomy of global optimization methods based on
response surfaces”. Journal of Global Optimization 21.4 (2001),
pp. 345-383.
Jones, D., M. Schonlau, and W. Welch. “Efficient global opti­
mization of expensive black-box functions”. Journal of Global
Optimization 13.4 (1998), pp. 455-492.
Kadane, J. B. and G. W. Wasilkowski. “Average case epsilon­
complexity in computer science: A Bayesian view”. Bayesian
Statistics 2, Proceedings of the Second Valencia International Meet­
ing. 1985, pp. 361-374.
Kalman, R. “A New Approach to Linear Filtering and Prediction
Problems”. Journal of Fluids Engineering 82.1 (1960), pp. 35-45.
Kanagawa, M. and H. Hennig. “Convergence Guarantees for
Adaptive Bayesian Quadrature Methods”. Advances in Neural
Information Processing Systems, NeurIPS. 2019, pp. 6234-6245.
Kanagawa, M., B. K. Sriperumbudur, and K. Fukumizu. “Con­
vergence Analysis of Deterministic Kernel-Based Quadrature
Rules in Misspecified Settings”. Foundations of Computational
Mathematics 20.1 (2020), pp. 155-194.
Kanagawa, M. et al. “Gaussian Processes and Kernel Methods: A
Review on Connections and Equivalences”. arXiv:1807.02582
[stat.ML] (2018).
Karatzas, I. and S. E. Shreve. Brownian Motion and Stochastic
Calculus. Springer, 1991.
Karvonen, T. and S. Sarkka. “Classical quadrature rules via
Gaussian processes”. IEEE International Workshop on Machine
Learning for Signal Processing (MLSP). Vol. 27. 2017.
Kennedy, M. “Bayesian quadrature with non-normal approxi­
mating functions”. Statistics and Computing 8.4 (1998), pp. 365­
375.
Kersting, H. “Uncertainty-Aware Numerical Solutions of ODEs
by Bayesian Filtering”. PhD thesis. Eberhard Karls Universitat
Tubingen, 2020.
References 381

Kersting, H. and P. Hennig. “Active Uncertainty Calibration


in Bayesian ODE Solvers”. Proceedings of the Thirty-Second
Conference on Uncertainty in Artificial Intelligence, UAI. AUAI
Press, 2016.
Kersting, H. and M. Mahsereci. “A Fourier State Space Model for
Bayesian ODE Filters”. Workshop on Invertible Neural Networks,
Normalizing Flows, and Explicit Likelihood Models, ICML. 2020.
Kersting, H., T. J. Sullivan, and P. Hennig. “Convergence Rates
of Gaussian ODE Filters”. Statistics and Computing 30.6 (2020),
pp. 1791-1816.
Kersting, H. et al. “Differentiable Likelihoods for Fast Inver­
sion of ’Likelihood-Free’ Dynamical Systems”. Proceedings of
the 37th International Conference on Machine Learning, ICML.
Vol. 119. Proceedings of Machine Learning Research. PMLR,
2020, pp. 5198-5208.
Kimeldorf, G. S. and G. Wahba. “A correspondence between
Bayesian estimation on stochastic processes and smoothing
by splines”. The Annals of Mathematical Statistics 41.2 (1970),
pp. 495-502.
Kingma, D. P. and J. Ba. “Adam: A Method for Stochastic Opti­
mization”. 3rd International Conference on Learning Representa­
tions, ICLR. 2015.
Kitagawa, G. “Non-Gaussian State—Space Modeling of Nonsta-
tionary Time Series”. Journal of the American Statistical Associa­
tion 82.400 (1987), pp. 1032-1041.
Klein, A. et al. “Fast Bayesian Optimization of Machine Learning
Hyperparameters on Large Datasets”. Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics,
AISTATS. Vol. 54. Proceedings of Machine Learning Research.
PMLR, 2017, pp. 528-536.
Ko, C.-W., J. Lee, and M. Queyranne. “An exact algorithm for
maximum entropy sampling”. Operations Research 43.4 (1995),
pp. 684-691.
Kochenderfer, M. J. Decision Making Under Uncertainty: Theory
and Application. The MIT Press, 2015.
Koller, D. and N. Friedman. Probabilistic Graphical Models: Princi­
ples and Techniques. MIT Press, 2009.
Kolmogorov, A. “Zur Theorie der Markoffschen Ketten”. Mathe-
matische Annalen 112 (1 1936), pp. 155-160.
Kolmogorov, A. “Three approaches to the quantitative definition
of information”. International Journal of Computer Mathematics
2.1-4 (1968), pp. 157-168.
Kramer, N. and P. Hennig. “Stable Implementation of Proba­
bilistic ODE Solvers”. arXiv:2012.10106 [stat.ML] (2020).
382 References

Kramer, N. and P. Hennig. “Linear-Time Probabilistic Solutions


of Boundary Value Problems”. Advances in Neural Information
Processing Systems, NeurIPS. 2021.
Kramer, N., J. Schmidt, and P. Hennig. “Probabilistic Numerical
Method of Lines for Time-Dependent Partial Differential
Equations”. Artificial Intelligence and Statistics (AISTATS). 2022.
Kramer, N. et al. “Probabilistic ODE Solutions in Millions of
Dimensions”. arXiv:2110.11812 [stat.ML] (2021).
Krizhevsky, A. and G. Hinton. Learning multiple layers offeatures
from tiny images. Tech. rep. 2009.
Kushner, H. J. “A New Method of Locating the Maximum Point
of an Arbitrary Multipeak Curve in the Presence of Noise”.
Journal of Basic Engineering 86.i (1964), pp. 97-106.
Kushner, H. J. “A versatile stochastic model of a function of
unknown and time varying form”. Journal of Mathematical
Analysis and Applications 5.1 (1962), pp. 150-167.
Lai, T. L. and H. Robbins. “Asymptotically efficient adaptive
allocation rules”. Advances in Applied Mathematics 6.1 (1985),
pp. 4-22.
Lanczos, C. “An iteration method for the solution of the eigen­
value problem of linear differential and integral operators”.
Journal of Research of the National Bureau of Standards 45 (1950),
pp. 255-282.
Laplace, P. Theorie Analytique des Probabilites. 2nd edition. V.
Courcier, Paris, 1814.
Larkin, F. “Gaussian measure in Hilbert space and applications
in numerical analysis”. Journal of Mathematics 2.3 (1972).
Lauritzen, S. “Time series analysis in 1880: A discussion of
contributions made by TN Thiele”. International Statistical
Review/Revue Internationale de Statistique (1981), pp. 319-331.
Lauritzen, S. and D. Spiegelhalter. “Local computations with
probabilities on graphical structures and their application to
expert systems”. Journal of the Royal Statistical Society. Series B
(Methodological) 50 (1988), pp. 157-224.
Le Cam, L. “Convergence of estimates under dimensionality
restrictions”. Annals of Statistics 1 (1973), pp. 38-53.
Lecomte, C. “Exact statistics of systems with uncertainties: An
analytical theory of rank-one stochastic dynamic systems”.
Journal of Sound and Vibration 332.11 (2013), pp. 2750-2776.
Lemarechal, C. “Cauchy and the Gradient Method”. Docu­
menta Mathematica Extra Volume: Optimization Stories (2012),
pp. 251-254.
Lemieux, C. Monte Carlo and quasi-Monte Carlo sampling. Springer
Science & Business Media, 2009.
References 383

Lie, H. C., M. Stahn, and T. J. Sullivan. “Randomised one-step


time integration methods for deterministic operator differen­
tial equations”. Calcolo 59.1 (2022), p. 13.
Lie, H. C., A. M. Stuart, and T. J. Sullivan. “Strong conver­
gence rates of probabilistic integrators for ordinary differen­
tial equations”. Statistics and Computing 29.6 (2019), pp. 1265­
1283.
Lie, H. C., T. J. Sullivan, and A. L. Teckentrup. “Random For­
ward Models and Log-Likelihoods in Bayesian Inverse Prob­
lems”. SIAM/ASA Journal on Uncertainty Quantification 6.4
(2018), pp. 1600-1629.
Lindstrom, E., H. Madsen, and J. N. Nielsen. Statistics for Finance.
Texts in Statistical Science. Chapman and Hall/CRC, 2015.
Lorenz, E. N. “Deterministic Nonperiodic Flow”. Journal of At­
mospheric Sciences 20 (2 1963), pp. 130-141.
Loveland, D. “A new interpretation of the von Mises’ concept of
random sequence”. Mathematical Logic Quarterly 12.1 (1966),
pp. 279-294.
Luenberger, D. Introduction to Linear and Nonlinear Programming.
2nd edition. Addison Wesley, 1984.
Lutkepohl, H. Handbook of Matrices. Wiley, 1996.
MacKay, D. “The evidence framework applied to classification
networks”. Neural computation 4.5 (1992), pp. 720-736.
- “Introduction to Gaussian processes”. NATO ASI Series F
Computer and Systems Sciences 168 (1998), pp. 133-166.
- Information Theory, Inference, and Learning Algorithms. Cam­
bridge University Press, 2003.
- The Humble Gaussian Distribution. Tech. rep. Cavendish Labo­
ratory, Cambridge University, 2006.
Magnani, E. et al. “Bayesian Filtering for ODEs with Bounded
Derivatives”. arXiv:1709.08471 [cs.NA] (2017).
Mahsereci, M. “Probabilistic Approaches to Stochastic Opti­
mization”. PhD thesis. Eberhard Karl University of Tubingen,
2018.
Mahsereci, M. and P. Hennig. “Probabilistic Line Searches for
Stochastic Optimization”. Advances in Neural Information Pro­
cessing Systems, NeurIPS. 2015, pp. 181-189.
- “Probabilistic Line Searches for Stochastic Optimization”.
Journal of Machine Learning Research 18.119 (2017), pp. 1-59.
Mahsereci, M. et al. “Early Stopping without a Validation Set”.
arXiv:1703.09580 [cs.LG] (2017).
Mania, H., A. Guy, and B. Recht. “Simple random search pro­
vides a competitive approach to reinforcement learning”.
arXiv:1803.07055 [cs.LG] (2018).
384 References

Marchant, R. and F. Ramos. “Bayesian Optimisation for Intelli­


gent Environmental Monitoring”. NIPS workshop on Bayesian
Optimization and Decision Making. 2012.
Marchant, R., F. Ramos, and S. Sanner. “Sequential Bayesian
Optimisation for Spatial-Temporal Monitoring”. Proceedings
of the Thirtieth Conference on Uncertainty in Artificial Intelligence,
UAI. AUAI Press, 2014, pp. 553-562.
Markov, A. “Rasprostranenie zakona bol’shih chisel na velichiny,
zavisyaschie drug ot druga (A generalization of the law
of large numbers to variables that depend on each other)”.
Izvestiya Fiziko-matematicheskogo obschestva pri Kazanskom uni-
versitete (Proceedings of the Society for Physics and Mathematics
at Kazan University) 15.135-156 (1906), p. 18.
Marmin, S., C. Chevalier, and D. Ginsbourger. “Differentiat­
ing the multipoint Expected Improvement for optimal batch
design”. arXiv:1503.05509 [stat.ML] (2015).
- “Efficient batch-sequential Bayesian optimization with mo­
ments of truncated Gaussian vectors”. arXiv:1609.02700 [stat.ML]
(2016).
Martinez R., H. J. “Local and Superlinear Convergence of Struc­
tured Secant Methods from the Convex Class”. PhD thesis.
Rice University, 1988.
Ma tern, B. “Spatial variation”. Meddelanden fran Statens Skogs-
forskningsinstitut 49.5 (1960).
Matsuda, T. and Y. Miyatake. “Estimation of Ordinary Differ­
ential Equation Models with Discretization Error Quantifi­
cation”. SIAM/ASA Journal on Uncertainty Quantification 9.1
(2021), pp. 302-331.
Matsumoto, M. and T. Nishimura. “Mersenne twister: a 623-
dimensionally equidistributed uniform pseudo-random num­
ber generator ”. ACM Transactions on Modeling and Computer
Simulation (TOMACS) 8.1 (1998), pp. 3-30.
McLeod, M., M. A. Osborne, and S. J. Roberts. “Practical Bayesian
Optimization for Variable Cost Objectives”. arXiv:1703.04335
[stat.ML] (2015).
Meister, A. Numerik Linearer Gleichungssysteme. Springer, 2011.
Minka, T. Deriving quadrature rules from Gaussian processes. Tech.
rep. Statistics Department, Carnegie Mellon University, 2000.
Mitchell, M. An introduction to genetic algorithms. MIT press, 1998.
Mitchell, T. M. The Need for Biases in Learning Generalizations.
Tech. rep. CBM-TR 5-110. Rutgers University, 1980.
Mockus, J., V. Tiesis, and A. Zilinskas. “The Application of
Bayesian Methods for Seeking the Extremum”. Toward Global
Optimization. Vol. 2. Elsevier, 1978.
References 385

Moore, E. “On the reciprocal of the general algebraic matrix,


abstract”. Bulletin of American Mathematical Society 26 (1920),
-
PP. 394 395.
More, J. J. “Recent developments in algorithms and software for
trust region methods”. Mathematical Programming: The state of
the art. 1983, pp. 258-287.
Neal, R. “Annealed imPortance samPling”. Statistics and Com­
puting 11.2 (2001), pp. 125-139.
Nesterov, Y. “A method of solving a convex programming prob­
lem with convergence rate O(1/k2)”. Soviet Mathematics Dok-
lady. Vol. 27. 2. 1983, pp. 372-376.
Netzer, Y. et al. “Reading digits in natural images with unsuper­
vised feature learning”. NIPS workshop on deep learning and
unsupervised feature learning. 2. 2011, p. 5.
Nickson, T. et al. “Automated Machine Learning on Big Data us­
ing Stochastic Algorithm Tuning”. arXiv:1407.7969 [stat.ML]
(2014).
Nocedal, J. and S. Wright. Numerical Optimization. Springer Ver­
lag, 1999.
Nordsieck, A. “On numerical integration of ordinary differential
equations”. Mathematics of Computation 16.77 (1962), pp. 22­
49.
Novak, E. Deterministic and stochastic error bounds in numerical
analysis. Vol. 1349. Springer, 2006.
Nystrom, E. “Uber die praktische Auflosung von Integralgle-
ichungen mit Anwendungen auf Randwertaufgaben”. Acta
Mathematica 54.1 (1930), pp. 185-204.
O’Hagan, A. “Bayes-Hermite quadrature”. Journal of Statistical
Planning and Inference (1991), pp. 245-260.
- “Some Bayesian Numerical Analysis”. Bayesian Statistics (1992),
pp. 345-363.
O’Hagan, A. and J. F. C. Kingman. “Curve Fitting and Optimal
Design for Prediction”. Journal of the Royal Statistical Society.
Series B 40.1 (1978), pp. 1-42.
O’Neil, C. Weapons of math destruction: How big data increases
inequality and threatens democracy. Crown, 2016.
Oates, C. J. and T. J. Sullivan. “A Modern Retrospective on
Probabilistic Numerics”. Statistics and Computing 29.6 (2019),
pp. 1335-1351.
Oates, C. J., M. Girolami, and N. Chopin. “Control functionals
for Monte Carlo integration”. Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 79.3 (2017), pp. 695­
718.
386 References

Oates, C. J. et al. “Convergence rates for a class of estimators


based on Stein’s method”. Bernoulli 25.2 (2019a), pp. 1141­
1159.
Oates, C. J. et al. “Bayesian Probabilistic Numerical Methods
in Time-Dependent State Estimation for Industrial Hydrocy­
clone Equipment”. Journal of the American Statistical Association
114.528 (2019b), pp. 1518-1531.
Oesterle, J. et al. “Numerical uncertainty can critically affect
simulations of mechanistic models in neuroscience”. bioRxiv
(2021).
Oksendal, B. Stochastic Differential Equations: An Introduction with
Applications. 6th edition. Springer, 2003.
Ortega, J. and W. Rheinboldt. Iterative solution of nonlinear equa­
tions in several variables. Vol. 30. Classics in Applied Mathe­
matics. SIAM, 1970.
Osborne, M., R. Garnett, and S. Roberts. “Gaussian processes for
global optimization”. 3rd International Conference on Learning
and Intelligent Optimization (LION3). 2009.
Osborne, M. A. et al. “Active Learning of Model Evidence
Using Bayesian Quadrature”. Advances in Neural Information
Processing Systems, NeurIPS. 2012, pp. 46-54.
Owhadi, H. “Multigrid with Rough Coefficients and Multireso­
lution Operator Decomposition from Hierarchical Informa­
tion Games”. SIAM Review 59.1 (2017), pp. 99-149.
Owhadi, H. and C. Scovel. “Conditioning Gaussian measure on
Hilbert space”. arXiv:1506.04208 [math.PR] (2015).
- “Toward Machine Wald”. Springer Handbook of Uncertainty
Quantification. Springer, 2016, pp. 1-35.
Owhadi, H. and L. Zhang. “Gamblets for opening the complexity­
bottleneck of implicit schemes for hyperbolic and parabolic
ODEs/PDEs with rough coefficients”. Journal of Computational
Physics 347 (2017), pp. 99-128.
Packel, E. and J. Traub. “Information-based complexity”. Nature
328.6125 (1987), pp. 29-33.
Paleyes, A. et al. “Emulation of physical processes with Emukit”.
Second Workshop on Machine Learning and the Physical Sciences,
NeurIPS. 2019.
Parlett, B. The Symmetric Eigenvalue Problem. Prentice-Hall, 1980.
Pasquale, F. The black box society: The secret algorithms that control
money and information. Harvard University Press, 2015.
Patterson, D. “The Carbon Footprint of Machine Learning Train­
ing Will Plateau, Then Shrink” (2022).
Paul, S., M. A. Osborne, and S. Whiteson. “Fingerprint Policy
Optimisation for Robust Reinforcement Learning”. Proceed­
References 387

ings of the 36th International Conference on Machine Learning,


ICML. Vol. 97. Proceedings of Machine Learning Research.
PMLR, 2019, pp. 5082-5091.
Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan
Kaufmann, 1988.
Pitman, E. “Sufficient statistics and intrinsic accuracy”. Mathe­
matical Proceedings of the Cambridge Philosophical society. Vol. 32.
04. 1936, pp. 567-579.
Poincare, H. Calcul des Probability. Gauthier-Villars, 1896.
Polyak, B. T. “Some methods of speeding up the convergence
of iteration methods”. USSR Computational Mathematics and
Mathematical Physics 4.5 (1964), pp. 1-17.
Powell, M. J. D. “A new algorithm for unconstrained optimiza­
tion”. Nonlinear Programming. AP, 1970.
- “Convergence properties of a class of minimization algo­
rithms”. Nonlinear programming 2. 1975, pp. 1-27.
Press, W. et al. Numerical Recipes in Fortran 77: The Art of Scientific
Computing. Cambridge University Press, 1992.
Quinonero-Candela, J. and C. Rasmussen. “A unifying view of
sparse approximate Gaussian process regression”. Journal of
Machine Learning Research 6 (2005), pp. 1939-1959.
Rackauckas, C. et al. “A Comparison of Automatic Differentia­
tion and Continuous Sensitivity Analysis for Derivatives of
Differential Equation Solutions”. arXiv:1812.01892 [math.NA]
(2018).
Rahimi, A. and B. Recht. “Random Features for Large-Scale
Kernel Machines”. Advances in Neural Information Processing
Systems, NeurIPS. Curran Associates, Inc., 2007, pp. 1177­
1184.
Raissi, M., P. Perdikaris, and G. E. Karniadakis. “Machine learn­
ing of linear differential equations using Gaussian processes”.
Journal of Computational Physics 348 (2017), pp. 683-693.
Rasmussen, C. and C. Williams. Gaussian Processes for Machine
Learning. MIT, 2006.
Rauch, H., C. Striebel, and F. Tung. “Maximum likelihood esti­
mates of linear dynamic systems”. Journal of the American Insti­
tute of Aeronautics and Astronautics (AAIA) 3.8 (1965), pp. 1445­
1450.
Reid, W. Riccati differential equations. Elsevier, 1972.
Riccati, J. “Animadversiones in aequationes differentiales se-
cundi gradus”. Actorum Eruditorum Supplementa 8 (1724),
pp. 66-73.
Ritter, K. Average-case analysis of numerical problems. Lecture
Notes in Mathematics 1733. Springer, 2000.
388 References

Robert, C. and G. Casella. Monte Carlo Statistical Methods. Springer


Science & Business Media, 2013.
Rontsis, N., M. A. Osborne, and P. J. Goulart. “Distributionally
Robust Optimization Techniques in Batch Bayesian Optimiza­
tion”. Journal of Machine Learning Research 21.149 (2020), pp. 1­
26.
Saad, Y. Iterative Methods for Sparse Linear Systems. SIAM, 2003.
Saad, Y. and M. Schultz. “GMRES: A generalized minimal
residual algorithm for solving nonsymmetric linear systems”.
SIAM Journal on scientific and statistical computing 7.3 (1986),
pp. 856-869.
Sacks, J. and D. Ylvisaker. “Statistical designs and integral ap­
proximation”. Proc. 12th Bienn. Semin. Can. Math. Congr. 1970,
pp. 115-136.
- “Model robust design in regression: Bayes theory”. Proceed­
ings of the Berkeley Conference in Honor of Jerzy Neyman and Jack
Kiefer. Vol. 2. 1985, pp. 667-679.
Sard, A. Linear approximation. American Mathematical Society,
1963.
Sarkka, S. Bayesian filtering and smoothing. Cambridge University
Press, 2013.
Sarkka, S. and A. Solin. Applied Stochastic Differential Equations.
Cambridge University Press, 2019.
Sarkka, S., A. Solin, and J. Hartikainen. “Spatiotemporal learn­
ing via infinite-dimensional Bayesian filtering and smoothing:
A look at Gaussian process regression through Kalman fil­
tering”. IEEE Signal Processing Magazine 30.4 (2013), pp. 51­
61.
Schmidt, J., N. Kramer, and P. Hennig. “A Probabilistic State
Space Model for Joint Inference from Differential Equations
and Data”. Advances in Neural Information Processing Systems,
NeurIPS. 2021.
Schmidt, R. M., F. Schneider, and P. Hennig. “Descending through
a Crowded Valley - Benchmarking Deep Learning Optimiz­
ers”. Proceedings of the 38th International Conference on Machine
Learning. Vol. 139. Proceedings of Machine Learning Research.
PMLR, 2021, pp. 9367-9376.
Schneider, F., F. Dangel, and P. Hennig. “Cockpit: A Practical
Debugging Tool for the Training of Deep Neural Networks”.
Advances in Neural Information Processing Systems, NeurIPS.
2021.
Schober, M., D. Duvenaud, and P. Hennig. “Probabilistic ODE
Solvers with Runge-Kutta Means”. Advances in Neural Infor­
mation Processing Systems, NeurIPS. 2014, pp. 739-747.
References 389

Schober, M., S. Sarkka, and P. Hennig. “A probabilistic model


for the numerical solution of initial value problems”. Statistics
and Computing 29.1 (2019), pp. 99-122.
Schober, M. et al. “Probabilistic shortest path tractography in
DTI using Gaussian Process ODE solvers”. Medical Image
Computing and Computer-Assisted Intervention-MICCAI 2014.
Springer, 2014.
Scholkopf, B. “The Kernel Trick for Distances”. Advances in
Neural Information Processing Systems, NeurIPS. MIT Press,
2000, pp. 301-307.
Scholkopf, B. and A. Smola. Learning with Kernels. MIT Press,
2002.
Schur, I. “Uber Potenzreihen, die im Innern des Einheitskreises
beschrankt sind.” Journal fur die reine und angewandte Mathe-
matik 147 (1917), pp. 205-232.
Schwartz, R. et al. “Green AI”. arXiv:1907.10597 [cs.CY] (2019).
Scieur, D. et al. “Integration Methods and Optimization Algo­
rithms”. Advances in Neural Information Processing Systems,
NeurIPS. 2017, pp. 1109-1118.
Shah, A. and Z. Ghahramani. “Parallel Predictive Entropy Search
for Batch Global Optimization of Expensive Objective Func­
tions”. Advances in Neural Information Processing Systems, NeurIPS.
2015, pp. 3330-3338.
Shahriari, B. et al. “An Entropy Search Portfolio for Bayesian
Optimization”. arXiv:1406.4625 [stat.ML] (2014).
Shahriari, B. et al. “Taking the human out of the loop: A review
of Bayesian optimization”. Proceedings of the IEEE 104.1 (2016),
pp. 148-175.
Shanno, D. “Conditioning of quasi-Newton methods for func­
tion minimization”. Mathematics of Computation 24.111 (1970),
pp. 647-656.
Skeel, R. “Equivalent Forms of Multistep Formulas”. Mathemat­
ics of Computation 33 (1979).
Skilling, J. “Bayesian solution of ordinary differential equations”.
Maximum Entropy and Bayesian Methods (1991).
Smola, A. et al. “A Hilbert space embedding for distributions”.
International Conference on Algorithmic Learning Theory. 2007,
pp. 13-31.
Snelson, E. and Z. Ghahramani. “Sparse Gaussian Processes us­
ing Pseudo-inputs”. Advances in Neural Information Processing
Systems, NeurIPS. 2005, pp. 1257-1264.
Snoek, J., H. Larochelle, and R. P. Adams. “Practical Bayesian
Optimization of Machine Learning Algorithms”. Advances in
390 References

Neural Information Processing Systems, NeurlPS. 2012, pp. 2960­


2968.
Snoek, J. et al. “Scalable Bayesian Optimization Using Deep
Neural Networks”. Proceedings of the 32nd International Confer­
ence on Machine Learning, ICML. Vol. 37. JMLR Workshop and
Conference Proceedings. JMLR.org, 2015, pp. 2171-2180.
Solak, E. et al. “Derivative Observations in Gaussian Process
Models of Dynamic Systems”. Advances in Neural Information
Processing Systems, NeurIPS. MIT Press, 2002, pp. 1033-1040.
Solin, A. and S. Sarkka. “Explicit Link Between Periodic Covari­
ance Functions and State Space Models”. Proceedings of the
Seventeenth International Conference on Artificial Intelligence and
Statistics, AISTATS. Vol. 33. JMLR Workshop and Conference
Proceedings. JMLR.org, 2014, pp. 904-912.
Sonneveld, P. “CGS, a fast Lanczos-type solver for nonsymmet-
ric linear systems”. SIAM Journal on Scientific and Statistical
Computing 10.1 (1989), pp. 36-52.
Spitzbart, A. “A generalization of Hermite’s Interpolation For­
mula”. The American Mathematical Monthly 67.1 (1960), pp. 42­
46.
Springenberg, J. T. et al. “Bayesian Optimization with Robust
Bayesian Neural Networks”. Advances in Neural Information
Processing Systems, NeurIPS. 2016, pp. 4134-4142.
Srinivas, N. et al. “Gaussian Process Optimization in the Bandit
Setting: No Regret and Experimental Design”. Proceedings of
the 27th International Conference on Machine Learning, ICML.
Omnipress, 2010, pp. 1015-1022.
Stein, M. Interpolation of spatial data: some theory for Kriging.
Springer Verlag, 1999.
Steinwart, I. “Convergence Types and Rates in Generic Karhunen-
Loeve Expansions with Applications to Sample Path Proper­
ties”. Potential Analysis 51.3 (2019), pp. 361-395.
Steinwart, I. and A. Christmann. Support Vector Machines. Springer
Science & Business Media, 2008.
Streltsov, S. and P. Vakili. “A Non-myopic Utility Function for
Statistical Global Optimization Algorithms”. Journal of Global
Optimization 14.3 (1999), pp. 283-298.
Student. “The probable error of a mean”. Biometrika 6 (1 1908),
pp. 1-25.
Sul’din, A. V. “Wiener measure and its applications to approx­
imation methods. I”. Izvestiya Vysshikh Uchebnykh Zavedenii.
Matematika 6 (1959), pp. 145-158.
References 391

- “Wiener measure and its applications to approximation meth­


ods. II”. Izvestiya Vysshikh Uchebnykh Zavedenii. Matematika 5
(1960), pp. 165-179.
Sullivan, T. Introduction to uncertainty quantification. Vol. 63. Texts
in Applied Mathematics. Springer, 2015.
Sutton, R. and A. Barto. Reinforcement Learning. MIT Press, 1998.
Swersky, K., J. Snoek, and R. P. Adams. “Freeze-Thaw Bayesian
Optimization”. arXiv:1406.3896 [stat.ML] (2014).
Swersky, K. et al. “Raiders of the lost architecture: Kernels
for Bayesian optimization in conditional parameter spaces”.
NIPS workshop on Bayesian Optimization in theory and practice
(BayesOpt’13). 2013.
Tarantola, A. Inverse Problem Theory and Methods for Model Param­
eter Estimation. SIAM, 2005.
Teschl, G. Ordinary Differential Equations and Dynamical Systems.
Vol. 140. Graduate Studies in Mathematics. American Mathe­
matical Society, 2012.
Teymur, O., K. Zygalakis, and B. Calderhead. “Probabilistic
Linear Multistep Methods”. Advances in Neural Information
Processing Systems, NeurIPS. 2016, pp. 4314-4321.
Teymur, O. et al. “Implicit Probabilistic Integrators for ODEs”.
Advances in Neural Information Processing Systems, NeurIPS.
2018, pp. 7255-7264.
Thaler, R. H. “Anomalies: The Winner’s Curse”. Journal of Eco­
nomic Perspectives 2.1 (1988), pp. 191-202.
Thiele, T. “Om Anvendelse af mindste Kvadraters Methode i
nogle Tilfelde, hvor en Komplikation af visse Slags uensart-
ede tilidvldige Fejlkilder giver Fejlene en ‘systematisk’ Karak-
ter”. Det Kongelige Danske Videnskabernes Selskabs Skrifter-
Naturvidenskabelig og Mathematisk Afdeling (1880), pp. 381­
408.
Traub, J., G. Wasilkowski, and H. Wozniakowski. Information, Un­
certainty, Complexity. Addison-Wesley Publishing Company,
1983.
Trefethen, L. and D. Bau III. Numerical Linear Algebra. SIAM,
1997.
Tronarp, F., N. Bosch, and P. Hennig. “Fenrir: Physics-Enhanced
Regression for Initial Value Problems”. arXiv:2202.01287 [cs.LG]
(2022).
Tronarp, F., S. Sarkka, and P. Hennig. “Bayesian ODE solvers:
The maximum a posteriori estimate”. Statistics and Computing
31.3 (2021), pp. 1-18.
392 References

Tronarp, F. et al. “Probabilistic solutions to ordinary differential


equations as nonlinear Bayesian filtering: a new perspective”.
Statistics and Computing 29.6 (2019), pp. 1297-1315.
Turing, A. “Rounding-off errors in matrix processes”. Quar­
terly Journal of Mechanics and Applied Mathematics 1.1 (1948),
pp. 287-308.
Uhlenbeck, G. and L. Ornstein. “On the theory of the Brownian
motion”. Physical Review 36.5 (1930), p. 823.
van Loan, C. “The ubiquitous Kronecker product”. Journal of
Computational and Applied Mathematics 123 (2000), pp. 85-100.
Vijayakumar, S. and S. Schaal. “Locally Weighted Projection
Regression: Incremental Real Time Learning in High Dimen­
sional Space”. Proceedings of the Seventeenth International Con­
ference on Machine Learning, ICML. Morgan Kaufmann, 2000,
pp. 1079-1086.
Villemonteix, J., E. Vazquez, and E. Walter. “An informational
approach to the global optimization of expensive-to-evaluate
functions”. Journal of Global Optimization 44.4 (2009), pp. 509­
534.
Von Neumann, J. “Various techniques used in connection with
random digits”. Monte Carlo Method. Vol. 12. National Bureau
of Standards Applied Mathematics Series. 1951, pp. 36-38.
Wahba, G. Spline models for observational data. CBMS-NSF Re­
gional Conferences series in applied mathematics. SIAM,
1990.
Wang, J., J. Cockayne, and C. Oates. “On the Bayesian Solution
of Differential Equations”. Proceedings of the 38th International
Workshop on Bayesian Inference and Maximum Entropy Methods
in Science and Engineering (2018).
Wang, J. et al. “Parallel Bayesian Global Optimization of Expen­
sive Functions”. arXiv:1602.05149 [stat.ML] (2016).
Wang, J. et al. “Bayesian numerical methods for nonlinear partial
differential equations”. Statistics and Computing 31.55 (2021).
Wang, Z. and S. Jegelka. “Max-value Entropy Search for Efficient
Bayesian Optimization”. Proceedings of the 34th International
Conference on Machine Learning, ICML. Vol. 70. Proceedings of
Machine Learning Research. PMLR, 2017, pp. 3627-3635.
Warth, W. and J. Werner. “Effiziente Schrittweitenfunktionen
bei unrestringierten Optimierungsaufgaben”. Computing 19.1
(1977), pp. 59-72.
Weise, T. “Global optimization algorithms-theory and applica­
tion”. Self-Published, (2009), pp. 25-26.
References 393

Wendland, H. and C. Rieger. “Approximate Interpolation with


Applications to Selecting Smoothing Parameters”. Numerische
Mathematik 101.4 (2005), pp. 729-748.
Wenger, J. and P. Hennig. “Probabilistic Linear Solvers for Ma­
chine Learning”. Advances in Neural Information Processing
Systems, NeurIPS (2020).
Wenger, J. et al. “ProbNum: Probabilistic Numerics in Python”.
2021.
Werner, J. “Uber die globale Konvergenz von Variable-Metrik-
Verfahren mit nicht-exakter Schrittweitenbestimmung”. Nu-
merische Mathematik 31.3 (1978), pp. 321-334.
Wiener, N. “Differential space”. Journal of Mathematical Physics 2
(1923), pp. 131-174.
Wilkinson, J. The Algebraic Eigenvalue Problem. Oxford University
Press, 1965.
Williams, C. K. I. and M. W. Seeger. “Using the Nystrom Method
to Speed Up Kernel Machines”. Advances in Neural Information
Processing Systems, NeurIPS. MIT Press, 2000, pp. 682-688.
Wills, A. G. and T. B. Schon. “On the construction of probabilistic
Newton-type algorithms”. IEEE Conference on Decision and
Control (CDC). Vol. 56. 2017.
Winfield, D. H. “Function and functional optimization by in­
terpolation in data tables”. PhD thesis. Harvard University,
1970.
Wishart, J. “The generalised product moment distribution in
samples from a normal multivariate population”. Biometrika
(1928), pp. 32-52.
Wolfe, P. “Convergence conditions for ascent methods”. SIAM
review (1969), pp. 226-235.
Wu, J. and P. I. Frazier. “The Parallel Knowledge Gradient
Method for Batch Bayesian Optimization”. Advances in Neural
Information Processing Systems, NeurIPS. 2016, pp. 3126-3134.
Wu, J. et al. “Bayesian Optimization with Gradients”. Advances in
Neural Information Processing Systems, NeurIPS. 2017, pp. 5267­
5278.
Zeiler, M. D. “ADADELTA: An Adaptive Learning Rate Method”.
arXiv:1212.5701 [cs.LG] (2012).
Index

A-stability, 322, 323 code, see software emukit package, see software
acquisition function, 254 companion matrix, 53 entropy, 23
Adam, 230 conditional distribution, 21 epistemic uncertainty, 11, 70
affine transformation, 23 conditional entropy, 272 equidistant grid, 92
agent, 3, 4, 11 conjugate gradients, 134, 137, 144, ergodicity, 85
aleatory uncertainty, 11, 71 145, 165, 166 error analysis vs error estimation,
analytic, 1 probabilistic, 165 92
Arenstorf orbit, 336 conjugate prior, 55, 56 error function, 215
Armijo condition, see line search continuous time, 48, 296 Euler’s method, 289
Arnoldi process, 135 continuous-time Riccati equation, Eulerian integrals, 56
atomic operation, 70 see Riccati equation evidence, 99
Automated Machine Learning, 276 convergence rate, 201 expected improvement, 259
average-case, 182 convex function, 199 expensive evaluations, 256
average-case analysis, 7 covariance, 23 exploration-exploitation trade-off,
covariance function, see kernel 247
backpack package, see software219 cubic splines, see splines exponential kernel, see Ornstein-
Bayes’ theorem, 21 curse of dimensionality, 79 Uhlenbeck process
Bayesian, 8 exponentiated quadratic kernel, see
Bayesian ODE filters and Dahlquist test equation, 323 Gaussian kernel
smoothers, see ODE filters DARE, see Riccati equation
and smoothers data, 21 filter, 44
Bayesian Optimisation, 251 decision theory, 4 Kalman, 46
Bayesian quadrature, 72 Dennis family, 236 ODE, see ODE filters and
belief propagation, 43 detailed balance, 85 smoothers
BFGS, 236 determinant lemma, 130 optimal, 46
bias, 11 determinantal point process, 80 particle, 309
bifurcation, 311 DFP, 237 forward-backward algorithm, see
boundary value problem, 286, 339 Dirac delta, 27 sum-product algorithm
Brownian motion, 50 discrete time, 42, 297 Frobenius norm, see norm
discrete-time algebraic Ricatti equa­ function-space view, 28
calibration, 12 tion, see Riccati equation
CARE, see Riccati equation dynamic model, 45, 298 Galerkin condition, 143
Cauchy-Schwarz inequality, 37 gamma distribution, 56
Chain graph, 42 early stopping, 224 gamma function, 56
chaos, 332 EKF0, EKF1, see ODE filters and Gauss-Markov process, 41
Chapman-Kolmogorov equation, smoothers Gauss-inverse-gamma, 56
43, 45 EKS0, EKS1, see ODE filters and Gauss-inverse-Wishart, 59
Chebyshev polynomials, 103 smoothers Gaussian
Cholesky decomposition, 36, 138 empirical risk minimisation, 37, 204 elimination, 133
396 Index

Gaussian distribution, 23 Kalman filter, see filter mean, 23


Gaussian kernel, 32, 40 Kalman smoother, see smoother measurement matrix, 45
Gaussian ODE filters and kernel, 31 measurement model, 44, 45
smoothers, see ODE filters kernel herding, 76, 121 Mercer kernel, see kernel
and smoothers kernel mean embedding, 76 message passing, 42, 44
Gaussian process, 30, 34 kernel ridge regression, 30, 37 meta-criterion for Bayesian optimi­
Gaussian quadrature, 102 knot, see node sation, 274
Gaussian vs nonparametric, 314, knowledge vs assumptions, 95 model, 73
336 Kronecker product, 127 modified Bessel function, 52
general linear model, see linear re­ covariance, 152 momentum, 230
gression skew-symmetric, 157 multi-armed bandits, 247
generative model, 73 symmetric, 157 multi-point expected improvement,
genetic algorithm, 200 Krylov sequence, 144 275
Gittins indices, 255 myopic Bayesian optimisation, 255
GNU C library, 70 Laguerre polynomials, 103
GPU, 1, 8 Lanczos process, 145 natural spline, 88
gradient descent, 201 Laplace approximation, 55 Nesterov acceleration, 230
Gram matrix, 36 learning curves, 277 neural network, 251
graphical model, 43 Legendre polynomials, 29, 103 neuroscience, 338
Greenstadt’s method, 236 likelihood, 21 nilpotent, 48
line search, 200, 205 node, 102
Hamiltonian matrix, 54 probabilistic, 211 nonparametric vs Gaussian, 314,
harmonic potential, 52 linear differential equation, 48 336
heavy ball, see mmentum230 linear Gaussian system, see LTI Nordsieck methods, 327
Hermite interpolation, 291 linear regression, 27 norm
Hermite polynomials, 103 linear spline, 88 Euclidean, 128
hierarchical inference, 55 linear splines, see splines, see Frobenius, 128
hyperparameter, 55 splines matrix, 128
Holder continuity, 95 linear time-invariant system, see normal distribution, see Gaussian
LTI, see LTI distribution
importance sampling, 71, 310 LTI, 44, 296 numerical algorithm, 70
sequential, 311 LU decomposition, 133 numerical stability, 131, 317, 322­
inference, 22 324
infill function, 254 machine learning, 1, 348 Nystrom approximation, 138
information operator, 320 MAP, see maximum a posteriori es­
information theory, 13 timate observation noise, 45
information-based complexity, 7 marginal, 21 ODE, 285
informed prior, 74 marginal likelihood, 55 bifurcation, 311
initial value problem, see IVP Markov flow, 289
integrated Brownian motion, see in­ chain, 42 higher-order, 286, 298
tegrated Wiener process process, 41 inverse problem, 340
integrated Wiener process, 51, 299 property, 42 Lorenz, 332
interpolant, 52 Markov chain Monte Carlo, see periodic, 300
inverse problems, 340 MCMC regularity, 287
Ito integral, 49, 50 matrix exponential, 48 sensitivity, 332
IVP, 286 matrix inversion lemma, 129 stiff, 322
regularity, 287 matrix perturbation theory, 160 ODE filters and smoothers, 301, 314
solver, see ODE solver Matern family, 52 A-stability, 322
maximum a posteriori estimate, bootstrap particle, 311
Jacobi polynomials, 103 109, 305 BVPs, 340
Jacobian-matrix estimator, 344 MCMC, 70, 72, 108, 341 convergence, 317, 318
Index 397

EKF0, 318 orthogonal polynomials, 102 Riccati equation, 53


global, 319, 321, 329 Riemannian manifolds, 340
IEKS, 321 parametric RKHS, see reproducing kernel
local, 318 model, 27 Hilbert space
MAP, 320 regression, 27 Runge-Kutta methods, 291
EK0, EK1, 305 partial differential equation, 285,
EKF0, EKF1, 304 347 scattered-data interpolation, 320
EKS0, EKS1, 305 Pascal triangle, 50 Schur complement, 130
equivalences, 326 perturbative solvers, 331, 336, 347 Schur ’s product theorem, 33
Nordsieck, 327 additive noise, 332 secant method, see quasi-Newton
trapezoidal rule, 327 convergence, 333, 335 method
error estimation, 313 inverse problems, 342 semi-ring (of kernels), 33
extended Kalman, 303-305 random step sizes, 334 sigma point, see node
Gaussian, 301-303, 336 Picard-Lindelof theorem, 286 Simpson’s rule, 95
gradient estimator, 345 Poincare, Henri, 5 singular value decomposition, 128
Hessian estimator, 345 positive definite kernel, see kernel smoother, 44, 47
IEKF, 306 positive definite matrix, 159 Kalman, 47
IEKS, 306 posterior, 21 ODE, see ODE filters and
implementation, 315 Powell symmetric Broyden, 236 smoothers
inverse problems, 343 preconditioning, 145, 166 particle, 311
iterated extended Kalman, 306 prediction-error decomposition, 60 Sobolev space, 52, 320
Jacobian-matrix estimator, 344 prior, 3, 21, 298 software
Kalman, 309 conjugate, 55, 56 backpack, 219
MAP estimate, 305 ProbNum package, see software emukit, 120, 278
multi-step, 329 process noise, 45 ProbNum, 121, 194, 315
Nordsieck form, 327 product rule, 21 splines
numerical stability, 317, 322 proposal distribution, 310, 311 cubic, 50, 212
A-stability, 323 pseudoinverse, 171 linear, 32, 35
linear algebra, 324
polynomial, 50
particle, 309-311 quadrature, 69 square exponential kernel, see
re-scaled variables, 324, 328 adaptive, 73 Gaussian kernel
square-root, 325 Bayesian, 72 square-exponential, see Gaussian
steady state, 329 Gaussian, 102 kernel
step-size selection, 314 quasi-Newton method, 234 SR1, 236
terminology, 301, 305, 309 query selection function, 254 state vector, 296
uncertainty calibration
state-space model, 44, 297, 307, 340
asymptotic, 319
radial basis function, see Gaussian statistical inference, 13
scaling, 312, 313
kernel, see Gaussian kernel steady-state, 54
ODE solver, 287, 289
Radon-Nikodym derivative, 21 steepest descent, 201
BVP, 339
random, 11 step function, 29
classical, 289, 291, 293
random forest, 251 stochastic differential equation, 48
geometric, 334
Rauch-Tung-Striebel smoother, see stochastic filtering problem, 296
multi-step, 290, 322, 329
smoother stochastic integral equation, 50
Nordsieck, 327
Rayleigh regression, 176 sufficient statistics, 57
probabilistic, 287, 293
recommendation strategy for sum rule, 21
single-step, 290, 322
Bayesian optimisation, 255 sum-product algorithm, 43
trapezoidal, 327
regression in ODEs, 289 surrogate function, 251
ordinary differential equation, see
reinforcement learning, 247, 276 switch function, 29
ODE
rejection sampling, 71 symmetric matrix, skew-symmetric
Ornstein-Uhlenbeck process, 32,
reproducing kernel Hilbert space, matrix, 156
52, 232, 299
36-38 symmetric positive (semi-) definite,
398 Index

128 uncertainty sampling, 83 weight space, 28


symplectic matrix, 54 uncertainty-awareness, 2, 291, 331, weight-space view, 28
342, 345 white noise, 109
under-confidence, 95 Wiener process, 32, 50
time series, 42 winner’s curse, 263
training curves, 277 Vandermonde matrix, 102 Wishart distribution, 59, 149
transition matrix, 45 variable metric, see quasi-Newton Wolfe conditions, see line search
trigonometric functions, 29 method Woodbury Identity, see matrix in­
trust-region, 234 vectorisation, 127 version lemma
type-II likelihood, see marginal like­ worst-case error, 37
lihood weak adaptivity, 85

You might also like