Stochastic Simulation Book
Stochastic Simulation Book
O. Deniz Akyildiz
2023
Department of Mathematics
Imperial College London
PREFACE
This text is about stochastic simulation methods that underpin most of modern statistical inference,
machine learning, and engineering applications. The material of this course serves an introduction to
wide ranging areas and applications, from inference and estimation in a broad class of statistical models
to modern generative modelling applications.
The core of this course is about sampling from a probability distribution that may have explicit,
implicit, complicated, or intractable form. This problem arises in many fields of science and engineering,
e.g., the probability distribution of interest can describe a posterior distribution in a statistical model or
an unknown data distribution (in a generative model) from which we are interested in generating more
samples. The sampling problem takes many forms, hence the solutions (sampling algorithms) is a broad
topic, and this course is an introduction to such methods. In order to develop tools to tackle such
problems, we will be covering basics of simulation, starting from something as basic as simulating an
independent random number from simple distributions (i.e. pseudo-sampling methods that underlie
all stochastic simulations) to designing advanced sampling algorithms for more complicated models.
There will be one coursework, accounting for 25% of the credit. The upload date of this coursework
is as follows.
• Coursework (25%)
– Upload: 4 Dec. 2023 – Deadline: 11 Dec. 2023
• Final exam (75%)
Coursework will be the same for all students and the exam will have an extra question for M4R students
(will be clarified before the exam). The primary course material is the lecture notes and slides – however,
we will also assign additional (optional) readings or complementary chapters where necessary.
I hope that this course will strengthen your skills to conduct statistical research and become well-
versed in the field of sampling, statistical inference, and generative modelling, no matter if you want to
be an academic researcher or a practitioner!
O. Deniz Akyildiz
London, 2023
i
CONTENTS
Preface i
Contents ii
1 Introduction 1
1.1 Introduction 1
1.1.1 Why is this course useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 What will be covered in this course? . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Sampling Problem 2
1.2.1 Motivating example: Estimating π . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Probability and Measure: Recap 4
1.3.1 Density notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Basic Probability Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ii
2.4.4 Sampling from Continuous Mixtures or Marginalisation . . . . . . . . . . . . . . . 38
2.5 Sampling Multivariate Densities 40
2.5.1 Sampling a Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iii
5.4 Gibbs sampling 109
5.4.1 Metropolis-within-Gibbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Langevin MCMC methods 115
5.5.1 Stochastic Gradient Langevin Dynamics . . . . . . . . . . . . . . . . . . . . . . . 119
5.6 MCMC for Optimisation 120
5.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6.3 Langevin MCMC for Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7 Monitoring and Postprocessing MCMC output 124
5.7.1 Trace plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.2 Autocorrelation plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.3 Effective sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.4 Thinning the MCMC output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Biblioghrapy 141
iv
INTRODUCTION
1
We introduce in this section the general ideas of this course, notation, and setup. We will also
introduce the principles behind sampling and generative modelling and also try to answer the
existential question from the beginning: Why is this course useful?
1.1 introduction
This course is about simulating random variables or put it differently sampling from probability
distributions. This seemingly simple task arises in a large number of scenarios in the real world
and embedded in many critical applications. Furthermore, we look into generating samples from
dependent processes (e.g. stochastic processes) as well as sampling from intractable distributions.
• Statistical inference: Now imagine a random system that is not fully known. This could
be defined using a simulator or a generator as described above. But if some variables
1
in this system are not known and we have some data of other variables, we can always
condition on the data and find out latent/hidden variables. Simulation and sampling
methods come in handy in these cases. By carefully formulating these estimation problems
as simulation/sampling problems, we can estimate the unknown variables. We will cover
this in detail in the course.
• Generative modelling: We are often interested in sampling from a completely unknown
probability measure p in the cases of generative models. Consider a given image dataset
{x1 , . . . , xn } and consider the problem of generating new images that mimic the properties
of the image dataset. This problem can be framed as a sampling procedure X ∼ p where
p is unknown but we have access to its samples {x1 , . . . , xn } in the form of a dataset.
Methods that address this challenge gave rise to very successful generative models, such as
DALLE-2 or ChatGPT. We will not cover generative models in this course, however. The
methods we introduce are the key foundations for working on these models.
X (i) ∼ p, i = 1, . . . , N.
The main goal here is to draw these samples as accurately as possible, as often we may not have
access to an exact sampling scheme. More technical applications of sampling include
• Integration. First reason sampling from a measure is interesting is that, even though one
may have access to density p’s analytic expression, computing certain integrals with respect
to this density may be intractable. This is required, e.g., for estimating tail probabilities.
Sampling from a distribution provides a way to compute this integrals (called Monte Carlo
integration, which will be introduced later in this course). This motivation also holds for
more general cases below.
• Intractable normalising constants. In many real life modelling scenarios, we have an
access to a function p̄ such that
p̄(x)
p(x) = ,
Z
2
where the normalising constant Z is unknown. In these cases, designing a sampler may be
non-trivial and this is a big research area.
• Generative models. Given a trained generative model, we still want to generate samples
as fast as possible.
We will first discuss exact sampling methods which are only applicable to simple distributions.
However, these will be crucial for advanced sampling techniques – as all sampling methods rely
on being able to draw realistic samples from simple distributions such as uniform or Gaussian
distribution. We will then describe cases where the exact sampling is not possible and introduce
advanced sampling algorithms (such as Markov chain Monte Carlo (MCMC) methods) to tackle
this problem.
We can again easily imagine that if we had uniformly distributed points in this square, the ratio
of the number of points within the circle and the total number of points will be approxiamtely
the same as the ratio of the areas. In other words, if we draw N points uniformly at random from
the square, and Nc of these points fall within the circle, then
Nc π
≈ .
N 4
This is a very simple Monte Carlo estimate which comes with lots of guarantees, in particular, we
know that as N → ∞, the estimate will converge to the true value of π.
To formalise this intuition, the trick at this stage is to phrase this question probabilistically.
Can we convert this problem into estimating the probability of a set? Let us say, for clarity of the
definition, this square is a set A = [−1, 1] × [−1, 1]. Let us define a 2-dimensional probability
distribution, uniform on this set, defined as P = Unif([−1, 1]⊗2 ). This is a uniform distribution
3
Figure 1.2: Estimating π using the Monte Carlo method.
on the square. Naturally, we have that P(A) = 1. Now we can see that, if the circle is defined as
another set
B = {(x, y) ∈ R2 : x2 + y 2 ≤ 1},
and estimating this expectation (integral) using samples. We will see later that this simple Monte
Carlo procedure coincides with the intuitive solution: Count the samples within the circle and
calculate the ratio to estimate π/4. We can see the result of this estimation procedure above.
4
variable. This is done in order to stress the fact that the densities pX and pY are different. However,
this becomes tedious when doing more complex modelling. For example, a simple case appears
in the Bayes’ update for conditionals. Again in normal literature, this is written as
pY |X (y|x)pX (x)
pX|Y (x|y) = .
pY (y)
Now consider even more general cases involving three or more variables and various dependences.
This is going to become an infeasible notation.
Throughout these notes and the course, we will use p(x) generically as a probability density
of a random variable X. When we then write p(y), this will mean a different density of another
random variable (say Y ). If we write the Bayes’ formula in these terms
p(y|x)p(x)
p(x|y) = ,
p(y)
which is much cleaner. Of course, here, p(x|y) and p(y|x) are different functions, just as p(x)
and p(y) are. We will however revert back to pX and pY in cases where it is necessary, such as
transformation of random variables.
We denote the expectation of a random variable with Ep [X] (or Ep X when there is no
confusion). In general, we define the expectation of a function of a random variable X ∼ p as
Z
p(ϕ) = Ep [ϕ(X)] = ϕ(x)p(x)dx.
We note the notation here p(ϕ) denotes the expectation and we will use this in the sections
regarding Monte Carlo integration heavily. Also note that we abuse the notation for denoting the
measures and densities with the same letter. In other words, for a probability measure p(dx), we
denote its density with p(x). The cumulative density function (CDF) will generally be denoted as
F(x). In this section, we review basic probability theory that will be required for the rest of the
course.
X = {1, 2, 3, 4, 5, 6},
which could denote, for example, the possible outcomes of a die roll. Now we define the probability
mass function.
5
Definition 1.1 (Probability Mass Functions). When a random variable is discrete, the probability
mass function can be defined as
We note that in one dimensional case, the probability mass function is typically represented
as a vector of probabilities when it comes to computations. Consider the following example.
X P(X = x)
1 0.1
2 0.2
3 0.3
4 0.4
0.1
0.2
p = .
0.3
0.4
indexed by discrete variables. Of course, one can also define a dictionary (Python data type) in
order to have more complicated states for the random variable.
6
Figure 1.3: A Gaussian example: Left: It can be seen that p(0) = 1.2615 while we know P(X = 0) = 0. This
shows that the density values can be bigger than 1 as they are not probabilities without the integration. Right: One
R 0.1
can only compute quantities like P(−0.1 ≤ X ≤ 0.1) = −0.1 p(x)dx = 0.248.
Next we define the probability density function in the case of continuous random variables.
Definition 1.2 (Measure and density). Assume X ⊂ R and X ∈ X (for simplicity). Given the
random variable X, we define the measure of X as
The reason P called a measure is that it measures the probability of sets. We have then the probability
density function which has the following relationship with the probability measure
Z x2
P(x1 ≤ X ≤ x2 ) = p(x) dx.
x1
Note that this notion generalises to higher dimensions straightforwardly. In the estimating π
example, we basically tried to “measure” the probability of the 2D set that was a circle. Another
important note about the measure/density distinction is demonstrated in the following remark.
Remark 1.1. Note that this difference in the continuous case matters. Consider the probability of
a point x0 ∈ R under a continuous probability distribution defined on R. Given its density p(x0 ),
we can surely evaluate p(x0 ) > 0, but this is not the probability of x0 under the distribution. The
probability of a point x0 is zero under a continuous distribution which follows from Definition 1.2
as x1 = x2 = x0 . An example is demonstrated in Fig. 1.3.
7
1.3.3 joint probability
We now define the joint probability distribution of two random variables X and Y . We will also
focus on the discrete case first, and then move to the continuous case.
Definition 1.3 (Discrete Joint Probability Mass Function). Let X and Y be random variables and
X and Y be the sets they live on. X and Y are at most countable sets. The joint probability mass
function of X and Y is
Example 1.2. Similar to the one dimensional case, we can now see the joint pmf p(x, y) as a
table of probabilities
Y =0 Y =1 Y =2 Y =3 pX (x)
X=0 1/6 1/6 0 0 2/6
X=1 1/6 0 1/6 0 2/6
X=2 0 0 1/6 0 1/6
X=3 0 0 0 1/6 1/6
pY (y) 2/6 1/6 2/6 1/6 1
Of course, on computer we can represent this as a matrix of probabilities
1/6 1/6 0 0
1/6 0 1/6 0
P= .
0 0 1/6 0
0 0 0 1/6
This allows us to perform simple computations for marginalisation simply as sums of rows or
columns. This is going to be a crucial tool when we study Markov models.
Let us finally define the probability density function p(x, y) for continuous variables.
Definition 1.4 (Continuous Joint Probability Density Function). Let X and Y be random variables
and X and Y be their ranges. We denote the joint probability measure as P(X ∈ A, Y ∈ B) and
the density function p(x, y) satisfies
Z Z
P(X ∈ A, Y ∈ B) = p(x, y) dx dy.
A B
8
The marginal probability densities from the
joint density can be computed as
Z
p(x) = p(x, y) dy,
Y
and
Z
p(y) = p(x, y) dx.
X
Definition 1.5 (Discrete Conditional Probability Mass Function). Let X and Y be random
variables and X and Y be their ranges. The conditional probability mass function of X given Y is
Example 1.3. Compute the conditional probability mass function p(y|x = 2) from the table of
probabilities of p(x, y) given below.
Solution. Let us say we would like to compute P(Y = i|X = 2) for i = 0, 1, 2, 3. We can do this
by simply dividing the joint probability mass function by the marginal probability mass function
of X. Consider the following table
9
p(x, y) X=0 X=1 X=2 X=3 pY (y)
Y =0 1/6 1/6 0 0 2/6
Y =1 1/6 0 1/6 0 2/6
Y =2 0 0 1/6 0 1/6
Y =3 0 0 0 1/6 1/6
pX (x) 2/6 1/6 2/6 1/6 1
where the red entries are the joint probabilities of Y given X = 2. We can write the conditional
probabilities as
P(Y = 0, X = 2) 0
P(Y = 0|X = 2) = = = 0,
P(X = 2) 2/6
P(Y = 1, X = 2) 1/6
P(Y = 1|X = 2) = = = 1/2,
P(X = 2) 2/6
P(Y = 2, X = 2) 1/6
P(Y = 2|X = 2) = = = 1/2,
P(X = 2) 2/6
P(Y = 3, X = 2) 0
P(Y = 3|X = 2) = = = 0.
P(X = 2) 2/6
As we can see that the conditional probability can also be represented as a vector
One can compute conditional probability tables from the joint probability table.
Example 1.4. Derive the conditional probability table from the joint probability table given above.
10
We next define the continuous conditional density given p(x, y).
Definition 1.6 (Continuous Conditional Probability Density Function). Let X and Y be random
variables and X and Y be their ranges. The conditional probability density function of X given Y is
p(x, y)
p(y|x) = ,
p(x)
where we call p(x | y) the conditional probability density function of X given Y . Similarly, we have
p(x, y)
p(x|y) = .
p(y)
11
EXACT GENERATION OF RANDOM VARIATES
2
In this section, we will focus on exact sampling from certain classes of distributions, above all, uniform
distribution. This chapter aims at developing an understanding for the basis for all simulation
algorithms.
3
One of the central pillars of sampling algorithms is the ability to sample from the uniform
distribution. This may sound straightforward, however, it is surprisingly difficult to sample a real
uniform random number (Devroye, 1986). If the aim is to generate these numbers on a computer,
one has to “listen” to some randomness1 (e.g. thermal noise in the circuits of the computer) and
even then, this random numbers have no guarantee to follow a uniform distribution. Therefore,
much of the literature is devoted to generating pseudo-uniform random number generation.
Furthermore, generation of random variables that follow popular distributions in statistics (such
as Gaussian or exponential distribution) also requires pseudo-uniform random numbers as we
will see in next sections.
Why would we use pseudo numbers? They are (i) easier, quicker, and cheaper to generate,
and more importantly, (ii) repeatable. This provides a crucial experimental advantage when it
comes to test algorithms based on random numbers – as we can re-run the same code (with the
same pseudo-random numbers), e.g., for debugging.
In what follows, we will describe different methods for pseudo-uniform random number
generators that can be used in practice.
1
see https://www.random.org/ if you need real random numbers.
12
Sequence Histogram uk + 1 vs uk
1.0 1.0
1.0
0.8 0.8
0.8
0.6 0.6
0.6
0.4 0.4
0.4
0.2 0.2
0.2
0.0 0.0
0.0
0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
k
Figure 2.1: Top Left: Samples generated by an LCG with parameters m = 2048, a = 43, b = 0, x0 = 1. Top
Middle: The histogram of the samples showing that it conforms the uniform distribution. Top Right: However,
plotting (uk , uk+1 ) pairs shows that the samples are not random enough. Bottom Left: Samples generated by
an LCG with parameters m = 232 , a = 1664525, b = 1013904223. Bottom Middle: The histogram of the
samples showing that it conforms the uniform distribution. Bottom Right: Plotting (uk , uk+1 ) pairs shows that
the samples do not look nonrandom.
where x0 is the seed, m is the modulus of the recursion, b is the shift, and a is the multiplier. If
b = 0, then the LCG is called a multiplicative generator and it is called a mixed generator when
b 6= 0. We set m an integer and choose x0 , a, b ∈ {0, . . . , m − 1}. Defined this way, the recursion
defines a class of generators and we have xn ∈ {0, 1, . . . , m − 1}. The uniform numbers are then
generated
xn
un = ∈ [0, 1) ∀n.
m
The sequence (un )n≥0 is periodic with period T ≤ m (Martino et al., 2018). The goal is to choose
the parameters a and b so that the sequence has the largest possible period, ideally, T = m
(full period). The choice of the modulus is determined by the precision, e.g., m ≈ 232 for
single-precision, and so on.
There are known shortcomings of LCGs, in particular that they generate periodic sequences.
An illustration is given in Fig. 2.1. In particular, if a care is not taken, this can cause a significant
problem in applications. For a typical issue, we demonstrate the impact of poor random samples
13
Figure 2.2: Left two figures: Gaussian samples generated using the samples in top row of Fig. 2.1. Right two
figures: Gaussian samples generated using the samples in the bottom row Fig. 2.1 but a better parameterisation:
m = 232 , a = 1664525, b = 1013904223.
on other random number generators. We will see in the following sections that, for example,
drawing Gaussian samples requires uniform random numbers. If we use the random samples
generated in top row of Fig. 2.1, we will get bad samples from a “Gaussian distribution”, see
Fig. 2.2 leftmost two figures. Similarly, if we choose a better parameterisation m = 232 , a =
1664525, b = 1013904223, whose samples are shown in the bottom row of Fig. 2.1, then we can
get much better samples which can be seen from rightmost two figures of Fig. 2.2.
Unfortunately, LCGs fail to produce good samples in high-dimensions. The standard imple-
mented method nowadays is Mersenne-Twister algorithm which is not an LCG. For the rest of
the course, we will use rng.uniform(0, 1) to draw uniform random numbers, where rng is a
properly initialised RNG (see online companion), as a suboptimal implementation can impact
our simulation schemes.
U ∼ Unif(0, 1),
14
PMF p
CDF F
1
0.8
0.6
U ∼ Unif(0, 1)
0.4
0.2
1 2 3 4
Figure 2.3: The inversion technique. The black is the probability mass function p whereas the red function is the
CDF F . One can see that, drawing a uniform random number projected on the y axis ensures that we draw the
samples {1, . . . , 4} w.r.t. the probability if we follow the inverse of CDF.
probabilities. In other words, if we draw U ∼ Unif(0, 1) and invert through the CDF, we will
choose x1 , x2 , . . . according to their probabilities (see Fig. 2.3).
This follows from a more general result called probability integral transform.
Theorem 2.1. Consider a random variable X with a CDF FX . Then the random variable FX−1 (U )
where U ∼ Unif(0, 1), has the same distribution as X.
inversion method.
2 −1
Note that above result is written for the case where FX exists, i.e., the CDF is continuous. If this is not the
case, one can define the generalised inverse function,
−
FX (u) = min{x : FX (x) ≥ u},
15
Algorithm 1 Pseudocode for inverse transform sampling
1: Input: The number of samples n
2: for i = 1, . . . , n do
3: Generate Ui ∼ Unif(0, 1)
4: Set Xi = FX−1 (Ui )
5: end for
Solution. In this case, the CDF is not continuous and the procedure is summarised in Fig. 2.3.
Let us call X as our random variable to be sampled X ∼ p. Then the sampling procedure as
described above becomes (with generalised inverse):
• Ui ∼ Unif(0, 1)
• Xi = ski .
This corresponds to something simple: Sample Ui and find the first state sk that gives FX (sk ) ≥ u.
Note that Bernoulli distribution corresponds to a special case of this with s1 = 0, s2 = 1 (see
Fig. 2.3).
16
Example 2.3. (Cauchy) Assume we want X ∼ Cauchy where the probability density is given as
1
pX (x) = .
π(1 + x2 )
• Generate Ui ∼ Unif(0, 1)
h i
1
• Set Xi = tan π Ui − 2
.
λk e−λ
P(X = k) = Pois(k; λ) = .
k!
Describe the sampler using the inversion method.
• Sample U ∼ Unif(0, 1)
While this is a useful technique for sampling from many distributions, it is limited to the
cases where FX−1 exists, which is a very stringent condition. For example, consider the problem
17
of sampling a standard normal, i.e., X ∼ N (0, 1). We know that the CDF is
Z x 1 1 2
FX (x) = √ e− 2 y dy.
−∞ 2π
We cannot find FX−1 . Fortunately, for certain special cases, we can use another transformation to
sample.
designing g is the crucial aspect of this method. This depends on the goal of the sampling method.
A crucial tool to understand what happens with this kind of transformations is the formula
describes transformation of random variables which is described below.
Remark 2.1. The transformation of random variables formula is an important formula for us
describing the transformation of probability densities when we transform random variables. In
other words, assume X ∼ pX (x) and Y = g(X), then pY (y) has a certain density that is related
to the density of X. The exact formula depends on the dimension of the random variables. For
one-dimensional case, the relationship is given by
dg −1 (y)
pY (y) = pX (g −1 (y)) . (2.1)
dy
This formula is simpler than it looks. One needs to explicitly find g −1 first (here is the weakness
of this approach). Provided that, writing down pX (g −1 (y)) simple (just write down the density
of X, evaluated at g −1 (y). The derivative is also often simple to compute, so is the absolute value.
For multidimensional (say n-dimensional) random variables (we will see one 2D example
below), the formula is equally compact and simple, however, computations might become more
involved. It is simply given as
pY (y) = pX (g −1 (y)) |det Jg−1 (y)| (2.2)
While the first term on the r.h.s. is similar to above, the last term now means determinant of the
Jacobian. And in this case, the Jacobian would be given as
∂g1−1 /∂y1 ∂g1 −1 /∂y2 · · · ∂g1−1 /∂yn
Jg−1
= .. ..
. ··· ··· .
−1 −1 −1
∂gn /∂y1 ∂gn /∂y2 · · · ∂gn /∂yn
18
where g −1 = (g1−1 , . . . , gn−1 ) is a multivariate function. In our lecture, we will not need this
formula for more than 2D and this case is exemplified in the examples.
Next, we consider the example where we develop the method to sample Gaussian random
variates.
py1 ,y2 (y1 , y2 ) = px1 ,x2 (g −1 (y1 , y2 )) |det Jg−1 (y1 , y2 )| (2.3)
and
sin x2 y2
=
cos x2 y1
which leads to
x2 = arctan(y2 /y1 ).
Therefore, g −1 : R2 → R2
g −1 (y1 , y2 ) = (g1−1 , g2−1 ) = y12 + y22 , arctan(y2 /y1 ) .
19
Hence, the absolute value of the determinant is:
|det Jg−1 | = 2.
which concludes.
r ∼ Unif(0, 1),
θ ∼ Unif(−π, π).
Solution. We will show now that using the same formula derived in the previous proof, we can
describe a scheme to sample uniformly on a circle. We define
√
x1 = r cos θ,
√
x2 = r sin θ.
px1 ,x2 (x1 , x2 ) = pr,θ (g −1 (x1 , x2 ))| det Jg−1 (x1 , x2 )|.
We know that, since we use the same transformation as in Example 2.5, we have the Jacobian
| det Jg−1 | = 2. We can then write
If we pay attention to the first Uniform distribution in the above formula, we see that this would
be 1 when x21 + x22 < 1. The second formula is arctan which takes values on (−π/2, π/2), which
means we always have Unif(arctan(x2 /x1 ); −π, π) = 1/2π. This results
1
px1 ,x2 (x1 , x2 ) = for x21 + x22 < 1
π
and 0 otherwise, which is the uniform distribution within a circle. See Fig. 2.4 for some examples
(and alternatives discussed in the class).
20
√
Figure 2.4: On the left, one can see the samples with the correct scaling r. Some other intuitive formulas result in
a non-uniform distribution.
Solution. This is a simple demonstration of the transformation of random variables formula (in
1-dimension). Let X ∼ N (x; 0, 1) where
!
1 x2
N (x; 0, 1) = pX (x) = √ exp − . (2.4)
2π 2
g(x) = σx + µ.
This is intuitive as we first scale the random variable with σ to increase or decrease its variability
(variance) and then add some mean µ. The transformation formula in 1D is simpler:
−1dg −1 (y)
pY (y) = pX (g (y)) (2.5)
dy
where we have the absolute value of the derivative of the inverse function g −1 (y). This is easy
derive by leaving x alone starting from y = g(x) = σx + µ and
y−µ
x= = g −1 (y).
σ
The derivative is then given by
dg −1 (y) 1
= .
dy σ
21
Then using Eq. (2.3) we obtain
1
pY (y) = pX (g −1 (y)) ,
σ !
1 (y − µ)2 1
pY (y) = √ exp − ,
2π 2σ 2 σ
as σ > 0 and by using Eq. (2.4) and plugging x = (y − µ)/σ. We can already recognize the
expression pY (y) = N (y; µ, σ 2 ).
Theorem 2.2 (Fundamental Theorem of Simulation). (Martino et al., 2018, Theorem 2.2) Drawing
samples from one dimensional random variable X with a density p̄(x) ∝ p(x) is equivalent to
sampling uniformly on the two dimensional region defined by
A = {(x, y) ∈ R2 : 0 ≤ y ≤ p̄(x)}. (2.6)
In other words, if (x0 , y 0 ) is uniformly distributed on A, then x0 is a sample from p(x).
22
Figure 2.5: On the left, you can see a mixture of Gaussians (we will cover mixture distributions later) and samples
uniformly distributed below the curve. Each black dot on under the curve is an (x, y) pair, hence you could denote
those samples (Xi , Yi ). On the right, you can see the histogram of the x-marginal, which means, only Xi samples.
This is the empirical demonstration of Theorem 2.2.
Proof. Consider the pair (X, Y ) uniformly distributed on the region A. We denote their joint
density as q(x, y) as
1
q(x, y) = , for (x, y) ∈ A. (2.7)
|A|
p̄(x)
p(x) = .
|A|
We use the standard formula for the joint density q(x, y) = q(y|x)q(x). Note that, since (X, Y )
is uniform in A, for fixed x, we have
1
q(y|x) = for (x, y) ∈ A.
p̄(x)
We therefore write
q(x)
q(x, y) = q(y|x)q(x) = for (x, y) ∈ A. (2.8)
p̄(x)
We consider now (2.7) and (2.8) which are both valid on (x, y) ∈ A. Combining them gives
p̄(x)
q(x) = ,
|A|
23
Figure 2.6: On the left, we plot the accepted samples (scattered) under the curve. On the right, we describe the
histogram of x-marginal of these samples (so we just keep the first dimension of the two dimensional array).
suggests a quite intuitive sampling procedure: We can sample uniformly under the area of a
density (or even an unnormalised negative curve) and obtain samples from the (normalised)
probability density by keeping the samples on the x-axis (this is sometimes called the x-marginal).
One simple example that does this is described below.
Example 2.8 (Beta density under a box). Consider the Beta density
Γ(α + β) α−1
p(x) = Beta(x; α, β) = x (1 − x)β−1 ,
Γ(α)Γ(β)
where Γ(n) = (n − 1)! for integers. Design a sampler that samples uniformly under the curve
p(x) = Beta(x; 2, 2).
Solution. In order to design a uniform sampler under the area of the Beta(2, 2), we can use its
special properties. For example, the Beta density is defined on [0, 1] which makes it easy to design
a uniform sampler. The simplest choice for this is to design a “box” over the density. In order to
design this box, we require the maximum of the density
We are of course lucky to have this number, which could be difficult to find analytically in general.
In this case, we can design a box [0, 1] × [0, p? ] and draw uniform random samples in this box.
Let us suggestively denote these samples
We can then check whether these samples are under the Beta density curve, which can be done
by checking:
U 0 ≤ p(X 0 ),
and accepting the sample if this condition holds. Fig. 2.6 shows the result of this procedure
together with the histogram.
24
2.3.1 rejection sampler
The box example is nice, however, it is not optimal: It might be too inefficient to find a box of that
type and for peaky densities, this could be horribly inefficient. We can however identify another
probability density we can sample from, denoted q(x), which may cover our target density much
better.
Rejection sampling is an algorithm just does that: We identify a q(x) to cover our target
density p(x). Of course, because p(x) and q(x) are both probability densities, q(x) can never
entirely cover p(x). However, it will be sufficient for us if we can find an M such that
p(x) ≤ M q(x),
so the scaled version of q(x) with M > 1 should entirely cover p(x). Depending on the choice of
the proposal, the procedure will be much more efficient than simple boxing. Let us describe the
conceptual algorithm.
p(Xi0 )
a(Xi0 ) = ≤ 1. (2.9)
M q(Xi0 )
This algorithm might look simpler than what you would expect. Above we mentioned drawing
samples uniformly under the curve, however, a simple look at the steps might not reveal the
fact that this is precisely what this algorithm is doing. Let us look into this more carefully: The
rejection sampler first generates X 0 ∼ q(x) and let us fix its value X 0 = x03 . Then in order to
implement the Accept step, we should generate U ∼ Unif(0, 1) and accept the sample if
p(x0 )
U ≤ a(x0 ) = .
M q(x0 )
This is what accept with probability a(x0 ) means. A closer look reveals, we could also write this
(by playing with the above inequality)
M q(x0 )U ≤ p(x0 ).
In order words, the lhs of this inequality is a uniform random variable multiplied by M q(x0 ), so
we could define U 0 = M q(x0 )U as
since U ∼ Unif(0, 1). Finally, you can see what the algorithm is doing behind the scenes:
• Sample X 0 ∼ q(X 0 )
• U 0 ∼ Unif(0, M q(X 0 ))
3
This is usual in probability: Capital letters are random variables, it is better to fix their values after they are
sampled (now deterministic).
25
• Accept if
U 0 ≤ p(X 0 ).
This is exactly drawing a (X 0 , U 0 ) pair and accepting the sample if it is under the curve of p(X 0 ).
By Theorem 2.2, this samples from the correct distribution!
So far we have written a few different versions of the method. Implementation however is
made according to Algorithm 3.
26
Algorithm 4 Pseudocode for rejection sampling without normalising constants
1: Input: The number of iterations n and scaling factor M .
2: for i = 1, . . . , n do
3: Generate X 0 ∼ q(x0 )
4: U ∼ Unif(0, 1)
0)
5: if U ≤ Mp̄(X q(X 0 )
then
0
6: Accept X . This should record the sample with other accepted samples
7: end if
8: end for
9: Return accepted samples
p(X 0 )
U≤ ,
M q(X 0 )
where X 0 ∼ q(x0 ). Let us denote the probability of acceptance (acceptance rate) as â. This
quantity is formalised below.
Proposition 2.1. When the target density p(x) is normalised and M is prechosen, the acceptance
rate is given by
1
â = ,
M
where M > 1 in order to satisfy the requirement
R
that q covers p. For an unnormalised target density
p̄(x) with the normalising constant Z = p̄(x)dx, the acceptance rate is given as
Z
â = .
M
27
in the normalised case. Similarly, we will also prove
Z
â = E[a(X 0 )] = (2.11)
M
for the unnormalised case where we use p̄(x) instead of p(x).
For the first fact, we can prove (2.10) by noting that
p(X 0 )
!
â = P(accept) = P U ≤ ,
M q(X 0 )
where the randomness here is w.r.t. U and X 0 jointly. We know however that, for a given X 0 = x0 ,
we accept with the following probability
0
Z
0 0 0
Z
p(x0 )
E[a(X )] = a(x )q(x )dx = q(x0 )dx0
M q(x0 )
1 Z 1
= p(x0 )dx0 = .
M M
For the unnormalised case, we can prove (2.11) with a similar argument above as
Z
p̄(x0 ) Z
â = E[a(X 0 )] = a(x0 )q(x0 )dx0 = q(x0 )dx0
M q(x0 )
Z
p(x0 ) 0 0 Z Z
= Z 0
q(x )dx = p(x0 )dx0
M q(x ) M
Z
= .
M
Remark 2.2. Note the difference between two quantities a and â. While a(x) is the acceptance
ratio p(x)/M q(x) (or the unnormalised version), â is the expectation of this quantity, called
acceptance rate.
It can be seen that smaller M is theoretically useful for us as it will mean higher acceptance
rates. Consider the following example.
28
Figure 2.7: A better proposal for p(x) = Beta(2, 2)
Example 2.9 (Beta(2, 2) density). We can go back to our example Beta(2, 2) density in Exam-
ple 2.8. Instead of the box, we can now choose
and M = 1.3 (this is optimised visually by plotting – we will compute these quantities explicitly
below). This will result in the graph shown in Fig. 2.7. Compare the acceptance rate of this
algorithm with the box example and demonstrate this numerically.
Solution. We can see now that in the box example, we chose Mbox = p? = 1.5. By visual
inspection for this example, for the Gaussian case, we chose MGauss = 1.3. We can now compute
the acceptance rate for both cases. For the box example, we have
1 1
âbox = = = 0.67,
Mbox 1.5
and for the Gaussian case, we have
1 1
âGauss = = = 0.77.
MGauss 1.3
These must be the precise numbers that the numerical simulations will give us.
Example 2.10 (Truncated Densities). Describe a rejection sampler for a Truncated Gaussian
29
target:
Solution. The truncated densities arise in a number of applications where we may want to model
something we know with a probability density p(x) we are familiar with. However, it could also
be the case that this variable X has strong constraints (e.g. positivity or boundedness). For
example, we could consider an age distribution could be restricted this way with hard constraints.
Imagine a Gaussian density N (x; 0, 1) and suppose we are interested in sampling this density
between [−a, a]. We can write this truncated normal density as
Here are a few important things about this equation: 1A (x) denotes a function where
(
1 if x ∈ A
1A (x) =
0 otherwise.
Note that now we have access to our density evaluation in an unnormalised way: We can evaluate
p̄(x) which equals to N (x; 0, 1) if −a ≤ x ≤ a and to 0 otherwise. Rejection sampling is optimal
for this task. Note here that, we can choose
q(x) = N (x; 0, 1)
anyway, and we have p̄(x) ≤ q(x) (i.e. we can take M = 1). Note a few interesting things for
this case: First of all, since p(x) is zero outside [−a, a] and p̄(x)/M q(x) = 1 if x ∈ [−a, a], we
can simply reject any sample that is out of bounds and accept if they are in the bounds (without
needing to sample U at all). Secondly, note that, this is one of the rare cases where we have Z < 1.
Note that we can very intuitively also calculate the acceptance rate:
Z
â = =Z
M
a
N (y; 0, 1)dy as given in the density (2.12). It is very intuitive that the accep-
R
where Z = −a
tance rate is this integral, as we literally accept a Gaussan sample from q(x) = N (x; 0, 1) if
it
Ra
falls into [−a, a], the probability of sample falling into this interval is given by the integral
−a N (y; 0, 1)dy = Z.
30
choosing M
In the above example, we have seen that choosing M is crucial for the acceptance rate. It is easy
to see that we should choose M such that M q(x) ≥ p(x) for all x. To choose smallest such M ,
we should find the number M ? such that
p(x)
M? = sup .
x q(x)
This will ensure that M? q(x) ≥ p(x) for all x. This is the optimal choice of M and we can see
that this is the smallest M that covers p(x). Needless to say, for the unnormalised case, we just
replace p with p̄ in the above formula.
p̄(x)
M = sup ,
x q(x)
as usual. How? For this, let R(x) = p̄(x)/q(x) and note that we will first need to find
x2
log R(x) = log p̄(x)/q(x) = − + log(1 + x2 ) + log(1/π)
2
and find the roots by taking the derivative and setting it to zero
d 2x
log R(x) = −x + =0
dx 1 + x2
x = 0, ±1,
we see that we have three roots to decide. Which one is the maximum? To look at the answer, we
need to check second derivatives. We compute the second derivative
d2 2(1 − x2 )
log p̄(x)/q(x) = −1 +
dx2 (1 + x2 )2
31
• When x = 0, the second derivative is positive - which means x = 0 is a minimum.
• x? = ±1.
So we have
p̄(1)
M= = 2πe−1/2 .
q(1)
p(x) ≤ Mθ qθ (x).
θ? = argmin log Mθ ,
θ
as log(1/Mθ ) = − log Mθ .
Example 2.12. (A numerical example from Yıldırım (2017)) Say we are interested in sampling
X ∼ Gamma(α, 1),
xα−1 e−x
p(x) = , x > 0.
Γ(α)
5
We denote ln with log, so all logaritms are w.r.t. to natural base.
32
Figure 2.8: Two rejection sampling procedures for Example 2.12 with λ = 0.001 and optimal λ = 1/α (as derived
in the example) for n = 10000.
with 0 < λ < 1 (for λ > 1, the ratio p/qλ is unbounded). Derive the optimal rejection sampler
for this problem.
Solution. We should ensure that p(x) ≤ M q(x), therefore, the standard choice for Mλ is to
compute
p(x)
Mλ = sup .
x qλ (x)
33
We know that
p(x) xα−1 e(λ−1)x
= .
qλ (x) λΓ(α)
In order to compute Mλ , in practice, one needs to first find
p(x)
x? = argmax ,
x qλ (x)
and then we can identify that Mλ = p(x? )/qλ (x? ) for fixed λ. Denote Rλ (x) = p(x)/q(x). To
find such x? , we can compute
x? = argmax log p(x) − log qλ (x),
as log is monotonic. We can then compute
log R(x) = log p(x) − log qλ (x) = (α − 1) log x − x + (λ − 1)x − log λ − log Γ(α).
We then take the derivative of this
d log R(x) α−1
= + λ − 1,
dx x
and set it to zero, which leads to
(α − 1)
x? = .
(1 − λ)
Placing x = x? in the ratio p(x)/qλ (x), we obtain
α−1
α−1
1−λ
e−(α−1)
Mλ = .
λΓ(α)
This leads to the acceptance probability
!α−1
p(x) x(1 − λ)
= e(λ−1)x+α−1 .
Mλ qλ (x) α−1
Now, we have to minimise Mλ with respect to λ so that â = 1/Mλ would be maximised. It is
easy to show that (show) Mλ is minimised by
1
λ? = .
α
Plugging this and computing
αα e−(α−1)
Mλ? = .
Γ(α)
So we designed our optimised rejection sampler. In order to sample from Γ(α, 1), we perform
• Sample X 0 ∼ Exp(1/α) and U ∼ Unif(0, 1)
• If
U ≤ (x/α)α−1 e(1/α−1)x+α−1 ,
accept X 0 , otherwise start again.
We can see the results of this algorithm in Fig. 2.8.
34
P3
Figure 2.9: The density of a mixture of three Gaussians: p(x) = k=1 wk N (x; µk , σk2 ) with µ1 = −2, µ2 =
0, µ3 = 4, σ1 = 0.5, σ2 = 1, σ3 = 0.5, w1 = 0.2, w2 = 0.6, w3 = 0.2.
2.4 composition
When the probability density p(x) can be expressed in a composition of operations, we can still
sample from such densities straightforwardly, albeit it may look complex at first look. In this
section, we focus on mixture densities, i.e., densities that can be written as a weighted mixture
of two probability densities. These objects are used to model subpopulations in a statistical
population, modelling experimental error (e.g. localised in different regions), and heterogeneous
populations. We will start from a discrete mixture and then will discuss the continuous case.
35
where w1 + w2 = 1 and q1 and q2 are probability densities. It is straightforward to verify that
p(x) is also a density
Z Z Z
p(x)dx = w1 q1 (x)dx + w2 q2 (x)dx
= w1 + w2
= 1.
An example can be seen from Fig. 2.9. We can generalise this idea and define a general mixture
distribution
K
X
p(x) = wk qk (x),
k=1
with k mixtures. Sampling from such distributions are extremely easy with the techniques we
know. We first sample from the probability mass function defined by weights: p(k) = wk where
PK
k=1 p(k) = 1 (using inversion as we learned). This gives us an index k ∼ p(k), then we sample
from associated density X 0 ∼ qk (x), which gives us a sample from the mixture. For example,
sampling a mixture of Gaussians is easy: Sample k ∼ p(k) from the PMF consists of weights wk ,
then sample from the selected Gaussian.
where the mean (parameter) of the Gaussian is denoted within the density as conditioned. This
notation is useful if one assumes x is also random (will see later). However, for fixed x, the
sampling is business usual:
36
2.4.3 sampling from joint distributions
Sampling from a joint distribution p(x, y) sounds straightforward but it might be still not obvious.
Assume, we would like to draw
X, Y ∼ p(x, y) (2.13)
e.g., a two-dimensional sample from 2D Gaussian. It is often the case that the standard factorisa-
tion of joint densities
p(x, y) = p(y|x)p(x),
X ∼ p(x),
Y |X = x ∼ p(y|x).
Note the notation which implies that things should be done in this order. Once X is sampled,
then it is fixed X = x. After that, Y is sampled conditioned on that specific x sample.
In particular, the idea can be generalised for n variables if one knows the full conditionals.
For example, consider a joint distribution p(x1 , . . . , xn ), then any joint distribution of n variables
satisfy the following.
X1 ∼ p(x1 )
X2 |X1 = x1 ∼ p(x2 |x1 )
X3 |X1 = x1 , X2 = x2 ∼ p(x3 |x2 , x1 )
..
.
Xn |X1 = x1 , X2 = x2 , . . . , Xn−1 = xn−1 ∼ p(xn |x1 , . . . , xn−1 ).
Of course the difficulty with this is that, it is often impossible to know these conditional distribu-
tions described above.
Remark 2.3. This idea can be taken to great generalisation, in fact, it is often the core of complex
simulations. The core idea of probabilistic modelling is to factorise (assuming independence)
some complex joint distribution p(x1 , . . . , xn ) with respect to the modelling assumptions. Simu-
lation methods can then be used to sample these variables in the order that is assumed in the
model and generate synthetic data.
37
2.4.4 sampling from continuous mixtures or marginalisation
It is a common case that a density can be written as an integral, instead of a sum (as in the discrete
mixture case). Consider the fact that
Z
p(y) = p(x, y)dx,
for any joint density. This operation is called marginalisation and it is often of interest to compute
marginal densities (and of course sampling from them).
For example, using the formula p(x, y) = p(y|x)p(x) and given a conditional density p(y|x)
and p(x), we can derive
Z Z
p(y) = p(x, y)dx = p(y|x)p(x)dx.
Surprisingly enough, sampling from y is pretty straightforward: Sample from the joint p(x, y)
using the method above (i.e. X ∼ p(x) and Y |X = x ∼ p(y|x)), then just keep Y samples.
They will approximate p(y)!. Let us see an example.
p(x) = N (x; µ, σ 2 )
and
• Sample X ∼ N (x; µ, σ 2 ),
• Sample Y |X = x ∼ N (y; x, 1)
38
Figure 2.10: The data simulated from (2.15)–(2.16) using a = 0.5 and b = 0.5 with three different values for σ 2 .
As can be seen from the figures, the generated (x, y) pairs exhibit a clear linear relationship (as intended) with
variance changing depending on our modelling choice.
Example 2.14 (Linear Model). Linear models are of utmost importance in many fields of science.
Describe a method to simulate (x, y) pairs that have a linear relationship.
Solution. We know that we can sample x, y ∼ p(x, y) by sampling x ∼ p(x) and y|x ∼ p(y|x)
from the last chapter. We will now use this for a linear example.
To start intuitively, a typical linear relationship is described as
y = ax + b, (2.14)
which describes a line where a is the slope and b is the intercept. In order to obtain a probabilistic
model and generate data, we have to simulate both x and y variables. Since, from the equation, it
is clear that y is generated given x, we should start from defining x. Now this depends on the
application. For example, x can be a variable that may be uniform or a Gaussian. We denote
its density as p(x). The typical task is also to formulate p(y|x). The linear equation suggests a
deterministic relationship, however, real data often contains noise. To generate realistic data, we
will instead assume
y = ax + b + n
where n ∼ N (0, σ 2 ) is noise (often with small σ 2 . Note that, given noise is zero mean and ax + b
is a deterministic number (given x), we can then write our full model
p(x) = Unif(x; −10, 10) (2.15)
p(y|x) = N (y; ax + b, σ 2 ). (2.16)
where we chose our p(x) distribution to be uniform on [−10, 10]. As a result, we have a full
model to simulate variables with a linear relationship
Xi ∼ p(x),
Yi |Xi = xi ∼ p(y|xi ),
39
Algorithm 6 Sampling Multivariate Gaussian
1: Input: The number of samples n.
2: for i = 1, . . . , n do
3: Compute L such that Σ = LL> . (Cholesky decomposition)
4: Draw d univariate independent normals vk ∼ N (0, 1) to form the vector v =
[v1 , . . . , vd ]>
5: Generate xi = µ + Lv.
6: end for
where p(x) could be a uniform, Gaussian, truncated Gaussian etc. depending on the nature of
the modelled variable. The results of this generation can be seen in the scatter plot in Fig. 2.10.
Y = Σ1/2 X + µ.
The computation of Σ1/2 is done using a Cholesky decomposition6 . The algorithm is provided in
Algorithm 6.
6
You do not need to know how to implement or compute this, it is perfectly fine to use
numpy.linalg.cholesky.
40
PROBABILISTIC MODELLING AND INFERENCE
3
In this chapter, we will cover probabilistic modelling in more detail and then talk about inference.
We will also review probability basics and a large range of applications the Bayesian viewpoint
enables.
3.1 introduction
In the previous chapter, we have seen how to generate data from a probabilistic model. Despite
we have only simulated from a linear model as an example, the idea is general. We will see more
about simulating models in other parts of the course. We have seen that
Xi ∼ p(x), (3.1)
Yi |Xi = xi ∼ p(y|xi ), (3.2)
generates the data according to the model p(x, y) = p(y|x)p(x). It is important to stress that this
can describe a very general situation: x variable can be multivariate (and even be time dependent),
and y can describe any other process. We will see, though, that in Bayesian modelling (I use it
simultaneously with probabilistic modelling), x generally denotes the latent (hidden) states or
parameters of a model (or both) . The variable y typically denotes the observed data. So seeing the
model (3.1) as a generative model, simulating from it can be seen as a way of generating synthetic
data1 .
41
Definition 3.1 (Bayes Theorem). Let X and Y be random variables with associated probability
density functions p(x) and p(y), respectively. The Bayes rule is given by
p(y|x)p(x)
p(x|y) = . (3.3)
p(y)
Note that the formula holds for continuous random variables as well as discrete random
variables. Its importance comes from the fact that it provides us a natural way to incorporate or
synthesise data into a probabilistic model. In this interpretation, we have three key concepts.
• Prior: In the formula (3.3), p(x) is called the prior probability of X. Here X can be
interpreted as a parameter of p(y|x) or a hidden (unobserved) variable. The probability
distribution p(x) encodes our prior knowledge about this variable we cannot observe
directly. This could be simple constraints, a distribution dictated by a real application
(e.g. a physical variable can be only positive). In time series applications, p(x) can be the
distribution over an entire time series, it can even encode physical laws.
• Likelihood: p(y|x) is called the likelihood of Y given X. This is the probability model
of the process of observation – in other words, it describes how the underlying parameter
or hidden variable is observed. For example, if Y is the number of observed cases of a
disease in a population, then p(y|x) is the probability of observing y cases given that the
true number of cases is x.
• Posterior: p(x|y) is called the posterior distribution of X given Y = y. This is the updated
probability distribution after we see y observation and updated our prior knowledge p(x)
into p(x|y).
Remark 3.1. Note the difference between simulation and inference. We can write down our
model (sometimes we will call the forward model) p(x) and p(y|x) to describe the data generation
process and can generate toy (synthetic) data with it as we have seen. But the essential goal of
Bayes rule (also called Bayesian or probabilistic inference) is to infer the posterior distribution
conditioned on already observed data. In other words, we can use a probabilistic model for two
purposes:
• Inference: We can infer the posterior distribution (implied by the model structure we
impose) of a parameter or hidden variable given observed data.
42
Example 3.1. Suppose we have two fair dice, each with six faces. Suppose that we throw both
dice and observe the sum as y = 9. Derive the posterior distribution of the first die X1 and the
second die X2 given Y = 9.
Solution. Let us first write down the joint distribution of the two dice. Define the outcome of
the first die as X1 and the outcome of the second die as X2 . We can then describe their joint
probability table as
p(x1 , x2 ) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 2 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 3 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 4 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 5 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 6 1/36 1/36 1/36 1/36 1/36 1/36
i.e., each combination is equally probable. Note that this is also the table of p(x1 )p(x2 ) due to
independence. Suppose that we can only observe the sum of the two dice, Y = X1 + X2 . This
would result in a likelihood
1 if y = x1 + x2 ,
p(y|x1 , x2 ) =
0 otherwise.
We can also denote this as an indicator function, i.e., let 1{y=x1 +x2 } (x1 , x2 ) be the indicator
function of the event y = x1 + x2 , then we have p(y|x1 , x2 ) = 1{y=x1 +x2 } (x1 , x2 ). Suppose now
we observe Y = 9 and would like to infer the posterior distribution of X1 and X2 given Y = 9.
We can use the Bayes rule to write
p(y = 9|x1 , x2 )p(x1 , x2 )
p(x1 , x2 |y = 9) =
p(y = 9)
p(y = 9|x1 , x2 )p(x1 )p(x2 )
= .
p(y = 9)
Let us first write out p(y = 9|x1 , x2 ) as a table
p(y = 9|x1 , x2 ) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 0 0 0 0 0 0
X2 = 2 0 0 0 0 0 0
X2 = 3 0 0 0 0 0 1
X2 = 4 0 0 0 0 1 0
X2 = 5 0 0 0 1 0 0
X2 = 6 0 0 1 0 0 0
This is just the likelihood. In order to get the full joint (numerator of the Bayes theorem), we
need to multiply the likelihood with the joint prior p(x1 , x2 ) = p(x1 )p(x2 ). Multiplying this
table with the joint probability table of X1 and X2 gives
43
p(y = 9|x1 , x2 )p(x1 )p(x2 ) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 0 0 0 0 0 0
X2 = 2 0 0 0 0 0 0
X2 = 3 0 0 0 0 0 1/36
X2 = 4 0 0 0 0 1/36 0
X2 = 5 0 0 0 1/36 0 0
X2 = 6 0 0 1/36 0 0 0
This is just the numerator in the Bayes theorem, we now need to compute the probability p(y = 9)
in order to finally arrive at the posterior distribution. We can compute this as
X
p(y = 9) = p(y = 9|x1 , x2 )p(x1 )p(x2 )
x1 ,x2
X
= 1(y = x1 + x2 )p(x1 )p(x2 )
x1 ,x2
X 1 1
= 1(y = x1 + x2 ) × ×
x1 ,x2 6 6
1 X
= 1(y = x1 + x2 ),
36 x1 ,x2
1
= ×4
36
1
= .
9
Now we are ready to normalise p(y = 9|x1 , x2 )p(x1 )p(x2 ) to obtain the posterior distribution as
a table
p(x1 , x2 |y = 9) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 0 0 0 0 0 0
X2 = 2 0 0 0 0 0 0
X2 = 3 0 0 0 0 0 1/4
X2 = 4 0 0 0 0 1/4 0
X2 = 5 0 0 0 1/4 0 0
X2 = 6 0 0 1/4 0 0 0
44
where µ0 and σ02 are the prior mean and variance, respectively, and σ 2 is the variance of the
likelihood. Derive the posterior distribution p(x|y) using the Bayes’ rule.
Solution. We have seen a similar example before (without the proof), where we computed the
marginal likelihood p(y). In this example, we will instead derive the posterior distribution p(x|y).
Now let us write
p(y|x)p(x)
p(x|y) = .
p(y)
In order to derive the posterior, we first derive p(y|x)p(x) as
p(y|x)p(x) = N (y; x, σ 2 )N (x; µ0 , σ02 )
! !
1 (y − x)2 1 (x − µ0 )2
=√ exp − exp −
2σ02
q
2πσ 2 2σ 2 2πσ02
!
1 (y − x)2 (x − µ0 )2
= q exp − − .
2π σ 2 σ02 2σ 2 2σ02
We know that
p(x|y) ∝ p(y|x)p(x)
!
(y − x)2 (x − µ0 )2
∝ exp − − .
2σ 2 2σ02
Recall now that we want a density on x. Let us expand this exponential
!
y2 xy x2 x2 xµ0 µ2
p(x|y) ∝ exp − 2 + 2 − 2 − 2 + 2 − 02
2σ σ 2σ 2σ0 σ0 2σ0
! ! !
1 1 2 y µ0 y2 µ20
= exp − + x + + x− 2 − 2 ,
2σ 2 2σ02 σ 2 σ02 2σ 2σ0
! ! !
1 σ 2 + σ02 2 yσ02 + µ0 σ 2
∝ exp − x + x ,
2 σ 2 σ02 σ02 σ 2
! !!!
1 σ 2 + σ02 2 yσ02 + µ0 σ 2
= exp − x − 2x ,
2 σ 2 σ02 σ02 + σ 2
yσ02 +µ0 σ 2 2
1
x− 2
σ0 +σ 2
∝ exp − .
2 σ 2 σ02
σ 2 +σ02
45
Figure 3.1: Posterior distribution of x given σ = 1, σ = 0.7 and σ = 0.5 respectively. One can see that as we
shrink the likelihood variance, the posterior distribution becomes more peaked towards the observation y = 0.5.
Old posteriors are also plotted in the second and third figure for comparison (in transparent blue).
Figure 3.2: On the left, we plot all distributions of interest: prior, likelihood (with y = 0.5 with respect to x),
the posterior, and the unnormalised posterior, and the proposal. Note that, the proposal should only cover the
unnormalised posterior, even if the normalising constant is less than one. On the left, we plot the samples vs. the
same quantities. One can see that we exactly sampled from the correct posterior.
This is an example of a conjugate prior, where the posterior distribution is of the same form
as the prior. In the solved examples section, we will see more examples of this. As you have seen,
the derivation of the posterior took some work. As opposed to this conjugate case, in the general
case, we will not be able to derive the posterior. Let us see one example now how we can avoid
computing the normalised posterior but still sample from it.
Example 3.3. Assume that we have a prior distribution p(x) = N (x; µ0 , σ02 ) and a likelihood
p(y|x) = N (y; x, σ 2 ). We want to sample the posterior distribution p(x|y) without going
through the derivation. Derive the rejection sampler for this purpose.
Recall that we would like to sample from the posterior p(x|y) without necessarily computing the
Bayes rule. We can pose this problem as a rejection sampling problem. We would like to sample
from the posterior distribution conditioned on y. In our case, the unnormalised posterior is
46
Figure 3.3: Illustration of the prior, posterior, likelihood, and the proposal distribution.
given by
p̄(x|y) = p(y|x)p(x)
Note that we evaluate the likelihood at the observation y and hence it becomes a function of x.
Below, for clarity, we will use the r.h.s. of above equation in acceptance rate, instead of p̄(x) as we
usually did before. For this example, we also set µ0 = 0, σ0 = 1, and σ = 0.5. Next, we need to
design a proposal distribution q(x). This could be tricky as we do not know the posterior. For
now, we can choose another simple Gaussian (we could also optimise this):
Let us choose µq = 0 and σq = 1 (note again that this is the standard deviation!) and M = 1.
An illustration of this is shown in Fig 3.2. We can now sample from the posterior
• Sample X 0 ∼ q(x)
• Sample U ∼ Unif(0, 1)
p(y|X 0 )p(X 0 )
• If U ≤ M q(X 0 )
, accept X 0 . Otherwise, reject X 0 and go back to step 1.
We can see the results of this procedure from Fig 3.2. As seen from the figure, we exactly sample
from the posterior p(x|y = 0.5) without ever computing the correct posterior. We have also
plotted the correct posterior in the figure for comparison.
47
Example 3.4. Assume that we have a Poisson observation model:
xy e−x
p(y|x) = Pois(y; x) = ,
y!
and a Gamma prior:
β α α−1 −βx
p(x) = Gamma(x; α, β) = x e .
Γ(α)
We want to sample from the posterior distribution p(x|y). Derive the posterior distribution
p(x|y).
p(x|y) ∝ p(y|x)p(x)
= Pois(y; x)Gamma(x; α, β),
∝ xα−1+y e−βx−x ,
where we ignored all the normalising constants. We can see that the posterior is also a Gamma
density:
Let us sample from this posterior with rejection sampling as we did before for the Gaussian.
48
Figure 3.4: Histogram of the samples drawn using rejection sampling.
49
Aiming at optimising this w.r.t. x, we first compute its log:
p̄(x|y)
log = log xα−1+y + log e−(2−λ)x − log λ
qλ (x)
= (α − 1 + y) log x − (2 − λ)x − log λ.
We now take the derivative of this w.r.t. x:
d α−1+y
[(α − 1 + y) log x − (2 − λ)x − log λ] = − (2 − λ),
dx x
and set it to zero:
α−1+y
− (2 − λ) = 0.
x
This gives us the maximiser
α−1+y
x? = .
2−λ
We can now compute Mλ :
p̄(x? |y)
Mλ =
qλ (x? )
?
x? α−1+y e−(2−λ)x
=
λ
1 α − 1 + y α−1+y −(2−λ) α−1+y
= e 2−λ
λ 2−λ
1 α − 1 + y α−1+y −(α−1+y)
= e .
λ 2−λ
We can now optimise this further to choose our optimal proposal. We will first compute the log
of Mλ :
1 α−1+y
log Mλ = log + (α − 1 + y) log − (α − 1 + y)
λ 2−λ
α−1+y
= − log λ + (α − 1 + y) log − (α − 1 + y).
2−λ
Taking the derivative of this w.r.t. λ, we obtain
d 1 (α − 1 + y)
log Mλ = − +
dλ λ 2−λ
Setting this to zero, we obtain
1 (α − 1 + y)
= ,
λ 2−λ
which implies that
2
λ? = .
α+y
Therefore, we can choose our optimal proposal in terms of α and y depends on the observed
sample. See Fig . 3.4 for the histogram of the samples drawn using rejection sampling.
50
3.3 conditional independence
The step forward from the simple Bayes rule to modelling complex dependencies and interactions
is to understand the notion of conditional independence. Simply put, conditional independence
is a notion of independence of two random variables conditioned on a third random variable. Of
course, this can be extended to arbitrary number of variables, defining a full probabilistic model.
It is important to note that these models everywhere in science and engineering.
Let us first define the notion of conditional independence.
Definition 3.2. Let X, Y and Z be random variables. We say that X and Y are conditionally
independent given Z if
This definition is of course the same as plain independence, just written in terms of conditional
probabilities. Note that, in general, X and Y are not independent if we do not condition on Z.
We note the important corollary.
Corollary 3.1. If X and Y are conditionally independent given Z, then
p(x|y, z) = p(x|z),
and
p(y|x, z) = p(y|z).
Proof. See Exercise 4.2 solution.
We can now describe the notion of conditional independence in terms of joint distributions.
Proposition 3.1. Let X, Y and Z be random variables. If X and Y are conditionally independent
given Z, then
p(x, y, z) = p(x|z)p(y|z)p(z).
Proof. Recall that we have described the chain rule for conditional probabilities in Sec. 2.4.3
p(x1 , . . . , xn ) = p(xn |xn−1 , . . . , x1 )p(xn−1 |xn−2 , . . . , x1 ) · · · p(x2 |x1 )p(x1 ).
This relationship is as important as in inference as in simulation. We can now use this to show
that
p(x, y, z) = p(x|y, z)p(y|z)p(z)
= p(x|z)p(y|z)p(z),
where the last line follows from Corollary 3.1.
51
This idea can be extended to arbitrary number of variables. This kind of factorisations are
at the core of probabilistic modelling. In other words, a probabilistic modeller (scientist) poses
a set of conditional independence assumptions which then allows them to factorise the joint
distribution into a product of conditional distributions. From then on, the modeller can use the
conditional distributions to compute any desired marginal or conditional distributions. This is
the essence of probabilistic modelling.
Plugging this in back to the Bayes update (3.4), we can see that the posterior is proportional to
the product
Again, in many occasions, we will not be able to compute the normalising constant. However, we
can still sample from the posterior. In this particular example, let us continue with the Gaussian
prior and likelihood. In this case, we can exactly compute the posterior too.
Example 3.6 (Gaussian Bayes update for conditionally independent observations). Let us assume
the following probabilistic model
2
We define the following notation. Let y1 , . . . , yn be n observations. We collectively denote these variables as
y1:n := (y1 , . . . , yn ). This will be also used in the following sections.
52
Figure 3.5: Bayes update for conditionally independent observations.
Solution. Here each observation is assumed to be conditionally independent given x. Note that
this model is very different than the one where we simulated (Xi , Yi ) pairs in Example 2.14. The
point in Example 2.14 was to simulate pairs exhibiting linear relationship, each (Yi , Xi ) was an
independent draw from the joint distribution. Here, we assume that the observations are sampled
conditioned on a single x – in essence, the sequence y1 , . . . , yn are dependent. They are only
conditionally independent given x.
Having observed y1 , . . . , yn , we would like to compute the posterior p(x|y1 , . . . , yn ). Let us
first compute the likelihood
n
Y
p(y1:n |x) = p(yi |x)
i=1
n
!
Y 1 (yi − x)2
= √ exp −
i=1 2πσ 2 2σ 2
n
!
1 1 X 2
= exp − 2 (yi − x) .
(2πσ 2 )n/2 2σ i=1
Using the same derivations as in Example 3.2, we can compute the posterior
p(y1:n |x)p(x)
p(x|y1:n ) =
p(y1:n )
p(y1:n |x)p(x)
=R
p(y1:n |x)p(x)dx
53
where p(x|y1:n ) = N (x; µp , σp2 ), with (Murphy, 2007)
Pn
σ02 i=1 yi + σ
2
µ0
µp =
σ02 n + σ 2
σ02 σ 2
σp2 = .
σ02 n + σ 2
The posterior with conditioned data can be seen from Fig. 3.5.
Proposition 3.2. Given X, Y, Z without any conditional independence assumptions, the conditional
Bayes rule is
p(y|x, z)p(x|z)
p(x|y, z) = .
p(y|z)
For fixed y, the interpretation of this term is that it gives us the probability of data y under the
model3 . For more complicated models (where x can be multiple variables or multiple other
3
Aside from its usual interpretation as the normalising constant.
54
Figure 3.6: Marginal likelihood for model comparison. For observed data, we can compute the marginal likelihood
for each model. The model with the highest marginal likelihood is the best model for the observed data.
distributions may exist), the quantity p(y) becomes crucial to determine the quality of the model
for the observed data. While itself does not mean much, it gives us a comparative measure to
compare different models. We will discuss this with an example.
Example 3.7 (Marginal likelihood for two Gaussian models). Consider two different models:
and
Consider observing y (a single data point). How can you find out which model is more likely?
Solution. Recall that, for these models, we have computed p(y) analytically before. We can
compute for both models:
Z
p0 (y) = p(y|x)p0 (x)dx
Z
= N (y; x, σy2 )N (x; µ0 , σ02 )dx
= N (y; µ0 , σ02 + σy2 )
55
and
Z
p1 (y) = p(y|x)p1 (x)dx
Z
= N (y; x, σy2 )N (x; µ1 , σ12 )dx
= N (y; µ1 , σ12 + σy2 )
We will say Model 1 is better than Model 0 if p1 (y) > p0 (y) for fixed y and similarly, we will say
Model 0 is better if p1 (y) < p0 (y). This, however, as you can see depends on various parameters.
Let us choose that σ = 1, µ0 = −4, σ0 = 2, and µ1 = 1, σ1 = 0.5. The computed marginal
likelihoods can be seen from Fig. 3.6. It can be seen that Model 1 is a much better fit to the data
than Model 0.
56
Figure 3.7: The curse of dimensionality for the sampling example for rejection sampling.
The generalisation of this process goes as follows. Assume that we have p(x|y1:n−1 ), i.e., the
conditional posterior distribution of n − 1 observations. Upon receiving yn , we can update our
posterior as
where we obtain the sequential Bayesian updating rule. This is a very important result as it allows
us to update our posterior sequentially without reprocessing the data. This is especially useful in
online learning scenarios where we would like to update our posterior as we receive new data
points. This will be of crucial use when we consider sequential Monte Carlo towards the end of
the course. We will have one exercise about a Gaussian example in this setting.
3.6 conclusion
In this section, we briefly discussed the Bayes rule and its application to probabilistic inference.
This is a vast topic and we have only scratched the surface. If you are curious about the topic,
Bishop (2006) is a good book to read. Some other very nices ones are Barber (2012) and Murphy
(2022). We will finish this chapter by discussing why rejection samplers as we introduced it would
not be an appropriate candidate for sampling in more complicated models we discussed in this
chapter.
57
Given all these derivations, it is natural to ask whether we can use rejection samplers for
Bayesian inference. Let us assume that we have y1 , . . . , yn observed and our unnormalised
posterior is given by
n
Y
p̄(x|y1:n ) = p(x) p(yi |x).
i=1
Let us assume that we have a proposal distribution q(x) and assume that we have been lucky to
identify some M such that
p̄(x|y1:n ) ≤ M q(x).
1. Sample X 0 ∼ q(x)
2. Sample U ∼ Unif(0, 1)
Qn
p̄(X 0 |y1:n ) p(X 0 ) p(yi |X 0 )
3. If U ≤ M q(X 0 )
= i=1
M q(X 0 )
then accept X 0
What could be an immediate problem as n grows? The multiplication ni=1 p(yi |X 0 ) would
Q
not be numerically stable. This would result in numerical underflow as the multiplication of
small probabilities gets smaller and smaller. In order to mitigate this, one solution is to work
with log-probabilities. This means that we can still perform rejection sampling (provided that
p̄(x|y) ≤ M q(x)) as follows:
1. Sample X 0 ∼ q(x)
2. Sample U ∼ Unif(0, 1)
i=1
58
This is also not the only failure mode of the rejection sampling. It is often the case that rejection
sampling is very inefficient in high dimensions even if one manages to find a good proposal q.
Consider the rejection sampling in 2D for sampling the circle within a square. The acceptance
probability for this case:
Next, consider the same sampler for the sphere and the cube (3D). The acceptance probability for
this case:
volume of the sphere π
â = = ≈ 0.52.
volume of the cube 6
If we were doing this in d dimensions, the acceptance rate would be
However, this ratio goes to zero incredibly fast as d grows (see Fig. 3.7) In other words, rejection
samplers have very poor acceptance rates in high dimensions. This will lead us to look at other
sampling methods.
59
MONTE CARLO INTEGRATION
4
In this section, we introduce Monte Carlo integration and importance sampling in detail. We will
show how these ideas can be applied to a variety of problems such as computing integrals, computing
expectations, sampling from complex distributions, and computing marginal likelihoods.
where ϕ(x) is called a test function. For example, ϕ(x) = x would give us the mean, ϕ(x) =
x2 the second moment, or ϕ(x) = log(x) would give us the entropy. For example, given
X (1) , . . . , X (N ) ∼ p i.i.d, we know that (intuitively, at this point) the mean estimator is given by
N
Z
1 X
Ep [X] = xp(x)dx ≈ X (i) ,
N i=1
which is simply the empirical average of the samples. While this can be intuitive, it underlies a
certain choice about the approximation of the probability distribution p using its samples. In
60
order to do this, we build an empirical distribution of the samples, using
N
N 1 X
p (x)dx = δ (i) (x)dx. (4.1)
N i=1 X
In order to understand how this works, we first need to understand the Dirac delta measure δx .
The Dirac delta measure is defined as
Z
f (y) = f (x)δy (x)dx. (4.2)
Here, the Dirac can be thought as a point mass at y. In other words, the Dirac delta measure is a
measure which is concentrated at a single point. To understand it intuitively, the object δy (x) can
be informally thought as a function centered at y (and only takes value 1 at y) 1
1 if x = y
δy (x) =
0 otherwise.
One can see that then pN is a sample based approximation of p, where the samples are equally
weighted. While we never may use this particular approximation of a density, it is useful to build
estimates of expectations. Generalising the above scenario, let us consider the estimation of the
general expectation
Z
ϕ̄ = Ep [ϕ(X)] = ϕ(x)p(x)dx.
Given samples X (1) , . . . , X (N ) , we can build pN as in (4.1) and approximate this expectation as
ϕ̄ = Ep [ϕ(x)]
Z
= ϕ(x)p(x)dx
Z
≈ ϕ(x)pN (x)dx
N
Z
1 X
= ϕ(x) δX (x)dx
N i=1 i
N Z
1 X
= ϕ(x)δXi (x)dx
N i=1
N
1 X
= ϕ(Xi ) = ϕ̂N . (4.3)
N i=1
where we have used (4.2) in the approximate integral to arrive at the final expression. Note that
this generalises the example above about the mean (which was ϕ(x) = x case). In this course,
we will also be interested in the properties of these estimators.
1
This is not correct rigorously – just for intuition! Note that the Diracs always make sense with an integral
attached to them.
61
Remark 4.1. As we can see that, the Monte Carlo estimator can be used to estimate expectations.
We can also use this idea to estimate integrals. Consider a standard integration problem
Z
I= f (x)dx,
where f (x) is a function. We can use the Monte Carlo (MC) estimator to estimate this integral as
Z
f (x)
I= p(x)dx
p(x)
N
Z
f (x) N N 1 X
≈ p (x)dx where p (x)dx = δ (i) (x)dx
p(x) N i=1 X
N
1 X f (Xi )
= . using (4.1)
N i=1 p(Xi )
f (x)
In this case, we have ϕ(x) = p(x)
. This is particularly easy for the integrals of type
Z 1
I= f (x)dx,
0
where f (x) is a function. In this case, we can use the uniform distribution as the base distribution
p and use the Monte Carlo estimator to estimate the integral without needing to compute any
ratios.
In the following, we prove some results about the properties of the Monte Carlo estimator
(4.3) when samples are i.i.d from p.
Proposition 4.1. Let X1 , . . . , XN be i.i.d samples from p. Then, the Monte Carlo estimator
N
1 X
ϕ̂N = ϕ(Xi )
N i=1
is unbiased, i.e.,
Ep [ϕ̂N ] = ϕ̄.
62
Proof. We have
N
" #
N 1 X
Ep [ϕ̂ ] = Ep ϕ(Xi )
N i=1
N
1 X
= Ep [ϕ(Xi )]
N i=1
N Z
1 X
= ϕ(x)p(x)dx
N i=1
Z
= ϕ(x)p(x)dx
= ϕ̄,
Proposition 4.2. Let X1 , . . . , XN be i.i.d samples from p. Then, the Monte Carlo estimator
N
N 1 X
ϕ̂ = ϕ(Xi )
N i=1
has variance
1
varp [ϕ̂N ] = varp [ϕ(X)].
N
where
Z
varp [ϕ(X)] = (ϕ(x) − ϕ̄)2 p(x)dx.
Proof. We have
N
" #
1 X
varp [ϕ̂ ] = varp
N
ϕ(Xi )
N i=1
N
1 X
= 2 varp [ϕ(Xi )]
N i=1
N Z
1 X
= 2 (ϕ(x) − ϕ̄)2 p(x)dx
N i=1
1 σ2
= varp [ϕ(X)] = ϕ
N N
Provided that varp [ϕ(X)] < ∞ and the estimator is consistent as N → ∞. This proves the result.
63
Remark 4.2. The expression varp [ϕ̂N ] is the variance of the MC estimator but this expression
requires the true mean ϕ̄ to be known. In practice, we do not know the true mean but also
have an MC estimator for it. We can plug this estimator into the variance in order to obtain an
empirical variance estimator. Note that
1
varp [ϕ̂N ] = varp [ϕ(X)]
NZ
1
= (ϕ(x) − ϕ̄)2 p(x)dx
N
N
1 X
≈ 2 (ϕ(Xi ) − ϕ̂N )2
N i=1
2
= σϕ,N .
This estimator then can be used to estimate the variance of the MC estimator.
We can therefore obtain a central limit theorem for our estimator, i.e.,
(ϕ̂N − ϕ̄)
→ N (0, 1) as N → ∞.
σϕ,N
This can be used to build empirical confidence intervals for the estimators. However, this is not
a principled estimate and may not be valid in many scenarios. We can also see that we have a
standard deviation estimate (which follows from the variance estimate) given by
σϕ
stdp [ϕ̂ ] = varp [ϕ̂N ] = √ .
q
N
N
√
This is a typical display of a convergence rate O(1/ N ).
Remark 4.3. One of the most common application of sampling is to estimate probabilities. We
have seen that different choices of ϕ can lead to estimating different quantities such as the mean
and nth moments. However, the MC estimators can also be used to estimate probabilities. In
order to see this, assume that we would like to estimate P(X ∈ A) where X ∼ p. We know that
this is given as
Z
P(X ∈ A) = p(x)dx,
A
see, e.g., Definition 1.2. For example, A can simply be an interval. Given the definition above, we
can write
Z
P(X ∈ A) = p(x)dx
ZA
= 1A (x)p(x)dx,
64
Figure 4.1: Estimating π using the Monte Carlo method.
where 1A (x) is the indicator function of A. We can therefore set ϕ(x) = 1A (x) and given the
samples from p, we can build an estimator
Z
P(X ∈ A) = 1A (x)p(x)dx,
Z
≈ 1A (x)pN (x)dx,
N
Z
1 X
= 1A (x) δX (x)dx,
N i=1 i
N Z
1 X
= 1A (x)δXi (x)dx,
N i=1
N
1 X
= 1A (Xi ).
N i=1
We can now return to the example of estimating π using the Monte Carlo method.
Example 4.1. Recall the problem of estimating π using the Monte Carlo method. Frame it as a
Monte Carlo integration problem and derive the algorithm rigorously.
65
Figure 4.2: Relative error (see next section) of the Monte Carlo estimate provided by sampling within the circle.
Solution. The logic that was used in this example was to estimate the area of a circle that lies
within a square. To be precise, consider the square [−1, 1] × [−1, 1] and define the uniform
distribution on this square as p(x, y) = Unif([−1, 1] × [−1, 1]). We can simply phrase the
problem as estimating the area of the circle which we define as A ⊂ [−1, 1] × [−1, 1]. The set A
is given as
A = {(x, y) ∈ [−1, 1] × [−1, 1] | x2 + y 2 ≤ 1}.
We can then formalise this problem as estimating the probability that a point lies within the circle.
This is given as
Z
P(X ∈ A) = p(x, y)dxdy,
ZA
= 1A (x, y)p(x, y)dxdy.
Sampling (Xi , Yi ) ∼ p(x, y) (a uniform sample within a square), we can estimate this integral
using the standard MC method. More formally, we can write
Z
P(A) = p(x, y)dxdy
ZA
= 1A (x, y)p(x, y)dxdy,
N
1 X π
≈ 1A (Xi , Yi ) → as N → ∞.
N i=1 4
A trajectory of the estimation procedure π can be seen from Fig. 4.1 w.r.t. varying sample size.
66
Figure 4.3: Monte Carlo integration of h(x) = [cos(50x) + sin(20x)]2
√
Nonasymptotic results showing the convergence rate of O(1/ N ) are also available (see,
e.g., Akyildiz (2019, Corollary 2 .1)) – see Fig. 4.2 for a demonstration.
Example 4.2 (Example 3.4 from Robert and Casella (2004)). Let us consider an example of
estimating an integral:
Z 1 Z 1
I= h(x)dx = [cos(50x) + sin(20x)]2 dx.
0 0
The exact value of this integral is 0.965. Describe a Monte Carlo method to estimate this integral.
Solution. We can just choose p(x) = Unif(0, 1) and set ϕ(x) = h(x). We can then write
Z 1
I= h(x)dx,
0
Z 1
= ϕ(x)p(x)dx,
0
and apply the standard MC estimator. The results (together with the empirical variance estimate)
can be seen from Fig. 4.3.
Example 4.3. Consider X ∼ N (0, 1) and we would like to estimate the probability that X > 2.
Describe the MC method.
67
Figure 4.4: Monte Carlo estimation of the tail probability X > 2. The “true value” is computed via numerical
integration.
where X1 , . . . , XN ∼ N (0, 1). The results can be seen from Fig. 4.4.
where ϕ̄ is the true value. We call an estimator unbiased if the bias is zero. In the case where we
sample i.i.d from p(x), we can build unbiased estimators of expectations and integrals. We recall
the variance
68
If the estimator is unbiased, we can then replace E[ϕ̂N ] with ϕ̄. Next, we define the mean squared
error (MSE)
One can see that the MSE and the variance coincides if the estimator is unbiased. We have also
the following decomposition of the MSE
RMSE(ϕ̂N ) = MSE(ϕ̂N ).
q
(4.8)
|ϕ̂N − ϕ̄|
RAE(ϕ̂N ) = . (4.9)
|ϕ̄|
We usually plot the absolute error of the estimator, as we only run the experiment once in general2 .
We note that this absolute error |ϕ̂N − ϕ̄| is a random variable (since
√ no expectations are taken).
However, this quantity provably converges with a rate of O(1/ N ) (see, e.g., Akyildiz (2019,
Corollary 2.1)). More precisely, we can write
V
|ϕ̂N − ϕ̄| ≤ √ , (4.10)
N
where V is an almost surely finite random variable. This error rate will be displayed empirically
in the following sections (see also Fig . 4.2).
Example 4.4 (Marginal Likelihood estimation). Recall that, given a prior p(x) and a likelihood
p(y|x), we can compute the marginal likelihood p(y) as
Z
p(y) = p(y|x)p(x)dx.
Solution. This defines a nice integration problem that we can solve using MC. Assume that we
are given the following model
2
However, if you were to do a proper experimentation, then you would have to run the same experiment M
times (Monte Carlo runs) and average the error to estimate the RMSE.
69
Figure√4.5: Estimating the marginal likelihood p(y) for y = 1. One can clearly see the displayed error rate that is
O(1/ N ).
where we can set ϕ(x) = p(y = 1|x). We can then compute the integral using MC estimation
procedure as
N
1 X
pN (y = 1) = p(y = 1|Xi ),
N i=1
We will next consider another example of estimating a probability where we show how to
quantify the variance using the true value.
70
Figure 4.6: Cauchy density of Example 4.5.
We can compute
Z ∞
P(X > 2) = p(x)dx
Z2
= 1{x>2} (x)p(x)dx.
We can also compute the real value of this integral as (see Example 2.3 for the CDF of this density)
Z ∞ 1 1
I = ϕ̄ = p(x)dx = FX (∞) − FX (2) = − tan−1 (2) = 0.1476.
2 2 π
Let us compute the variance of the Monte Carlo estimator for N = 10 samples:
varp (ϕ)
var(ϕ̂N ) =
N
So we need to compute:
Z Z 2
varp (ϕ) = 2
ϕ(x) p(x)dx − ϕ(x)p(x)dx
Z Z 2
2
= 1{x>2} (x) p(x)dx − 1{x>2} (x)p(x)dx
Z Z 2
= 1{x>2} (x)p(x)dx − 1{x>2} (x)p(x)dx
= 0.1476 − 0.14762 = 0.125.
The variance of the estimator then
0.125
var(ϕ̂N ) = = 0.0125.
10
71
Could we do better? An idea is to use the fact that the density is symmetric around zero: This
means P (X > 2) = P (X < −2) (see Fig . 4.6). So we could compute:
1Z
I= p(x)dx,
2 |x|>2
Z
1
= 1{|x|>2} (x)p(x)dx,
2
Now define the test function
1
ϕ(x) = 1{|x|>2} (x).
2
As before, we need to compute varp (ϕ):
Z Z 2
varp (ϕ) = 2
ϕ(x) p(x)dx − ϕ(x)p(x)dx
2
1 2 1
Z Z
= 1{|x|>2} p(x)dx − 1{|x|>2} p(x)dx
4 2
2
1 1
Z Z
= 1{|x|>2} p(x)dx − 1{|x|>2} p(x)dx
4 2
1 1
= × 2 × 0.1476 − × (2 × 0.1476)2 ,
4 4
= 0.052.
• A typical problem arises when computing tail probabilities (also called rare events). We
may have access to samples directly from p(x), however, sampling from the tail of p(x)
might be extremely difficult. For example, consider the Gaussian random variable X
72
with mean 0 and variance 1. The probability of X being larger than 4 is very small, i.e.,
P(X > 4) ≈ 0.00003. Sampling from the tail of this density directly would be very
inefficient without further tricks.
• Another typical scenario where we may want to compute expectations with respect to
p(x) when we do not have direct samples from it. The standard example for this is the
Bayesian setting. Given a prior p(x) and a likelihood p(y|x), we may want to compute the
expectations w.r.t. the posterior density p(x|y), i.e., Ep(x|y) [ϕ(X)]. In this case, we do not
have access to samples from p(x|y) so we need to employ other strategies.
A strategy we will pursue in this section is specific to Monte Carlo integration. In other words, we
will next describe a strategy where we can compute integrals and expectations w.r.t. a probability
density without having access to samples from it. This is slightly different than directly aiming at
sampling from the density (which can also be used to estimate integrals). While we will look at
sampling methods in the following chapters, it is important to note that importance sampling is
primarily an integration technique.
In this section, as opposed to previous sections, we assume that we cannot sample from p directly
(or exactly, e.g., using rejection sampling3 ). However, we assume (in this section) we can evaluate
the density p(x) pointwise. We can still estimate this expectation (and compute integrals more
generally), using samples from an instrumental, proposal distribution q. In other words, we
can sample from a proposal and we can repurpose these samples to estimate expectations w.r.t.
p(x). This resembles the rejection sampling where we have also used a proposal to accept-reject
samples. However, in this case, we will employ a different strategy of weighting samples and will
not throw any of the samples away. The weights we will compute will weight samples so that the
integral estimate gets closer to the true integral. In order to see how to do this, we compute
Z
ϕ̄ = ϕ(x)p(x)dx,
Z
p(x)
= ϕ(x) q(x)dx, “identity trick” (4.11)
q(x)
Z
= ϕ(x)w(x)q(x)dx, (4.12)
where w(x) = p(x)/q(x) (which is called the weight function). We know from Section 4.1 that
we can estimate the integral in (4.12) using samples from q. Let Xi ∼ q be i.i.d samples from q
for i = 1, . . . , N . We can then estimate the integral in (4.12), hence the expectation ϕ̄ using
Z
ϕ̄ = ϕ(x)w(x)q(x)dx
N
1 X
≈ wi ϕ(Xi ) = ϕ̂N
IS , (4.13)
N i=1
3
Recall that rejection sampling draws i.i.d samples from the density, not approximate.
73
where wi = w(Xi ) = p(Xi )/q(Xi ) are called the weights. The weights will play a crucial role
throughout this section. The key idea of importance sampling is that, instead of throwing away
the samples by rejection, we could reweight them according to their importance. This is why this
strategy is called importance sampling (IS).
The importance sampling algorithm for this case then can be described relatively straight-
forwardly. Given p(x) (which we can evaluate), we choose a proposal q(x). Then, we sample
Xi ∼ q(x) for i = 1, . . . , N and compute the IS estimator as
N
1 X
ϕ̂N
IS = wi ϕ(Xi ),
N i=1
where wi = p(X i)
q(Xi )
for i = 1, . . . , N are the importance weights. We summarise the method in
Algorithm 7. In what follows, we will discuss some details of the method.
N
1 X
ϕ̂N
IS = wi ϕ(Xi ).
N i=1
Remark 4.4. Unlike rejection sampling, in importance sampling, the proposal does not have
to dominate the target density. Instead, the crucial requirement for the IS is that the support of
the proposal should be the same as the support of the density. More precisely, we need q(x) > 0
whenever p(x) > 0. This is far less restrictive than the requirement of rejection sampling. Of
course, the choice of proposal can still effect the performance of the IS. We will discuss this in
more detail.
From Fig. 4.7, one can see an example plot of the target density p(x), the proposal q(x) and
the associated weight function w(x). See the caption for more details and intuition.
Example 4.6. Consider the problem of estimating P(X > 4) for X ∼ N (0, 1). Describe a
potential problem of using the naive MC method.
Solution. While we can exactly sample from this density, given that
74
Figure 4.7: An example of target density p(x), the proposal q(x) and the associated weight function w(x). One can
see that if q(x) < p(x) (which means fewer samples would be drawn from q(x) in this region), then w(x) > 1 to
account for this effect. The opposite is also true, since if q(x) > p(x), this means that we would draw more samples
than necessary, which should be downweighted, hence w(x) < 1 in these regions.
it will be the case that very few of the samples from exact distribution will fall into this tail (Note
that, while we know the exact value in this case, we will not know this in general – this is just
a demonstrative example). In fact, a standard run with N = 10000 gives exactly zero samples
that satisfy Xi > 4, hence provides the estimate as zero! It is obvious that this is not a great way
to estimate the probability and we can use importance sampling for this. Consider a proposal
q(x) = N (6, 1). This will draw a lot of samples from the region X > 4 and we can reweight this
samples w.r.t. the target density using the IS estimator in (4.13). A standard run in this case with
N = 10000 results in
−5
ϕ̂N
IS = 3.18 × 10 ,
Eq(x) [ϕ̂N
IS ] = ϕ̄.
75
Proof. We simply write
N
" #
1 X
Eq [ϕ̂N
IS ] = Eq(x) wi ϕ(Xi )
N i=1
N
" #
1 X p(Xi )
= Eq ϕ(Xi )
N i=1 q(Xi )
N
" #
1 X p(Xi )
= Eq ϕ(Xi )
N i=1 q(Xi )
N Z
1 X p(x)
= ϕ(x)q(x)dx since Xi ∼ q(x)
N i=1 q(x)
Z
= ϕ(x)p(x)dx,
= ϕ̄,
which completes the proof.
An important quantity in IS is the variance of the estimator ϕ̂N
IS . The variance of the estimator
is a measure of how much the estimator fluctuates around its expected value. The variance of the
IS estimator (4.13) is given by the following proposition.
1
varq [ϕ̂N
IS ] = Eq [w2 (X)ϕ2 (X)] − ϕ̄2 .
N
76
Figure 4.8: The importance sampling estimator ϕ̂N IS is plotted against the number of samples N for the example
√
in in Fig . 4.7, for ϕ(x) = x2 . This demonstrates that the random error in the IS case also satisfies O(1/ N )
convergence rate.
where V is an almost surely finite random variable. As in the perfect MC case, we will not prove
this result as it is beyond our scope, but curious reader can refer to Corollary 2.2 in Akyildiz (2019)
(which also holds for the self normalised case which will be introduced below). A demonstration
of this rate for importance sampling can be seen from Fig. 4.8.
We can see that the variance of the IS estimator is finite if
Eq [w2 (X)ϕ2 (X)] < ∞.
This implies that
Z Z
p(x) 2
w2 (x)ϕ2 (x)q(x)dx = ϕ (x)p(x)dx < ∞.
q(x)
In other words, for our importance sampling estimate to be well-defined, the ratio
p2 (x) 2
ϕ (x)
q(x)
has to be integrable. We will see next an example where this condition is not satisfied.
Example 4.7 (Infinite variance IS, Example 3.8 from Robert and Casella (2010)). Consider the
target
1 1
p(x) = ,
π 1 + x2
77
Importance sampling under infinite variance
0.25
Estimate of the probability
0.20
0.15
0.10
0.05
0.00
0.2 0.4 0.6 0.8 1.0
Number of samples 1e8
Figure 4.9: Estimating P(2 < X < 6) where X is Cauchy with q(x) = N (0, 1). The true value is plotted in red
and the estimator value in black.
Remark 4.5 (Optimal proposal). We can try to inspect the variance expression to figure out
which proposals can give us variance reduction. From Prop. 4.4, it follows that we have
1
varq [ϕ̂N
IS ] = varq [w(X)ϕ(X)] .
N
This means that minimising the variance of the IS estimator is the same as minmising the variance
78
of the function w(x)ϕ(x). Moreover, looking at the expression,
1 h 2
varq [ϕ̂N
i
2 2
IS ] = E q w (X)ϕ (X) − ϕ̄ ,
N
we can see that since ϕ̄2 > 0 (which is independent of the proposal), we should choose a proposal
that minimises Eq [w2 (X)ϕ2 (X)]. We can lower bound this quantity using Jensen’s inequality:
h i
Eq w2 (X)ϕ2 (X) ≥ Eq [w(X)|ϕ(X)|]2 ,
where we used the fact that (·)2 is a convex function (For a convex function f , Jensen’s inequality
states that Eq [f (X)] ≥ f (Eq [X]).). Using w(x) = p(x)/q(x), we arrive at the following lower
bound:
h i
Eq w2 (X)ϕ2 (X) ≥ Ep [|ϕ(X)|]2 . (4.14)
Now let us expand the term Eq [w2 (X)ϕ2 (X)] out and write
" #
h
2 p2 (X) 2
2
i
Eq w (X)ϕ (X) = Eq 2 ϕ (X)
q (X)
Z 2
p (x) 2
= ϕ (x)q(x)dx
q 2 (x)
Z
p(x) 2
= p(x) ϕ (x)dx,
q(x)
h i
= Ep w(X)ϕ2 (X) . (4.15)
The last equation, eq. (4.15), suggests that we can choose a proposal such that we attain the lower
bound of this function (4.14) (which means that it would be the minimiser). In particular, if we
choose a proposal q(x) such that
p(x) Ep [|ϕ(X)|]
w(x) = =
q(x) |ϕ(x)|
is satisfied, then (4.15) would be equal to the lower bound (4.14). This implies that
|ϕ(x)|
q? (x) = p(x) , (4.16)
Ep [|ϕ(X)|]
would minimise the variance of the importance sampling estimator.
Choosing q? as the proposal, one can see that the variance of the IS estimator satisfies
1 1
varq? [ϕ̂N
IS ] = Ep [|ϕ(X)|]2 − ϕ̄2
N N
1 h
2
i 1 2
≤ Ep ϕ (X) − ϕ̄
N N
= varp [ϕ̂N
MC ],
therefore we obtain
varq? [ϕ̂N
IS ] ≤ varp [ϕ̂MC ],
N
i.e., a variance reduction. In fact, one can show that, if ϕ(x) ≥ 0 for all x ∈ R, then the variance
of the IS estimator with optimal proposal q? is equal to zero.
79
We note that this optimal construction of the proposal (4.16) is not possible to implement in
practice. It requires the knowledge of the very quantity we want to estimate, namely, Ep [|ϕ(X)|]!
But in general, we can choose proposals that minimise the variance of the IS estimator where
possible. This idea has been used in the literature to construct proposals that minimise the
variance of the estimator, see, e.g., Akyildiz and Míguez (2021) and references therein. Within
the context of this course, we will construct some simple examples for this purpose later.
where we use the fact that p(x) = p̄(x)/Z. This gives us two separate integration problems, one
to estimate the numerator and one to estimate the denominator. We will estimate both quantities
using samples from q(x)4 .
Let us now introduce the unnormalised weight function W (x)
p̄(x)
W (x) = ,
q(x)
which is analogous to the normalised weight function w(x) in the previous section. Using
Xi ∼ q(x) and building the Monte Carlo estimator of the numerator and denominator, we arrive
at the following estimator of the (4.17):
1 PN
i=1 ϕ(Xi )W (Xi )
ϕ̂N
SNIS = N
1 PN ,
N i=1 W (Xi )
PN
i=1 ϕ(Xi )W (Xi )
= PN
i=1 W (Xi )
N
X
= w̄i ϕ(Xi ), (4.18)
i=1
4
We do not have to, see, e.g., Lamberti et al. (2018).
80
Algorithm 8 Pseudocode for self-normalised importance sampling
1: Input: The number of samples N
2: for i = 1, . . . , N do
3: Sample Xi ∼ q(x)
p̄(Xi )
4: Compute unnormalised weights W (Xi ) = q(Xi )
5: Normalise:
W (Xi )
w̄i = PN .
i=1 W (Xi )
6: end for
7: Report the estimator
N
ϕ̂N
X
SNIS = w̄i ϕ(Xi ).
i=1
where
W (Xi )
w̄i = PN ,
i=1 W (Xi )
are the normalised weights. This estimator (4.18) is called the self-normalised importance sam-
pling (SNIS) estimator.
Remark 4.6 (Bias and variance). As opposed to the normalised case, the estimator ϕ̂N SNIS is biased.
The reason of this bias can be seen by recalling the integral (4.17). By sampling from q(x), we
can construct unbiased estimates of the numerator and denominator. However, the ratio of these
two quantities is biased in general. However, it can be shown that the bias of the SNIS estimator
decreases with a rate O(1/N ) (Agapiou et al., 2017).
Since the SNIS estimator is biased, we can not use the same variance formula as in the previous
section. It makes sense to consider the MSE(ϕ̂N SNIS ) instead. However, this quantity is challenging
to control in general – without bounded test functions. With bounded test functions, it is possible
to show that the MSE(ϕ̂N SNIS ) is controlled with a rate O(1/N ) (Agapiou et al., 2017; Akyildiz
and Míguez, 2021). We will not go into the details of this result here.
We can now describe the estimation procedure using SNIS. Given an unnormalised density
p̂(x), we first sample N samples from a proposal, X1 , . . . , XN ∼ q(x), and then compute
normalised weights
W (Xi )
w̄i = PN .
i=1 W (Xi )
81
p̄(x)
where W (x) = q(x)
. Finally, we compute the estimator
N
ϕ̂N
X
SNIS = w̄i ϕ(Xi ).
i=1
Assume that we choose q(x) and decided to perform SNIS. We first sample X1 , . . . , XN ∼ q(x)
and construct
p̄(Xi |y) p(y|Xi )p(Xi )
Wi = = .
q(Xi ) q(Xi )
We can now normalise these weights and obtain
Wi
w̄i = PN ,
i=1 Wi
It is useful to recall that this estimator is biased, since it is a SNIS estimator. However, as a
byproduct of this estimator, we can also obtain an unbiased estimate of the marginal likelihood
p(y). Note that, this is already provided by the SNIS estimator
N
1 X
p(y) ≈ Wi .
N i=1
82
Proof. We can easily show this
N N
" # " #
X X p(y|Xi )p(Xi )
Eq Wi = Eq
i=1 i=1 q(Xi )
" #
p(y|Xi )p(X)
= N Eq
q(X)
Z
p(y|x)p(x)
=N q(x)dx
q(x)
= N p(y).
As we have seen before, a number of interesting problems require computing normalising
constants, including model selection and prediction. SNIS estimators are very useful in the sense
that they provide an unbiased estimate of it. Let us now look at an example using the optimal
importance proposal as introduced in Remark 4.5.
Example 4.8 (Marginal likelihood using optimal importance proposal). We have seen that
we can get unbiased estimates of the marginal likelihood. Find the optimal proposal q? for
estimating the marginal likelihood.
Solution. We will now see how to find the optimal importance proposal to compute the marginal
likelihood. Note that, we have
Z
p(y) = p(y|x)p(x)dx,
for some prior p(x) and likelihood p(y|x). Note, as we mentioned before, in this case p(x) can
be seen as the
R
distribution to sample from and ϕ(x) = p(y|x) to obtain the standard problem of
integration ϕ(x)p(x)dx. A naive way to approximate this quantity (as we have seen before) is
to sample i.i.d from p(x) and approximate the integral, i.e., X1 , . . . , XN ∼ p(x) and write
N N
1 X 1 X
pN
MC (y) = ϕ(Xi ) = p(y|Xi ).
N i=1 N i=1
|ϕ(x)|
q? (x) = p(x) .
Ep (x)[|ϕ(x)|]
83
In this case, however, we have ϕ(x) = p(y|x) (and |ϕ(x)| = ϕ(x) since the likelihood is positive
everywhere). We can now write
p(y|x) p(y|x)
q? (x) = p(x) = p(x) .
Ep [p(y|x)] p(y)
In other words, the optimal proposal is the posterior itself! Now we can compute the IS estimator
variance where we plug q? = p(x|y). Note to explore variance, we write
!2
1 p(x)
varq? [pN
IS (y)] = Eq? p(y|x) 2
− p(y) 2
.
N q? (x)
Plugging this back into the above variance expression varq? [pN
IS (y)], we obtain
1
varq? [pN
IS (y)] = p(y)2 − p(y)2 = 0.
N
It can be seen that we can achieve zero variance, but as we mentioned before, this required us to
know the posterior density.
84
1.0
pλ(x)
qμ(x)
0.8
0.6
0.4
0.2
0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Figure 4.10: The density of the exponential distribution pλ , the proposal qµ and K = 6
p(x)
wθ (x) = .
qθ (x)
Note that 1/N does not change the location of the minimum and ϕ̄2 is independent of θ. Therefore,
we drop these terms and see that in order to minimise this variance w.r.t. θ, we need to solve the
following optimisation problem
h i
θ? = argmin Eqθ wθ2 (X)ϕ2 (X) .
θ
pλ (x) = λ exp(−λx),
and want to compute P(X > K). For example, λ = 2 and K = 6, we can analytically compute
P(X > 6) = 6.144 × 10−6 . Therefore, the standard MC estimator would be very inefficient
to compute this probability. In order to mitigate the problem, we would like to use another
85
exponential proposal qµ (x) which may have higher probability concentration around 6. We
would like to design our proposal using the minimum variance criterion (see Remark 4.5). Find
the optimal µ? that would make the estimator variance minimal.
In this case, note that we have ϕ(x) = 1{x>K} (x). In order to do this, we write next
h i Z
pλ (x)2
Eqµ w2 (X)ϕ2 (X) = qµ (x)ϕ2 (x)dx,
qµ (x)2
Z ∞
pλ (x)
= pλ (x)dx,
K qµ (x)
Z ∞ 2 −2λx
λe
= dx,
K µe−µx
λ2 Z ∞ −(2λ−µ)x
= e dx.
µ K
Note at this stage that in order for this integral to be finite, we need to have 2λ − µ > 0. Therefore,
we limit for µ ∈ (0, 2λ). In order to compute this, we can multiply and divide by (2λ − µ) and
obtain
h i λ2 Z ∞
Eqµ w2 (X)ϕ2 (X) = (2λ − µ)e−(2λ−µ)x dx,
µ(2λ − µ) K
and using the CDF of the exponential distribution, we obtain
h i λ2 h i
g(µ) = Eqµ w2 (X)ϕ2 (X) = 1 − 1 + e−(2λ−µ)K . (4.19)
µ(2λ − µ)
Now we optimise g(µ) w.r.t. µ. As usual, we compute first log (and drop the terms unrelated to
µ as they will not matter in optimisation)
log g(µ) =c − log µ − log(2λ − µ) + µK.
Computing
d 1 1
log g(µ) = − + + K,
dµ µ 2λ − µ
Setting this to zero, we obtain
1 1
−+ + K = 0,
µ 2λ − µ
⇒ − (2λ − µ) + µ + Kµ(2λ − µ) = 0,
⇒Kµ2 − 2Kµλ + 2λ − 2µ = 0,
⇒Kµ2 − 2(Kλ + 1)µ + 2λ = 0.
86
This is a quadratic equation, therefore we will have two solutions:
q
2(Kλ + 1) ± (2Kλ + 2)2 − 8Kλ
µ= ,
√ 2K
2(Kλ + 1) ± 4K 2 λ2 + 4
= .
2K
If we inspect this solution, if we choose µ to be the sum of the two terms, we will then have
µ > 2λ which is a violation of a condition condition we imposed for the integral to be finite.
Therefore, we arrive at
√
2(Kλ + 1) − 4K 2 λ2 + 4
µ? = .
2K
After this tedious computation, we can now verify the reduction in variance and estimation
quality. Let us now set K = 6 and λ = 2. See Fig. 4.10 for plot of pλ , K = 6 and qµ? (the optimal
exponential proposal). We can see that the proposal puts much higher mass to the right of K.
A standard run for N = 105 samples gives us zero samples in the region of X > 6, therefore
the standard MC estimate is zero! Compared to ϕ̂N = 0, using qµ? as a proposal, we obtain
−6
ϕ̂N
IS = 6.08 × 10 which is a much better estimate.
Let us compare the theoretical variances of two estimators. The standard variance of ϕ̂N is
1
varp (ϕ̂N ) = varp (ϕ(X)),
N
where
Z Z 2
varp (ϕ(X)) = ϕ(x)2 pλ (x)dx − ϕ(x)pλ (x)dx ,
Z ∞ Z ∞ 2
= pλ (x)dx − pλ (x)dx .
K K
Using CDFs, we can compute this quantity hence can obtain the estimate of the variance for ϕ̂N .
Now set µ = µ? . The variance of ϕ̂N
IS is given by (see Prop. 4.4)
1
varq [ϕ̂N
IS ] = Eq [w2 (X)ϕ2 (X)] − ϕ̄2 ,
N
We have already computed the term Eq [w2 (X)ϕ2 (X)] in Eq. (4.19). The second term is the true
integral, which we also summarised how to compute above, i.e., ϕ̄ = K∞ pλ (x)dx which can be
R
87
4.6 implementation, algorithms, diagnostics
When implementing the IS or SNIS, there are several numerical considerations that need to be
taken into account. Especially for SNIS, where the weight normalisation takes place, several
numerical problems can arise for complex distributions that would prevent us from implementing
them successfully.
This ensures that the maximum weight is 0 and all other weights are negative. We can now
exponentiate the weights and normalise them:
exp(log W̃i )
w̄i = PN .
i=1 exp(log W̃i )
Note that, this does not change the computation, just done for numerical stability.
This pN can be seen as a weighted distribution. By drawing samples from this distribution, we
may also approximately sample from p(x) (recall that IS based ideas here are just introduced for
integration so far). We can then draw5
k ∼ Discrete(w̄1 , . . . , w̄N ),
5
Note that here the weights w̄i are normalised. Even in the basic IS case, we need to normalise weights (just for
resampling) as they do not naturally sum up to one.
88
Figure 4.11: Top left shows the target density, the proposal, and the weight function. Top right shows the samples
with their respective weights. Bottom left shows that these samples are indeed approximately distribution w.r.t. q(x)
(just with attached weights). Bottom right shows that we can resample these existing samples to obtain a new set of
samples X̄i that are distributed (approximately )according to p(x).
X̄i = Xk .
This amounts to resampling the existing samples w.r.t. their weights. A demonstration of this
idea can be seen from Figure 4.11.
Definition 4.1 (Effective Sample Size). To measure the sample efficiency, one measure that is used
in the literature is the effective sample size (ESS) which is given by
1
ESSN = PN ,
i=1 w̄i2
In order to see the meaning of the ESS, consider the case where w̄i = 1/N where we have an
equally weighted sample. This means all samples are equally considered and in this case we have
ESSN = N . On the other hand, if we have a sample Xi where w̄i = 1 and, hence, w̄j = 0 for
every j 6= i, we obtain ESSN = 1. This means, we effectively have one sample which is the goal
89
of the estimator. ESS is used to measure importance samplers and importance sampling-based
estimators in the literature (Elvira et al., 2018). Note that the ESSN takes values between 1 and
N , i.e., 1 ≤ ESSN ≤ N .
where αk ≥ 0 and K k=1 αk = 1. In this version of the method, we just sample from the mixture
P
proposal Xi ∼ qα (x) and then, given an unnormalised target p̄, compute the importance weights
as
W (Xi )
w̄i = PN ,
i=1 W (Xi )
where
p̄(Xi )
W (Xi ) = PK .
k=1 αk qk (Xi )
The computational concerns may arise in this situation too, as the denominator as a sum of
densities and its log can be tricky to compute. In these cases, we can use the log-sum-exp trick to
compute the log of the denominator.
90
MARKOV CHAIN MONTE CARLO
5
In this chapter, we introduce Markov chains and then Markov Chain Monte Carlo (MCMC) methods.
These methods are at the heart of Bayesian statistics, generative modelling, statistical physics, and
many other fields. We will introduce the Metropolis-Hastings algorithm and then introduce the
celebrated Gibbs sampler and, if time permits, some others.
6
In this chapter, we introduce a new sampling methodology - namely using Markov chains
for sampling problems. This is a very powerful and widely used idea in statistics and machine
learning. The idea is to set up Markov chains with prescribed stationary distributions. These
distributions will be our target distributions.
In this chapter, we will adapt our notation and modify it to suit the new setting. From now
on, we denote stationary/invariant distributions of Markov chains (which are also coincide with
our target distributions) as p? . We will introduce discrete space Markov chains next.
Definition 5.1 (Markov chain). A discrete Markov chain is a sequence of random variables
X0 , X1 , . . . , Xn such that
91
In other words, a Markov chain is a sequence of random variables such that a given state at
time n is conditionally independent of all previous states given the state at time n − 1. One can
see that this describes many systems in the real world – as evolution of many systems can be
summarised with the current state of the system and the evolution law.
An important quantity in the study of Markov chains is the transition matrix (or kernel in the
continuous space case). This matrix defines the evolution structure of the chain and determines
all of its properties. The transition matrix is defined as follows.
Definition 5.2 (Transition matrix). The transition matrix of a Markov chain is a matrix M such
that
A usual way to depict Markov chains is the following conditional independence structure
which sums the structure of the Markov chain up. We note that we will only consider the case
X0 X1 X2 ··· Xn
where the transition matrix is time-homoegeneous, i.e., the transition matrix is the same for all
times. We can see then that a Markov chain’s behaviour is completely determined by its initial
distribution and the transition matrix. We will denote the initial distribution of the chain as p0
and note that this is a discrete distribution over the state space X (in this case)1 . The transition
matrix M is a matrix of size d × d where d = |X|.
92
Example 5.1 (Discrete space Markov chain). Consider the state transition diagram
0.7
3
0.2 0 0.3
0.2
0.2
0.6 1 2 0.5
0.3
of a Markov chain. Write out the transition matrix of this chain and decsribe the simulation
procedure.
0 0.3 0.7
where the notation Mxt−1 ,· denotes the xt−1 th row of the transition matrix M (where xt−1 ∈
{1, 2, 3} naturally).
Markov chains can be seen everywhere in the real world, hence we will skip their applications.
But one exciting example can be provided from the field of language modelling. Let us describe a
character-level Markov model of English.
93
Example 5.2 (Character-level Markov model of English). Consider the task of modelling
an English language text as a Markov model. Describe a possible way of doing this on a
character-level.
Solution. We can model the English language as a Markov chain on a character-level. In this
case, we can define the state space as X = {a, b, . . . , z, .} (where for simplicity the character
dot models the space character and no other special characters are considered). We can then
define the transition matrix as follows. Let xt−1 be the previous character and xt be the current
character. Of course, writing out the transition matrix by hand is an impossible task, hence we
can learn it from data. For this, we simply choose a large corpus of English text and count the
number of times a character is followed by another character. We have provided an example code
for this in our online companion. By estimating the transition matrix from data, we can then
simulate English text by sampling from this Markov chain. For example, one simulated text reads
as
g.l.se.t.bin.s.lese.ry..wolked.t.hered.e.ly.hr.impefatrt.mofe.mouroreand
and goes on (see lecture for a longer simulation). We can see that while this captures very
vague structure, it is not a good model of English. However, this character-level model can still
be useful for some applications. Because we can use this model to estimate the probability of a
given text. For example, we can compute
by breaking down this text into characters and ‘reading’ the transition matrix. This is a very
simple example of a Markov chain model of English.
Let us look at a more intelligent way of using Markov chains to model language.
Example 5.3 (Word-level Markov model of English). Consider the task of modelling an English
language text as a Markov model. Describe a possible way of doing this on a word-level.
Solution. We can model the English language as a Markov chain on a word-level. As you can
see, there is no single state-space for this (as we can expand our word list (state-space) possibly
indefinitely). A good idea is to pick a book or a large text corpus and count all possible word
transitions. Note that modern language models do not use words, but rather sub-word units,
called tokens. Here we will stick to words and we will pick the book ‘A Study in Scarlet’ by Sir
Arthur Conan Doyle. We have provided an example code for this in our online companion. By
estimating the transition matrix from data, we can then simulate English text by sampling from
this Markov chain. For example, one simulated text reads as
Sherlock Holmes, “because you will place where you remember, in his
cudgel, only one in summer are grey with a vague shadowy terrors which
hung over my hand, and eyeing the strange contrast to the name a walking
in time before twelve, you are.”
94
While this is much more intelligible and seems to follow some English rules, it is nothing but
a simulation from a Markov chain given a transition matrix as estimated in our online companion.
Modern language models, however, do not use Markov models either – their structure is much
more complex and they are based on neural networks.
Therefore, M (n) = M n which is the nth power of the transition matrix. Note that we can compute
in general the conditional distributions of the Markov chain by summing out the variables in the
middle. For example, in order to compute P(Xn+2 = xn+2 |Xn = xn ), we can write
X
P(Xn+2 = xn+2 |Xn = xn ) = P(Xn+2 = xn+2 |Xn+1 = xn+1 )P(Xn+1 = xn+1 |Xn = xn ).
xn+1
This will lead us to define the Chapman-Kolmogorov equation, which is a generalization of the
n-step transition matrix:
M (m+n) = P(Xm+n = j|X0 = i)
X
= P(Xm+n = j|Xn = k)P(Xn = k|X0 = i)
k
X (m) (n)
= Mik Mkj .
k
m+n
Therefore, we can write M = M mM n.
It is also important to define the evolution of the chain. Note that we defined our initial
distribution as p0 and it is important to quantify how this distribution evolves over time. We
denote the distribution at time n as pn and write Then, the density of the chain at time n is given
by:
pn (i) = P(Xn = i)
X
= P(Xn = i, Xn−1 = k)
k
X
= P(Xn = i|Xn−1 = k)P(Xn−1 = k)
k
X
= Mki pn−1 (k).
k
95
This implies that
pn = pn−1 M.
Therefore,
pn = p0 M n .
These are important equations, which will have corresponding equations in the continuous case
(however, they will be integrals).
Since we have expressed our interest in Markov chains because of their potential utility in
sampling, we will now discuss the properties we need to ensure that we can use Markov chains
for sampling. In short, we need Markov chains that have (i) invariant distributions, (ii) their
convergence to invariant distributions are ensured, (iii) the invariant distribution is unique. We
will now discuss the properties we need to ensure these in detail.
5.1.1 irreducibility
The first property we need to ensure is that the Markov chain is irreducible. This means that
there is a path from any state to any other state . To be precise, let x, x0 ∈ X be any two states.
We write x x0 if there is a path from x to x0 :
If x x0 and x0 x, then we say that x and x0 communicate. We then define the communication
class C ⊂ X which is a set of states such that x ∈ C and x0 ∈ C if and only if x x0 and x0 x.
A chain is irreducible if X is a single communication class. This simply means that there is a
positive probability of moving around to every other state. This makes sense as without ensuring
this, we won’t be sampling from the full support.
τi = inf{n ≥ 1 : Xn = i}.
In other words, the probability of waiting time being finite is 1. If a chain is not recurrent, it is
said to be transient. We can also further define the positive recurrence which is a slightly stronger
(better) condition. We say that i is positively recurrent if
This means that the expected waiting time is finite. If a chain is recurrent but not positive
recurrent, then it is called null recurrent.
96
5.1.3 invariant distributions
In the discrete time case, a distribution p? is called invariant if
p? = p? M.
This means that the chain is reach stationarity, i.e., evolving further (via M ) does not change the
distribution. We have then the following theorem (Yıldırım, 2017).
Theorem 5.1. If M is irreducible, then M has a unique invariant distribution if and only if it is
positive recurrent.
This is encouraging however for actual convergence of the chain to this distribution, we will
need more conditions.
p? (i)Mij = p? (j)Mji .
This trivially implies that p? = p? M , hence the invariance of p? . We will have a more detailed
discussion of this condition in the continuous state space case.
has no common divisor other than 1. A Markov chain is called aperiodic if all states are aperiodic.
An irreducible Markov chain is called ergodic if it is positive recurrent and aperiodic. If (Xn )n∈N
is an ergodic Markov chain with initial p0 and p? as its invariant distribution, then
Moreover, for i, j ∈ X
In other words, the chain will converge to its invariant distribution from every state.
97
5.2 continuous state space markov chains
Our main interest is in the continuous case. However, it is important to understand the definitions
above – as we will not go into analogous definitions in the continuous case. The reason for this
is that, in continuous cases, the individual states have zero probability (i.e. a point has zero
probability) and all the notions above are defined using sets and measure theoretic concepts. We
focus on simulation methods within this course, therefore, we will not go into reviewing this
material. A couple of very good books for this are Douc et al. (2018) and Douc et al. (2013).
Let X be an uncountable set from now on, e.g., X = R or X = Rd . We denote the initial
density as p0 (x) as usual, the transition kernel with K(x|x0 ), the marginal density of the chain at
time n as pn (x).
We can write the Markov property in this case as follows. For any measurable A
This implies that if we write down the joint distribution of X1:n , then the following factorisation
holds
n
Y
p(x0 , . . . , xn ) = p(xk |xk−1 ),
k=0
where p(x0 |x−1 ) := p0 (x0 ). We also assume that the transition kernel has a density which we
denote as K(xn |xn−1 ) at time n. Similarly to the discrete case, we will assume that the density is
time-homogeneous (i.e. same for every n). Note that the transition density is a density in its first
variable, i.e.,
Z
K(xn |xn−1 )dxn = 1.
X
Example 5.4 (Simulation of a Markov process). Consider the following Markov chain with
X0 = 0
with 0 < a < 1. Describe the simulation procedure for this chain in terms of a recursion.
X1 ∼ N (0, 1)
X2 ∼ N (aX1 , 1)
X3 ∼ N (aX2 , 1)
..
.
Xn ∼ N (aXn−1 , 1).
98
How to do this? We also note that Eq. (5.1) can also be expressed as
Xn = aXn−1 + n
where n ∼ N (0, 1). This is also called AR(1) process. From the last equation, it must be clear
how to simulate this as you only need a for loop and samples from N (0, 1).
Similar to the continuous case, we can define the distribution of Xn given a past variable
Xn−k by integrating out the variables in between . It is important to note that, Xn is independent
of past variables if (and a big if) Xn−1 = xn−1 is given. Otherwise, we can write down the
densities as
Z Z
p(xn |xn−k ) = ··· K(xn |xn−1 )K(xn−1 |xn−2 ) · · · K(xn−k+1 |xn−k )dxn−1 · · · dxn−k+1 .
We now provide the definition of invariance in this context, w.r.t. to the transition kernel.
It can be seen that p? being invariant means that the kernel operating on p? results in the
same distribution p? (the integral agains the kernel can be seen as a transformation, similar to
the matrix product in the discrete case). Finally, we get to the detailed balance condition.
Definition 5.4 (Detailed balance). A transition kernel K is said to satisfy detailed balance if
Proposition 5.1 (Detailed balance implies stationarity). If K satisfies detailed balance, then p? is
the invariant distribution.
99
Proof. The proof is a one-liner:
Z Z
p? (x)K(x0 |x)dx0 = p? (x0 )K(x|x0 )dx0 ,
which is just integrating both sides after writing the detailed balance condition. The lhs of this
equation is p? (x) since K(x0 |x) integrates to 1 which leaves us with the definition of K-invariance
as given in (5.2).
Let us see an example of a continuous space Markov chain (or rather go back to AR(1)
example).
Example 5.5. Consider again the Markov chain with the following transition kernel
We can also describe the evolution this chain as a recursion, as mentioned before
Xn = aXn−1 + n
1 − a2m
K (m) (xm+n |xn ) = N (xm+n ; am xn , ).
1 − a2
Solution. The proofs of these results will be asked as exercises (as usual, solutions will be posted).
We can note that the last result implies trivially that
for any x0 . In other words, starting from any x0 , the chain will reach stationarity.
100
designing Markov kernels for specific probability distributions and provides a generic way to
design samplers that will target any measure we want. The algorithm relies on the idea of using
local proposals q(x0 |x) and accepting them with a certain acceptance ratio. The acceptance ratio
is designed so that the resulting samples X1 , . . . , Xn from the method form a Markov chain
that leaves p? invariant. We will provide the algorithm below, as seen from Algorithm 9. Note,
p? (X 0 )q(Xn−1 |X 0 )
( )
0
α(Xn−1 , X ) = min 1, .
p? (Xn−1 )q(X 0 |Xn−1 )
as mentioned in the lecture, the last step of the method: When a sample is rejected, we do not
sample again – we set Xn = Xn−1 and continue sampling the next sample. This means that, if the
rejection rate is high, there will be a lot of duplicated samples and this is the expected behaviour.
Another important note is about the burnin period. Any Markov chain started at a random point
will take some time to reach stationarity (the whole magic is to be able to make them converge
faster). Therefore, we discard the first burnin samples and only return the remaining ones. This
is a common practice in MCMC methods.
We define the acceptance ratio as
p? (x0 )q(x|x0 )
r(x, x ) =
0
. (5.4)
p? (x)q(x0 |x)
We also note that in the practical algorithm, one does not need to implement the min operation.
For accepting with a certain probability (like in the rejection sampling), we draw U ∼ Unif(0, 1)
and check if U ≤ α(Xn−1 , X 0 ). However, if the ratio r(Xn−1 , X 0 ) is greater than 1, this sample
is always going to be accepted anyway. The min operation is important however for theoretical
properties of the kernel to hold.
As we mentioned above, the algorithm provides us with an implicit kernel K(xn |xn−1 ) – if
you think about it, it is just a way to get Xn given Xn−1 . The specific structure of the algorithm,
however, ensures that we leave the right kind of distribution invariant – i.e. p? – that is our target
measure. We elucidate this in the following proposition.
101
Proof. We first define the kernel induced by the MH algorithm. This can be seen by inspecting
the algorithm:
K(x0 |x) = α(x, x0 )q(x0 |x) + (1 − a(x))δx (x0 ),
where δx is the Dirac delta function and
Z
a(x) = α(x, x0 )q(x0 |x)dx0 ,
X
is the probability of accepting a sample (hence 1 − a(x) is the probability of rejecting a new
sample while at point x). See Sec. 2.3.1 of Douc et al. (2018) for a rigorous derivation. Given this,
we write
p? (x)K(x0 |x) = p? (x)q(x0 |x)α(x0 , x) + p? (x)(1 − a(x))δx (x0 )
p? (x0 )q(x|x0 )
( )
0
= p? (x)q(x |x) min 1, 0
+ p? (x)(1 − a(x))δx (x0 )
p? (x)q(x |x)
= min {p? (x)q(x |x), p? (x0 )q(x|x0 )} + p? (x)(1 − a(x))δx (x0 )
0
p? (x)q(x0 |x)
( )
= min , 1 p? (x0 )q(x|x0 ) + p? (x0 )(1 − a(x0 ))δx0 (x)
p? (x0 )q(x|x0 )
= K(x|x0 )p? (x0 ),
which shows that the detailed balance holds!
One can see that the algorithm works just the same with unnormalised densities, i.e., recall
p̄? (x)
p? (x) = ,
Z
where Z is the normalisation constant. In this case, the acceptance ratio becomes
p̄? (x0 )q(x|x0 )
r(x, x ) =
0
,
p̄? (x)q(x0 |x)
without any change as the normalising constants cancel out in (5.4). We will next describe certain
classes of proposals to sample from various kinds of distributions and assess their performance.
102
Example 5.6 (Independent Gaussian proposal). Consider a Gaussian (artificial) target:
p? (x) = N (x; µ, σ 2 )
p? (x0 )q(x)
r(x, x0 ) =
p? (x)q(x0 )
N (x0 ; µ, σ 2 )N (x; µq , σq2 )
=
N (x; µ, σ 2 )N (x0 ; µq , σq2 )
0 2 2
−µ)
√ 1
2πσ 2
exp − (x2σ 2
p1
2πσq2
exp − (x−µ q)
2σ 2 q
=
2 (x0 −µ 2
q)
√ 1
2πσ 2
exp − (x−µ)
2σ 2
p1
2πσq2
exp − 2σq2
0 2 2
−µ)
exp − (x2σ 2 exp − (x−µ q)
2σ 2 q
=
(x−µ)2 (x0 −µq )2
exp − 2σ 2
exp − 2σq2
1
− 12
(x0 −µ)2 −(x−µ)2
− 2 (x−µq )2 −(x0 −µq )2
2σq
=e 2σ e
Example 5.7 (Random walk Gaussian proposal). Consider a mixture Gaussian target:
with some w1 , w2 > 0 and w1 + w2 = 1. Assume we want to use MH to sample from it. Choose
a proposal
103
The density and its histogram Markov chain
0.7 4
0.6
3
0.5
2
0.4
0.3 1
0.2
0
0.1
0.0 1
2 1 0 1 2 3 4 5 0 200 400 600 800 1000
Figure 5.2: Random walk Gaussian proposal for a mixture of two Gaussians.
p? (x0 )q(x|x0 )
r(x, x0 ) =
p? (x)q(x0 |x)
p? (x0 )
= ,
p? (x)
w1 N (x0 ; µ1 , σ12 ) + w2 N (x0 ; µ2 , σ22 )
= ,
w1 N (x; µ1 , σ12 ) + w2 N (x; µ2 , σ22 )
Therefore, without doing much more than what we are already doing (using unnormalised
density), we can inform the proposal by using the gradient of the target distribution:
This algorithm is widely popular in the fields of statistics and especially in machine learning. This
approach is called Metropolis adjusted Langevin algorithm (MALA)
104
5.3.4 bayesian inference with metropolis-hastings
We can finally use the Metropolis-Hastings method for Bayesian inference. In what follows, we
will provide some examples for this and visualisations resulting from the sampling procedures.
Recall that, with conditionally independent observations y1 , . . . , yn , we have the Bayes theo-
rem as
Qn
p(y1:n |x)p(x) i=1 p(yi |x)p(x)
p(x|y1:n ) = = .
p(y1:n ) p(y1:n )
We write
n
Y
p(x|y1:n ) ∝ p(yi |x)p(x),
i=1
and set
n
Y
p̄? (x) = p(yi |x)p(x),
i=1
which is the unnormalised posterior density. We can then use the Metropolis-Hastings algorithm
to sample from this posterior density. A generic Metropolis-Hastings method for Bayesian
inference is described in Algorithm 10.
Example 5.8 (Source localisation). This example is taken from Cemgil (2014). Consider the
problem of source localisation in the presence of three sensors with three noisy observations.
The setup in this example can be seen from the left part of Fig. 5.3. We have three sensors
surrounding an object we are trying to locate. The sensors receive noisy observations on R2 . We
are trying to locate the object based on these observations. We define our prior rather broadly:
p(x) = N (x; µ, σ 2 I) where µ = [0, 0] and σ 2 = 20. We assume that the observations are coming
from
105
where si is the location of the ith sensor on R2 for i = 1, 2, 3. We assume that the observations
are independent and that the noise is independent of the location of the object (of course, for
the sake of the example, we simulate our observations from the true model which is not the
case in the real world). Devise a MH sampler to sample from the posterior density of x, i.e., the
distribution over the location of the hidden object.
This sort of Bayes update follows from the conditional Bayes rule introduced in Prop. 3.2. In
order to design the MH scheme, therefore, we need to just evaluate the likelihood and the prior
for MH. We choose a random walk proposal:
An example solution to this problem can be seen from Fig. 5.3 and the code can be accessed from
our online companion.
Example 5.9 (Gaussian with unknown mean and variance Example 5.13 in Yıldırım (2017)).
Assume that we observe
Y1 , . . . , Yn |z, s ∼ N (yi ; z, s)
β α −α−1
!
β
IG(s; α, β) = s exp − .
Γ(α) s
106
The object, the sensors, the posterior Path traces of the 2D Markov chain
3 Samples
True 1
2 Sensor
1 0
0 1
1
2
2
3 3
3 2 1 0 1 2 0 1000 2000 3000 4000
Figure 5.3: Solution of the source localisation problem.
Let us call our unnormalised posterior as p̄? (z, s|y1:n ). In order to do this, we need to design
proposals over z and s. We choose a random walk proposal for z:
q(z 0 |z) = N (z 0 ; z, σq2 ).
and an independent proposal for s:
q(s0 ) = IG(s0 ; α, β).
The joint proposal therefore is
q(z 0 , s0 |z, s) = N (z 0 ; z, σq2 )IG(s0 ; α, β).
When we design the MH algorithm, we see that the acceptance ratio is
p̄? (z 0 , s0 |y1:n )q(z, s|z 0 , s0 )
r(z, s, z 0 , s0 ) =
p̄? (z, s|y1:n )q(z 0 , s0 |z, s)
p(z 0 )p(s0 ) [ nk=1 N (yk ; z 0 , s0 )] N (z; z 0 , σq2 )p(s)
Q
=
p(z)p(s) [ nk=1 N (yk ; z, s)] N (z 0 ; z, σq2 )p(s0 )
Q
N (z 0 ; m, κ2 ) [ nk=1 N (yk ; z 0 , s0 )]
Q
=
N (z; m, κ2 ) [ nk=1 N (yk ; z, s)]
Q
107
6 Target Distribution 6 Random Walk Metropolis
4 4
2 2
0 0
24 2 0 2 4 24 2 0 2 4
Figure 5.4: Banana density estimation using Random walk metropolis and plotting the histogram.
This is only available in unnormalised form and it is an excellent test problem for many
algorithms to fail. Design an MH algorithm for this density.
Solution. We have
!
x2 y4
p̄? (x, y) = exp − − − 2(y − x2 )2 .
10 10
p̄? (x0 , y 0 )
r(x, y, x0 , y 0 ) = .
p̄? (x, y)
and implement the acceptance rate by drawing U ∼ Unif(0, 1) and accepting if log U <
log r(x, y, x0 , y 0 ). The result can be seen from Fig. 5.4.
108
5.4 gibbs sampling
We will now go into another major class of MCMC samplers, called Gibbs samplers. The idea of
Gibbs samplers is that, given a joint distribution of many variables p(x1 , . . . , xd ), one can build a
Markov chain that samples from this distribution by sampling from the conditional distributions.
This will also allow us straightforwardly sample from high-dimensional distributions. The
downside of this approach is that, one has to derive the conditional distributions, which can be
difficult. However, if one can do this, then the Gibbs sampler can be a very efficient method.
In this chapter, we denote our target similarly as p? (x) where x ∈ Rd and define the full
conditional distributions as
pm,? (xm |x1 , . . . , xm−1 , xm+1 , . . . , xd ) = pm,? (xm |x1:m−1 , xm+1:d ) = pm,? (xm |x−m ),
where x−m = (x1 , . . . , xm−1 , xm+1 , . . . , xd ) is the vector of all variables except xm . For a mo-
ment, assume that the full conditionals are available. Also assume, we obtain Xn−1 ∈ Rd at the
n − 1’th iteration of the algorithm. To denote individual components, we use Xn−1,m to denote
the m’th component of Xn−1 . Of course, the key aspect of the Gibbs sampler is to derive the
4: end for
5: Discard first burnin samples and return the remaining samples.
full conditional distributions. We will come back to this point, but we will first investigate why
the Gibbs sampling approach provides us a valid MCMC kernel, in other words, how the Gibbs
sampler satisfies the detailed balance.
Let us denote x = (xm , x−m ). It is easy to see from Algorithm 11 that the Gibbs sampler for
every iteration (at time n) is defined as d separate operations, each sampling from the conditional
distribution. We can first look at what goes on in each of these d updates . It is also easy to see
that, the kernel defined in each of these d updates is given as
where δx−m (x0−m ) is the Dirac delta function. Intuitively, each step samples from the full con-
ditional pm,? (·|x−m ) for mth dimension where m ∈ {1, . . . , d} and leaves others unchanged,
which is enforced by the term δx−m (x0−m ). One can then see that the entire Gibbs kernel can be
written as
K = K1 K2 . . . Kd .
109
Note that each kernel is an integral operator – therefore the above equation is almost symbolic, it
does not mean multiplication of kernels. We will now show that the Gibbs kernel satisfies the
detailed balance.
Proposition 5.3. The Gibbs kernel K leaves the target distribution p? invariant.
Proof. We first show that each kernel Km satisfies the detailed balance condition:
p? (x)Km (x0 |x) = p? (x)pm,? (x0m |x−m )δx−m (x0−m )
= p? (x−m )pm,? (xm |x−m )pm,? (x0m |x−m )δx−m (x0−m )
= p? (x0−m )pm,? (x0m |x0−m )pm,? (xm |x0−m )δx0−m (x−m )
= p? (x0 )Km (x|x0 ).
The steps of this derivation follows from the fact that the use of Dirac allows us to exchange
variables x and x0 . This shows that Km satisfies the detailed balance condition, therefore, we have
Z
Km (x0 |x)p? (x)dx = p? (x0 ),
One can see that we have (K2 , (K1 , p? )) = (K2 , p? ) = p? , which is true for all m = 1, . . . , d.
Therefore, we see that application of d kernels K1 , . . . , Kd will leave p? invariant.
We can also see why Gibbs sampling works by relating it to the Metropolis-Hastings algorithm.
Recall that we can see our sampling from the conditional as a proposal, i.e.,
qm (x0 |x) = pm,? (x0m |x−m )δx−m (x0−m ).
If we calculate the acceptance ratio for this proposal
p? (x0 )qm (x|x0 )
( )
0
αm (x |x) = min 1, ,
p? (x)qm (x0 |x)
p? (x0 )Km (x|x0 )
( )
= min 1, .
p? (x)Km (x0 |x)
We see that this is equal to 1 as the detailed balance is satisfied for qm (which is Km – see the
proof of Proposition 5.3).
As we noted before, we have shown that the kernel K would leave p? invariant, but this would
not give us proper convergence guarantees. Note that the version of the algorithm we presented
is called deterministic scan Gibbs sampler. The reason for this is that the algorithm in Alg. 11
is implemented so that we sample x1 , . . . , xd in order, scanning the variables deterministically.
It turns out, while this sampler’s convergence guarantees cannot be established easily, there is
an algorithmic fix which results in a procedure that is also guaranteed to converge. Instead of
scanning the variables deterministically, we can sample them in a random order. This is called
the random scan Gibbs sampler. The algorithm is given in Alg. 12. We will now see an example.
110
Algorithm 12 Random scan Gibbs sampler
1: Input: The number of samples N , and starting point X0 ∈ Rd .
2: for n = 1, . . . , N do
3: Sample j ∼ {1, . . . , d}
4: end for
Original Image Noisy Image Denoised Image
Figure 5.5: Denoising of an image using Gibbs sampler. The left column shows the original image, the middle
column shows the noisy image, and the right column shows the denoised image. I used σ = 1, J = 4 for this and
the Gibbs sampler scanned the entire image only 10 times.
Example 5.11 (Image denoising). A biologist knocked your door as some of the images from the
microscope were too noisy. You decide to help and use the Gibbs sampler for this. The model is
given as follows.
Consider a set of random variables Xij for i = 1, . . . , m and j = 1, . . . , n. This is a matrix
modeling an m × n image. We assume that we have an image that takes values Xij ∈ {−1, 1} –
note that this is an “unusual” image, as the images usually take values between [0, 255] (or [0, 1]).
We assume that the image is corrupted by noise, i.e., we have a noisy image
Yij = Xij + σij ,
where ij ∼ N (0, 1) and σ is the standard deviation of the noise. We assume that the noise is
independent of the image. We want to recover the image Xij from the noisy image Yij and utilise
Gibbs sampler for this purpose.
The goal is obtain (conceptually) p(X|Y ), i.e., samples from p(X|Y ) given Y . For this, we
need to specify a prior p(X). We take this from the literature and place as a prior a smooth
Markov random field (MRF) assumption. This is formalised as
1
p(Xij |X−ij ) =
exp(JXij Wij ),
Z
where Wij is the sum of the Xij ’s in the neighbourhood of Xij , i.e.,
X
Wij = Xkl = Xi−1,j + Xi+1,j + Xi,j−1 + Xi,j+1 .
kl:neighbourhood of (i,j)
111
This is an intuitive model of the image, making the current value of the pixel depend on the
values of its neighbours. The exercise is to design the Gibbs sampler for this problem.
Solution. We aim at using a Gibbs sampler approach from sampling the posterior p(X|Y ). Note
that now we need to sample from full conditionals, e.g., for each (i, j), we need to sample from
Xij ∼ p(Xij |X−ij , Yij ). We derive the full conditional as
where p(Yij |Xij = k) = N (Yij ; k, σ 2 ) is the likelihood of the noisy image given the value of the
pixel. We can easily compute these probabilities since each term in the Bayes rule is computable
(and 1/Z cancels). Therefore, we can get explicit expressions for q = p(Xij = 1|X−ij , Yij ) and
1 − q = p(Xij = −1|X−ij , Yij ). We can then sample from the full conditional as
1 with probability q,
Xij ∼
−1 with probability 1 − q.
We can now loop over (i, j) (to sample from each full conditional) and sample from the full
conditionals. This is the Gibbs sampler algorithm as described above. The results of this procedure
can be seen from Fig. 5.5.
Γ(α + β) α−1
p(θ) = Beta(θ; α, β) = θ (1 − θ)β−1 ,
Γ(α)Γ(β)
and
!
n x
p(x|θ) = Bin(x; n, θ) = θ (1 − θ)n−x .
x
Derive the Gibbs sampler to sample from the joint distribution p(x, θ).
Solution. We know that, for this, we need full conditionals, i.e., we need p(x|θ) and p(θ|x). We
can see that p(x|θ) is already provided in the definition of the model. Therefore, we only need to
derive the posterior. We can write the joint distribution as
!
n x Γ(α + β) α−1
p(x, θ) = p(x|θ)p(θ) = θ (1 − θ)n−x θ (1 − θ)β−1 ,
x Γ(α)Γ(β)
!
Γ(α + β) n x+α−1
= θ (1 − θ)n−x+β−1 .
Γ(α)Γ(β) x
112
For Bayes theorem p(θ|x) = p(x|θ)p(θ)/p(x), we also need to compute p(x). This is given by
!
Z 1 Z
Γ(α + β) n x+α−1
1
p(x) = p(x, θ)dθ = θ (1 − θ)n−x+β−1 dθ,
0 0 Γ(α)Γ(β) x
!
n Γ(α + β) Γ(x + α)Γ(n − x + β)
= .
x Γ(α)Γ(β) Γ(n + α + β)
= Beta(θ; x + α, n − x + β).
Therefore we can sample from p(θ|x) using any method to simulate a Beta variable. The Gibbs
sampler is then defined as follows:
• Initialise x0 , θ0
• For k = 1, 2, . . .:
– Sample θk ∼ p(θ|xk−1 )
– Sample xk ∼ p(x|θk )
• Return xk , θk for k = 1, 2, . . ..
We also note that simulated xk are approximately from p(x) which also gives us a way to approxi-
mate p(x).
5.4.1 metropolis-within-gibbs
One remarkable feature of the Gibbs sampler is that when we cannot derive the full conditionals
(or too lazy to do it), we can instead target the full conditional with a single Metropolis step at
each iteration. This is called the Metropolis-within-Gibbs algorithm and, remarkably, it samples
from the correct posterior!
Consider a generic target p(x, y). In many situations, it is easier to write unnormalised full
conditionals, i.e., we would have access to p̄(x|y) and p̄(y|x), but not p(x|y) and p(y|x). In this
case, we can use the Metropolis-within-Gibbs algorithm – meaning that instead of sampling
from the full conditional, we can take a Metropolis step to sample from the full conditional. The
algorithm would be summarised as follows. Given (xn−1 , yn−1 )
1. Sample x0 ∼ qx (x0 |xn−1 ) and accept x0 with probability
113
2. Sample y 0 ∼ qy (y 0 |yn−1 ) and accept y 0 with probability
p̄(y 0 |xn )qy (yn−1 |y 0 )
ry =
p̄(yn−1 |xn )qy (y 0 |yn−1 )
i.e., set yn = y 0 with probability ry and yn = yn−1 otherwise.
Example 5.13 (Metropolis-within-Gibbs for Example 5.9). Let us return to Example 5.9. To
recall the model, assume that we observe
Y1 , . . . , Yn |z, s ∼ N (yi ; z, s)
where we do not know z and s. Assume we have an independent prior on z and s:
p(z)p(s) = N (z; m, κ2 )IG(s; α, β).
where IG(s; α, β) is the inverse Gamma distribution
β α −α−1
!
β
IG(s; α, β) = s exp − .
Γ(α) s
In other words, we have
β α −α−1
! !
1 (z − m)2 β
p(z)p(s) = √ exp − s exp − .
2πκ2 2κ2 Γ(α) s
We are after the posterior distribution
p(z, s|y1 , . . . , yn ) ∝ p(y1 , . . . , yn |z, s)p(z)p(s),
n
N (yi ; z, s)N (z; m, κ2 )IG(s; α, β).
Y
=
i=1
Let us call our unnormalised posterior as p̄? (z, s|y1:n ). Now instead of MH or defining Gibbs
(requires us to derive full conditionals), derive the Metropolis-within-Gibbs algorithm.
and
n
Y
p̄? (s|z, y1:n ) = N (yi ; z, s)IG(s; α, β).
i=1
In order to do this, we need to design proposals over z and s to target p̄(z|s, y1:n ) and p̄(s|z, y1:n )
respectively. This step will be a standard Metropolis as if we are solving each problem indepen-
dently. We choose a random walk proposal for z:
q(z 0 |z) = N (z 0 ; z, σq2 ).
and an independent proposal for s:
q(s0 ) = IG(s0 ; α, β).
Therefore, we Metropolis-within-Gibbs can be implemented as follows
114
• Initialise z0 , s0
• For k = 1, 2, . . .:
– Sample s0 ∼ q(s0 )
– Accept s0 and set sk = s0 with probability
• Return zk , sk for k = 1, 2, . . ..
115
numerically solve this SDE which can be done with a variety of numerical methods (akin to ODE
solvers) . One caveat in this situation is that, while the SDE would target p? , its discretisation
would incur bias. This is why MALA is “Metropolised”.
Let us recall the MALA algorithm. We start with a point X0 and then define the proposal
q(xn |xn−1 ) = N (xn ; xn−1 + γ∇ log p? (xn−1 ), 2γId ), (5.6)
where γ > 0 is a step size. We then sample from q to obtain Xn . We then accept Xn with
probability
!
p? (Xn )q(Xn−1 |Xn )
α(Xn , Xn−1 ) = min 1, . (5.7)
p? (Xn−1 )q(Xn |Xn−1 )
Recall that the MALA proposal would not define a symmetric proposal, therefore, we would
need to compute the ratio. Now one can see that, Eq . (5.6) can be equivalently written as
√
Xn = Xn−1 + γ∇ log p? (Xn−1 ) + 2γWn , (5.8)
where Wn ∼ N (0, I). The relationship of Eq. (5.8) to (5.5) can be seen by noting that the
discretisation of the SDE in (5.5) would exactly take the form of Eq. (5.8). Therefore, MALA uses
this Langevin SDE as the proposal and then accept/reject its samples. This has a beneficial effect
of correcting the bias of the discretisation.
However, the Metropolis step can be computationally infeasible in higher dimensions, just
as in the case of rejection sampling. Higher dimensional problems cause the acceptance rate to
vanish, which results in slow convergence. To remedy this situation, a common approach is to
simply drop the Metropolis step and use the following iteration:
√
Xn = Xn−1 + γ∇ log p? (Xn−1 ) + 2γWn . (5.9)
This is a simple (and biased) MCMC method - which is called the unadjusted Langevin algorithm
(ULA). The ULA is a discretisation of the SDE, and as such, its stationary measure is not p? .
However, under various conditions, it can be shown that the limiting distribution of ULA pγ? can
be made arbitrarily close to p? as γ → 0. This means that the ULA can be a viable alternative.
Example 5.14. Consider the target p? (x) = N (x; µ, σ 2 ). Write down the ULA for this target
and derive the stationary distribution of the chain.
116
where Wn ∼ N (0, 1). In this simple case, we can compute the stationary distribution of the
chain and analyse its relationship to the true target p? . Let
γ γ
a = 1 − 2 , b = 2 µ.
σ σ
We can write now the iterates beginning at x0 as
√
x1 = ax0 + b + 2γW1 ,
√ √
x2 = a2 x0 + ab + a 2γW1 +b + 2γW2 ,
| {z }
ax1
√ √ √
x3 = a3 x0 + a b + a2 2γW1 + ab + a 2γW2 +b + 2γW3 ,
2
| {z }
ax2
..
.
n−1 n−1 √
x n = an x 0 + ak b + ak 2γWn−k .
X X
k=0 k=0
1
= 2γ 2γ γ 2 ,
σ2
− σ4
4
2σ
= 2 .
2σ − γ
117
Therefore, we obtained the target measure of ULA as
!
2σ 4
pγ? (x) = N x; µ, 2 ,
2σ − γ
which is different than p? . Note that in this particular case, the means of the pγ? and p? agree. It
can be seen that the bias enters the picture through the variance, but this vanishes as γ → 0.
Recall that ∇ log p̄? (x, y) = ∇ log p? (x, y). Therefore, we will directly compute the unnor-
malised gradients:
" −x
+ 8x(y − x2 )
#
5
∇ log p? (x, y) = −2y 3 .
5
− 4(y − x2 )
√
" # " #
xn+1 x
= n + γ∇ log p? (xn , yn ) + 2γVn
yn+1 yn
" #!
1 0
where Vn ∼ N 0, .
0 1
Example 5.16 (Bayesian inference with ULA). Show that we can also straightforwardly perform
Bayesian inference in this setting by derive ULA for a generic posterior density given p(y|x) and
p(x).
118
Solution. Recall the target posterior density in this setting
n
Y
p̄? (x|y) = p(x) p(yk |x),
k=1
where p(x) is the prior density and p(yk |x) is the likelihood where observations are conditionally
i.i.d given x. We can write the ULA iterates as
√
Xn = Xn−1 + γ∇ log p̄? (Xn−1 |y) + 2γVn ,
n √
!
X
= Xn−1 + γ ∇ log p(Xn−1 ) + ∇ log p(yk |Xn−1 ) + 2γVn .
k=1
A common problem arising in machine learning and statistics is big data, where the number
of observations n is large. In this case, both ULA and MALA are infeasible as both require the
iterates above to be evaluated, e.g., each iteration involves summing n terms. If n is order of
millions, this is computationally infeasible. In this case, we can use the stochastic gradients. This
is only applicable in the setting of ULA as we will see below, which is one reason why ULA-type
methods are more popular than MALA-type methods.
Example 5.17 (Large scale Bayesian inference). Recall the problem setting in Example 5.16:
n √
!
X
Xn = Xn−1 + γ ∇ log p(x) + ∇ log p(yk |Xn−1 ) + 2γVn .
k=1
119
Design the stochastic gradient sampler for this case.
Solution. Assume we sample uniformly i1 , . . . , iK from {1, . . . , n}, we can then approximate the
sum
n K
X n X
∇ log p(yk |Xn−1 ) ≈ ∇ log p(yik |Xn−1 ).
k=1 K k=1
Note that (n/K) factor comes here as the sum itself did not have (1/n) term (as opposed to the
sum example above). Therefore, the stochastic gradient Langevin dynamics (SGLD) iterate can
be written as
K √
!
n X
Xn = Xn−1 + γ ∇ log p(x) + ∇ log p(yik |Xn−1 ) + 2γVn .
K k=1
This is also called data subsampling as one can see that the gradient only uses a subset of the data.
Every iteration is cheap and computable as we only need to compute K terms. This is a very
popular method in Bayesian inference and is used in many applications.
5.6.1 background
It is important to note that a sampler can be used as an optimiser in the following context.
Consider the target density
pβ? (x) ∝ exp(−βf (x)),
where β > 0 is a parameter. It is known in the literature that the density pβ? (x) concentrates
around the minima of f as β → ∞ (Hwang, 1980). This connection between probability
distributions and optimisation spurred the development of MCMC methods for optimisation. In
what follows, we describe two methods that exploit this connection.
120
Algorithm 13 Simulated Annealing
1: X0 .
2: for t = 1, 2, . . . do
3: X 0 ∼ q(x|Xt−1 ) √(symmetric proposal, e.g., random walk)
4: Set βt (e.g. βt = 1 + t)
5: Accept Xt with probability
p̄βt (X 0 )
( )
min 1, βt? .
p̄? (Xt−1 )
p̄β? t (X 0 ) exp(−βt f (X 0 ))
βt = = exp(βt (f (Xt−1 ) − f (X 0 ))),
p̄? (Xt−1 ) exp(−βt f (Xt−1 ))
we can see that the acceptance ratio is a function of the difference in the objective function values.
If f (X 0 ) ≤ f (Xt−1 ), this proposal will take higher values, possibly bigger than 1 depends on
the improvement. If, however, f (X 0 ) ≥ f (Xt−1 ), the acceptance ratio will be small as it should
be. Scheduling of (βt )t≥0 is a design problem that depends on the specific cost function under
consideration.
This is a function with multiple local minima and is nonconvex. The function has one global
minima√ and we aim at finding it. Describe the simulated annealing method with a schedule
βt = 1 + t.
Solution. We see that βt → ∞ as t grows. We use a random walk proposal with a standard
deviation of σq = 0.1. We implement this on the log domain. We initialise X0 ∼ Unif(−1, 1).
The algorithm is implemented as, given Xt−1
• Sample u ∼ Unif(0, 1)
• Accept if
121
0.0
0.5
1.0
1.5
2.0
2.5
3.0
f(x)
3.5 Xt
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
Figure 5.6: Simulated annealing for the function in Eq. 5.10. The red line shows the final estimate of SA algorithm.
The result can be seen from Figure 5.6. We can see that the algorithm is able to find the global
minima.
Example 5.19 (ULA for Optimisation). Assume that we try to solve the following problem:
where f (x) = 2σ1 2 (x − µ)2 and µ and σ are known. Of course, (i) we do not really need the
scaling factor 2σ1 2 and (ii) we can simply solve this problem exactly (no surprise, the minimiser
is µ). For the sake of this example, design a Langevin MCMC method to optimise this cost
function and show its properties.
Solution. We can convert the optimisation problem into a sampling problem by defining the
target density as
One can argue that we will not know the normalising constant – which is exactly true. In general,
to solve the problem minx∈Rd f (x), one constructs a target density p? (x) ∝ e−βf (x) . To sample
122
The density and its histogram Markov chain Autocorrelation
4
1.0
0.7
3
0.6 0.8
0.5 2 0.6
0.4
1 0.4
0.3
0.2 0 0.2
0.1
1 0.0
0.0
2 1 0 1 2 3 4 5 0 200 400 600 800 1000 0 20 40 60 80 100
The density and its histogram Markov chain Autocorrelation
0.7
4 1.0
0.6
3 0.8
0.5
0.4 2 0.6
0.3 1 0.4
0.2
0 0.2
0.1
1 0.0
0.0
2 1 0 1 2 3 4 5 0 200 400 600 800 1000 0 20 40 60 80 100
Figure 5.7: Random walk Metropolis-Hastings for a mixture of two Gaussians. The top panel shows the situation
where σq = 0.5, so chain gets stuck in modes. This causes a high autocorrelation, and as such, this sampler is
not considered to be a good one. When we set σq = 4, then the chain exhibits low autocorrelation and is a good
sampler.
where β is a parameter that is called the inverse temperature. We can see following the same
logic in Example 5.14 that, we have a target distribution
2σ 4
pβ? (x) = N (x; µ, ).
β(2σ 2 − γ)
One can see that as β → ∞, we have pβ? (x) → δµ (x), i.e., the target distribution is a Dirac delta
at µ. This is an example of a more general result where sampling from pβ? (x) ∝ exp(−βf (x))
(as it is what the sampler is doing) leads to distributions that concentrate on the minima of f (x)
as β → ∞.
In our case, for large β, the distribution would be concentrated around µ, that is maximum.
Therefore, samples from this distribution would be very close to µ. The error can be verified and
quantified in a number of challenging and nonconvex settings (Zhang et al., 2023).
123
5.7 monitoring and postprocessing mcmc output
There are a number of ways to monitor the MCMC samples to ensure that the algorithm is
working as expected. We will discuss a few of them here.
This can be empirically computed on the samples coming from the Markov chain (xk )k∈N . Since
the aim of MCMC is to obtain nearly independent samples from a target p? , we expect a good
MCMC chain to exhibit low autocorrelation. A bad chain which is not mixing well will exhibit
high autocorrelation. An example can be seen from Fig. 5.7 and see its caption for more details.
One way to choose the proposal variance is to ensure that the chain has a low autocorrelation.
This is a very simple way to monitor the chain.
N
ESS = P∞ ,
1+2 k=1 ρk
where ρk is the autocorrelation function. The ESS is an approximate measure of the number of
independent samples that we have. For example, if the chain exhibits no autocorrelation, then
the ESS is equal to the number of samples. If the chain exhibits high autocorrelation, then the
ESS will be very low, as the sum in denominator will be large.
The computation of effective sample size in MCMC is usually done by software packages. We
will not go into the details of this computation here.
124
Acceptance rate: 0.24672467246724672 Autocorrelation Autocorrelation (every 10th sample)
4 1.0 1.0
0.6
3 0.8 0.8
0.5
2 0.6 0.6
0.4
Figure 5.8: Thinning of MCMC samples. We keep every 10th sample for the same mixture of Gaussians example
with σq = 2. It can be seen that the thinned MCMC chain exhibits significantly lower autocorrelation.
some of the samples and keep only a subset of them. This is done by keeping every kth sample.
Since autocorrelation in an MCMC chain decays naturally over time, after reaching stationary,
we can choose every kth of them and discard the rest. This will still give us a chain with the same
stationary measure but with a lower autocorrelation. A demonstration of this can be seen from
Fig .5.8.
125
SEQUENTIAL MONTE CARLO
6
In this chapter, we introduce sequential Monte Carlo (SMC) methods. These methods are used to
approximate a sequence of target distributions rather than just a single, fixed target. This can have
a number of applications, including filtering and smoothing in state space models. We will briefly
introduce state-space models, SMC and its connection to importance sampling, and application of
SMC to filtering in state-space models, which is also called particle filtering.
6.1 introduction
In this section, we depart from our standard setting where we have a single, fixed target p? (x). In
many problems in the real world, the target distributions are evolving over time. For example,
consider the example of tracking a target, a straightforward extension of the source localisation
problem we discussed in Example 5.8. Instead of a fixed target and fixed measurements, we could
have easily the case of a moving target and fixed/moving sensors. In this case, we could recompute
our posterior every time we get a new measurement, however, this could become very prohibitive
(imagine every time you get new data, you need to run a new MCMC chain!). However, the
applications of this framework is not limited to simple localisation examples, it broadly generalises
to many dynamical systems. A few examples are volatility estimation in financial time series,
robotics (tracking and control of moving arms), infectious disease modelling (tracking the spread
of a disease), and many more. The idea of evolving sequence of distributions can also be used to
target static problems, as we have seen in the example of simulated annealing.
Our running example in this section will be state-space models. A good example within this
setting will be the target tracking example which summarises the notion of a hidden state and a
sequence of observations. However, it is crucial to observe that the example generalises to any
situation where a hidden, evolving quantity to be estimated (out in the wild) and a stream of data
is received to update our latest belief on the state of the object.
126
x0 x1 x2 ... xt
y1 y2 ... yt
This is going to be important. To see the analogy with standard Bayesian inference, recall
that with a given prior p(x) and a likelihood p(y|x), we could write the joint distribution as
p(x, y) = p(y|x)p(x). This is a similar structure above for the joint distribution.
Recall also that we have used joint distributions as unnormalised densities throughout. This
will be the same in this section where the joint distribution above will act like an unnormalised
density. Another way to look at it is to consider y1:t completely fixed. This also makes it clear
that we have distributions of the form πt (x0:t ) (normalised) and π̄t (x0:t ) (unnormalised). We
will then apply what we have covered in Chapter 4 to this setting. The final thing to note is the
definition of the marginal likelihood in this setting. Let p(y1:t ) denote the marginal likelihood of
the observations y1:t , then we can write
Z t
Y
p(y1:t ) = µ(x0 ) f (xk |xk−1 )g(yk |xk )dx0:t . (6.1)
k=1
127
6.2.1 the filtering problem
Given a sequence of observations y1:t , a typical problem is to estimate the conditional distributions
of the signal process (Xt )t≥0 given the observed data. We denote this distribution with πt (xt |y1:t )
which is called the filtering distribution. The problem of sequentially updating the sequence of
filtering distributions (πt (xt |y1:t ))t≥1 is called the filtering problem.
To introduce the idea intuitively, consider the scenario of tracking a target. We denote the
states of the target with (xt )t≥0 which may include positions and velocities. We assume that
the target moves in space w.r.t. f , i.e., the transition model of the target is given by f (xt |xt−1 ).
Observations may consist of the locations of the target on R2 or power measurements with
associated sensors (which may result in high-dimensional observations). At each time t, we
receive a measurement vector yt conditional on the true state of the system xt . The likelihood of
each observation is assumed to follow g(yt |xt ).
We now provide a simple recursion to demonstrate one possible solution to the filtering
problem. Assume that we are given the distribution at time t − 1 (to define our sequential
recursion) and would like to incorporate a recent observation yt . One way to do so is to first
perform prediction
Z
ξt (xt |y1:t−1 ) = f (xt |xt−1 )πt−1 (xt−1 |y1:t−1 )dxt−1 , (6.2)
g(yt |xt )
πt (xt |y1:t ) = ξt (xt |y1:t−1 ) , (6.3)
p(yt |y1:t−1 )
where p(yt |y1:t−1 ) = ξt (xt |y1:t−1 )g(yt |xt )dxt is the incremental marginal likelihood.
R
Remark 6.1. We remark that the celebrated Kalman filter (Kalman, 1960) exactly implements
recursions (6.2)–(6.3) in the case of
µ(x0 ) = N (x0 ; µ0 , Σ0 ),
f (xt |xt−1 ) = N (xt ; Axt−1 , Q),
g(yt |xt ) = N (yt ; Cxt , R).
For this Gaussian system, computing the integral (6.2) and the update (6.3) is analytically tractable,
which results in Kalman filtering recursions of the mean and the covariance of the filtering
distribution πt (xt |y1:t ). We skip the update rules of the Kalman filter, as our main aim is to focus
on sequential Monte Carlo in this course.
Finally, we can move on to show how to update joint filtering distribution of the states x0:t .
128
To see this, note the recursion
π̄t (x0:t , y1:t )
πt (x0:t |y1:t ) =
p(y1:t )
π̄t−1 (x0:t−1 , y1:t−1 ) f (xt |xt−1 )g(yt |xt )
=
p(y1:t−1 ) p(yt |y1:t−1 )
f (xt |xt−1 )g(yt |xt )
= πt−1 (x0:t−1 |y1:t−1 ) .
p(yt |y1:t−1 )
This recursion will be behind the sequential Monte Carlo method we use for filtering in the next
sections.
We also assume that sampling from this density is not possible and we can only evaluate the
unnormalised density π̄(x). One way to estimate this expectation is to sample from a proposal
measure q and rewrite the integral as
Z
Eπ [ϕ(X)] = ϕ(x)π(x)dx,
ϕ(x) π̄(x)
R
q(x)
q(x)dx
= R π̄(x)
,
q(x)
q(x)dx
(i)
ϕ(x(i) ) π̄(x )
1 PN
N i=1 q(x(i) )
≈ PN π̄(x(i) )
, x(i) ∼ q, i = 1, . . . , N. (6.4)
1
N i=1 q(x(i) )
129
where W(i) = W (x(i) ) are called the unnormalised weights. Finally, we can obtain the estimator
in a more convenient form,
N
ϕ̂N w(i) ϕ(x(i) ),
X
IS =
i=1
In the following section, we will derive the importance sampler aiming at building particle
approximations of πt (x0:t |y1:t ) for a state-space model.
This simply the joint distribution of all variables (x0:t , y1:t ). Just as in the regular importance
sampling case in eq. (6.5), we write
π̄(x0:t , y1:t )
W0:t (x0:t ) = .
q(x0:t )
(i)
Obviously, given samples from the proposal x0:t ∼ q(x0:t ), one can easily build the same weighted
(i) (i)
measure as in (6.7) on the path space by evaluating the weight W0:t = W0:t (x0:t ) for i = 1, . . . , N
and building a particle approximation
N
(i)
π N (x0:t )dx0:t =
X
w0:t δx(i) (x0:t )dx0:t .
0:t
i=1
where
(i)
(i) W0:t
w0:t = PN (i)
.
i=1 W0:t
130
However, this would be an undesirable scheme: We would need to store all variables in memory
which is infeasible as t grows. Furthermore, with the arrival of a new observation yt+1 , this
would have to be re-done, as this importance sampling procedure does not take into account
the dynamic properties of the SSM. Therefore, implementing this sampler to build estimators
sequentially is out of question.
Fortunately, we can design our proposal in certain ways so that this process can be done
sequentially, starting from 0 to t. Furthermore, this would allow us to run the filter online and
incorporate new observations. The clever choices of the proposal here lead to a variety of different
particle filters as we shall see next. Let us consider a decomposition of the proposal
t
Y
q(x0:t ) = q(x0 ) q(xk |x1:k−1 ).
k=1
Note that, based on this, we can build a recursion for the function W (x0:t ) by writing
π̄(x0:t , y1:t )
W0:t (x0:t ) = ,
q(x0:t )
π̄(x0:t−1 , y1:t−1 ) f (xt |xt−1 )g(yt |xt )
= ,
q(x0:t−1 ) q(xt |x0:t−1 )
f (xt |xt−1 )g(yt |xt )
= W0:t−1 (x0:t−1 ) ,
q(xt |x0:t−1 )
= W0:t−1 (x0:t−1 )Wt (x0:t ). (6.9)
That is, under this scenario, the weights can be computed recursively – given the weights of time
t − 1, one can evaluate W0:t (x0:t ) and update the weights. However, this would not solve the
infeasibility problem mentioned earlier, as the cost of evaluating using the whole path of samples
is still out of question. Finally, to remedy this, we can further simplify our proposal
t
Y
q(x0:t ) = q(x0 ) q(xk |xk−1 ).
k=1
by removing dependence to the past, essentially choosing a Markov process as a proposal. This
allows us to obtain purely recursive weight computation
π̄(x0:t , y1:t )
W0:t (x0:t ) = , (6.10)
q(x0:t )
π̄(x0:t−1 , y1:t−1 ) f (xt |xt−1 )g(yt |xt )
= , (6.11)
q(x0:t−1 ) q(xt |xt−1 )
f (xt |xt−1 )g(yt |xt )
= W0:t−1 (x0:t−1 ) , (6.12)
q(xt |xt−1 )
= W0:t−1 (x0:t−1 )Wt (xt , xt−1 ), (6.13)
using only the samples from time t − 1 and time t. The advantage of this scheme is explicit in the
notation: Note that the final weight function Wt only depends on (xt , xt−1 ), but not the whole
past as in (6.9). The function Wt (xt , xt−1 ) is called the incremental weight function.
131
6.3.3 sequential importance sampling
We can now see how the one-step update of this sampler works given a new observation. Assume
(i) (i)
that we have computed the unnormalised weights W1:t−1 = W (x0:t−1 ) recursively and obtained
(i) (i)
samples x0:t−1 . As we mentioned earlier, we only need the last sample xt−1 to obtain the weight
(i)
update given in (6.13). And also note that W1:t−1 for i = 1, . . . , N are just numbers, they do not
need the storage of previous samples. Given this, we can now sample from the Markov proposal
(i) (i)
xt ∼ q(xt |xt−1 ) and compute the weights of the path sampler at time t as
(i) (i) (i)
W1:t = W1:t−1 × Wt ,
where
(i) (i) (i)
(i) f (xt |xt−1 )g(yt |xt )
Wt = (i) (i)
.
q(xt |xt−1 )
(i)
What we described in other words is that, given the samples xt−1 , we first perform sampling step
(i)
xt ∼ q(xt |xt−1 )
The full scheme is given in Algorithm 14. This method is called sequential importance sampling
(SIS). This is not very popular in the literature due to the well known weight degeneracy problem.
We next introduce a resampling step to this method and will obtain the first particle filter in this
lecture.
132
Algorithm 14 Sequential Importance Sampling (SIS)
(i)
1: Sample x0 ∼ q(x0 ) for i = 1, . . . , N .
2: for t ≥ 1 do
(i) (i)
3: Sample: xt ∼ q(xt |xt−1 ),
4: Compute weights:
(i) (i) (i)
(i) f (xt |xt−1 )g(yt |xt )
Wt = (i) (i)
.
q(xt |xt−1 )
and update
(i) (i) (i)
W1:t = W1:t−1 × Wt .
Normalise weights,
(i)
(i) W
w1:t = PN 1:t (i) .
i=1 W1:t
5: Report
N
(i)
πtN (x0:t )dx0:t =
X
w1:t δx(i) (x0:t )dx0:t .
0:t
i=1
6: end for
without resampling easily degenerates, i.e., after some time, only a single weight approximates
to 1 and others to 0, rendering the method a point estimate. To keep the particle diversity, a
resampling method is introduced in between weighting and sampling steps. This step does not
introduce a systematic bias, although, it adds additional terms to the overall Lp error.
With the additional resampling step, the sequential importance sampling with resampling
(SISR) takes the form given in Algorithm 15. We note that, effectively, resampling step sets
(i)
W1:t−1 = 1/N for i = 1, . . . , N . Therefore, we only need to compute the last incremental
weight and weight our particles with the current weight. Also, note that the resampling step does
introduce extra error but does not induce bias, since moments of πtN does not change.
133
Algorithm 15 Sequential Importance Sampling with Resampling (SISR)
(i)
1: Sample x0 ∼ q(x0 ) for i = 1, . . . , N .
2: for t ≥ 1 do
(i) (i)
3: Sample: x̃t ∼ q(xt |xt−1 ),
4: Compute weights:
(i) (i) (i)
(i) f (x̃t |xt−1 )g(yt |x̃t )
Wt = (i) (i)
.
q(x̃t |xt−1 )
Normalise weights,
(i)
(i) Wt
wt = PN (i)
.
i=1 Wt
5: Report
N
(i)
πtN (xt )dxt =
X
wt δx̃(i) (xt )dxt .
t
i=1
6: Resample:
N
(i) X (i)
xt ∼ wt δx̃(i) (xt )dxt .
t
i=1
7: end for
the derivation we provided based on importance sampling here. It can be most generally thought
as an evolutionary method. To uncover some of this intuition, see Fig. 6.2.
(i)
To elaborate the interpretation, consider a set of particles xt−1 representing the state of the
system at time t − 1. If our state-space transition model f (xt |xt−1 ) is well-specified (that is,
if the underlying system we aim at tracking does indeed move according to f ), then the first
intuivite step we can do to predict where the state would be at time t would be to move particles
(i) (i)
according to f , that is sampling x̃t ∼ f (xt |xt−1 ) which is the first step of the BPF. This gives us
(i)
a predictive distribution which consists of x̃t for i = 1, . . . , N . The prediction step (naturally)
does not require to observe the data point at yt . Once we observe the data point yt , we can then
use this data point to evaluate a fitness measure for our particles. In other words, if a predictive
(i) (i)
particle x̃t is a good fit to the observation, we would expect its likelihood g(yt |x̃t ) to be high.
Otherwise, this likelihood would be low. Thus, it intuitively makes sense to use our likelihood
evaluations as “weights”, that is to compute a measure of fitness for each particle. That is exactly
what the BPF does at the second step by computing weights using the likelihood evaluations. The
final step is then to use these relative weights to resample – a step that is used to refine the cloud
of particles we have. Simply, the resampling step removes some of the particles with low weights
(that are bad fits to the observation) and regenerates the particles with high weights.
The connection to evolutionary terms are clearer within this interpretation. The sampling step
in the BPF can be seen as “mutation” that introduces changes to an individual particle according
134
Figure 6.2: Intuitive model of BPF (Figure courtesy Victor Elvira).
to some mutation mechanism (in our case, the dynamics). Then, weighting and resampling
correspond to “selection” step, where individual particles are evaluated w.r.t. a fitness measure
coming from the environment (defined by an observation) and individuals are reproduced in a
random manner w.r.t. their fitness.
135
Algorithm 16 Bootstrap particle filter (BPF)
(i)
1: Sample x0 ∼ q(x0 ) for i = 1, . . . , N .
2: for t ≥ 1 do
(i) (i)
3: Sample: x̃t ∼ f (xt |xt−1 ),
4: Compute weights:
(i) (i)
Wt = g(yt |x̃t ).
Normalise weights,
(i)
(i) Wt
wt = PN (i)
.
i=1 Wt
5: Report
N
(i)
πtN (xt )dxt =
X
wt δx̃(i) (xt )dxt .
t
i=1
6: Resample:
N
(i) X (i)
xt ∼ wt δx̃(i) (xt )dxt .
t
i=1
7: end for
In order to recursively compute p(y1:t ), we need to estimate p(yt |y1:t−1 ) using the BPF. This is
possible via the use of the predictive density
Z
p(yt |y1:t−1 ) = g(yt |xt )ξ(xt |y1:t−1 )dxt . (6.14)
We note that the predictive density can be built using the particles after sampling (as explained
above). In short, we can build the predictive density
N
1 X
ξ N (xt |y1:t−1 )dxt = δ (i) (xt )dxt ,
N i=1 x̃t
if the resampling is done at every iteration (otherwise, the weights from the previous iteration
has to be used). Plugging this back into Eq. (6.14), we arrive at the empirical estimate of the
predictive density
N
1 X (i)
pN (yt |y1:t−1 ) = g(yt |x̃t ).
N i=1
136
Incredibly, this estimate is unbiased (see Lemma 2 of (Crisan et al., 2018) for a proof). This is
incredibly useful for many things, including model selection.
Note that, computation of the marginal likelihood requires numerical care, as the product
of likelihoods can easily underflow. To avoid this, we can use logsumexp trick to compute the
log-marginal likelihood as
N
N
X (i)
log p (y1:t ) = log exp log g(yt |x̃t ) − log N,
i=1
N
X (i)
= log exp log Wt ) − log N
i=1
Now let us see how we can use this for parameter inference in state-space models. Assume that
we have a parameter θ that we would like to infer in the SSM. We further have a prior distribution
p(θ). The existence of parameter means that the transition density fθ (xt |xt−1 ) and the likelihood
gθ (yt |xt ) are now parameterised by θ. Let us for now fix θ and write the marginal likelihood of
the SSM as given in (6.1) for fixed θ as
Z
p(y1:T |θ) = π̄θ (x0:T , y1:T )dx0:T .
It is obvious that for every θ, we can estimate this using the BPF and the estimator in (6.15). To
be completely explicit, for every θ, we can get estimators pN (y1:T |θ) of the marginal likelihood
such that
h i
p(y1:T |θ) = E pN (y1:T |θ) .
Assume next for simplicity that we have the knowledge of p(y1:T |θ) and p(θ) and want to design
a Metropolis-Hastings sampler for the posterior p(θ|y1:T ). As we have seen before, we can do
this via choosing a proposal q(θ0 |θ) and writing the Metropolis-Hastings method as
• Sample θ0 ∼ q(θ0 |θn ).
• Compute the acceptance ratio
p(y1:T |θ0 )p(θ0 )q(θn |θ0 )
( )
α(θn , θ0 ) = min 1, .
p(y1:T |θn )p(θn )q(θ0 |θn )
137
• Otherwise, reject θ0 and set θn+1 = θn .
We have everything we want to sample from the parameter posterior p(θ|y1:T ) except the fact that
we do not have p(y1:T |θ) in closed form. However, we can use the unbiased estimator pN (y1:T |θ)
instead. This would still leave the target invariant (Andrieu and Roberts, 2009; Andrieu et al.,
2010). Assume now that we have run the particle filter and obtained pN (y1:T |θn ) for the current
parameter θn . Then the particle MH algorithm can be summarised as
k=1
Remarkably, this can work out of the box, in the sense that, it functions as a valid MCMC method
and will sample from the marginal p(θ|y1:T ) asymptotically!
6.5 examples
We will next consider some examples of the BPF in action.
Example 6.1 (Tracking a moving target in 2D). Let us assume that we would like to track a
2D moving target. The model is given by In this experiment, we consider a tracking scenario
where a target is observed through sensors collecting radio signal strength (RSS) measurements
contaminated with additive heavy-tailed noise. The target dynamics are described by the model,
xt = Axt−1 + ut ,
4
where xt ∈ R" denotes
# the target state, consisting of its position rt ∈ R2 and its velocity, vt ∈ R2 ,
rt
hence xt = ∈ R4 . Each element in the sequence {ut }t∈N is a zero-mean Gaussian random
vt
vector with covariance matrix Q. The parameters A and Q are selected as
" #
I κI2
A= 2 ,
0 0.99I2
138
and
κ3 κ2
" #
3 2
I 2 2
I
Q= κ2 ,
2 2
I κI2
where I2 is the 2 × 2 identity matrix and κ = 0.04. The observation model is given by
yt = Hxt + vt ,
where yt ∈ R2 is the measurement assumed to be noisy. Implement the particle filter for this
problem (see the code companion).
Example 6.2 (Tracking Lorenz 63 system). In this example, we consider the tracking of a discre-
tised stochastic Lorenz 63 system given by
√
x1,t = x1,t−1 − γs(x1,t − x2,t ) + γξ1,t ,
√
x2,t = x2,t−1 + γ(rx1,t − x2,t − x1,t x3,t ) + γξ2,t ,
√
x3,t = x3,t−1 + γ(x1,t x2,t − bx3,t ) + γξ3,t ,
where γ = 0.01, r = 28, b = 8/3, s = 10, and ξ1,t , ξ2,t , ξ3,t ∼ N (0, 1) are independent Gaussian
random variables. The observation model is given by
yt = [1, 0, 0]xt + ηt ,
where ηt ∼ N (0, σy2 ) is a Gaussian random variable. Implement the particle filter for this problem
(see the code companion).
139
BIBLIOGRAPHY
Agapiou, Sergios; Papaspiliopoulos, Omiros; Sanz-Alonso, Daniel; and Stuart, Andrew M. 2017.
Importance sampling: Intrinsic dimension and computational cost. In Statistical Science, pp.
405–431. Cited on p. 81.
Akyildiz, Omer Deniz. March 2019. Sequential and adaptive Bayesian computation for inference
and optimization. Ph.D. thesis, Universidad Carlos III de Madrid. Can be accessed from:
http://akyildiz.me/works/thesis.pdf. Cited on pp. 67, 69, and 77.
Akyildiz, Ömer Deniz and Míguez, Joaquín. 2021. Convergence rates for optimised adaptive
importance samplers. In Statistics and Computing, vol. 31, no. 2, pp. 1–17. Cited on pp. 80
and 81.
Andrieu, Christophe; Doucet, Arnaud; and Holenstein, Roman. 2010. Particle Markov chain
Monte Carlo methods. In Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), vol. 72, no. 3, pp. 269–342. Cited on pp. 137 and 138.
Andrieu, Christophe and Roberts, Gareth O. 2009. The pseudo-marginal approach for efficient
Monte Carlo computations. In . Cited on pp. 137 and 138.
Barber, David. 2012. Bayesian reasoning and machine learning. Cambridge University Press.
Cited on p. 57.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer. Cited on
p. 57.
Cemgil, A Taylan. 2014. A tutorial introduction to Monte Carlo methods, Markov Chain Monte
Carlo and particle filtering. In Academic Press Library in Signal Processing, vol. 1, pp. 1065–1114.
Cited on p. 105.
Crisan, Dan; Míguez, Joaquín; and Ríos-Muñoz, Gonzalo. 2018. On the performance of paral-
lelisation schemes for particle filtering. In EURASIP Journal on Advances in Signal Processing,
vol. 2018, pp. 1–18. Cited on p. 137.
Douc, Randal; Moulines, Eric; Priouret, Pierre; and Soulier, Philippe. 2018. Markov chains.
Springer. Cited on pp. 98 and 102.
Douc, Randal; Moulines, Éric; and Stoffer, David. 2013. Nonlinear Time Series: Theory, Methods
and Applications with R Examples. Chapman & Hall. Cited on p. 98.
140
Doucet, Arnaud; Godsill, Simon; and Andrieu, Christophe. 2000. On sequential Monte Carlo
sampling methods for Bayesian filtering. In Statistics and computing, vol. 10, no. 3, pp. 197–208.
Cited on p. 127.
Elvira, Víctor; Martino, Luca; and Robert, Christian P. 2018. Rethinking the effective sample size.
In International Statistical Review. Cited on p. 90.
Hwang, Chii-Ruey. 1980. Laplace’s method revisited: weak convergence of probability measures.
In The Annals of Probability, pp. 1177–1182. Cited on p. 120.
Kalman, Rudolph Emil. 1960. A new approach to linear filtering and prediction problems. In
Journal of Fluids Engineering, vol. 82, no. 1, pp. 35–45. Cited on p. 128.
Lamberti, Roland; Petetin, Yohan; Septier, François; and Desbouvries, François. 2018. A double
proposal normalized importance sampling estimator. In 2018 IEEE Statistical Signal Processing
Workshop (SSP), pp. 238–242. IEEE. Cited on p. 80.
Martino, Luca; Luengo, David; and Míguez, Joaquín. 2018. Independent random sampling
methods. Springer. Cited on pp. 13, 22, 26, and 40.
Murphy, Kevin P. 2007. Conjugate Bayesian analysis of the Gaussian distribution. In def, vol. 1,
no. 2σ2, p. 16. Cited on pp. 44 and 54.
———. 2022. Probabilistic machine learning: an introduction. MIT press. Cited on p. 57.
Owen, Art B. 2013. Monte Carlo theory, methods and examples. Cited on p. 90.
Robert, Christian P and Casella, George. 2004. Monte Carlo statistical methods. Springer. Cited
on pp. 67 and 88.
———. 2010. Introducing Monte Carlo methods with R, vol. 18. Springer. Cited on p. 77.
Yıldırım, Sinan. 2017. Sabanci University IE 58001 Lecture notes: Simulation Methods for
Statistical Inference. Cited on pp. 32, 97, and 106.
Zhang, Ying; Akyildiz, Ömer Deniz; Damoulas, Theodoros; and Sabanis, Sotirios. 2023.
Nonasymptotic estimates for stochastic gradient Langevin dynamics under local conditions in
nonconvex optimization. In Applied Mathematics & Optimization, vol. 87, no. 2, p. 25. Cited on
p. 123.
141