0% found this document useful (0 votes)
13 views

Stochastic Simulation Book

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Stochastic Simulation Book

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

Stochastic Simulation

From Uniform Random Numbers to Generative Models

O. Deniz Akyildiz

2023
Department of Mathematics
Imperial College London
PREFACE

This text is about stochastic simulation methods that underpin most of modern statistical inference,
machine learning, and engineering applications. The material of this course serves an introduction to
wide ranging areas and applications, from inference and estimation in a broad class of statistical models
to modern generative modelling applications.
The core of this course is about sampling from a probability distribution that may have explicit,
implicit, complicated, or intractable form. This problem arises in many fields of science and engineering,
e.g., the probability distribution of interest can describe a posterior distribution in a statistical model or
an unknown data distribution (in a generative model) from which we are interested in generating more
samples. The sampling problem takes many forms, hence the solutions (sampling algorithms) is a broad
topic, and this course is an introduction to such methods. In order to develop tools to tackle such
problems, we will be covering basics of simulation, starting from something as basic as simulating an
independent random number from simple distributions (i.e. pseudo-sampling methods that underlie
all stochastic simulations) to designing advanced sampling algorithms for more complicated models.
There will be one coursework, accounting for 25% of the credit. The upload date of this coursework
is as follows.
• Coursework (25%)
– Upload: 4 Dec. 2023 – Deadline: 11 Dec. 2023
• Final exam (75%)
Coursework will be the same for all students and the exam will have an extra question for M4R students
(will be clarified before the exam). The primary course material is the lecture notes and slides – however,
we will also assign additional (optional) readings or complementary chapters where necessary.
I hope that this course will strengthen your skills to conduct statistical research and become well-
versed in the field of sampling, statistical inference, and generative modelling, no matter if you want to
be an academic researcher or a practitioner!

O. Deniz Akyildiz
London, 2023

i
CONTENTS

Preface i

Contents ii

1 Introduction 1
1.1 Introduction 1
1.1.1 Why is this course useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 What will be covered in this course? . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Sampling Problem 2
1.2.1 Motivating example: Estimating π . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Probability and Measure: Recap 4
1.3.1 Density notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Basic Probability Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Exact Generation of Random Variates 12


2.1 Generating uniform random variates 13
2.2 Transformation Methods 14
2.2.1 Inverse Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Transformation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Box-Müller Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Rejection Sampling 22
2.3.1 Rejection sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Acceptance Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Designing the optimal rejection sampler . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Composition 35
2.4.1 Sampling from Discrete Mixture Densities . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Sampling from Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Sampling from Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ii
2.4.4 Sampling from Continuous Mixtures or Marginalisation . . . . . . . . . . . . . . . 38
2.5 Sampling Multivariate Densities 40
2.5.1 Sampling a Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Probabilistic Modelling and Inference 41


3.1 Introduction 41
3.2 The Bayes Rule and its Uses 41
3.3 Conditional Independence 51
3.3.1 Bayes Rule for Conditionally Independent Observations . . . . . . . . . . . . . . . 52
3.3.2 Conditional Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Marginal Likelihood 54
3.5 Sequential Bayesian Updating 56
3.6 Conclusion 57

4 Monte Carlo Integration 60


4.1 Introduction to Monte Carlo Integration 60
4.2 Error Metrics 68
4.3 Importance Sampling 72
4.3.1 Basic Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Self-normalised importance sampling . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Bayesian Inference with Importance Sampling 82
4.5 Choosing the optimal importance sampling proposal within a family 84
4.6 Implementation, Algorithms, Diagnostics 88
4.6.1 Computing Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6.2 Sampling Importance Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6.3 Diagnostics for Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6.4 Mixture Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5 Markov Chain Monte Carlo 91


5.1 Discrete state space Markov chains 91
5.1.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.2 Recurrence and transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.3 Invariant distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.4 Reversibility and Detailed Balance . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.5 Convergence to invariant distribution . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Continuous state space Markov chains 98
5.3 Metropolis-Hastings Algorithm 100
5.3.1 Independent proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2 Random walk (symmetric) proposals . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.3 Gradient based (Langevin) proposals . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.4 Bayesian inference with Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . 105

iii
5.4 Gibbs sampling 109
5.4.1 Metropolis-within-Gibbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Langevin MCMC methods 115
5.5.1 Stochastic Gradient Langevin Dynamics . . . . . . . . . . . . . . . . . . . . . . . 119
5.6 MCMC for Optimisation 120
5.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6.3 Langevin MCMC for Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7 Monitoring and Postprocessing MCMC output 124
5.7.1 Trace plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.2 Autocorrelation plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.3 Effective sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.4 Thinning the MCMC output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6 Sequential Monte Carlo 126


6.1 Introduction 126
6.2 State-space models 127
6.2.1 The filtering problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 Sequential Monte Carlo for filtering 129
6.3.1 Importance sampling: Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.2 Importance sampling for state-space models: The emergence of the general particle filter 130
6.3.3 Sequential importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3.4 Sequential importance sampling with resampling: The general particle filter . . . . . 132
6.3.5 The bootstrap particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.6 Practical implementation of the BPF . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3.7 Marginal Likelihood Computation with BPF . . . . . . . . . . . . . . . . . . . . . 135
6.4 Particle Markov chain Monte Carlo 137
6.5 Examples 138

Biblioghrapy 141

iv
INTRODUCTION
1
We introduce in this section the general ideas of this course, notation, and setup. We will also
introduce the principles behind sampling and generative modelling and also try to answer the
existential question from the beginning: Why is this course useful?

1.1 introduction
This course is about simulating random variables or put it differently sampling from probability
distributions. This seemingly simple task arises in a large number of scenarios in the real world
and embedded in many critical applications. Furthermore, we look into generating samples from
dependent processes (e.g. stochastic processes) as well as sampling from intractable distributions.

1.1.1 why is this course useful?


This course is about simulating structured randomness. The course is a standard part of a Mathe-
matics/Statistics curriculum, sometimes named differently, e.g., Monte Carlo methods or Compu-
tational Statistics. This course will teach you essential skills of simulating random systems and as
such this can be very useful in a multitude of settings:

• Simulation of complex systems: It is often of interest to outright simulate systems to


understand their behaviour. This is straightforward for simple systems but it can get
complicated when one wants to model intricate systems. This sort of “forward simulation”
approaches consist of designing the system and then simulating it. This is a very common
approach in engineering and physics.

• Statistical inference: Now imagine a random system that is not fully known. This could
be defined using a simulator or a generator as described above. But if some variables

1
in this system are not known and we have some data of other variables, we can always
condition on the data and find out latent/hidden variables. Simulation and sampling
methods come in handy in these cases. By carefully formulating these estimation problems
as simulation/sampling problems, we can estimate the unknown variables. We will cover
this in detail in the course.
• Generative modelling: We are often interested in sampling from a completely unknown
probability measure p in the cases of generative models. Consider a given image dataset
{x1 , . . . , xn } and consider the problem of generating new images that mimic the properties
of the image dataset. This problem can be framed as a sampling procedure X ∼ p where
p is unknown but we have access to its samples {x1 , . . . , xn } in the form of a dataset.
Methods that address this challenge gave rise to very successful generative models, such as
DALLE-2 or ChatGPT. We will not cover generative models in this course, however. The
methods we introduce are the key foundations for working on these models.

1.1.2 what will be covered in this course?


We will first go through exact sampling methods. That is, we will learn the mechanisms that are
usually under the hood when you call a function to generate random numbers. To begin, we
will learn to generate random samples from simple distributions, e.g., uniform, Gaussian, and
exponential. We will then move onto integration problems, one of the fundamental applications
of sampling methods and will learn to use these samplers to estimate integrals. This will be
useful in the context of Bayesian models and inference. Then, we will move onto Markov chain
Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methods which are for intractable
distributions that do not have a simple form.

1.2 the sampling problem


Given a probability distribution p, we are often interested in sampling from this distribution.
This is simply denoted as drawing

X (i) ∼ p, i = 1, . . . , N.

The main goal here is to draw these samples as accurately as possible, as often we may not have
access to an exact sampling scheme. More technical applications of sampling include
• Integration. First reason sampling from a measure is interesting is that, even though one
may have access to density p’s analytic expression, computing certain integrals with respect
to this density may be intractable. This is required, e.g., for estimating tail probabilities.
Sampling from a distribution provides a way to compute this integrals (called Monte Carlo
integration, which will be introduced later in this course). This motivation also holds for
more general cases below.
• Intractable normalising constants. In many real life modelling scenarios, we have an
access to a function p̄ such that
p̄(x)
p(x) = ,
Z

2
where the normalising constant Z is unknown. In these cases, designing a sampler may be
non-trivial and this is a big research area.

• Generative models. Given a trained generative model, we still want to generate samples
as fast as possible.

We will first discuss exact sampling methods which are only applicable to simple distributions.
However, these will be crucial for advanced sampling techniques – as all sampling methods rely
on being able to draw realistic samples from simple distributions such as uniform or Gaussian
distribution. We will then describe cases where the exact sampling is not possible and introduce
advanced sampling algorithms (such as Markov chain Monte Carlo (MCMC) methods) to tackle
this problem.

1.2.1 motivating example: estimating π


We mentioned repeatedly above that we can use random sampling methods to do various compu-
tation tasks. We can now demonstrate an intuitive application of this and will the idea of random
variate generation to estimate π. This example is simple enough for us to understand the idea
even if we have not seen any sampling methods yet.
Suppose that we want to obtain an estimate of π using computation.
Normally, this number is available on any given software package, you
can just query it, e.g., np.pi in Python. Let us assume that we do not
have access to this number and we want to estimate it using a computer.
We can use this through a neat geometric idea: Consider Fig. 1.1. We
know that the area of a circle is given by πr2 for any r. If we fit a square
around a circle, then we also know that the square will have one side
of length 2r and hence the area of the square will be 4r2 . Given this,
we can easily conclude
Figure 1.1: The circle and
square.
area of circle πr2 π
= 2 = .
area of square 4r 4

We can again easily imagine that if we had uniformly distributed points in this square, the ratio
of the number of points within the circle and the total number of points will be approxiamtely
the same as the ratio of the areas. In other words, if we draw N points uniformly at random from
the square, and Nc of these points fall within the circle, then

Nc π
≈ .
N 4
This is a very simple Monte Carlo estimate which comes with lots of guarantees, in particular, we
know that as N → ∞, the estimate will converge to the true value of π.
To formalise this intuition, the trick at this stage is to phrase this question probabilistically.
Can we convert this problem into estimating the probability of a set? Let us say, for clarity of the
definition, this square is a set A = [−1, 1] × [−1, 1]. Let us define a 2-dimensional probability
distribution, uniform on this set, defined as P = Unif([−1, 1]⊗2 ). This is a uniform distribution

3
Figure 1.2: Estimating π using the Monte Carlo method.

on the square. Naturally, we have that P(A) = 1. Now we can see that, if the circle is defined as
another set

B = {(x, y) ∈ R2 : x2 + y 2 ≤ 1},

then the probability of this set is


area of circle π
P(B) = = .
area of square 4
We need to then find a way to formally estimate this set, using samples from P. This is formally
done by converting this probability into an expectation

P(B) = EP [1B (X)],

and estimating this expectation (integral) using samples. We will see later that this simple Monte
Carlo procedure coincides with the intuitive solution: Count the samples within the circle and
calculate the ratio to estimate π/4. We can see the result of this estimation procedure above.

1.3 probability and measure: recap


Before we go and recap some of the necessary notions for this course, we briefly describe the
notation.

1.3.1 density notation


We will use p(x) as a generic probability distribution. Normally, in probability text books, the
notation used is something like pX (x) for one random variable X and pY (y) for another random

4
variable. This is done in order to stress the fact that the densities pX and pY are different. However,
this becomes tedious when doing more complex modelling. For example, a simple case appears
in the Bayes’ update for conditionals. Again in normal literature, this is written as

pY |X (y|x)pX (x)
pX|Y (x|y) = .
pY (y)

Now consider even more general cases involving three or more variables and various dependences.
This is going to become an infeasible notation.
Throughout these notes and the course, we will use p(x) generically as a probability density
of a random variable X. When we then write p(y), this will mean a different density of another
random variable (say Y ). If we write the Bayes’ formula in these terms

p(y|x)p(x)
p(x|y) = ,
p(y)

which is much cleaner. Of course, here, p(x|y) and p(y|x) are different functions, just as p(x)
and p(y) are. We will however revert back to pX and pY in cases where it is necessary, such as
transformation of random variables.
We denote the expectation of a random variable with Ep [X] (or Ep X when there is no
confusion). In general, we define the expectation of a function of a random variable X ∼ p as
Z
p(ϕ) = Ep [ϕ(X)] = ϕ(x)p(x)dx.

We note the notation here p(ϕ) denotes the expectation and we will use this in the sections
regarding Monte Carlo integration heavily. Also note that we abuse the notation for denoting the
measures and densities with the same letter. In other words, for a probability measure p(dx), we
denote its density with p(x). The cumulative density function (CDF) will generally be denoted as
F(x). In this section, we review basic probability theory that will be required for the rest of the
course.

1.3.2 basic probability definitions


Let X be a random variable that takes values on set X. We do not limit ourselves to any specific
set X for now. We call this random variable a discrete random variable if X is a discrete set, i.e., a
set with a finite or countable number of elements. We call it a continuous random variable if X is
a set that is not countable. We will next define associated probability distributions.
Let us start from the case where a random variable is discrete. This means the set X is either
finite or countable. A simple example is

X = {1, 2, 3, 4, 5, 6},

which could denote, for example, the possible outcomes of a die roll. Now we define the probability
mass function.

5
Definition 1.1 (Probability Mass Functions). When a random variable is discrete, the probability
mass function can be defined as

p(x) = P(X = x),

where x ∈ X. We call p(x) the probability mass function of X.

We note that in one dimensional case, the probability mass function is typically represented
as a vector of probabilities when it comes to computations. Consider the following example.

Example 1.1. Assume that X = {1, 2, 3, 4} and





 0.1 if x = 1,

0.2 if x = 2,

p(x) =




0.3 if x = 3,


0.4 if x = 4.

Describe how you would represent such a distribution on a computer.

Solution: We can see this as a table of probabilities

X P(X = x)
1 0.1
2 0.2
3 0.3
4 0.4

We can then define its states as


1
 
2
s=  ,
 
3
4

and its probabilities as

0.1
 
0.2
p =  .
 
0.3
0.4

indexed by discrete variables. Of course, one can also define a dictionary (Python data type) in
order to have more complicated states for the random variable.

6
Figure 1.3: A Gaussian example: Left: It can be seen that p(0) = 1.2615 while we know P(X = 0) = 0. This
shows that the density values can be bigger than 1 as they are not probabilities without the integration. Right: One
R 0.1
can only compute quantities like P(−0.1 ≤ X ≤ 0.1) = −0.1 p(x)dx = 0.248.

Next we define the probability density function in the case of continuous random variables.

Definition 1.2 (Measure and density). Assume X ⊂ R and X ∈ X (for simplicity). Given the
random variable X, we define the measure of X as

P(x1 ≤ X ≤ x2 ) = P(X ∈ (x1 , x2 )).

The reason P called a measure is that it measures the probability of sets. We have then the probability
density function which has the following relationship with the probability measure
Z x2
P(x1 ≤ X ≤ x2 ) = p(x) dx.
x1

We call p(x) the probability density function of X.

Note that this notion generalises to higher dimensions straightforwardly. In the estimating π
example, we basically tried to “measure” the probability of the 2D set that was a circle. Another
important note about the measure/density distinction is demonstrated in the following remark.

Remark 1.1. Note that this difference in the continuous case matters. Consider the probability of
a point x0 ∈ R under a continuous probability distribution defined on R. Given its density p(x0 ),
we can surely evaluate p(x0 ) > 0, but this is not the probability of x0 under the distribution. The
probability of a point x0 is zero under a continuous distribution which follows from Definition 1.2
as x1 = x2 = x0 . An example is demonstrated in Fig. 1.3.

7
1.3.3 joint probability
We now define the joint probability distribution of two random variables X and Y . We will also
focus on the discrete case first, and then move to the continuous case.

Definition 1.3 (Discrete Joint Probability Mass Function). Let X and Y be random variables and
X and Y be the sets they live on. X and Y are at most countable sets. The joint probability mass
function of X and Y is

p(x, y) = P(X = x, Y = y).

We call p(x, y) the joint probability mass function of X and Y .

Example 1.2. Similar to the one dimensional case, we can now see the joint pmf p(x, y) as a
table of probabilities

Y =0 Y =1 Y =2 Y =3 pX (x)
X=0 1/6 1/6 0 0 2/6
X=1 1/6 0 1/6 0 2/6
X=2 0 0 1/6 0 1/6
X=3 0 0 0 1/6 1/6
pY (y) 2/6 1/6 2/6 1/6 1
Of course, on computer we can represent this as a matrix of probabilities

1/6 1/6 0 0
 
1/6 0 1/6 0 
P= .

 0 0 1/6 0 

0 0 0 1/6

This allows us to perform simple computations for marginalisation simply as sums of rows or
columns. This is going to be a crucial tool when we study Markov models.

Let us finally define the probability density function p(x, y) for continuous variables.

Definition 1.4 (Continuous Joint Probability Density Function). Let X and Y be random variables
and X and Y be their ranges. We denote the joint probability measure as P(X ∈ A, Y ∈ B) and
the density function p(x, y) satisfies
Z Z
P(X ∈ A, Y ∈ B) = p(x, y) dx dy.
A B

8
The marginal probability densities from the
joint density can be computed as
Z
p(x) = p(x, y) dy,
Y

and
Z
p(y) = p(x, y) dx.
X

From Fig. 1.4, you can see the visualisations of a


joint density p(x, y) and its marginals p(x) and
p(y). Note that while given the joint, it is well-
defined to compute marginals; given marginals, it
is not a trivial task to compute the joint density and
this is a big research area. Figure 1.4: Visualisation of marginal densities and
the joint density.

1.3.4 conditional probability


We now define the conditional probability of a ran-
dom variable X given another random variable Y . As usual, we will first focus on the discrete
case, and then move to the continuous case.

Definition 1.5 (Discrete Conditional Probability Mass Function). Let X and Y be random
variables and X and Y be their ranges. The conditional probability mass function of X given Y is

p(x | y) = P(X = x | Y = y).

We call p(x | y) the conditional probability mass function of X given Y .

Example 1.3. Compute the conditional probability mass function p(y|x = 2) from the table of
probabilities of p(x, y) given below.

X=0 X=1 X=2 X=3 pY (y)


Y =0 1/6 1/6 0 0 2/6
Y =1 1/6 0 1/6 0 2/6
Y =2 0 0 1/6 0 1/6
Y =3 0 0 0 1/6 1/6
pX (x) 2/6 1/6 2/6 1/6 1

Solution. Let us say we would like to compute P(Y = i|X = 2) for i = 0, 1, 2, 3. We can do this
by simply dividing the joint probability mass function by the marginal probability mass function
of X. Consider the following table

9
p(x, y) X=0 X=1 X=2 X=3 pY (y)
Y =0 1/6 1/6 0 0 2/6
Y =1 1/6 0 1/6 0 2/6
Y =2 0 0 1/6 0 1/6
Y =3 0 0 0 1/6 1/6
pX (x) 2/6 1/6 2/6 1/6 1
where the red entries are the joint probabilities of Y given X = 2. We can write the conditional
probabilities as

P(Y = 0, X = 2) 0
P(Y = 0|X = 2) = = = 0,
P(X = 2) 2/6
P(Y = 1, X = 2) 1/6
P(Y = 1|X = 2) = = = 1/2,
P(X = 2) 2/6
P(Y = 2, X = 2) 1/6
P(Y = 2|X = 2) = = = 1/2,
P(X = 2) 2/6
P(Y = 3, X = 2) 0
P(Y = 3|X = 2) = = = 0.
P(X = 2) 2/6

As we can see that the conditional probability can also be represented as a vector

p = [0, 1/2, 1/2, 0].

for implementation purposes.

One can compute conditional probability tables from the joint probability table.

Example 1.4. Derive the conditional probability table from the joint probability table given above.

Solution. The conditional probability table is given as


p(y|x) X=0 X=1 X=2 X=3
Y =0 1/2 1 0 0
Y =1 1/2 0 1/2 0
Y =2 0 0 1/2 0
Y =3 0 0 0 1
Similarly, we can compute p(x|y) as
p(x|y) X=0 X=1 X=2 X=3
Y =0 1/2 1/2 0 0
Y =1 1/2 0 1/2 0
Y =2 0 0 1 0
Y =3 0 0 0 1

10
We next define the continuous conditional density given p(x, y).

Definition 1.6 (Continuous Conditional Probability Density Function). Let X and Y be random
variables and X and Y be their ranges. The conditional probability density function of X given Y is

p(x, y)
p(y|x) = ,
p(x)

where we call p(x | y) the conditional probability density function of X given Y . Similarly, we have

p(x, y)
p(x|y) = .
p(y)

We call p(x | y) the conditional probability density function of X given Y .

11
EXACT GENERATION OF RANDOM VARIATES
2
In this section, we will focus on exact sampling from certain classes of distributions, above all, uniform
distribution. This chapter aims at developing an understanding for the basis for all simulation
algorithms.

3
One of the central pillars of sampling algorithms is the ability to sample from the uniform
distribution. This may sound straightforward, however, it is surprisingly difficult to sample a real
uniform random number (Devroye, 1986). If the aim is to generate these numbers on a computer,
one has to “listen” to some randomness1 (e.g. thermal noise in the circuits of the computer) and
even then, this random numbers have no guarantee to follow a uniform distribution. Therefore,
much of the literature is devoted to generating pseudo-uniform random number generation.
Furthermore, generation of random variables that follow popular distributions in statistics (such
as Gaussian or exponential distribution) also requires pseudo-uniform random numbers as we
will see in next sections.

Definition 2.1. A sequence pseudo-random numbers u1 , u2 , . . . is a deterministic sequence of num-


bers whose statistical properties match a sequence of random numbers from a desired distribution.

Why would we use pseudo numbers? They are (i) easier, quicker, and cheaper to generate,
and more importantly, (ii) repeatable. This provides a crucial experimental advantage when it
comes to test algorithms based on random numbers – as we can re-run the same code (with the
same pseudo-random numbers), e.g., for debugging.
In what follows, we will describe different methods for pseudo-uniform random number
generators that can be used in practice.
1
see https://www.random.org/ if you need real random numbers.

12
Sequence Histogram uk + 1 vs uk
1.0 1.0
1.0

0.8 0.8
0.8

0.6 0.6
0.6

0.4 0.4
0.4

0.2 0.2
0.2

0.0 0.0
0.0
0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
k

Figure 2.1: Top Left: Samples generated by an LCG with parameters m = 2048, a = 43, b = 0, x0 = 1. Top
Middle: The histogram of the samples showing that it conforms the uniform distribution. Top Right: However,
plotting (uk , uk+1 ) pairs shows that the samples are not random enough. Bottom Left: Samples generated by
an LCG with parameters m = 232 , a = 1664525, b = 1013904223. Bottom Middle: The histogram of the
samples showing that it conforms the uniform distribution. Bottom Right: Plotting (uk , uk+1 ) pairs shows that
the samples do not look nonrandom.

2.1 generating uniform random variates


The most popular (in practice) uniform random number generator is called the linear congruential
generator (LCG). This method generates random numbers using a linear recursion

xn+1 ≡ axn + b (mod m)

where x0 is the seed, m is the modulus of the recursion, b is the shift, and a is the multiplier. If
b = 0, then the LCG is called a multiplicative generator and it is called a mixed generator when
b 6= 0. We set m an integer and choose x0 , a, b ∈ {0, . . . , m − 1}. Defined this way, the recursion
defines a class of generators and we have xn ∈ {0, 1, . . . , m − 1}. The uniform numbers are then
generated
xn
un = ∈ [0, 1) ∀n.
m
The sequence (un )n≥0 is periodic with period T ≤ m (Martino et al., 2018). The goal is to choose
the parameters a and b so that the sequence has the largest possible period, ideally, T = m
(full period). The choice of the modulus is determined by the precision, e.g., m ≈ 232 for
single-precision, and so on.
There are known shortcomings of LCGs, in particular that they generate periodic sequences.
An illustration is given in Fig. 2.1. In particular, if a care is not taken, this can cause a significant
problem in applications. For a typical issue, we demonstrate the impact of poor random samples

13
Figure 2.2: Left two figures: Gaussian samples generated using the samples in top row of Fig. 2.1. Right two
figures: Gaussian samples generated using the samples in the bottom row Fig. 2.1 but a better parameterisation:
m = 232 , a = 1664525, b = 1013904223.

on other random number generators. We will see in the following sections that, for example,
drawing Gaussian samples requires uniform random numbers. If we use the random samples
generated in top row of Fig. 2.1, we will get bad samples from a “Gaussian distribution”, see
Fig. 2.2 leftmost two figures. Similarly, if we choose a better parameterisation m = 232 , a =
1664525, b = 1013904223, whose samples are shown in the bottom row of Fig. 2.1, then we can
get much better samples which can be seen from rightmost two figures of Fig. 2.2.
Unfortunately, LCGs fail to produce good samples in high-dimensions. The standard imple-
mented method nowadays is Mersenne-Twister algorithm which is not an LCG. For the rest of
the course, we will use rng.uniform(0, 1) to draw uniform random numbers, where rng is a
properly initialised RNG (see online companion), as a suboptimal implementation can impact
our simulation schemes.

2.2 transformation methods


Given a pseudo-uniform random number generator, we can now describe some exact sampling
methods. In this section, we will describe transformation methods, where a uniform random
number

U ∼ Unif(0, 1),

can be transferred to a prescribed random variable Y = g(U ) using a deterministic transform


g. Note that, in almost all of our exact sampling methods for nonuniform densities, we need
a uniform random number generator – hence the samplers above are of crucial importance to
stochastic simulation applications.
We will next start with the inversion method.

2.2.1 inverse transform


This method considers the cumulative distribution function (CDF) F of a density p to draw
samples from p given access to uniform random numbers. The intuition of the method is best
seen on a discrete example first. Assume p is a discrete distribution on some finite set X taking
values x1 , x2 , x3 , . . .. The CDF is an increasing staircase function whose spacing in y axis reflects

14
PMF p
CDF F
1

0.8

0.6
U ∼ Unif(0, 1)

0.4

0.2

1 2 3 4
Figure 2.3: The inversion technique. The black is the probability mass function p whereas the red function is the
CDF F . One can see that, drawing a uniform random number projected on the y axis ensures that we draw the
samples {1, . . . , 4} w.r.t. the probability if we follow the inverse of CDF.

probabilities. In other words, if we draw U ∼ Unif(0, 1) and invert through the CDF, we will
choose x1 , x2 , . . . according to their probabilities (see Fig. 2.3).
This follows from a more general result called probability integral transform.

Theorem 2.1. Consider a random variable X with a CDF FX . Then the random variable FX−1 (U )
where U ∼ Unif(0, 1), has the same distribution as X.

Proof. The proof is one line:

P(FX−1 (U ) ≤ x) = P(U ≤ FX (x)) = FX (x).

which is the CDF of the target distribution2 . 


This suggests then a sampling procedure for distributions where we can compute FX−1 .

Example 2.1. (Discrete distribution) Let p be a discrete probability distribution defined on


the set S = {s1 , . . . , sK } (states) with probabilities p(sk ) = wk and K
k=1 wk = 1. Provide the
P

inversion method.

2 −1
Note that above result is written for the case where FX exists, i.e., the CDF is continuous. If this is not the
case, one can define the generalised inverse function,

FX (u) = min{x : FX (x) ≥ u},

for which the result holds.

15
Algorithm 1 Pseudocode for inverse transform sampling
1: Input: The number of samples n
2: for i = 1, . . . , n do
3: Generate Ui ∼ Unif(0, 1)
4: Set Xi = FX−1 (Ui )
5: end for

Solution. In this case, the CDF is not continuous and the procedure is summarised in Fig. 2.3.
Let us call X as our random variable to be sampled X ∼ p. Then the sampling procedure as
described above becomes (with generalised inverse):

• Ui ∼ Unif(0, 1)

• Find ki = mink {FX (sk ) ≥ u}.

• Xi = ski .

This corresponds to something simple: Sample Ui and find the first state sk that gives FX (sk ) ≥ u.
Note that Bernoulli distribution corresponds to a special case of this with s1 = 0, s2 = 1 (see
Fig. 2.3).

Example 2.2. (Exponential) Describe a sampler to generate X ∼ Exp(x; λ) = λe−λx .

Solution. We calculate the CDF


Z x
FX (x) = p(x0 )dx0 ,
0
Z x 0
= λ e−λx dx0 ,
0
x
1

0
= λ − e−λx
λ x0 =0
= 1 − e−λx .
Given FX (x) = 1 − e−λx we start by deriving the inverse by
U = 1 − e−λX
1
=⇒ X = − log(1 − U ),
λ
=⇒ FX (U ) = −λ−1 log(1 − U ).
−1

The algorithm is described next to draw samples from exponential distribution.


• Generate Ui ∼ Unif(0, 1)
• Set Xi = −λ−1 log(1 − Ui ).

16
Example 2.3. (Cauchy) Assume we want X ∼ Cauchy where the probability density is given as
1
pX (x) = .
π(1 + x2 )

Describe the sampler using the inversion method.

Solution. The CDF is analytically available and given as


Z x 1
FX (x) = pX (y)dy = + π −1 tan−1 x
−∞ 2
Furthermore, the inverse is also available
1
  
FX−1 (u) = tan π U −
2
Given this, we can provide the algorithm (this should be obvious now!).

• Generate Ui ∼ Unif(0, 1)
h  i
1
• Set Xi = tan π Ui − 2
.

Example 2.4. (Poisson) Consider the Poisson distribution

λk e−λ
P(X = k) = Pois(k; λ) = .
k!
Describe the sampler using the inversion method.

Solution. The CDF is given as


k
λi
FX (k) = P(X ≤ k) = e−λ
X
.
i=0 i!

This is similar to the discrete case.

• Sample U ∼ Unif(0, 1)

• Find the smallest k such that FX (k) ≥ U

then k ∈ N is our sample.

While this is a useful technique for sampling from many distributions, it is limited to the
cases where FX−1 exists, which is a very stringent condition. For example, consider the problem

17
of sampling a standard normal, i.e., X ∼ N (0, 1). We know that the CDF is
Z x 1 1 2
FX (x) = √ e− 2 y dy.
−∞ 2π
We cannot find FX−1 . Fortunately, for certain special cases, we can use another transformation to
sample.

2.2.2 transformation method


Transformation method is a generalisation of the inversion method, in the sense that, one can
generalise the idea of sampling U and passing it through FX−1 to using a more general transform
g. In this case, we can describe the sampling procedure as the following algorithm Of course,

Algorithm 2 Pseudocode for transformation method


1: Input: The number of samples n.
2: for i = 1, . . . , n do
3: Generate Ui ∼ Unif(0, 1)
4: Set Xi = g(Ui )
5: end for

designing g is the crucial aspect of this method. This depends on the goal of the sampling method.
A crucial tool to understand what happens with this kind of transformations is the formula
describes transformation of random variables which is described below.

Remark 2.1. The transformation of random variables formula is an important formula for us
describing the transformation of probability densities when we transform random variables. In
other words, assume X ∼ pX (x) and Y = g(X), then pY (y) has a certain density that is related
to the density of X. The exact formula depends on the dimension of the random variables. For
one-dimensional case, the relationship is given by
dg −1 (y)
pY (y) = pX (g −1 (y)) . (2.1)
dy
This formula is simpler than it looks. One needs to explicitly find g −1 first (here is the weakness
of this approach). Provided that, writing down pX (g −1 (y)) simple (just write down the density
of X, evaluated at g −1 (y). The derivative is also often simple to compute, so is the absolute value.
For multidimensional (say n-dimensional) random variables (we will see one 2D example
below), the formula is equally compact and simple, however, computations might become more
involved. It is simply given as
pY (y) = pX (g −1 (y)) |det Jg−1 (y)| (2.2)
While the first term on the r.h.s. is similar to above, the last term now means determinant of the
Jacobian. And in this case, the Jacobian would be given as
∂g1−1 /∂y1 ∂g1 −1 /∂y2 · · · ∂g1−1 /∂yn
 

Jg−1

= .. .. 
 . ··· ··· . 

−1 −1 −1
∂gn /∂y1 ∂gn /∂y2 · · · ∂gn /∂yn

18
where g −1 = (g1−1 , . . . , gn−1 ) is a multivariate function. In our lecture, we will not need this
formula for more than 2D and this case is exemplified in the examples.

Next, we consider the example where we develop the method to sample Gaussian random
variates.

Example 2.5. Let X1 and X2 be independent random variables where


1
 
X1 ∼ Exp
,
2
X2 ∼ Unif(−π, π),
√ √
Then, prove that Y1 = X1 cos X2 and Y2 = X1 sin X2 are independent and standard normal.

Solution. This is a a transformation method with


√ √
(y1 , y2 ) = g(x1 , x2 ) = ( x1 cos x2 , x1 sin x2 ).

We use the transformation of random variables formula (from standard probability)

py1 ,y2 (y1 , y2 ) = px1 ,x2 (g −1 (y1 , y2 )) |det Jg−1 (y1 , y2 )| (2.3)

where Jg−1 is the Jacobian of the inverse. We next derive g −1 by writing

x1 = y12 + y22 , as cos2 x2 + sin2 x2 = 1.

and
sin x2 y2
=
cos x2 y1
which leads to

x2 = arctan(y2 /y1 ).

Therefore, g −1 : R2 → R2
 
g −1 (y1 , y2 ) = (g1−1 , g2−1 ) = y12 + y22 , arctan(y2 /y1 ) .

We now compute the Jacobian

∂g1−1 /∂y1 ∂g1 −1 /∂y2


" #
Jg−1 =
∂g2 −1 /∂y1 ∂g2−1 /∂y2
" #
2y1 2y2
= 1 −y2 1 1
1+(y2 /y1 )2 y12 1+(y2 /y1 )2 y1

19
Hence, the absolute value of the determinant is:

|det Jg−1 | = 2.

From the transformation of random variables formula

py1 ,y2 (y1 , y2 ) = Exp(g1−1 ; 1/2)Unif(g2−1 ; −π, π) |Jg−1 |


1 − 1 (y12 +y22 ) 1
= e 2 2
2 2π
= N (y1 ; 0, 1)N (y2 ; 0, 1),

which concludes.

Example 2.6. (Sampling uniformly on a circle) Consider drawing

r ∼ Unif(0, 1),
θ ∼ Unif(−π, π).

Prove that this results in a uniform distribution on a circle.

Solution. We will show now that using the same formula derived in the previous proof, we can
describe a scheme to sample uniformly on a circle. We define

x1 = r cos θ,

x2 = r sin θ.

We derive using the eq. (2.3)

px1 ,x2 (x1 , x2 ) = pr,θ (g −1 (x1 , x2 ))| det Jg−1 (x1 , x2 )|.

We know that, since we use the same transformation as in Example 2.5, we have the Jacobian
| det Jg−1 | = 2. We can then write

px1 ,x2 (x1 , x2 ) = Unif(x21 + x22 ; 0, 1)Unif(arctan(x2 /x1 ); −π, π)2.

If we pay attention to the first Uniform distribution in the above formula, we see that this would
be 1 when x21 + x22 < 1. The second formula is arctan which takes values on (−π/2, π/2), which
means we always have Unif(arctan(x2 /x1 ); −π, π) = 1/2π. This results
1
px1 ,x2 (x1 , x2 ) = for x21 + x22 < 1
π
and 0 otherwise, which is the uniform distribution within a circle. See Fig. 2.4 for some examples
(and alternatives discussed in the class).

20

Figure 2.4: On the left, one can see the samples with the correct scaling r. Some other intuitive formulas result in
a non-uniform distribution.

We next consider the Gaussian example.

Example 2.7. Describe a transformation method to sample Y ∼ N (y; µ, σ 2 ) using


X ∼ N (x; 0, 1) and prove that Y ∼ N (y; µ, σ 2 ) as desired.

Solution. This is a simple demonstration of the transformation of random variables formula (in
1-dimension). Let X ∼ N (x; 0, 1) where
!
1 x2
N (x; 0, 1) = pX (x) = √ exp − . (2.4)
2π 2

Now let Y = σX + µ which is our transformation, i.e.,

g(x) = σx + µ.

This is intuitive as we first scale the random variable with σ to increase or decrease its variability
(variance) and then add some mean µ. The transformation formula in 1D is simpler:

−1dg −1 (y)
pY (y) = pX (g (y)) (2.5)
dy

where we have the absolute value of the derivative of the inverse function g −1 (y). This is easy
derive by leaving x alone starting from y = g(x) = σx + µ and
y−µ
x= = g −1 (y).
σ
The derivative is then given by

dg −1 (y) 1
= .
dy σ

21
Then using Eq. (2.3) we obtain
1
pY (y) = pX (g −1 (y)) ,
σ !
1 (y − µ)2 1
pY (y) = √ exp − ,
2π 2σ 2 σ

as σ > 0 and by using Eq. (2.4) and plugging x = (y − µ)/σ. We can already recognize the
expression pY (y) = N (y; µ, σ 2 ).

2.2.3 box-müller transform


Box-Müller transform is a related transform to above, but provides a way to sample Gaussians
directly from uniforms. In this case, we will just provide the algorithm.
Let U1 , U2 ∼ Unif(0, 1) be independent. Then
q
Z1 = −2 log U1 cos(2πU2 ),
q
Z2 = −2 log U1 sin(2πU2 ),
are independent standard Normal random variables, i.e., Z1 , Z2 ∼ N (0, 1). This transformation
is called the Box-Müller transform. This algorithm is the standard way to draw Gaussian samples
in practice. For correlated samples, we can use the Cholesky decomposition of the covariance
matrix to transform the samples as will be covered in Section 2.5.

2.3 rejection sampling


Inversion and the more general transformation method depend on the existence of specific
transformations. Given a general p(x), we may not have a specific transformation derived from
simpler distributions or an inverse transform. We can still devise sampling methods in this case
(in fact, there are hundreds of them). In this section, we will look into one specific class called
rejection samplers.
This specific class of methods are powered by a principle: If one can sample (x, y) pairs which
are uniformly distributed under the area of p(x), then the x-marginal of these samples exactly
coming from p(x). We formalise this intuition in the next theorem, adapted from Martino et al.
(2018).

Theorem 2.2 (Fundamental Theorem of Simulation). (Martino et al., 2018, Theorem 2.2) Drawing
samples from one dimensional random variable X with a density p̄(x) ∝ p(x) is equivalent to
sampling uniformly on the two dimensional region defined by
A = {(x, y) ∈ R2 : 0 ≤ y ≤ p̄(x)}. (2.6)
In other words, if (x0 , y 0 ) is uniformly distributed on A, then x0 is a sample from p(x).

22
Figure 2.5: On the left, you can see a mixture of Gaussians (we will cover mixture distributions later) and samples
uniformly distributed below the curve. Each black dot on under the curve is an (x, y) pair, hence you could denote
those samples (Xi , Yi ). On the right, you can see the histogram of the x-marginal, which means, only Xi samples.
This is the empirical demonstration of Theorem 2.2.

Proof. Consider the pair (X, Y ) uniformly distributed on the region A. We denote their joint
density as q(x, y) as
1
q(x, y) = , for (x, y) ∈ A. (2.7)
|A|

where |A| is the area of the set A. We note that

p̄(x)
p(x) = .
|A|

We use the standard formula for the joint density q(x, y) = q(y|x)q(x). Note that, since (X, Y )
is uniform in A, for fixed x, we have
1
q(y|x) = for (x, y) ∈ A.
p̄(x)

We therefore write
q(x)
q(x, y) = q(y|x)q(x) = for (x, y) ∈ A. (2.8)
p̄(x)

We consider now (2.7) and (2.8) which are both valid on (x, y) ∈ A. Combining them gives

p̄(x)
q(x) = ,
|A|

which means q(x) = p(x). 


An illustration of this theorem can be seen in Fig. 2.5. This theorem extends to cases where
we do not have the p(x) exactly, but only have access to an unnormalised version (a very practical
issue, as we will see in the following sections).
We will now describe some numerical methods which utilise the fact that if we manage to
uniformly under the area of a curve, then we can sample from the probability density. Theorem 2.2

23
Figure 2.6: On the left, we plot the accepted samples (scattered) under the curve. On the right, we describe the
histogram of x-marginal of these samples (so we just keep the first dimension of the two dimensional array).

suggests a quite intuitive sampling procedure: We can sample uniformly under the area of a
density (or even an unnormalised negative curve) and obtain samples from the (normalised)
probability density by keeping the samples on the x-axis (this is sometimes called the x-marginal).
One simple example that does this is described below.

Example 2.8 (Beta density under a box). Consider the Beta density
Γ(α + β) α−1
p(x) = Beta(x; α, β) = x (1 − x)β−1 ,
Γ(α)Γ(β)
where Γ(n) = (n − 1)! for integers. Design a sampler that samples uniformly under the curve
p(x) = Beta(x; 2, 2).

Solution. In order to design a uniform sampler under the area of the Beta(2, 2), we can use its
special properties. For example, the Beta density is defined on [0, 1] which makes it easy to design
a uniform sampler. The simplest choice for this is to design a “box” over the density. In order to
design this box, we require the maximum of the density

p? = max Beta(x; 2, 2) = 1.5.


x

We are of course lucky to have this number, which could be difficult to find analytically in general.
In this case, we can design a box [0, 1] × [0, p? ] and draw uniform random samples in this box.
Let us suggestively denote these samples

(X 0 , U 0 ) ∼ Unif([0, 1] × [0, p? ]).

We can then check whether these samples are under the Beta density curve, which can be done
by checking:

U 0 ≤ p(X 0 ),

and accepting the sample if this condition holds. Fig. 2.6 shows the result of this procedure
together with the histogram.

24
2.3.1 rejection sampler
The box example is nice, however, it is not optimal: It might be too inefficient to find a box of that
type and for peaky densities, this could be horribly inefficient. We can however identify another
probability density we can sample from, denoted q(x), which may cover our target density much
better.
Rejection sampling is an algorithm just does that: We identify a q(x) to cover our target
density p(x). Of course, because p(x) and q(x) are both probability densities, q(x) can never
entirely cover p(x). However, it will be sufficient for us if we can find an M such that

p(x) ≤ M q(x),

so the scaled version of q(x) with M > 1 should entirely cover p(x). Depending on the choice of
the proposal, the procedure will be much more efficient than simple boxing. Let us describe the
conceptual algorithm.

• Generate Xi0 ∼ q(x)

• Accept with probability

p(Xi0 )
a(Xi0 ) = ≤ 1. (2.9)
M q(Xi0 )

This algorithm might look simpler than what you would expect. Above we mentioned drawing
samples uniformly under the curve, however, a simple look at the steps might not reveal the
fact that this is precisely what this algorithm is doing. Let us look into this more carefully: The
rejection sampler first generates X 0 ∼ q(x) and let us fix its value X 0 = x03 . Then in order to
implement the Accept step, we should generate U ∼ Unif(0, 1) and accept the sample if

p(x0 )
U ≤ a(x0 ) = .
M q(x0 )

This is what accept with probability a(x0 ) means. A closer look reveals, we could also write this
(by playing with the above inequality)

M q(x0 )U ≤ p(x0 ).

In order words, the lhs of this inequality is a uniform random variable multiplied by M q(x0 ), so
we could define U 0 = M q(x0 )U as

U 0 ∼ Unif(0, M q(x0 )),

since U ∼ Unif(0, 1). Finally, you can see what the algorithm is doing behind the scenes:

• Sample X 0 ∼ q(X 0 )

• U 0 ∼ Unif(0, M q(X 0 ))
3
This is usual in probability: Capital letters are random variables, it is better to fix their values after they are
sampled (now deterministic).

25
• Accept if
U 0 ≤ p(X 0 ).

This is exactly drawing a (X 0 , U 0 ) pair and accepting the sample if it is under the curve of p(X 0 ).
By Theorem 2.2, this samples from the correct distribution!
So far we have written a few different versions of the method. Implementation however is
made according to Algorithm 3.

Algorithm 3 Pseudocode for rejection sampling


1: Input: The number of iterations n and scaling factor M .
2: for i = 1, . . . , n do
3: Generate X 0 ∼ q(x0 )
4: U ∼ Unif(0, 1)
0)
5: if U ≤ Mp(X q(X 0 )
then
0
6: Accept X . This should record the sample with other accepted samples
7: end if
8: end for
9: Return accepted samples

rejection sampling with unnormalised densities


So far, we have assumed that we have access to the density we want to sample from p(x): We could
evaluate it, hence use it for rejection under the curve. However, imagine we know a probability
density up to a normalising constant. This is one of the most common problems in computational
statistics (Google: Estimating normalising constants) and it arises in multiple situations which
will be described shortly.
We denote the unnormalised density associated to p(x) as p̄(x). Usually, we write
p̄(x)
p(x) = ,
Z
R
where Z = p̄(x)dx. This is typical in in physics, engineering, machine learning, and optimisa-
tion, we do not start from densities, instead one defines:
p(x) ∝ e−f (x) ,
for some function f (which is generally called a potential). f usually comes from a rule which
determines how probability mass should be spread (e.g. a multimodal function). However, in
order to convert this into probabilities, we need normalise e−f (x) .
The surprising fact about the rejection sampling is that it works in the same way for unnor-
malised densities p̄. In other words, more generally, Theorem 2.2 holds for p̄: As long as we
sample uniformly under p̄, we can obtain x-marginal which would be distributed w.r.t. p(x)
(Martino et al., 2018). This gives rejection samplers a massive advantage. Of course, needless to
say, in this case, one should ensure that
p̄(x) ≤ M q(x),
i.e. the unnormalised density is covered by our scaled proposal M q(x). We describe the algorithm
for the unnormalised case in Algorithm 4.

26
Algorithm 4 Pseudocode for rejection sampling without normalising constants
1: Input: The number of iterations n and scaling factor M .
2: for i = 1, . . . , n do
3: Generate X 0 ∼ q(x0 )
4: U ∼ Unif(0, 1)
0)
5: if U ≤ Mp̄(X q(X 0 )
then
0
6: Accept X . This should record the sample with other accepted samples
7: end if
8: end for
9: Return accepted samples

2.3.2 acceptance rate


An important aspect of this algorithm is the concept of acceptance rate, that is, the ratio of the
number of accepted samples vs. the number of total samples. When the algorithm has a low
acceptance rate, this hints that the proposal is poorly designed and most of the computational
effort (sampling) goes unused and wasted4 .
Let us consider the normalised case first with a probability density p(x). Note that the
acceptance probability is a function of X 0 and defined as a(X 0 ) in (2.9). We accept a sample
when U ≤ a(X 0 ), in other words, when

p(X 0 )
U≤ ,
M q(X 0 )

where X 0 ∼ q(x0 ). Let us denote the probability of acceptance (acceptance rate) as â. This
quantity is formalised below.

Proposition 2.1. When the target density p(x) is normalised and M is prechosen, the acceptance
rate is given by
1
â = ,
M
where M > 1 in order to satisfy the requirement
R
that q covers p. For an unnormalised target density
p̄(x) with the normalising constant Z = p̄(x)dx, the acceptance rate is given as

Z
â = .
M

Proof. We will first prove


1
â = E[a(X 0 )] = (2.10)
M
4
Acceptance rate will also be a crucial notion when we later study Markov chain Monte Carlo (MCMC) methods.

27
in the normalised case. Similarly, we will also prove
Z
â = E[a(X 0 )] = (2.11)
M
for the unnormalised case where we use p̄(x) instead of p(x).
For the first fact, we can prove (2.10) by noting that

p(X 0 )
!
â = P(accept) = P U ≤ ,
M q(X 0 )

where the randomness here is w.r.t. U and X 0 jointly. We know however that, for a given X 0 = x0 ,
we accept with the following probability

p(x0 ) p(x0 )/M q(x0 ) p(x0 )


! Z
0 0
P(accept|X = x ) = P U ≤ = du = = a(x0 ).
M q(x0 ) 0 M q(x0 )
This means that we can compute the unconditional acceptance probability as
Z
P(accept) = P(accept|X 0 = x0 )q(x0 )dx0 ,
= E[a(X 0 )].

We can then easily show that

0
Z
0 0 0
Z
p(x0 )
E[a(X )] = a(x )q(x )dx = q(x0 )dx0
M q(x0 )
1 Z 1
= p(x0 )dx0 = .
M M
For the unnormalised case, we can prove (2.11) with a similar argument above as
Z
p̄(x0 ) Z
â = E[a(X 0 )] = a(x0 )q(x0 )dx0 = q(x0 )dx0
M q(x0 )
Z
p(x0 ) 0 0 Z Z
= Z 0
q(x )dx = p(x0 )dx0
M q(x ) M
Z
= .
M


Remark 2.2. Note the difference between two quantities a and â. While a(x) is the acceptance
ratio p(x)/M q(x) (or the unnormalised version), â is the expectation of this quantity, called
acceptance rate.

It can be seen that smaller M is theoretically useful for us as it will mean higher acceptance
rates. Consider the following example.

28
Figure 2.7: A better proposal for p(x) = Beta(2, 2)

Example 2.9 (Beta(2, 2) density). We can go back to our example Beta(2, 2) density in Exam-
ple 2.8. Instead of the box, we can now choose

q(x) = N (x; 0.5, 0.25),

and M = 1.3 (this is optimised visually by plotting – we will compute these quantities explicitly
below). This will result in the graph shown in Fig. 2.7. Compare the acceptance rate of this
algorithm with the box example and demonstrate this numerically.

Solution. We can see now that in the box example, we chose Mbox = p? = 1.5. By visual
inspection for this example, for the Gaussian case, we chose MGauss = 1.3. We can now compute
the acceptance rate for both cases. For the box example, we have
1 1
âbox = = = 0.67,
Mbox 1.5
and for the Gaussian case, we have
1 1
âGauss = = = 0.77.
MGauss 1.3
These must be the precise numbers that the numerical simulations will give us.

Another example can be seen as follows.

Example 2.10 (Truncated Densities). Describe a rejection sampler for a Truncated Gaussian

29
target:

p̄(x) = N (x; 0, 1)1|x|≤a (x).

Solution. The truncated densities arise in a number of applications where we may want to model
something we know with a probability density p(x) we are familiar with. However, it could also
be the case that this variable X has strong constraints (e.g. positivity or boundedness). For
example, we could consider an age distribution could be restricted this way with hard constraints.
Imagine a Gaussian density N (x; 0, 1) and suppose we are interested in sampling this density
between [−a, a]. We can write this truncated normal density as

p̄(x) N (x; 0, 1)1|x|≤a (x)


p(x) = = Ra . (2.12)
Z −a N (y; 0, 1)dy

Here are a few important things about this equation: 1A (x) denotes a function where
(
1 if x ∈ A
1A (x) =
0 otherwise.

Note that now we have access to our density evaluation in an unnormalised way: We can evaluate
p̄(x) which equals to N (x; 0, 1) if −a ≤ x ≤ a and to 0 otherwise. Rejection sampling is optimal
for this task. Note here that, we can choose

q(x) = N (x; 0, 1)

anyway, and we have p̄(x) ≤ q(x) (i.e. we can take M = 1). Note a few interesting things for
this case: First of all, since p(x) is zero outside [−a, a] and p̄(x)/M q(x) = 1 if x ∈ [−a, a], we
can simply reject any sample that is out of bounds and accept if they are in the bounds (without
needing to sample U at all). Secondly, note that, this is one of the rare cases where we have Z < 1.
Note that we can very intuitively also calculate the acceptance rate:

Z
â = =Z
M
a
N (y; 0, 1)dy as given in the density (2.12). It is very intuitive that the accep-
R
where Z = −a
tance rate is this integral, as we literally accept a Gaussan sample from q(x) = N (x; 0, 1) if
it
Ra
falls into [−a, a], the probability of sample falling into this interval is given by the integral
−a N (y; 0, 1)dy = Z.

2.3.3 designing the optimal rejection sampler


In the above examples, we have seen that choosing the right proposal q(x) and M can improve the
sampler quality significantly. In this section, we will discuss how to choose the optimal proposal
q(x) and M for a given target density p(x) (or p̄(x)).

30
choosing M
In the above example, we have seen that choosing M is crucial for the acceptance rate. It is easy
to see that we should choose M such that M q(x) ≥ p(x) for all x. To choose smallest such M ,
we should find the number M ? such that
p(x)
M? = sup .
x q(x)
This will ensure that M? q(x) ≥ p(x) for all x. This is the optimal choice of M and we can see
that this is the smallest M that covers p(x). Needless to say, for the unnormalised case, we just
replace p with p̄ in the above formula.

Example 2.11 (Choosing M ). Let our target be


2 /2
p̄(x) = e−x

i.e., an unnormalised Gaussian and our proposal be


1 1
q(x) = .
π 1 + x2
Compute M for the optimal rejection sampler.

Solution. We need to compute

p̄(x)
M = sup ,
x q(x)

as usual. How? For this, let R(x) = p̄(x)/q(x) and note that we will first need to find

x? = argmax R(x) = argmax log R(x),


x x

as log is monotonic. Then we can find M = R(x? ). Let us now compute

x2
log R(x) = log p̄(x)/q(x) = − + log(1 + x2 ) + log(1/π)
2
and find the roots by taking the derivative and setting it to zero

d 2x
log R(x) = −x + =0
dx 1 + x2
x = 0, ±1,

we see that we have three roots to decide. Which one is the maximum? To look at the answer, we
need to check second derivatives. We compute the second derivative

d2 2(1 − x2 )
log p̄(x)/q(x) = −1 +
dx2 (1 + x2 )2

31
• When x = 0, the second derivative is positive - which means x = 0 is a minimum.

• When x = ±1, the second derivative is negative - which means x = ±1 is a maximum.

• x? = ±1.

So we have
p̄(1)
M= = 2πe−1/2 .
q(1)

optimising the proposal


Let us in this section parameterise our proposal qθ where θ is the parameter of the proposal
density qθ . We have seen above that the choice of M will obviously depend on qθ so we can
denote it as Mθ . Since, in fact, we would like to achieve the optimal Mθ such that

p(x) ≤ Mθ qθ (x).

We have seen that we can choose Mθ as


p(x)
Mθ = sup .
x qθ (x)
Note that this is a function of θ, therefore allows us now to optimise our proposal to maximise
the acceptance rate. Therefore, we can now solve the problem
1
θ? = argmax .
θ Mθ
We always perform this optimisation in the log-space5 as the quantities we will obtain are more
tractable, therefore, the problem often becomes

θ? = argmin log Mθ ,
θ

as log(1/Mθ ) = − log Mθ .

Example 2.12. (A numerical example from Yıldırım (2017)) Say we are interested in sampling

X ∼ Gamma(α, 1),

with α > 1. The density is given by

xα−1 e−x
p(x) = , x > 0.
Γ(α)

5
We denote ln with log, so all logaritms are w.r.t. to natural base.

32
Figure 2.8: Two rejection sampling procedures for Example 2.12 with λ = 0.001 and optimal λ = 1/α (as derived
in the example) for n = 10000.

As a proposal, let us choose exponential

qλ (x) = Exp(x; λ) = λe−λx , x>0

with 0 < λ < 1 (for λ > 1, the ratio p/qλ is unbounded). Derive the optimal rejection sampler
for this problem.

Solution. We should ensure that p(x) ≤ M q(x), therefore, the standard choice for Mλ is to
compute

p(x)
Mλ = sup .
x qλ (x)

33
We know that
p(x) xα−1 e(λ−1)x
= .
qλ (x) λΓ(α)
In order to compute Mλ , in practice, one needs to first find
p(x)
x? = argmax ,
x qλ (x)
and then we can identify that Mλ = p(x? )/qλ (x? ) for fixed λ. Denote Rλ (x) = p(x)/q(x). To
find such x? , we can compute
x? = argmax log p(x) − log qλ (x),
as log is monotonic. We can then compute
log R(x) = log p(x) − log qλ (x) = (α − 1) log x − x + (λ − 1)x − log λ − log Γ(α).
We then take the derivative of this
d log R(x) α−1
= + λ − 1,
dx x
and set it to zero, which leads to
(α − 1)
x? = .
(1 − λ)
Placing x = x? in the ratio p(x)/qλ (x), we obtain
 α−1
α−1
1−λ
e−(α−1)
Mλ = .
λΓ(α)
This leads to the acceptance probability
!α−1
p(x) x(1 − λ)
= e(λ−1)x+α−1 .
Mλ qλ (x) α−1
Now, we have to minimise Mλ with respect to λ so that â = 1/Mλ would be maximised. It is
easy to show that (show) Mλ is minimised by
1
λ? = .
α
Plugging this and computing
αα e−(α−1)
Mλ? = .
Γ(α)
So we designed our optimised rejection sampler. In order to sample from Γ(α, 1), we perform
• Sample X 0 ∼ Exp(1/α) and U ∼ Unif(0, 1)
• If
U ≤ (x/α)α−1 e(1/α−1)x+α−1 ,
accept X 0 , otherwise start again.
We can see the results of this algorithm in Fig. 2.8.

34
P3
Figure 2.9: The density of a mixture of three Gaussians: p(x) = k=1 wk N (x; µk , σk2 ) with µ1 = −2, µ2 =
0, µ3 = 4, σ1 = 0.5, σ2 = 1, σ3 = 0.5, w1 = 0.2, w2 = 0.6, w3 = 0.2.

2.4 composition
When the probability density p(x) can be expressed in a composition of operations, we can still
sample from such densities straightforwardly, albeit it may look complex at first look. In this
section, we focus on mixture densities, i.e., densities that can be written as a weighted mixture
of two probability densities. These objects are used to model subpopulations in a statistical
population, modelling experimental error (e.g. localised in different regions), and heterogeneous
populations. We will start from a discrete mixture and then will discuss the continuous case.

2.4.1 sampling from discrete mixture densities


To start simple, consider the following probability density

p(x) = w1 q1 (x) + w2 q2 (x),

35
where w1 + w2 = 1 and q1 and q2 are probability densities. It is straightforward to verify that
p(x) is also a density
Z Z Z
p(x)dx = w1 q1 (x)dx + w2 q2 (x)dx
= w1 + w2
= 1.

An example can be seen from Fig. 2.9. We can generalise this idea and define a general mixture
distribution
K
X
p(x) = wk qk (x),
k=1

with k mixtures. Sampling from such distributions are extremely easy with the techniques we
know. We first sample from the probability mass function defined by weights: p(k) = wk where
PK
k=1 p(k) = 1 (using inversion as we learned). This gives us an index k ∼ p(k), then we sample
from associated density X 0 ∼ qk (x), which gives us a sample from the mixture. For example,

Algorithm 5 Sampling discrete mixtures


1: Input: The number of samples n.
2: for i = 1, . . . , n do
3: Generate k ∼ p(k)
4: Generate Xi ∼ qk (x)
5: end for

sampling a mixture of Gaussians is easy: Sample k ∼ p(k) from the PMF consists of weights wk ,
then sample from the selected Gaussian.

2.4.2 sampling from conditional densities


Before we move on to the continuous mixture case, we clarify how one can sample from condi-
tional distributions, denoted, generally, as p(y|x). In this case, this is a density for every fixed x,
therefore conditioned on x, sampling is same as the any other sampling problem. For example,
consider

p(y|x) = N (y; x, 1),

where the mean (parameter) of the Gaussian is denoted within the density as conditioned. This
notation is useful if one assumes x is also random (will see later). However, for fixed x, the
sampling is business usual:

y ∼ p(y|x) = N (y; x, 1),

is sampling a Gaussian with a fixed mean x.

36
2.4.3 sampling from joint distributions
Sampling from a joint distribution p(x, y) sounds straightforward but it might be still not obvious.
Assume, we would like to draw

X, Y ∼ p(x, y) (2.13)

e.g., a two-dimensional sample from 2D Gaussian. It is often the case that the standard factorisa-
tion of joint densities

p(x, y) = p(y|x)p(x),

can be used. In order to realise (2.13), one can employ

X ∼ p(x),
Y |X = x ∼ p(y|x).

Note the notation which implies that things should be done in this order. Once X is sampled,
then it is fixed X = x. After that, Y is sampled conditioned on that specific x sample.
In particular, the idea can be generalised for n variables if one knows the full conditionals.
For example, consider a joint distribution p(x1 , . . . , xn ), then any joint distribution of n variables
satisfy the following.

p(x1 , . . . , xn ) = p(xn |xn−1 , . . . , x1 )p(xn−1 |xn−2 , . . . , x1 ) · · · p(x2 |x1 )p(x1 ).

Therefore, simulating from a joint distribution can be done

X1 ∼ p(x1 )
X2 |X1 = x1 ∼ p(x2 |x1 )
X3 |X1 = x1 , X2 = x2 ∼ p(x3 |x2 , x1 )
..
.
Xn |X1 = x1 , X2 = x2 , . . . , Xn−1 = xn−1 ∼ p(xn |x1 , . . . , xn−1 ).

Of course the difficulty with this is that, it is often impossible to know these conditional distribu-
tions described above.

Remark 2.3. This idea can be taken to great generalisation, in fact, it is often the core of complex
simulations. The core idea of probabilistic modelling is to factorise (assuming independence)
some complex joint distribution p(x1 , . . . , xn ) with respect to the modelling assumptions. Simu-
lation methods can then be used to sample these variables in the order that is assumed in the
model and generate synthetic data.

37
2.4.4 sampling from continuous mixtures or marginalisation
It is a common case that a density can be written as an integral, instead of a sum (as in the discrete
mixture case). Consider the fact that
Z
p(y) = p(x, y)dx,

for any joint density. This operation is called marginalisation and it is often of interest to compute
marginal densities (and of course sampling from them).
For example, using the formula p(x, y) = p(y|x)p(x) and given a conditional density p(y|x)
and p(x), we can derive
Z Z
p(y) = p(x, y)dx = p(y|x)p(x)dx.

Surprisingly enough, sampling from y is pretty straightforward: Sample from the joint p(x, y)
using the method above (i.e. X ∼ p(x) and Y |X = x ∼ p(y|x)), then just keep Y samples.
They will approximate p(y)!. Let us see an example.

Example 2.13. Let

p(x) = N (x; µ, σ 2 )

and

p(y|x) = N (y; x, 1).

Then it can be shown that (we will do this in future exercises):

p(y) = N (y; µ, σ 2 + 1).

Numerically verify that this is the case.

Solution. This can be verified by implementing two procedures. First,

• Sample X ∼ N (x; µ, σ 2 ),

• Sample Y |X = x ∼ N (y; x, 1)

and comparing resulting Y samples to

• Y ∼ p(y) = N (y; µ, σ 2 + 1).

Online companion contains the result.

Another example can be seen as follows.

38
Figure 2.10: The data simulated from (2.15)–(2.16) using a = 0.5 and b = 0.5 with three different values for σ 2 .
As can be seen from the figures, the generated (x, y) pairs exhibit a clear linear relationship (as intended) with
variance changing depending on our modelling choice.

Example 2.14 (Linear Model). Linear models are of utmost importance in many fields of science.
Describe a method to simulate (x, y) pairs that have a linear relationship.

Solution. We know that we can sample x, y ∼ p(x, y) by sampling x ∼ p(x) and y|x ∼ p(y|x)
from the last chapter. We will now use this for a linear example.
To start intuitively, a typical linear relationship is described as
y = ax + b, (2.14)
which describes a line where a is the slope and b is the intercept. In order to obtain a probabilistic
model and generate data, we have to simulate both x and y variables. Since, from the equation, it
is clear that y is generated given x, we should start from defining x. Now this depends on the
application. For example, x can be a variable that may be uniform or a Gaussian. We denote
its density as p(x). The typical task is also to formulate p(y|x). The linear equation suggests a
deterministic relationship, however, real data often contains noise. To generate realistic data, we
will instead assume
y = ax + b + n
where n ∼ N (0, σ 2 ) is noise (often with small σ 2 . Note that, given noise is zero mean and ax + b
is a deterministic number (given x), we can then write our full model
p(x) = Unif(x; −10, 10) (2.15)
p(y|x) = N (y; ax + b, σ 2 ). (2.16)
where we chose our p(x) distribution to be uniform on [−10, 10]. As a result, we have a full
model to simulate variables with a linear relationship
Xi ∼ p(x),
Yi |Xi = xi ∼ p(y|xi ),

39
Algorithm 6 Sampling Multivariate Gaussian
1: Input: The number of samples n.
2: for i = 1, . . . , n do
3: Compute L such that Σ = LL> . (Cholesky decomposition)
4: Draw d univariate independent normals vk ∼ N (0, 1) to form the vector v =
[v1 , . . . , vd ]>
5: Generate xi = µ + Lv.
6: end for

where p(x) could be a uniform, Gaussian, truncated Gaussian etc. depending on the nature of
the modelled variable. The results of this generation can be seen in the scatter plot in Fig. 2.10.

2.5 sampling multivariate densities


Finally, we will discuss the sampling methods for multivariate distributions. As we saw earlier,
a multivariate density is nothing but a joint density in d dimensions, i.e., can be defined as
p(x1 , . . . , xd ). The techniques mentioned in Sec. 2.4.3 might be used (sampling from conditionals)
if they are known. Also, what we have seen as samplers, e.g., rejection sampling generalizes to
many dimensions as you have already seen from the circle example. But, of course, most of the
time, they are not known and very general independent sampling techniques are available, see
Martino et al. (2018, Chapter 6). We will only cover one specific case.

2.5.1 sampling a multivariate gaussian


Define x ∈ Rd , a multivariate Gaussian:
1
 
− d2 − 12
p(x) = (2π) | det Σ| exp − (x − µ)> Σ−1 (x − µ) ,
2
where µ ∈ Rd is the mean vector and Σ ∈ Rd×d is a d × d symmetric positive definite matrix.
Recall that, in the univariate case, Y = µ + σX (where µ, σ are scalars) gave us a sample from
N (µ, σ 2 ). The same idea works here, however, since now we have the covariance instead of
variance, we need to find a notion of a “square-root” of the covariance matrix Σ. In short, let
X ∼ N (x; 0, I) where now X ∈ Rd and the mean is a zero vector, and finally, I is the identity
matrix. Such a multivariate Gaussian can be drawn by drawing each entry independently. We
then can sample Y ∼ N (y; µ, Σ) where µ ∈ Rd the mean vector and Σ ∈ Rd×d the symmetric
covariance matrix, as

Y = Σ1/2 X + µ.

The computation of Σ1/2 is done using a Cholesky decomposition6 . The algorithm is provided in
Algorithm 6.
6
You do not need to know how to implement or compute this, it is perfectly fine to use
numpy.linalg.cholesky.

40
PROBABILISTIC MODELLING AND INFERENCE
3
In this chapter, we will cover probabilistic modelling in more detail and then talk about inference.
We will also review probability basics and a large range of applications the Bayesian viewpoint
enables.

3.1 introduction
In the previous chapter, we have seen how to generate data from a probabilistic model. Despite
we have only simulated from a linear model as an example, the idea is general. We will see more
about simulating models in other parts of the course. We have seen that
Xi ∼ p(x), (3.1)
Yi |Xi = xi ∼ p(y|xi ), (3.2)
generates the data according to the model p(x, y) = p(y|x)p(x). It is important to stress that this
can describe a very general situation: x variable can be multivariate (and even be time dependent),
and y can describe any other process. We will see, though, that in Bayesian modelling (I use it
simultaneously with probabilistic modelling), x generally denotes the latent (hidden) states or
parameters of a model (or both) . The variable y typically denotes the observed data. So seeing the
model (3.1) as a generative model, simulating from it can be seen as a way of generating synthetic
data1 .

3.2 the bayes rule and its uses


In this section, we will discuss the Bayes rule in depth and its uses. The Bayesian formula is at the
heart of many probabilistic modelling approaches. We start with the definition of the Bayes rule.
1
This is a big deal in industry. Search for example for synthetic data startups.

41
Definition 3.1 (Bayes Theorem). Let X and Y be random variables with associated probability
density functions p(x) and p(y), respectively. The Bayes rule is given by

p(y|x)p(x)
p(x|y) = . (3.3)
p(y)

Note that the formula holds for continuous random variables as well as discrete random
variables. Its importance comes from the fact that it provides us a natural way to incorporate or
synthesise data into a probabilistic model. In this interpretation, we have three key concepts.

• Prior: In the formula (3.3), p(x) is called the prior probability of X. Here X can be
interpreted as a parameter of p(y|x) or a hidden (unobserved) variable. The probability
distribution p(x) encodes our prior knowledge about this variable we cannot observe
directly. This could be simple constraints, a distribution dictated by a real application
(e.g. a physical variable can be only positive). In time series applications, p(x) can be the
distribution over an entire time series, it can even encode physical laws.

• Likelihood: p(y|x) is called the likelihood of Y given X. This is the probability model
of the process of observation – in other words, it describes how the underlying parameter
or hidden variable is observed. For example, if Y is the number of observed cases of a
disease in a population, then p(y|x) is the probability of observing y cases given that the
true number of cases is x.

• Posterior: p(x|y) is called the posterior distribution of X given Y = y. This is the updated
probability distribution after we see y observation and updated our prior knowledge p(x)
into p(x|y).

We will see a number of examples where these quantities make sense.

Remark 3.1. Note the difference between simulation and inference. We can write down our
model (sometimes we will call the forward model) p(x) and p(y|x) to describe the data generation
process and can generate toy (synthetic) data with it as we have seen. But the essential goal of
Bayes rule (also called Bayesian or probabilistic inference) is to infer the posterior distribution
conditioned on already observed data. In other words, we can use a probabilistic model for two
purposes:

• Simulation: We can generate synthetic data with a probabilistic model.

• Inference: We can infer the posterior distribution (implied by the model structure we
impose) of a parameter or hidden variable given observed data.

Let us see the Bayes’ rule on a discrete example.

42
Example 3.1. Suppose we have two fair dice, each with six faces. Suppose that we throw both
dice and observe the sum as y = 9. Derive the posterior distribution of the first die X1 and the
second die X2 given Y = 9.

Solution. Let us first write down the joint distribution of the two dice. Define the outcome of
the first die as X1 and the outcome of the second die as X2 . We can then describe their joint
probability table as
p(x1 , x2 ) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 2 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 3 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 4 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 5 1/36 1/36 1/36 1/36 1/36 1/36
X2 = 6 1/36 1/36 1/36 1/36 1/36 1/36
i.e., each combination is equally probable. Note that this is also the table of p(x1 )p(x2 ) due to
independence. Suppose that we can only observe the sum of the two dice, Y = X1 + X2 . This
would result in a likelihood

1 if y = x1 + x2 ,
p(y|x1 , x2 ) = 
0 otherwise.

We can also denote this as an indicator function, i.e., let 1{y=x1 +x2 } (x1 , x2 ) be the indicator
function of the event y = x1 + x2 , then we have p(y|x1 , x2 ) = 1{y=x1 +x2 } (x1 , x2 ). Suppose now
we observe Y = 9 and would like to infer the posterior distribution of X1 and X2 given Y = 9.
We can use the Bayes rule to write
p(y = 9|x1 , x2 )p(x1 , x2 )
p(x1 , x2 |y = 9) =
p(y = 9)
p(y = 9|x1 , x2 )p(x1 )p(x2 )
= .
p(y = 9)
Let us first write out p(y = 9|x1 , x2 ) as a table
p(y = 9|x1 , x2 ) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 0 0 0 0 0 0
X2 = 2 0 0 0 0 0 0
X2 = 3 0 0 0 0 0 1
X2 = 4 0 0 0 0 1 0
X2 = 5 0 0 0 1 0 0
X2 = 6 0 0 1 0 0 0
This is just the likelihood. In order to get the full joint (numerator of the Bayes theorem), we
need to multiply the likelihood with the joint prior p(x1 , x2 ) = p(x1 )p(x2 ). Multiplying this
table with the joint probability table of X1 and X2 gives

43
p(y = 9|x1 , x2 )p(x1 )p(x2 ) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 0 0 0 0 0 0
X2 = 2 0 0 0 0 0 0
X2 = 3 0 0 0 0 0 1/36
X2 = 4 0 0 0 0 1/36 0
X2 = 5 0 0 0 1/36 0 0
X2 = 6 0 0 1/36 0 0 0

This is just the numerator in the Bayes theorem, we now need to compute the probability p(y = 9)
in order to finally arrive at the posterior distribution. We can compute this as
X
p(y = 9) = p(y = 9|x1 , x2 )p(x1 )p(x2 )
x1 ,x2
X
= 1(y = x1 + x2 )p(x1 )p(x2 )
x1 ,x2
X 1 1
= 1(y = x1 + x2 ) × ×
x1 ,x2 6 6
1 X
= 1(y = x1 + x2 ),
36 x1 ,x2
1
= ×4
36
1
= .
9
Now we are ready to normalise p(y = 9|x1 , x2 )p(x1 )p(x2 ) to obtain the posterior distribution as
a table
p(x1 , x2 |y = 9) X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5 X1 = 6
X2 = 1 0 0 0 0 0 0
X2 = 2 0 0 0 0 0 0
X2 = 3 0 0 0 0 0 1/4
X2 = 4 0 0 0 0 1/4 0
X2 = 5 0 0 0 1/4 0 0
X2 = 6 0 0 1/4 0 0 0

Let us next see a continuous example adapted from Murphy (2007).

Example 3.2. Let

p(x) = N (x; µ0 , σ02 ),


p(y|x) = N (y; x, σ 2 ),

44
where µ0 and σ02 are the prior mean and variance, respectively, and σ 2 is the variance of the
likelihood. Derive the posterior distribution p(x|y) using the Bayes’ rule.

Solution. We have seen a similar example before (without the proof), where we computed the
marginal likelihood p(y). In this example, we will instead derive the posterior distribution p(x|y).
Now let us write
p(y|x)p(x)
p(x|y) = .
p(y)
In order to derive the posterior, we first derive p(y|x)p(x) as
p(y|x)p(x) = N (y; x, σ 2 )N (x; µ0 , σ02 )
! !
1 (y − x)2 1 (x − µ0 )2
=√ exp − exp −
2σ02
q
2πσ 2 2σ 2 2πσ02
!
1 (y − x)2 (x − µ0 )2
= q exp − − .
2π σ 2 σ02 2σ 2 2σ02

We know that
p(x|y) ∝ p(y|x)p(x)
!
(y − x)2 (x − µ0 )2
∝ exp − − .
2σ 2 2σ02
Recall now that we want a density on x. Let us expand this exponential
!
y2 xy x2 x2 xµ0 µ2
p(x|y) ∝ exp − 2 + 2 − 2 − 2 + 2 − 02
2σ σ 2σ 2σ0 σ0 2σ0
! ! !
1 1 2 y µ0 y2 µ20
= exp − + x + + x− 2 − 2 ,
2σ 2 2σ02 σ 2 σ02 2σ 2σ0
! ! !
1 σ 2 + σ02 2 yσ02 + µ0 σ 2
∝ exp − x + x ,
2 σ 2 σ02 σ02 σ 2
! !!!
1 σ 2 + σ02 2 yσ02 + µ0 σ 2
= exp − x − 2x ,
2 σ 2 σ02 σ02 + σ 2
yσ02 +µ0 σ 2 2
   
 1
x− 2
σ0 +σ 2
∝ exp − .

2 σ 2 σ02
σ 2 +σ02

This can be recognised as a Gaussian:


σ 2 µ0 + σ02 y
µp = ,
σ 2 + σ02
σ2σ2
σp2 = 2 0 2 .
σ + σ0
This gives us our Gaussian posterior. See Fig 3.1 for an illustration.

45
Figure 3.1: Posterior distribution of x given σ = 1, σ = 0.7 and σ = 0.5 respectively. One can see that as we
shrink the likelihood variance, the posterior distribution becomes more peaked towards the observation y = 0.5.
Old posteriors are also plotted in the second and third figure for comparison (in transparent blue).

Figure 3.2: On the left, we plot all distributions of interest: prior, likelihood (with y = 0.5 with respect to x),
the posterior, and the unnormalised posterior, and the proposal. Note that, the proposal should only cover the
unnormalised posterior, even if the normalising constant is less than one. On the left, we plot the samples vs. the
same quantities. One can see that we exactly sampled from the correct posterior.

This is an example of a conjugate prior, where the posterior distribution is of the same form
as the prior. In the solved examples section, we will see more examples of this. As you have seen,
the derivation of the posterior took some work. As opposed to this conjugate case, in the general
case, we will not be able to derive the posterior. Let us see one example now how we can avoid
computing the normalised posterior but still sample from it.

Example 3.3. Assume that we have a prior distribution p(x) = N (x; µ0 , σ02 ) and a likelihood
p(y|x) = N (y; x, σ 2 ). We want to sample the posterior distribution p(x|y) without going
through the derivation. Derive the rejection sampler for this purpose.

Solution. We know the posterior is given by

p(x|y) ∝ p(y|x)p(x) = N (x; µ0 , σ02 )N (y; x, σ 2 ).

Recall that we would like to sample from the posterior p(x|y) without necessarily computing the
Bayes rule. We can pose this problem as a rejection sampling problem. We would like to sample
from the posterior distribution conditioned on y. In our case, the unnormalised posterior is

46
Figure 3.3: Illustration of the prior, posterior, likelihood, and the proposal distribution.

given by

p̄(x|y) = p(y|x)p(x)

Note that we evaluate the likelihood at the observation y and hence it becomes a function of x.
Below, for clarity, we will use the r.h.s. of above equation in acceptance rate, instead of p̄(x) as we
usually did before. For this example, we also set µ0 = 0, σ0 = 1, and σ = 0.5. Next, we need to
design a proposal distribution q(x). This could be tricky as we do not know the posterior. For
now, we can choose another simple Gaussian (we could also optimise this):

q(x) = N (x; µq , σq2 ).

Let us choose µq = 0 and σq = 1 (note again that this is the standard deviation!) and M = 1.
An illustration of this is shown in Fig 3.2. We can now sample from the posterior

• Sample X 0 ∼ q(x)

• Sample U ∼ Unif(0, 1)
p(y|X 0 )p(X 0 )
• If U ≤ M q(X 0 )
, accept X 0 . Otherwise, reject X 0 and go back to step 1.

We can see the results of this procedure from Fig 3.2. As seen from the figure, we exactly sample
from the posterior p(x|y = 0.5) without ever computing the correct posterior. We have also
plotted the correct posterior in the figure for comparison.

Let us see another example.

47
Example 3.4. Assume that we have a Poisson observation model:

xy e−x
p(y|x) = Pois(y; x) = ,
y!
and a Gamma prior:

β α α−1 −βx
p(x) = Gamma(x; α, β) = x e .
Γ(α)

We want to sample from the posterior distribution p(x|y). Derive the posterior distribution
p(x|y).

Solution. We know that the posterior is proportional to

p(x|y) ∝ p(y|x)p(x)
= Pois(y; x)Gamma(x; α, β),
∝ xα−1+y e−βx−x ,

where we ignored all the normalising constants. We can see that the posterior is also a Gamma
density:

p(x|y) = Gamma(x; α + y, β + 1).

Let us sample from this posterior with rejection sampling as we did before for the Gaussian.

Example 3.5. Assume that we have a Gamma prior:


1 α−1 −x
p(x) = Gamma(x; α, 1) = x e ,
Γ(α)

with α > 0. Next, we define our Poisson observation model as before


xy e−x
p(y|x) = Pois(y; x) = .
y!
Derive the rejection sampler for the posterior without explicitly computing the posterior
distribution.

Solution. We know that the posterior is proportional to

p(x|y) ∝ p(y|x)p(x) = Pois(y; x)Gamma(x; α, 1),


∝ xα−1+y e−2x .

48
Figure 3.4: Histogram of the samples drawn using rejection sampling.

In short, we will choose this as our unnormalised posterior


p̄(x|y) = xα−1+y e−2x .
Now we will design our proposal distribution. We choose the proposal as an exponential distri-
bution:
qλ (x) = Exp(x; λ) = λe−λx .
Now we derive the acceptance probability. As usual, we need to first find
p̄(x|y)
Mλ = sup .
x qλ (x)
First we need to optimise the ratio:
p̄(x|y) xα−1+y e−2x
=
qλ (x) λe−λx
xα−1+y e−(2−λ)x
= .
λ

49
Aiming at optimising this w.r.t. x, we first compute its log:
p̄(x|y)
log = log xα−1+y + log e−(2−λ)x − log λ
qλ (x)
= (α − 1 + y) log x − (2 − λ)x − log λ.
We now take the derivative of this w.r.t. x:
d α−1+y
[(α − 1 + y) log x − (2 − λ)x − log λ] = − (2 − λ),
dx x
and set it to zero:
α−1+y
− (2 − λ) = 0.
x
This gives us the maximiser
α−1+y
x? = .
2−λ
We can now compute Mλ :
p̄(x? |y)
Mλ =
qλ (x? )
?
x? α−1+y e−(2−λ)x
=
λ
1 α − 1 + y α−1+y −(2−λ) α−1+y
  
= e 2−λ
λ 2−λ
1 α − 1 + y α−1+y −(α−1+y)
 
= e .
λ 2−λ
We can now optimise this further to choose our optimal proposal. We will first compute the log
of Mλ :
1 α−1+y
 
log Mλ = log + (α − 1 + y) log − (α − 1 + y)
λ 2−λ
α−1+y
 
= − log λ + (α − 1 + y) log − (α − 1 + y).
2−λ
Taking the derivative of this w.r.t. λ, we obtain
d 1 (α − 1 + y)
log Mλ = − +
dλ λ 2−λ
Setting this to zero, we obtain
1 (α − 1 + y)
= ,
λ 2−λ
which implies that
2
λ? = .
α+y
Therefore, we can choose our optimal proposal in terms of α and y depends on the observed
sample. See Fig . 3.4 for the histogram of the samples drawn using rejection sampling.

50
3.3 conditional independence
The step forward from the simple Bayes rule to modelling complex dependencies and interactions
is to understand the notion of conditional independence. Simply put, conditional independence
is a notion of independence of two random variables conditioned on a third random variable. Of
course, this can be extended to arbitrary number of variables, defining a full probabilistic model.
It is important to note that these models everywhere in science and engineering.
Let us first define the notion of conditional independence.

Definition 3.2. Let X, Y and Z be random variables. We say that X and Y are conditionally
independent given Z if

p(x, y|z) = p(x|z)p(y|z).

This definition is of course the same as plain independence, just written in terms of conditional
probabilities. Note that, in general, X and Y are not independent if we do not condition on Z.
We note the important corollary.
Corollary 3.1. If X and Y are conditionally independent given Z, then
p(x|y, z) = p(x|z),
and
p(y|x, z) = p(y|z).
Proof. See Exercise 4.2 solution. 
We can now describe the notion of conditional independence in terms of joint distributions.

Proposition 3.1. Let X, Y and Z be random variables. If X and Y are conditionally independent
given Z, then

p(x, y, z) = p(x|z)p(y|z)p(z).

Proof. Recall that we have described the chain rule for conditional probabilities in Sec. 2.4.3
p(x1 , . . . , xn ) = p(xn |xn−1 , . . . , x1 )p(xn−1 |xn−2 , . . . , x1 ) · · · p(x2 |x1 )p(x1 ).
This relationship is as important as in inference as in simulation. We can now use this to show
that
p(x, y, z) = p(x|y, z)p(y|z)p(z)
= p(x|z)p(y|z)p(z),
where the last line follows from Corollary 3.1. 

51
This idea can be extended to arbitrary number of variables. This kind of factorisations are
at the core of probabilistic modelling. In other words, a probabilistic modeller (scientist) poses
a set of conditional independence assumptions which then allows them to factorise the joint
distribution into a product of conditional distributions. From then on, the modeller can use the
conditional distributions to compute any desired marginal or conditional distributions. This is
the essence of probabilistic modelling.

3.3.1 bayes rule for conditionally independent observations


So far, we have seen an example of prior to posterior update for a single observation in Sec. 3.2
and the definition of conditional independence. We can now combine these two ideas to obtain
the Bayes update for conditionally independent observations. This is a standard use case for
conditional independence: Typically, given an unobserved variable x, we can obtain multiple
measurements related to a single latent variable x.
Let us define the general Bayes update for this case. Assume that we have observed y1 , . . . , yn ∼
p(y|x) (these can be thought of as conditionally i.i.d samples from the likelihood)2 . Given a
prior of x, denoted p(x), we want to compute the posterior p(x|y1 , . . . , yn ). We know that the
posterior is given by
p(y1:n |x)p(x)
p(x|y1:n ) = . (3.4)
p(y1:n )
Under the conditional independence assumption of observations, we can just use Definition 3.2
to arrive at
n
Y
p(y1:n |x) = p(yi |x).
i=1

Plugging this in back to the Bayes update (3.4), we can see that the posterior is proportional to
the product

p(x|y1 , . . . , yn ) ∝ p(y1 , . . . , yn |x)p(x)


n
Y
= p(yi |x)p(x),
i=1

Again, in many occasions, we will not be able to compute the normalising constant. However, we
can still sample from the posterior. In this particular example, let us continue with the Gaussian
prior and likelihood. In this case, we can exactly compute the posterior too.

Example 3.6 (Gaussian Bayes update for conditionally independent observations). Let us assume
the following probabilistic model

X ∼ N (x; µ0 , σ02 ) = p(x)


Yi |X = x ∼ N (yi ; x, σ 2 ) = p(yi |x), i = 1, . . . , n,

2
We define the following notation. Let y1 , . . . , yn be n observations. We collectively denote these variables as
y1:n := (y1 , . . . , yn ). This will be also used in the following sections.

52
Figure 3.5: Bayes update for conditionally independent observations.

where Yi are conditionally independent given X = x. Derive the posterior distribution


p(x|y1 , . . . , yn ).

Solution. Here each observation is assumed to be conditionally independent given x. Note that
this model is very different than the one where we simulated (Xi , Yi ) pairs in Example 2.14. The
point in Example 2.14 was to simulate pairs exhibiting linear relationship, each (Yi , Xi ) was an
independent draw from the joint distribution. Here, we assume that the observations are sampled
conditioned on a single x – in essence, the sequence y1 , . . . , yn are dependent. They are only
conditionally independent given x.
Having observed y1 , . . . , yn , we would like to compute the posterior p(x|y1 , . . . , yn ). Let us
first compute the likelihood

n
Y
p(y1:n |x) = p(yi |x)
i=1
n
!
Y 1 (yi − x)2
= √ exp −
i=1 2πσ 2 2σ 2
n
!
1 1 X 2
= exp − 2 (yi − x) .
(2πσ 2 )n/2 2σ i=1

Using the same derivations as in Example 3.2, we can compute the posterior

p(y1:n |x)p(x)
p(x|y1:n ) =
p(y1:n )
p(y1:n |x)p(x)
=R
p(y1:n |x)p(x)dx

53
where p(x|y1:n ) = N (x; µp , σp2 ), with (Murphy, 2007)
Pn
σ02 i=1 yi + σ
2
µ0
µp =
σ02 n + σ 2
σ02 σ 2
σp2 = .
σ02 n + σ 2

The posterior with conditioned data can be seen from Fig. 3.5.

3.3.2 conditional bayes rule


It is important to realise that the Bayes rule can be used conditionally. Consider three variables
X, Y, Z without specifying any conditional independence assumptions. In this case, the Bayes
rule for p(x|y, z) can be written entirely on z (of course, this is true if we swap the variables and
condition on x or y). We can write in this case the conditional Bayes rule.

Proposition 3.2. Given X, Y, Z without any conditional independence assumptions, the conditional
Bayes rule is

p(y|x, z)p(x|z)
p(x|y, z) = .
p(y|z)

Proof. See the solution of Exercise 4.1. 


This is of course true if we write the same rule for x or y conditioned.

3.4 marginal likelihood


The notion of marginal likelihood is left unexplored so far and we will now investigate it. We can
go back to the Bayes theorem and write
p(y|x)p(x)
p(x|y) = .
p(y)
In this formula, we have been discussing the posterior p(x|y), the prior p(x), and the likelihood
p(y|x) in past sections. However, the normalising constant, which we assumed to be intractable,
is also of interest. This quantity, p(y), is called the marginal likelihood and it is given by
Z
p(y) = p(y|x)p(x)dx.

For fixed y, the interpretation of this term is that it gives us the probability of data y under the
model3 . For more complicated models (where x can be multiple variables or multiple other
3
Aside from its usual interpretation as the normalising constant.

54
Figure 3.6: Marginal likelihood for model comparison. For observed data, we can compute the marginal likelihood
for each model. The model with the highest marginal likelihood is the best model for the observed data.

distributions may exist), the quantity p(y) becomes crucial to determine the quality of the model
for the observed data. While itself does not mean much, it gives us a comparative measure to
compare different models. We will discuss this with an example.

Example 3.7 (Marginal likelihood for two Gaussian models). Consider two different models:

p0 (x) = N (x; µ0 , σ02 )


p(y|x) = N (y; x, σy2 )

and

p1 (x) = N (x; µ1 , σ12 )


p(y|x) = N (y; x, σy2 ).

Consider observing y (a single data point). How can you find out which model is more likely?

Solution. Recall that, for these models, we have computed p(y) analytically before. We can
compute for both models:
Z
p0 (y) = p(y|x)p0 (x)dx
Z
= N (y; x, σy2 )N (x; µ0 , σ02 )dx
= N (y; µ0 , σ02 + σy2 )

55
and
Z
p1 (y) = p(y|x)p1 (x)dx
Z
= N (y; x, σy2 )N (x; µ1 , σ12 )dx
= N (y; µ1 , σ12 + σy2 )

We will say Model 1 is better than Model 0 if p1 (y) > p0 (y) for fixed y and similarly, we will say
Model 0 is better if p1 (y) < p0 (y). This, however, as you can see depends on various parameters.
Let us choose that σ = 1, µ0 = −4, σ0 = 2, and µ1 = 1, σ1 = 0.5. The computed marginal
likelihoods can be seen from Fig. 3.6. It can be seen that Model 1 is a much better fit to the data
than Model 0.

3.5 sequential bayesian updating


Before concluding this section, we will pay special attention to a case where the conditional Bayes
rule is used together with many conditionally i.i.d. observations. Consider a scenario where we
would like to obtain the distribution of a random variable X given a sequence of datapoints y1:n ,
i.e., p(x|y1:n ). We can see that, to repeat, this can be done using Eq. (3.4), i.e.,
p(y1:n |x)p(x)
p(x|y1:n ) = .
p(y1:n )
However, in many cases, we are interested in (yn )n≥1 arriving sequentially, that is, one at a time.
In this case, we can use the conditional Bayes rule to update the posterior sequentially.
Assume that we start from the prior p(x) before we observe any data. If we then observe y1 ,
one can construct the posterior of x as
p(y1 |x)p(x)
p(x|y1 ) = ,
p(y1 )
as we have seen before. Suppose now that we observe a new data point y2 and would like to
obtain p(x|y1:2 ) without reprocessing y1 . We can do this using the Bayes rule as
p(y1:2 |x)p(x)
p(x|y1:2 ) =
p(y1:2 )
p(y2 |x, y1 )p(y1 |x)p(x)
=
p(y2 |y1 )p(y1 )
p(y2 |x, 
y1 )p(y1 |x)p(x)
=
p(y2 |y1 )p(y1 )
p(y2 |x)p(x|y1 )
= .
p(y2 |y1 )
One can see that this is equivalent to using p(x|y1 ) as a prior. Also this is the conditional Bayes
update conditioned on y1 .

56
Figure 3.7: The curse of dimensionality for the sampling example for rejection sampling.

The generalisation of this process goes as follows. Assume that we have p(x|y1:n−1 ), i.e., the
conditional posterior distribution of n − 1 observations. Upon receiving yn , we can update our
posterior as

p(x|y1:n ) = p(x|yn , y1:n−1 ),


p(yn |x,   )p(x|y1:n−1 )
y1:n−1
= ,
p(yn |y1:n−1 )
p(yn |x)p(x|y1:n−1 )
= .
p(yn |y1:n−1 )

where we obtain the sequential Bayesian updating rule. This is a very important result as it allows
us to update our posterior sequentially without reprocessing the data. This is especially useful in
online learning scenarios where we would like to update our posterior as we receive new data
points. This will be of crucial use when we consider sequential Monte Carlo towards the end of
the course. We will have one exercise about a Gaussian example in this setting.

3.6 conclusion
In this section, we briefly discussed the Bayes rule and its application to probabilistic inference.
This is a vast topic and we have only scratched the surface. If you are curious about the topic,
Bishop (2006) is a good book to read. Some other very nices ones are Barber (2012) and Murphy
(2022). We will finish this chapter by discussing why rejection samplers as we introduced it would
not be an appropriate candidate for sampling in more complicated models we discussed in this
chapter.

57
Given all these derivations, it is natural to ask whether we can use rejection samplers for
Bayesian inference. Let us assume that we have y1 , . . . , yn observed and our unnormalised
posterior is given by
n
Y
p̄(x|y1:n ) = p(x) p(yi |x).
i=1

Let us assume that we have a proposal distribution q(x) and assume that we have been lucky to
identify some M such that

p̄(x|y1:n ) ≤ M q(x).

We can now perform rejection sampling as follows:

1. Sample X 0 ∼ q(x)

2. Sample U ∼ Unif(0, 1)
Qn
p̄(X 0 |y1:n ) p(X 0 ) p(yi |X 0 )
3. If U ≤ M q(X 0 )
= i=1
M q(X 0 )
then accept X 0

4. Otherwise reject X 0 and go back to step 1.

What could be an immediate problem as n grows? The multiplication ni=1 p(yi |X 0 ) would
Q

not be numerically stable. This would result in numerical underflow as the multiplication of
small probabilities gets smaller and smaller. In order to mitigate this, one solution is to work
with log-probabilities. This means that we can still perform rejection sampling (provided that
p̄(x|y) ≤ M q(x)) as follows:

1. Sample X 0 ∼ q(x)

2. Sample U ∼ Unif(0, 1)

3. Compute log-acceptance probability

p̄(X 0 |y1:n ) p(X 0 ) ni=1 p(yi |X 0 )


Q
0
log a(X ) = log = log ,
M q(X 0 ) M q(X 0 )
n
= log p(X 0 ) + log p(yi |X 0 ) − log M − log q(X 0 ).
X

i=1

4. If log U ≤ log a(X 0 ) then accept X 0

However, this would also not often solve our issues as

• It is often impossible to find M and q(x) such that p̄(x|y) ≤ M q(x).

• Bounds found to log-unnormalised posterior can be very loose

– Super low acceptance probability

58
This is also not the only failure mode of the rejection sampling. It is often the case that rejection
sampling is very inefficient in high dimensions even if one manages to find a good proposal q.
Consider the rejection sampling in 2D for sampling the circle within a square. The acceptance
probability for this case:

area of the circle π


â = = ≈ 0.78.
area of the square 4

Next, consider the same sampler for the sphere and the cube (3D). The acceptance probability for
this case:
volume of the sphere π
â = = ≈ 0.52.
volume of the cube 6
If we were doing this in d dimensions, the acceptance rate would be

volume of the unit ball π d/2


â = = .
volume of the unit cube Γ(d/2 + 1)

However, this ratio goes to zero incredibly fast as d grows (see Fig. 3.7) In other words, rejection
samplers have very poor acceptance rates in high dimensions. This will lead us to look at other
sampling methods.

59
MONTE CARLO INTEGRATION
4
In this section, we introduce Monte Carlo integration and importance sampling in detail. We will
show how these ideas can be applied to a variety of problems such as computing integrals, computing
expectations, sampling from complex distributions, and computing marginal likelihoods.

4.1 introduction to monte carlo integration


We have repeatedly highlighted that we are interested in sampling from a probability measure p.
One reason we are interested in this is to estimate expectations of certain measures, i.e., we can
estimate moments of distributions. Of course, so far, we have been considering drawing samples
from known distributions (for which moments might be readily available). However, it is often
the case in sampling applications that, in most cases, the primary goal is to compute expectations
for distributions which are not available to us in closed form.
We will call this task as Monte Carlo integration. Briefly, given a probability distribution p, we
are interested in computing expectations of the form
Z
ϕ̄ = Ep [ϕ(X)] = ϕ(x)p(x)dx,

where ϕ(x) is called a test function. For example, ϕ(x) = x would give us the mean, ϕ(x) =
x2 the second moment, or ϕ(x) = log(x) would give us the entropy. For example, given
X (1) , . . . , X (N ) ∼ p i.i.d, we know that (intuitively, at this point) the mean estimator is given by
N
Z
1 X
Ep [X] = xp(x)dx ≈ X (i) ,
N i=1

which is simply the empirical average of the samples. While this can be intuitive, it underlies a
certain choice about the approximation of the probability distribution p using its samples. In

60
order to do this, we build an empirical distribution of the samples, using
N
N 1 X
p (x)dx = δ (i) (x)dx. (4.1)
N i=1 X

In order to understand how this works, we first need to understand the Dirac delta measure δx .
The Dirac delta measure is defined as
Z
f (y) = f (x)δy (x)dx. (4.2)

Here, the Dirac can be thought as a point mass at y. In other words, the Dirac delta measure is a
measure which is concentrated at a single point. To understand it intuitively, the object δy (x) can
be informally thought as a function centered at y (and only takes value 1 at y) 1

1 if x = y
δy (x) =
0 otherwise.

One can see that then pN is a sample based approximation of p, where the samples are equally
weighted. While we never may use this particular approximation of a density, it is useful to build
estimates of expectations. Generalising the above scenario, let us consider the estimation of the
general expectation
Z
ϕ̄ = Ep [ϕ(X)] = ϕ(x)p(x)dx.

Given samples X (1) , . . . , X (N ) , we can build pN as in (4.1) and approximate this expectation as

ϕ̄ = Ep [ϕ(x)]
Z
= ϕ(x)p(x)dx
Z
≈ ϕ(x)pN (x)dx
N
Z
1 X
= ϕ(x) δX (x)dx
N i=1 i
N Z
1 X
= ϕ(x)δXi (x)dx
N i=1
N
1 X
= ϕ(Xi ) = ϕ̂N . (4.3)
N i=1

where we have used (4.2) in the approximate integral to arrive at the final expression. Note that
this generalises the example above about the mean (which was ϕ(x) = x case). In this course,
we will also be interested in the properties of these estimators.

1
This is not correct rigorously – just for intuition! Note that the Diracs always make sense with an integral
attached to them.

61
Remark 4.1. As we can see that, the Monte Carlo estimator can be used to estimate expectations.
We can also use this idea to estimate integrals. Consider a standard integration problem
Z
I= f (x)dx,

where f (x) is a function. We can use the Monte Carlo (MC) estimator to estimate this integral as
Z
f (x)
I= p(x)dx
p(x)
N
Z
f (x) N N 1 X
≈ p (x)dx where p (x)dx = δ (i) (x)dx
p(x) N i=1 X
N
1 X f (Xi )
= . using (4.1)
N i=1 p(Xi )
f (x)
In this case, we have ϕ(x) = p(x)
. This is particularly easy for the integrals of type
Z 1
I= f (x)dx,
0

where f (x) is a function. In this case, we can use the uniform distribution as the base distribution
p and use the Monte Carlo estimator to estimate the integral without needing to compute any
ratios.

In the following, we prove some results about the properties of the Monte Carlo estimator
(4.3) when samples are i.i.d from p.

Proposition 4.1. Let X1 , . . . , XN be i.i.d samples from p. Then, the Monte Carlo estimator
N
1 X
ϕ̂N = ϕ(Xi )
N i=1

is unbiased, i.e.,

Ep [ϕ̂N ] = ϕ̄.

62
Proof. We have
N
" #
N 1 X
Ep [ϕ̂ ] = Ep ϕ(Xi )
N i=1
N
1 X
= Ep [ϕ(Xi )]
N i=1
N Z
1 X
= ϕ(x)p(x)dx
N i=1
Z
= ϕ(x)p(x)dx
= ϕ̄,

which proves the result. 


Next, we can also compute the variance of the Monte Carlo estimator.

Proposition 4.2. Let X1 , . . . , XN be i.i.d samples from p. Then, the Monte Carlo estimator
N
N 1 X
ϕ̂ = ϕ(Xi )
N i=1

has variance
1
varp [ϕ̂N ] = varp [ϕ(X)].
N
where
Z
varp [ϕ(X)] = (ϕ(x) − ϕ̄)2 p(x)dx.

Proof. We have
N
" #
1 X
varp [ϕ̂ ] = varp
N
ϕ(Xi )
N i=1
N
1 X
= 2 varp [ϕ(Xi )]
N i=1
N Z
1 X
= 2 (ϕ(x) − ϕ̄)2 p(x)dx
N i=1
1 σ2
= varp [ϕ(X)] = ϕ
N N
Provided that varp [ϕ(X)] < ∞ and the estimator is consistent as N → ∞. This proves the result.


63
Remark 4.2. The expression varp [ϕ̂N ] is the variance of the MC estimator but this expression
requires the true mean ϕ̄ to be known. In practice, we do not know the true mean but also
have an MC estimator for it. We can plug this estimator into the variance in order to obtain an
empirical variance estimator. Note that
1
varp [ϕ̂N ] = varp [ϕ(X)]
NZ
1
= (ϕ(x) − ϕ̄)2 p(x)dx
N
N
1 X
≈ 2 (ϕ(Xi ) − ϕ̂N )2
N i=1
2
= σϕ,N .

This estimator then can be used to estimate the variance of the MC estimator.

We can therefore obtain a central limit theorem for our estimator, i.e.,
(ϕ̂N − ϕ̄)
→ N (0, 1) as N → ∞.
σϕ,N

This can be used to build empirical confidence intervals for the estimators. However, this is not
a principled estimate and may not be valid in many scenarios. We can also see that we have a
standard deviation estimate (which follows from the variance estimate) given by
σϕ
stdp [ϕ̂ ] = varp [ϕ̂N ] = √ .
q
N
N

This is a typical display of a convergence rate O(1/ N ).

Remark 4.3. One of the most common application of sampling is to estimate probabilities. We
have seen that different choices of ϕ can lead to estimating different quantities such as the mean
and nth moments. However, the MC estimators can also be used to estimate probabilities. In
order to see this, assume that we would like to estimate P(X ∈ A) where X ∼ p. We know that
this is given as
Z
P(X ∈ A) = p(x)dx,
A

see, e.g., Definition 1.2. For example, A can simply be an interval. Given the definition above, we
can write
Z
P(X ∈ A) = p(x)dx
ZA
= 1A (x)p(x)dx,

64
Figure 4.1: Estimating π using the Monte Carlo method.

where 1A (x) is the indicator function of A. We can therefore set ϕ(x) = 1A (x) and given the
samples from p, we can build an estimator
Z
P(X ∈ A) = 1A (x)p(x)dx,
Z
≈ 1A (x)pN (x)dx,
N
Z
1 X
= 1A (x) δX (x)dx,
N i=1 i
N Z
1 X
= 1A (x)δXi (x)dx,
N i=1
N
1 X
= 1A (Xi ).
N i=1

This estimator also leads to an intuitive procedure: We sample X1 , . . . , XN from p and we


effectively just count the samples in A and divide it by N .

We can now return to the example of estimating π using the Monte Carlo method.

Example 4.1. Recall the problem of estimating π using the Monte Carlo method. Frame it as a
Monte Carlo integration problem and derive the algorithm rigorously.

65
Figure 4.2: Relative error (see next section) of the Monte Carlo estimate provided by sampling within the circle.

Solution. The logic that was used in this example was to estimate the area of a circle that lies
within a square. To be precise, consider the square [−1, 1] × [−1, 1] and define the uniform
distribution on this square as p(x, y) = Unif([−1, 1] × [−1, 1]). We can simply phrase the
problem as estimating the area of the circle which we define as A ⊂ [−1, 1] × [−1, 1]. The set A
is given as
A = {(x, y) ∈ [−1, 1] × [−1, 1] | x2 + y 2 ≤ 1}.
We can then formalise this problem as estimating the probability that a point lies within the circle.
This is given as
Z
P(X ∈ A) = p(x, y)dxdy,
ZA
= 1A (x, y)p(x, y)dxdy.
Sampling (Xi , Yi ) ∼ p(x, y) (a uniform sample within a square), we can estimate this integral
using the standard MC method. More formally, we can write
Z
P(A) = p(x, y)dxdy
ZA
= 1A (x, y)p(x, y)dxdy,
N
1 X π
≈ 1A (Xi , Yi ) → as N → ∞.
N i=1 4
A trajectory of the estimation procedure π can be seen from Fig. 4.1 w.r.t. varying sample size.

66
Figure 4.3: Monte Carlo integration of h(x) = [cos(50x) + sin(20x)]2


Nonasymptotic results showing the convergence rate of O(1/ N ) are also available (see,
e.g., Akyildiz (2019, Corollary 2 .1)) – see Fig. 4.2 for a demonstration.

Example 4.2 (Example 3.4 from Robert and Casella (2004)). Let us consider an example of
estimating an integral:
Z 1 Z 1
I= h(x)dx = [cos(50x) + sin(20x)]2 dx.
0 0

The exact value of this integral is 0.965. Describe a Monte Carlo method to estimate this integral.

Solution. We can just choose p(x) = Unif(0, 1) and set ϕ(x) = h(x). We can then write
Z 1
I= h(x)dx,
0
Z 1
= ϕ(x)p(x)dx,
0

and apply the standard MC estimator. The results (together with the empirical variance estimate)
can be seen from Fig. 4.3.

Finally, we provide an example of estimating the probability of a random variable.

Example 4.3. Consider X ∼ N (0, 1) and we would like to estimate the probability that X > 2.
Describe the MC method.

67
Figure 4.4: Monte Carlo estimation of the tail probability X > 2. The “true value” is computed via numerical
integration.

Solution. The way to do this is to choose

p(x) = N (0, 1), ϕ(x) = 1{x>2} (x).

We can then write that


Z ∞
P(X > 2) = 1{x>2} (x)N (0, 1)dx,
−∞
N
1 X
≈ 1{Xi >2} (Xi ).
N i=1

where X1 , . . . , XN ∼ N (0, 1). The results can be seen from Fig. 4.4.

4.2 error metrics


In order to quantify convergence, we will have a number of error metrics in the following section.
We start with defining the bias as

bias(ϕ̂N ) = E[ϕ̂N ] − ϕ̄, (4.4)

where ϕ̄ is the true value. We call an estimator unbiased if the bias is zero. In the case where we
sample i.i.d from p(x), we can build unbiased estimators of expectations and integrals. We recall
the variance

var(ϕ̂N ) = E[(ϕ̂N − E[ϕ̂N ])2 ]. (4.5)

68
If the estimator is unbiased, we can then replace E[ϕ̂N ] with ϕ̄. Next, we define the mean squared
error (MSE)

MSE(ϕ̂N ) = E[(ϕ̂N − ϕ̄)2 ]. (4.6)

One can see that the MSE and the variance coincides if the estimator is unbiased. We have also
the following decomposition of the MSE

MSE(ϕ̂N ) = bias(ϕ̂N )2 + var(ϕ̂N ). (4.7)

We define the root mean square error (RMSE) as

RMSE(ϕ̂N ) = MSE(ϕ̂N ).
q
(4.8)

Finally, we define the relative absolute error (RAE) as

|ϕ̂N − ϕ̄|
RAE(ϕ̂N ) = . (4.9)
|ϕ̄|

We usually plot the absolute error of the estimator, as we only run the experiment once in general2 .
We note that this absolute error |ϕ̂N − ϕ̄| is a random variable (since
√ no expectations are taken).
However, this quantity provably converges with a rate of O(1/ N ) (see, e.g., Akyildiz (2019,
Corollary 2.1)). More precisely, we can write

V
|ϕ̂N − ϕ̄| ≤ √ , (4.10)
N
where V is an almost surely finite random variable. This error rate will be displayed empirically
in the following sections (see also Fig . 4.2).

Example 4.4 (Marginal Likelihood estimation). Recall that, given a prior p(x) and a likelihood
p(y|x), we can compute the marginal likelihood p(y) as
Z
p(y) = p(y|x)p(x)dx.

Describe the MC method to estimate p(y).

Solution. This defines a nice integration problem that we can solve using MC. Assume that we
are given the following model

p(x) = N (x; µ0 , σ02 ),


p(y|x) = N (y; x, σ 2 ).

2
However, if you were to do a proper experimentation, then you would have to run the same experiment M
times (Monte Carlo runs) and average the error to estimate the RMSE.

69
Figure√4.5: Estimating the marginal likelihood p(y) for y = 1. One can clearly see the displayed error rate that is
O(1/ N ).

Assume that µ0 = 0, σ0 = 1, σ = 2, and y = 1. For example, for fixed y = 1, we would want to


estimate p(y = 1). This integral becomes
Z
p(y = 1) = p(y = 1|x)p(x)dx,

where we can set ϕ(x) = p(y = 1|x). We can then compute the integral using MC estimation
procedure as
N
1 X
pN (y = 1) = p(y = 1|Xi ),
N i=1

where X1 , . . . , XN ∼ p(x). The results can be seen from Fig. 4.5.

We will next consider another example of estimating a probability where we show how to
quantify the variance using the true value.

Example 4.5. Consider the following density


1
p(x) = .
π(1 + x2 )
We would like to compute the probability of X ∼ p(x) being larger than 2, i.e., P(X > 2).
Describe the method and suggest an improvement.

Solution. We can compute this probability using MC estimation as we can set


ϕ(x) = 1{x>2} (x).

70
Figure 4.6: Cauchy density of Example 4.5.

We can compute
Z ∞
P(X > 2) = p(x)dx
Z2
= 1{x>2} (x)p(x)dx.

We can also compute the real value of this integral as (see Example 2.3 for the CDF of this density)
Z ∞ 1 1
I = ϕ̄ = p(x)dx = FX (∞) − FX (2) = − tan−1 (2) = 0.1476.
2 2 π
Let us compute the variance of the Monte Carlo estimator for N = 10 samples:
varp (ϕ)
var(ϕ̂N ) =
N
So we need to compute:
Z Z 2
varp (ϕ) = 2
ϕ(x) p(x)dx − ϕ(x)p(x)dx
Z Z 2
2
= 1{x>2} (x) p(x)dx − 1{x>2} (x)p(x)dx
Z Z 2
= 1{x>2} (x)p(x)dx − 1{x>2} (x)p(x)dx
= 0.1476 − 0.14762 = 0.125.
The variance of the estimator then
0.125
var(ϕ̂N ) = = 0.0125.
10

71
Could we do better? An idea is to use the fact that the density is symmetric around zero: This
means P (X > 2) = P (X < −2) (see Fig . 4.6). So we could compute:

P(|X| > 2) = P(X > 2) + P(X < −2) = 2I.

Therefore, our new problem is I = 12 P(|X| > 2). Let us write it as

1Z
I= p(x)dx,
2 |x|>2
Z
1
= 1{|x|>2} (x)p(x)dx,
2
Now define the test function
1
ϕ(x) = 1{|x|>2} (x).
2
As before, we need to compute varp (ϕ):
Z Z 2
varp (ϕ) = 2
ϕ(x) p(x)dx − ϕ(x)p(x)dx
2
1 2 1
Z Z 
= 1{|x|>2} p(x)dx − 1{|x|>2} p(x)dx
4 2
2
1 1
Z Z
= 1{|x|>2} p(x)dx − 1{|x|>2} p(x)dx
4 2
1 1
= × 2 × 0.1476 − × (2 × 0.1476)2 ,
4 4
= 0.052.

Therefore, the variance of the estimator for N = 10 samples is


0.052
var(ϕ̂N ) = = 0.0052.
10
Improvement over the previous estimator! This kind of variance improvements are crucial in
safety critical applications.

4.3 importance sampling


While the estimators constructed using samples exactly coming from p has desirable properties
as we have seen above, in the majority of cases, we need to employ more complex sampling
strategies. A few cases where we need this are summarised below.

• A typical problem arises when computing tail probabilities (also called rare events). We
may have access to samples directly from p(x), however, sampling from the tail of p(x)
might be extremely difficult. For example, consider the Gaussian random variable X

72
with mean 0 and variance 1. The probability of X being larger than 4 is very small, i.e.,
P(X > 4) ≈ 0.00003. Sampling from the tail of this density directly would be very
inefficient without further tricks.
• Another typical scenario where we may want to compute expectations with respect to
p(x) when we do not have direct samples from it. The standard example for this is the
Bayesian setting. Given a prior p(x) and a likelihood p(y|x), we may want to compute the
expectations w.r.t. the posterior density p(x|y), i.e., Ep(x|y) [ϕ(X)]. In this case, we do not
have access to samples from p(x|y) so we need to employ other strategies.
A strategy we will pursue in this section is specific to Monte Carlo integration. In other words, we
will next describe a strategy where we can compute integrals and expectations w.r.t. a probability
density without having access to samples from it. This is slightly different than directly aiming at
sampling from the density (which can also be used to estimate integrals). While we will look at
sampling methods in the following chapters, it is important to note that importance sampling is
primarily an integration technique.

4.3.1 basic importance sampling


Consider the basic task of estimating the integral
Z
ϕ̄ = Ep [ϕ(X)] = ϕ(x)p(x)dx.

In this section, as opposed to previous sections, we assume that we cannot sample from p directly
(or exactly, e.g., using rejection sampling3 ). However, we assume (in this section) we can evaluate
the density p(x) pointwise. We can still estimate this expectation (and compute integrals more
generally), using samples from an instrumental, proposal distribution q. In other words, we
can sample from a proposal and we can repurpose these samples to estimate expectations w.r.t.
p(x). This resembles the rejection sampling where we have also used a proposal to accept-reject
samples. However, in this case, we will employ a different strategy of weighting samples and will
not throw any of the samples away. The weights we will compute will weight samples so that the
integral estimate gets closer to the true integral. In order to see how to do this, we compute
Z
ϕ̄ = ϕ(x)p(x)dx,
Z
p(x)
= ϕ(x) q(x)dx, “identity trick” (4.11)
q(x)
Z
= ϕ(x)w(x)q(x)dx, (4.12)

where w(x) = p(x)/q(x) (which is called the weight function). We know from Section 4.1 that
we can estimate the integral in (4.12) using samples from q. Let Xi ∼ q be i.i.d samples from q
for i = 1, . . . , N . We can then estimate the integral in (4.12), hence the expectation ϕ̄ using
Z
ϕ̄ = ϕ(x)w(x)q(x)dx
N
1 X
≈ wi ϕ(Xi ) = ϕ̂N
IS , (4.13)
N i=1
3
Recall that rejection sampling draws i.i.d samples from the density, not approximate.

73
where wi = w(Xi ) = p(Xi )/q(Xi ) are called the weights. The weights will play a crucial role
throughout this section. The key idea of importance sampling is that, instead of throwing away
the samples by rejection, we could reweight them according to their importance. This is why this
strategy is called importance sampling (IS).
The importance sampling algorithm for this case then can be described relatively straight-
forwardly. Given p(x) (which we can evaluate), we choose a proposal q(x). Then, we sample
Xi ∼ q(x) for i = 1, . . . , N and compute the IS estimator as
N
1 X
ϕ̂N
IS = wi ϕ(Xi ),
N i=1

where wi = p(X i)
q(Xi )
for i = 1, . . . , N are the importance weights. We summarise the method in
Algorithm 7. In what follows, we will discuss some details of the method.

Algorithm 7 Pseudocode for basic importance sampling


1: Input: The number of samples N
2: for i = 1, . . . , N do
3: Sample Xi ∼ q(x)
4: Compute weights wi = p(X i)
q(Xi )
5: end for
6: Report the estimator

N
1 X
ϕ̂N
IS = wi ϕ(Xi ).
N i=1

Remark 4.4. Unlike rejection sampling, in importance sampling, the proposal does not have
to dominate the target density. Instead, the crucial requirement for the IS is that the support of
the proposal should be the same as the support of the density. More precisely, we need q(x) > 0
whenever p(x) > 0. This is far less restrictive than the requirement of rejection sampling. Of
course, the choice of proposal can still effect the performance of the IS. We will discuss this in
more detail.

From Fig. 4.7, one can see an example plot of the target density p(x), the proposal q(x) and
the associated weight function w(x). See the caption for more details and intuition.

Example 4.6. Consider the problem of estimating P(X > 4) for X ∼ N (0, 1). Describe a
potential problem of using the naive MC method.

Solution. While we can exactly sample from this density, given that

P(X > 4) = 3.16 × 10−5 ,

74
Figure 4.7: An example of target density p(x), the proposal q(x) and the associated weight function w(x). One can
see that if q(x) < p(x) (which means fewer samples would be drawn from q(x) in this region), then w(x) > 1 to
account for this effect. The opposite is also true, since if q(x) > p(x), this means that we would draw more samples
than necessary, which should be downweighted, hence w(x) < 1 in these regions.

it will be the case that very few of the samples from exact distribution will fall into this tail (Note
that, while we know the exact value in this case, we will not know this in general – this is just
a demonstrative example). In fact, a standard run with N = 10000 gives exactly zero samples
that satisfy Xi > 4, hence provides the estimate as zero! It is obvious that this is not a great way
to estimate the probability and we can use importance sampling for this. Consider a proposal
q(x) = N (6, 1). This will draw a lot of samples from the region X > 4 and we can reweight this
samples w.r.t. the target density using the IS estimator in (4.13). A standard run in this case with
N = 10000 results in
−5
ϕ̂N
IS = 3.18 × 10 ,

which is obviously a much closer number to the truth.

One can next prove that the estimator ϕ̂N


IS is unbiased.

Proposition 4.3. The estimator ϕ̂N


IS is unbiased, i.e.,

Eq(x) [ϕ̂N
IS ] = ϕ̄.

75
Proof. We simply write
N
" #
1 X
Eq [ϕ̂N
IS ] = Eq(x) wi ϕ(Xi )
N i=1
N
" #
1 X p(Xi )
= Eq ϕ(Xi )
N i=1 q(Xi )
N
" #
1 X p(Xi )
= Eq ϕ(Xi )
N i=1 q(Xi )
N Z
1 X p(x)
= ϕ(x)q(x)dx since Xi ∼ q(x)
N i=1 q(x)
Z
= ϕ(x)p(x)dx,
= ϕ̄,
which completes the proof. 
An important quantity in IS is the variance of the estimator ϕ̂N
IS . The variance of the estimator
is a measure of how much the estimator fluctuates around its expected value. The variance of the
IS estimator (4.13) is given by the following proposition.

Proposition 4.4. The variance of the estimator ϕ̂N


IS is given by

1 
varq [ϕ̂N

IS ] = Eq [w2 (X)ϕ2 (X)] − ϕ̄2 .
N

Proof. Next we write out the estimator ϕ̂N


IS in (4.13)
N
" #
1 X
varq [ϕ̂N
IS ] = varq wi ϕ(Xi )
N i=1
N
" #
1
= 2 varq
X
w(Xi )ϕ(Xi )
N i=1
1
= varq [w(X)ϕ(X)] where X ∼ q(x)
N
1  h 2 i 
= Eq w (X)ϕ2 (X) − Eq [w(X)ϕ(X)]2
N
1  h 2 i 
= Eq w (X)ϕ2 (X) − ϕ̄2 ,
N
which concludes the proof. We have used the fact that the variance of the sum of independent
random variables is the sum of the variances. 
One can see that this easily leads to the bound for the standard devation stdq [ϕ̂N
 
√1
IS ] ≤ O N
.
Also, we still have the result for the relative absolute error as
V
|ϕ̂N
IS − ϕ̄| ≤ √ ,
N

76
Figure 4.8: The importance sampling estimator ϕ̂N IS is plotted against the number of samples N for the example

in in Fig . 4.7, for ϕ(x) = x2 . This demonstrates that the random error in the IS case also satisfies O(1/ N )
convergence rate.

where V is an almost surely finite random variable. As in the perfect MC case, we will not prove
this result as it is beyond our scope, but curious reader can refer to Corollary 2.2 in Akyildiz (2019)
(which also holds for the self normalised case which will be introduced below). A demonstration
of this rate for importance sampling can be seen from Fig. 4.8.
We can see that the variance of the IS estimator is finite if
Eq [w2 (X)ϕ2 (X)] < ∞.
This implies that
Z Z
p(x) 2
w2 (x)ϕ2 (x)q(x)dx = ϕ (x)p(x)dx < ∞.
q(x)
In other words, for our importance sampling estimate to be well-defined, the ratio
p2 (x) 2
ϕ (x)
q(x)
has to be integrable. We will see next an example where this condition is not satisfied.

Example 4.7 (Infinite variance IS, Example 3.8 from Robert and Casella (2010)). Consider the
target
1 1
p(x) = ,
π 1 + x2

77
Importance sampling under infinite variance
0.25
Estimate of the probability

0.20

0.15

0.10

0.05

0.00
0.2 0.4 0.6 0.8 1.0
Number of samples 1e8
Figure 4.9: Estimating P(2 < X < 6) where X is Cauchy with q(x) = N (0, 1). The true value is plotted in red
and the estimator value in black.

which is the Cauchy density. Let us choose the proposal


!
1 x2
q(x) = N (x; 0, 1) = √ exp − .
2π 2

Discuss whether the IS estimator is well-defined.

Solution. We can compute the ratio p(x) q(x)


∝ exp(x2 /2)/(1 + x2 ) and see that it is explosive
(precisely, its integral is infinite). This can result in problematic situations even if ϕ ensures that
the variance is finite. For example, consider the problem of estimating P(2 < X < 6). One
example run for this case can be seen from Fig. 4.9. One can see that the estimator in this case is
unstable and cannot be reliably used.

Remark 4.5 (Optimal proposal). We can try to inspect the variance expression to figure out
which proposals can give us variance reduction. From Prop. 4.4, it follows that we have
1
varq [ϕ̂N
IS ] = varq [w(X)ϕ(X)] .
N
This means that minimising the variance of the IS estimator is the same as minmising the variance

78
of the function w(x)ϕ(x). Moreover, looking at the expression,
1  h 2
varq [ϕ̂N
i 
2 2
IS ] = E q w (X)ϕ (X) − ϕ̄ ,
N
we can see that since ϕ̄2 > 0 (which is independent of the proposal), we should choose a proposal
that minimises Eq [w2 (X)ϕ2 (X)]. We can lower bound this quantity using Jensen’s inequality:
h i
Eq w2 (X)ϕ2 (X) ≥ Eq [w(X)|ϕ(X)|]2 ,
where we used the fact that (·)2 is a convex function (For a convex function f , Jensen’s inequality
states that Eq [f (X)] ≥ f (Eq [X]).). Using w(x) = p(x)/q(x), we arrive at the following lower
bound:
h i
Eq w2 (X)ϕ2 (X) ≥ Ep [|ϕ(X)|]2 . (4.14)
Now let us expand the term Eq [w2 (X)ϕ2 (X)] out and write
" #
h
2 p2 (X) 2
2
i
Eq w (X)ϕ (X) = Eq 2 ϕ (X)
q (X)
Z 2
p (x) 2
= ϕ (x)q(x)dx
q 2 (x)
Z
p(x) 2
= p(x) ϕ (x)dx,
q(x)
h i
= Ep w(X)ϕ2 (X) . (4.15)
The last equation, eq. (4.15), suggests that we can choose a proposal such that we attain the lower
bound of this function (4.14) (which means that it would be the minimiser). In particular, if we
choose a proposal q(x) such that
p(x) Ep [|ϕ(X)|]
w(x) = =
q(x) |ϕ(x)|
is satisfied, then (4.15) would be equal to the lower bound (4.14). This implies that
|ϕ(x)|
q? (x) = p(x) , (4.16)
Ep [|ϕ(X)|]
would minimise the variance of the importance sampling estimator.
Choosing q? as the proposal, one can see that the variance of the IS estimator satisfies
1 1
varq? [ϕ̂N
IS ] = Ep [|ϕ(X)|]2 − ϕ̄2
N N
1 h
2
i 1 2
≤ Ep ϕ (X) − ϕ̄
N N
= varp [ϕ̂N
MC ],
therefore we obtain
varq? [ϕ̂N
IS ] ≤ varp [ϕ̂MC ],
N

i.e., a variance reduction. In fact, one can show that, if ϕ(x) ≥ 0 for all x ∈ R, then the variance
of the IS estimator with optimal proposal q? is equal to zero.

79
We note that this optimal construction of the proposal (4.16) is not possible to implement in
practice. It requires the knowledge of the very quantity we want to estimate, namely, Ep [|ϕ(X)|]!
But in general, we can choose proposals that minimise the variance of the IS estimator where
possible. This idea has been used in the literature to construct proposals that minimise the
variance of the estimator, see, e.g., Akyildiz and Míguez (2021) and references therein. Within
the context of this course, we will construct some simple examples for this purpose later.

4.3.2 self-normalised importance sampling


As mentioned several times in past chapters, in many scenarios, we have access to the unnormalised
density, i.e., given p, we can evaluate it up to a normalising constant. As usual, we denote this
density p̄(x) and recall that it is related to p by
p̄(x)
p(x) =
Z
R
where Z = p̄(x)dx. In the context of Bayesian inference, we usually have an unnormalised
posterior density p̄(x|y) ∝ p(y|x)p(x). In the previous section, we have built an importance
sampling estimator for the case where we have access to the normalised density. In this section,
we will generalise the idea and assume we only have access to the unnormalised density.
Consider, again, the problem of estimating expectations of a given density p. For the case
where we can only evaluate p̄(x), one way to estimate this expectation is to sample from a proposal
distribution q and rewrite the integral as
Z
ϕ̄ = ϕ(x)p(x)dx,
ϕ(x) p̄(x)
R
q(x)
q(x)dx
= R p̄(x)
, (4.17)
q(x)
q(x)dx

where we use the fact that p(x) = p̄(x)/Z. This gives us two separate integration problems, one
to estimate the numerator and one to estimate the denominator. We will estimate both quantities
using samples from q(x)4 .
Let us now introduce the unnormalised weight function W (x)
p̄(x)
W (x) = ,
q(x)
which is analogous to the normalised weight function w(x) in the previous section. Using
Xi ∼ q(x) and building the Monte Carlo estimator of the numerator and denominator, we arrive
at the following estimator of the (4.17):
1 PN
i=1 ϕ(Xi )W (Xi )
ϕ̂N
SNIS = N
1 PN ,
N i=1 W (Xi )
PN
i=1 ϕ(Xi )W (Xi )
= PN
i=1 W (Xi )
N
X
= w̄i ϕ(Xi ), (4.18)
i=1
4
We do not have to, see, e.g., Lamberti et al. (2018).

80
Algorithm 8 Pseudocode for self-normalised importance sampling
1: Input: The number of samples N
2: for i = 1, . . . , N do
3: Sample Xi ∼ q(x)
p̄(Xi )
4: Compute unnormalised weights W (Xi ) = q(Xi )
5: Normalise:
W (Xi )
w̄i = PN .
i=1 W (Xi )

6: end for
7: Report the estimator
N
ϕ̂N
X
SNIS = w̄i ϕ(Xi ).
i=1

where
W (Xi )
w̄i = PN ,
i=1 W (Xi )

are the normalised weights. This estimator (4.18) is called the self-normalised importance sam-
pling (SNIS) estimator.

Remark 4.6 (Bias and variance). As opposed to the normalised case, the estimator ϕ̂N SNIS is biased.
The reason of this bias can be seen by recalling the integral (4.17). By sampling from q(x), we
can construct unbiased estimates of the numerator and denominator. However, the ratio of these
two quantities is biased in general. However, it can be shown that the bias of the SNIS estimator
decreases with a rate O(1/N ) (Agapiou et al., 2017).
Since the SNIS estimator is biased, we can not use the same variance formula as in the previous
section. It makes sense to consider the MSE(ϕ̂N SNIS ) instead. However, this quantity is challenging
to control in general – without bounded test functions. With bounded test functions, it is possible
to show that the MSE(ϕ̂N SNIS ) is controlled with a rate O(1/N ) (Agapiou et al., 2017; Akyildiz
and Míguez, 2021). We will not go into the details of this result here.

We can now describe the estimation procedure using SNIS. Given an unnormalised density
p̂(x), we first sample N samples from a proposal, X1 , . . . , XN ∼ q(x), and then compute
normalised weights

W (Xi )
w̄i = PN .
i=1 W (Xi )

81
p̄(x)
where W (x) = q(x)
. Finally, we compute the estimator
N
ϕ̂N
X
SNIS = w̄i ϕ(Xi ).
i=1

In the following, we describe the algorithm, which is given in Algorithm 8.

4.4 bayesian inference with importance sampling


In this section, we demonstrate the applications of the Monte Carlo and importance sampling
estimators to Bayesian inference. Self normalised IS is a natural choice for Bayesian inference.
Assume that we have a prior p(x) and a likelihood p(y|x). The posterior is given by
p(y|x)p(x)
p(x|y) = R .
p(y|x)p(x)dx
Let p̄(x|y) := p(y|x)p(x) as usual and design an importance sampler to estimate expectations of
the form
Z
Ep(x|y) [ϕ(x)] = ϕ(x)p(x|y)dx.

Assume that we choose q(x) and decided to perform SNIS. We first sample X1 , . . . , XN ∼ q(x)
and construct
p̄(Xi |y) p(y|Xi )p(Xi )
Wi = = .
q(Xi ) q(Xi )
We can now normalise these weights and obtain
Wi
w̄i = PN ,
i=1 Wi

which will give us the Monte Carlo estimator:


N
X
Ep(x|y) [ϕ(x)] ≈ w̄i ϕ(Xi ).
i=1

It is useful to recall that this estimator is biased, since it is a SNIS estimator. However, as a
byproduct of this estimator, we can also obtain an unbiased estimate of the marginal likelihood
p(y). Note that, this is already provided by the SNIS estimator
N
1 X
p(y) ≈ Wi .
N i=1

Proposition 4.5. The marginal likelihood estimator given by


N
1 X
pN (y) = Wi ,
N i=1

is an unbiased estimator of the marginal likelihood p(y).

82
Proof. We can easily show this
N N
" # " #
X X p(y|Xi )p(Xi )
Eq Wi = Eq
i=1 i=1 q(Xi )
" #
p(y|Xi )p(X)
= N Eq
q(X)
Z
p(y|x)p(x)
=N q(x)dx
q(x)
= N p(y).


As we have seen before, a number of interesting problems require computing normalising
constants, including model selection and prediction. SNIS estimators are very useful in the sense
that they provide an unbiased estimate of it. Let us now look at an example using the optimal
importance proposal as introduced in Remark 4.5.

Example 4.8 (Marginal likelihood using optimal importance proposal). We have seen that
we can get unbiased estimates of the marginal likelihood. Find the optimal proposal q? for
estimating the marginal likelihood.

Solution. We will now see how to find the optimal importance proposal to compute the marginal
likelihood. Note that, we have
Z
p(y) = p(y|x)p(x)dx,

for some prior p(x) and likelihood p(y|x). Note, as we mentioned before, in this case p(x) can
be seen as the
R
distribution to sample from and ϕ(x) = p(y|x) to obtain the standard problem of
integration ϕ(x)p(x)dx. A naive way to approximate this quantity (as we have seen before) is
to sample i.i.d from p(x) and approximate the integral, i.e., X1 , . . . , XN ∼ p(x) and write
N N
1 X 1 X
pN
MC (y) = ϕ(Xi ) = p(y|Xi ).
N i=1 N i=1

We can now look at the variance of this estimator


1
varp pN varp(x) [p(y|x)].
h i
MC (y) =
N
This quantity may depend on the prior-likelihood selection and can be large.
Let us take Remark 4.5 seriously as the question suggested and search for the optimal proposal
q? . From (4.16), we can see that

|ϕ(x)|
q? (x) = p(x) .
Ep (x)[|ϕ(x)|]

83
In this case, however, we have ϕ(x) = p(y|x) (and |ϕ(x)| = ϕ(x) since the likelihood is positive
everywhere). We can now write

p(y|x) p(y|x)
q? (x) = p(x) = p(x) .
Ep [p(y|x)] p(y)

In other words, the optimal proposal is the posterior itself! Now we can compute the IS estimator
variance where we plug q? = p(x|y). Note to explore variance, we write
  !2  
1  p(x)
varq? [pN
IS (y)] = Eq?  p(y|x) 2
− p(y) 2
.
N q? (x)

We compute the first term in brackets,


 !2 
p(x) 2
Z
p2 (x)
Eq?  p(y|x) = p(y|x)2 q? (x)dx
q? (x) q?2 (x)
p2 (x)
Z
= p(y|x)2 dx
q? (x)
Z
p2 (x)
= p(y|x)2 dx
p(x|y)
Z
p2 (x)p(y)
= p(y|x)2 dx
p(y|x)p(x)
Z
= p(y) p(x)p(y|x)dx
= p(y)2 .

Plugging this back into the above variance expression varq? [pN
IS (y)], we obtain

1 
varq? [pN

IS (y)] = p(y)2 − p(y)2 = 0.
N
It can be seen that we can achieve zero variance, but as we mentioned before, this required us to
know the posterior density.

4.5 choosing the optimal importance sampling proposal


within a family
While we may not be able to choose the optimal proposal q? as shown above (as it requires us to
know the target density), we can still optimise the proposal within a family of distributions. Recall
the discussion for this in the case of rejection sampling, where we optimised the acceptance rate
â in order to choose the optimal parameter of a proposal. In the case of importance sampling,
the target is to minimise the variance. Let qθ be a parametric family of proposals. Recall that in

84
1.0
pλ(x)
qμ(x)
0.8

0.6

0.4

0.2

0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Figure 4.10: The density of the exponential distribution pλ , the proposal qµ and K = 6

Prop. 4.4, we have defined the variance of the IS estimator as


1 
varqθ [ϕ̂N
h i 
IS ] = Eqθ wθ2 (X)ϕ2 (X) − ϕ̄2 ,
N
where ϕ̄ = Ep [ϕ(X)] and

p(x)
wθ (x) = .
qθ (x)

Note that 1/N does not change the location of the minimum and ϕ̄2 is independent of θ. Therefore,
we drop these terms and see that in order to minimise this variance w.r.t. θ, we need to solve the
following optimisation problem
h i
θ? = argmin Eqθ wθ2 (X)ϕ2 (X) .
θ

We will now demonstrate this with an example.

Example 4.9 (Minimum variance IS). We are given an exponential distribution

pλ (x) = λ exp(−λx),

and want to compute P(X > K). For example, λ = 2 and K = 6, we can analytically compute
P(X > 6) = 6.144 × 10−6 . Therefore, the standard MC estimator would be very inefficient
to compute this probability. In order to mitigate the problem, we would like to use another

85
exponential proposal qµ (x) which may have higher probability concentration around 6. We
would like to design our proposal using the minimum variance criterion (see Remark 4.5). Find
the optimal µ? that would make the estimator variance minimal.

Solution. Accordingly, we would like to find µ such that


h i
µ? = argmin Eq w2 (X)ϕ2 (X) .
µ

In this case, note that we have ϕ(x) = 1{x>K} (x). In order to do this, we write next
h i Z
pλ (x)2
Eqµ w2 (X)ϕ2 (X) = qµ (x)ϕ2 (x)dx,
qµ (x)2
Z ∞
pλ (x)
= pλ (x)dx,
K qµ (x)
Z ∞ 2 −2λx
λe
= dx,
K µe−µx
λ2 Z ∞ −(2λ−µ)x
= e dx.
µ K
Note at this stage that in order for this integral to be finite, we need to have 2λ − µ > 0. Therefore,
we limit for µ ∈ (0, 2λ). In order to compute this, we can multiply and divide by (2λ − µ) and
obtain
h i λ2 Z ∞
Eqµ w2 (X)ϕ2 (X) = (2λ − µ)e−(2λ−µ)x dx,
µ(2λ − µ) K
and using the CDF of the exponential distribution, we obtain
h i λ2 h i
g(µ) = Eqµ w2 (X)ϕ2 (X) = 1 − 1 + e−(2λ−µ)K . (4.19)
µ(2λ − µ)
Now we optimise g(µ) w.r.t. µ. As usual, we compute first log (and drop the terms unrelated to
µ as they will not matter in optimisation)
log g(µ) =c − log µ − log(2λ − µ) + µK.
Computing
d 1 1
log g(µ) = − + + K,
dµ µ 2λ − µ
Setting this to zero, we obtain
1 1
−+ + K = 0,
µ 2λ − µ
⇒ − (2λ − µ) + µ + Kµ(2λ − µ) = 0,
⇒Kµ2 − 2Kµλ + 2λ − 2µ = 0,
⇒Kµ2 − 2(Kλ + 1)µ + 2λ = 0.

86
This is a quadratic equation, therefore we will have two solutions:
q
2(Kλ + 1) ± (2Kλ + 2)2 − 8Kλ
µ= ,
√ 2K
2(Kλ + 1) ± 4K 2 λ2 + 4
= .
2K
If we inspect this solution, if we choose µ to be the sum of the two terms, we will then have
µ > 2λ which is a violation of a condition condition we imposed for the integral to be finite.
Therefore, we arrive at

2(Kλ + 1) − 4K 2 λ2 + 4
µ? = .
2K
After this tedious computation, we can now verify the reduction in variance and estimation
quality. Let us now set K = 6 and λ = 2. See Fig. 4.10 for plot of pλ , K = 6 and qµ? (the optimal
exponential proposal). We can see that the proposal puts much higher mass to the right of K.
A standard run for N = 105 samples gives us zero samples in the region of X > 6, therefore
the standard MC estimate is zero! Compared to ϕ̂N = 0, using qµ? as a proposal, we obtain
−6
ϕ̂N
IS = 6.08 × 10 which is a much better estimate.
Let us compare the theoretical variances of two estimators. The standard variance of ϕ̂N is
1
varp (ϕ̂N ) = varp (ϕ(X)),
N
where
Z Z 2
varp (ϕ(X)) = ϕ(x)2 pλ (x)dx − ϕ(x)pλ (x)dx ,
Z ∞ Z ∞ 2
= pλ (x)dx − pλ (x)dx .
K K

Using CDFs, we can compute this quantity hence can obtain the estimate of the variance for ϕ̂N .
Now set µ = µ? . The variance of ϕ̂N
IS is given by (see Prop. 4.4)

1 
varq [ϕ̂N

IS ] = Eq [w2 (X)ϕ2 (X)] − ϕ̄2 ,
N
We have already computed the term Eq [w2 (X)ϕ2 (X)] in Eq. (4.19). The second term is the true
integral, which we also summarised how to compute above, i.e., ϕ̄ = K∞ pλ (x)dx which can be
R

computed using the exponential CDF. In this particular case, we compute


1 
varq [ϕ̂N

IS ] = g(µ? ) − ϕ̄2 .
N
The theoretical variance of the naive MC estimator is given by 6.14 × 10−7 for N = 10 samples
vs. the IS estimator variance is 6.04 × 10−11 for the same amount of samples. This is a huge
improvement in the variance of the estimation.

87
4.6 implementation, algorithms, diagnostics
When implementing the IS or SNIS, there are several numerical considerations that need to be
taken into account. Especially for SNIS, where the weight normalisation takes place, several
numerical problems can arise for complex distributions that would prevent us from implementing
them successfully.

4.6.1 computing weights


In the case of SNIS, we have stated that the weights are computed as
W (Xi )
w̄i = PN ,
i=1 W (Xi )

where W (x) = p̄(x)


q(x)
. However, this formula can be numerically ill-behaved for complex distri-
butions. To mitigate this, the weighting step is implemented as follows. We first compute log
unnormalised weights:
log Wi = log p̄(Xi ) − log q(Xi ).
However, directly applying exponentiation and normalisation can also lead to numerical problems.
Note that, we will normalise these weights (e.g. multiplying them with a constant does not change
the result), we can use this to our advantage. A common numerical trick is to subtract the
maximum log weight from all weights:
log W̃i = log p̄(Xi ) − log q(Xi ) − max log Wi .
i=1,...,N

This ensures that the maximum weight is 0 and all other weights are negative. We can now
exponentiate the weights and normalise them:
exp(log W̃i )
w̄i = PN .
i=1 exp(log W̃i )

Note that, this does not change the computation, just done for numerical stability.

4.6.2 sampling importance resampling


We can also use the SNIS estimator as a sampler (Robert and Casella, 2004). Recall that, the SNIS
estimator provides us an estimator of the distribution p(x) as
N
p(x)dx ≈ p̃N (x)dx =
X
w̄i δXi (x)dx.
i=1

This pN can be seen as a weighted distribution. By drawing samples from this distribution, we
may also approximately sample from p(x) (recall that IS based ideas here are just introduced for
integration so far). We can then draw5
k ∼ Discrete(w̄1 , . . . , w̄N ),
5
Note that here the weights w̄i are normalised. Even in the basic IS case, we need to normalise weights (just for
resampling) as they do not naturally sum up to one.

88
Figure 4.11: Top left shows the target density, the proposal, and the weight function. Top right shows the samples
with their respective weights. Bottom left shows that these samples are indeed approximately distribution w.r.t. q(x)
(just with attached weights). Bottom right shows that we can resample these existing samples to obtain a new set of
samples X̄i that are distributed (approximately )according to p(x).

and set new samples

X̄i = Xk .

This amounts to resampling the existing samples w.r.t. their weights. A demonstration of this
idea can be seen from Figure 4.11.

4.6.3 diagnostics for importance sampling


It is important to have a good intuition and diagnostic tools to understand the performance of
the IS and SNIS estimators. We first start with the effective sample size (ESS).

Definition 4.1 (Effective Sample Size). To measure the sample efficiency, one measure that is used
in the literature is the effective sample size (ESS) which is given by
1
ESSN = PN ,
i=1 w̄i2

for the SNIS estimator.

In order to see the meaning of the ESS, consider the case where w̄i = 1/N where we have an
equally weighted sample. This means all samples are equally considered and in this case we have
ESSN = N . On the other hand, if we have a sample Xi where w̄i = 1 and, hence, w̄j = 0 for
every j 6= i, we obtain ESSN = 1. This means, we effectively have one sample which is the goal

89
of the estimator. ESS is used to measure importance samplers and importance sampling-based
estimators in the literature (Elvira et al., 2018). Note that the ESSN takes values between 1 and
N , i.e., 1 ≤ ESSN ≤ N .

4.6.4 mixture importance sampling


Sometimes the target density p(x) can be multimodal, therefore it is beneficial to use mixture
densities as proposals (Owen, 2013). We have seen in previous chapters how to sample from a
mixture. Let us define a proposal
K
X
qα (x) = αk qk (x),
k=1

where αk ≥ 0 and K k=1 αk = 1. In this version of the method, we just sample from the mixture
P

proposal Xi ∼ qα (x) and then, given an unnormalised target p̄, compute the importance weights
as
W (Xi )
w̄i = PN ,
i=1 W (Xi )

where
p̄(Xi )
W (Xi ) = PK .
k=1 αk qk (Xi )

The computational concerns may arise in this situation too, as the denominator as a sum of
densities and its log can be tricky to compute. In these cases, we can use the log-sum-exp trick to
compute the log of the denominator.

90
MARKOV CHAIN MONTE CARLO
5
In this chapter, we introduce Markov chains and then Markov Chain Monte Carlo (MCMC) methods.
These methods are at the heart of Bayesian statistics, generative modelling, statistical physics, and
many other fields. We will introduce the Metropolis-Hastings algorithm and then introduce the
celebrated Gibbs sampler and, if time permits, some others.

6
In this chapter, we introduce a new sampling methodology - namely using Markov chains
for sampling problems. This is a very powerful and widely used idea in statistics and machine
learning. The idea is to set up Markov chains with prescribed stationary distributions. These
distributions will be our target distributions.
In this chapter, we will adapt our notation and modify it to suit the new setting. From now
on, we denote stationary/invariant distributions of Markov chains (which are also coincide with
our target distributions) as p? . We will introduce discrete space Markov chains next.

5.1 discrete state space markov chains


A good setting for an introduction to Markov chains is the discrete space setting. In this setting,
we have a finite set of states X where the cardinality of X is finite. We first define the Markov
chain in this context.

Definition 5.1 (Markov chain). A discrete Markov chain is a sequence of random variables
X0 , X1 , . . . , Xn such that

P(Xn+1 = xn+1 | X0 = x0 , . . . , Xn = xn ) = P(Xn+1 = xn+1 | Xn = xn ).

91
In other words, a Markov chain is a sequence of random variables such that a given state at
time n is conditionally independent of all previous states given the state at time n − 1. One can
see that this describes many systems in the real world – as evolution of many systems can be
summarised with the current state of the system and the evolution law.
An important quantity in the study of Markov chains is the transition matrix (or kernel in the
continuous space case). This matrix defines the evolution structure of the chain and determines
all of its properties. The transition matrix is defined as follows.

Definition 5.2 (Transition matrix). The transition matrix of a Markov chain is a matrix M such
that

Mij = P(Xn+1 = j | Xn = i).

A usual way to depict Markov chains is the following conditional independence structure
which sums the structure of the Markov chain up. We note that we will only consider the case

X0 X1 X2 ··· Xn

Figure 5.1: A Markov chain with n states.

where the transition matrix is time-homoegeneous, i.e., the transition matrix is the same for all
times. We can see then that a Markov chain’s behaviour is completely determined by its initial
distribution and the transition matrix. We will denote the initial distribution of the chain as p0
and note that this is a discrete distribution over the state space X (in this case)1 . The transition
matrix M is a matrix of size d × d where d = |X|.

M11 M12 · · · M1d


 

M21 M22 · · · M2d 


 
M =  ..
 .. .. .
.. 
 . . . . 
Md1 Md2 · · · Mdd

We note that this matrix is stochastic, i.e. each row sums to 1:


d
X
Mij = 1,
j=1

since Mij = P(Xn+1 = j|Xn = i) and


d
X
P(Xn+1 = j|Xn = i) = 1.
j=1

Next we will consider an example.


1
When we move to continuous spaces, we will use the same notation for densities.

92
Example 5.1 (Discrete space Markov chain). Consider the state transition diagram

0.7

3
0.2 0 0.3
0.2
0.2
0.6 1 2 0.5

0.3

of a Markov chain. Write out the transition matrix of this chain and decsribe the simulation
procedure.

Solution. We can write the transition matrix as


 
0.6 0.2 0.2
M = 0.3 0.5 0.2 , where X = {1, 2, 3}.
 

0 0.3 0.7

The state transition diagram of this matrix can be described as follows.


This Markov chain can be simulated using the following idea. Given the above diagram, we
can denote its transition matrix as a table:
M Xt = 1 Xt = 2 Xt = 3
Xt−1 = 1 0.6 0.2 0.2
Xt−1 = 2 0.3 0.5 0.2
Xt−1 = 3 0 0.3 0.7
Given X0 = 1, how to simulate this chain? This boils down to just selecting the correct row from
this matrix and then sampling using the discrete distribution given by that row. For example, if
we sample from the first row, we get X1 = 1 with probability 0.6, X1 = 2 with probability 0.2
and X1 = 3 with probability 0.2. We can then repeat this process for X2 and so on. This is a
simple way to simulate a Markov chain. The precise sampler is given below.

Xt |Xt = xt−1 ∼ Discrete(Mxt−1 ,· ),

where the notation Mxt−1 ,· denotes the xt−1 th row of the transition matrix M (where xt−1 ∈
{1, 2, 3} naturally).

Markov chains can be seen everywhere in the real world, hence we will skip their applications.
But one exciting example can be provided from the field of language modelling. Let us describe a
character-level Markov model of English.

93
Example 5.2 (Character-level Markov model of English). Consider the task of modelling
an English language text as a Markov model. Describe a possible way of doing this on a
character-level.

Solution. We can model the English language as a Markov chain on a character-level. In this
case, we can define the state space as X = {a, b, . . . , z, .} (where for simplicity the character
dot models the space character and no other special characters are considered). We can then
define the transition matrix as follows. Let xt−1 be the previous character and xt be the current
character. Of course, writing out the transition matrix by hand is an impossible task, hence we
can learn it from data. For this, we simply choose a large corpus of English text and count the
number of times a character is followed by another character. We have provided an example code
for this in our online companion. By estimating the transition matrix from data, we can then
simulate English text by sampling from this Markov chain. For example, one simulated text reads
as
g.l.se.t.bin.s.lese.ry..wolked.t.hered.e.ly.hr.impefatrt.mofe.mouroreand
and goes on (see lecture for a longer simulation). We can see that while this captures very
vague structure, it is not a good model of English. However, this character-level model can still
be useful for some applications. Because we can use this model to estimate the probability of a
given text. For example, we can compute

P(the quick brown fox jumps over the lazy dog)

by breaking down this text into characters and ‘reading’ the transition matrix. This is a very
simple example of a Markov chain model of English.

Let us look at a more intelligent way of using Markov chains to model language.

Example 5.3 (Word-level Markov model of English). Consider the task of modelling an English
language text as a Markov model. Describe a possible way of doing this on a word-level.

Solution. We can model the English language as a Markov chain on a word-level. As you can
see, there is no single state-space for this (as we can expand our word list (state-space) possibly
indefinitely). A good idea is to pick a book or a large text corpus and count all possible word
transitions. Note that modern language models do not use words, but rather sub-word units,
called tokens. Here we will stick to words and we will pick the book ‘A Study in Scarlet’ by Sir
Arthur Conan Doyle. We have provided an example code for this in our online companion. By
estimating the transition matrix from data, we can then simulate English text by sampling from
this Markov chain. For example, one simulated text reads as
Sherlock Holmes, “because you will place where you remember, in his
cudgel, only one in summer are grey with a vague shadowy terrors which
hung over my hand, and eyeing the strange contrast to the name a walking
in time before twelve, you are.”

94
While this is much more intelligible and seems to follow some English rules, it is nothing but
a simulation from a Markov chain given a transition matrix as estimated in our online companion.
Modern language models, however, do not use Markov models either – their structure is much
more complex and they are based on neural networks.

We can also compute n-step transition matrix:


M (n) = P(Xn = j|X0 = i),
where M (n) is a matrix of size d × d. For this, see that n-step transition matrix can be written as:
(n)
Mij = P(Xn = j|X0 = i)
X
= P(Xn = j, X1 = k|X0 = i)
k
X
= P(Xn = j|X1 = k, X0 = i)P(X1 = k|X0 = i)
k
X
= P(Xn = j|X1 = k)P(X1 = k|X0 = i)
k
X (n−1)
= Mik Mkj .
k

Therefore, M (n) = M n which is the nth power of the transition matrix. Note that we can compute
in general the conditional distributions of the Markov chain by summing out the variables in the
middle. For example, in order to compute P(Xn+2 = xn+2 |Xn = xn ), we can write
X
P(Xn+2 = xn+2 |Xn = xn ) = P(Xn+2 = xn+2 |Xn+1 = xn+1 )P(Xn+1 = xn+1 |Xn = xn ).
xn+1

This will lead us to define the Chapman-Kolmogorov equation, which is a generalization of the
n-step transition matrix:
M (m+n) = P(Xm+n = j|X0 = i)
X
= P(Xm+n = j|Xn = k)P(Xn = k|X0 = i)
k
X (m) (n)
= Mik Mkj .
k
m+n
Therefore, we can write M = M mM n.
It is also important to define the evolution of the chain. Note that we defined our initial
distribution as p0 and it is important to quantify how this distribution evolves over time. We
denote the distribution at time n as pn and write Then, the density of the chain at time n is given
by:
pn (i) = P(Xn = i)
X
= P(Xn = i, Xn−1 = k)
k
X
= P(Xn = i|Xn−1 = k)P(Xn−1 = k)
k
X
= Mki pn−1 (k).
k

95
This implies that

pn = pn−1 M.

Therefore,

pn = p0 M n .

These are important equations, which will have corresponding equations in the continuous case
(however, they will be integrals).
Since we have expressed our interest in Markov chains because of their potential utility in
sampling, we will now discuss the properties we need to ensure that we can use Markov chains
for sampling. In short, we need Markov chains that have (i) invariant distributions, (ii) their
convergence to invariant distributions are ensured, (iii) the invariant distribution is unique. We
will now discuss the properties we need to ensure these in detail.

5.1.1 irreducibility
The first property we need to ensure is that the Markov chain is irreducible. This means that
there is a path from any state to any other state . To be precise, let x, x0 ∈ X be any two states.
We write x x0 if there is a path from x to x0 :

∃n > 0, s.t. , P(Xn = x0 |X0 = x) > 0.

If x x0 and x0 x, then we say that x and x0 communicate. We then define the communication
class C ⊂ X which is a set of states such that x ∈ C and x0 ∈ C if and only if x x0 and x0 x.
A chain is irreducible if X is a single communication class. This simply means that there is a
positive probability of moving around to every other state. This makes sense as without ensuring
this, we won’t be sampling from the full support.

5.1.2 recurrence and transience


We now define the notion of recurrence and transience. A state i ∈ X is recurrent if there is a
positive probability of returning to i. In order to see define this, consider the return time

τi = inf{n ≥ 1 : Xn = i}.

We say that the state i is recurrent if

P(τi < ∞|X0 = i) = 1.

In other words, the probability of waiting time being finite is 1. If a chain is not recurrent, it is
said to be transient. We can also further define the positive recurrence which is a slightly stronger
(better) condition. We say that i is positively recurrent if

E[τi |X0 = i] < ∞.

This means that the expected waiting time is finite. If a chain is recurrent but not positive
recurrent, then it is called null recurrent.

96
5.1.3 invariant distributions
In the discrete time case, a distribution p? is called invariant if

p? = p? M.

This means that the chain is reach stationarity, i.e., evolving further (via M ) does not change the
distribution. We have then the following theorem (Yıldırım, 2017).

Theorem 5.1. If M is irreducible, then M has a unique invariant distribution if and only if it is
positive recurrent.

This is encouraging however for actual convergence of the chain to this distribution, we will
need more conditions.

5.1.4 reversibility and detailed balance


We define the detailed balance condition as

p? (i)Mij = p? (j)Mji .

This trivially implies that p? = p? M , hence the invariance of p? . We will have a more detailed
discussion of this condition in the continuous state space case.

5.1.5 convergence to invariant distribution


Finally, we need the ergodicity condition to ensure that the chain converges to the invariant
distribution. For this, we require the chain to be aperiodic, which is defined as follows. A state i
is called aperiodic if

{n > 0 : P(Xn+1 = i|X1 = i) > 0}

has no common divisor other than 1. A Markov chain is called aperiodic if all states are aperiodic.
An irreducible Markov chain is called ergodic if it is positive recurrent and aperiodic. If (Xn )n∈N
is an ergodic Markov chain with initial p0 and p? as its invariant distribution, then

lim pn (i) = p? (i).


n→∞

Moreover, for i, j ∈ X

lim P(Xn = i|X1 = j) = p? (i).


n→∞

In other words, the chain will converge to its invariant distribution from every state.

97
5.2 continuous state space markov chains
Our main interest is in the continuous case. However, it is important to understand the definitions
above – as we will not go into analogous definitions in the continuous case. The reason for this
is that, in continuous cases, the individual states have zero probability (i.e. a point has zero
probability) and all the notions above are defined using sets and measure theoretic concepts. We
focus on simulation methods within this course, therefore, we will not go into reviewing this
material. A couple of very good books for this are Douc et al. (2018) and Douc et al. (2013).
Let X be an uncountable set from now on, e.g., X = R or X = Rd . We denote the initial
density as p0 (x) as usual, the transition kernel with K(x|x0 ), the marginal density of the chain at
time n as pn (x).
We can write the Markov property in this case as follows. For any measurable A

P(Xn ∈ A|X1 = x1 , . . . , Xn−1 = xn−1 ) = P(Xn ∈ A|Xn−1 = xn−1 ).

This implies that if we write down the joint distribution of X1:n , then the following factorisation
holds
n
Y
p(x0 , . . . , xn ) = p(xk |xk−1 ),
k=0

where p(x0 |x−1 ) := p0 (x0 ). We also assume that the transition kernel has a density which we
denote as K(xn |xn−1 ) at time n. Similarly to the discrete case, we will assume that the density is
time-homogeneous (i.e. same for every n). Note that the transition density is a density in its first
variable, i.e.,
Z
K(xn |xn−1 )dxn = 1.
X

It is a function of xn−1 otherwise. We give an example of a continuous state-space Markov chain


in what follows.

Example 5.4 (Simulation of a Markov process). Consider the following Markov chain with
X0 = 0

Xn |Xn−1 = xn−1 ∼ N (xn ; axn−1 , 1), (5.1)

with 0 < a < 1. Describe the simulation procedure for this chain in terms of a recursion.

Solution. We can simulate this chain by

X1 ∼ N (0, 1)
X2 ∼ N (aX1 , 1)
X3 ∼ N (aX2 , 1)
..
.
Xn ∼ N (aXn−1 , 1).

98
How to do this? We also note that Eq. (5.1) can also be expressed as

Xn = aXn−1 + n

where n ∼ N (0, 1). This is also called AR(1) process. From the last equation, it must be clear
how to simulate this as you only need a for loop and samples from N (0, 1).

Similar to the continuous case, we can define the distribution of Xn given a past variable
Xn−k by integrating out the variables in between . It is important to note that, Xn is independent
of past variables if (and a big if) Xn−1 = xn−1 is given. Otherwise, we can write down the
densities as
Z Z
p(xn |xn−k ) = ··· K(xn |xn−1 )K(xn−1 |xn−2 ) · · · K(xn−k+1 |xn−k )dxn−1 · · · dxn−k+1 .

We define the m-step transition kernel as


Z
K (m) (xm+n |xn ) = K(xm+n |xm+n−1 ) · · · K(xn+1 |xn ) dxm+n−1 · · · dxn+1 .
X

We now provide the definition of invariance in this context, w.r.t. to the transition kernel.

Definition 5.3 (K-invariance). A probability measure p? is called K-invariant if


Z
p? (x) = K(x|x0 ) p? (x0 )dx0 . (5.2)
X

It can be seen that p? being invariant means that the kernel operating on p? results in the
same distribution p? (the integral agains the kernel can be seen as a transformation, similar to
the matrix product in the discrete case). Finally, we get to the detailed balance condition.

Definition 5.4 (Detailed balance). A transition kernel K is said to satisfy detailed balance if

K(x0 |x)p? (x) = K(x|x0 )p? (x0 ). (5.3)

We note that this is a sufficient condition for stationarity of p? .

Proposition 5.1 (Detailed balance implies stationarity). If K satisfies detailed balance, then p? is
the invariant distribution.

99
Proof. The proof is a one-liner:
Z Z
p? (x)K(x0 |x)dx0 = p? (x0 )K(x|x0 )dx0 ,

which is just integrating both sides after writing the detailed balance condition. The lhs of this
equation is p? (x) since K(x0 |x) integrates to 1 which leaves us with the definition of K-invariance
as given in (5.2). 
Let us see an example of a continuous space Markov chain (or rather go back to AR(1)
example).

Example 5.5. Consider again the Markov chain with the following transition kernel

K(xn |xn−1 ) = N (xn ; axn−1 , 1).

We can also describe the evolution this chain as a recursion, as mentioned before

Xn = aXn−1 + n

where n ∼ N (0, 1). Show that


1
p? (x) = N (x; 0, ),
1 − a2
by checking the detailed balance condition (we will prove this directly later). Prove that the
m-step transition kernel is given by

1 − a2m
K (m) (xm+n |xn ) = N (xm+n ; am xn , ).
1 − a2
Solution. The proofs of these results will be asked as exercises (as usual, solutions will be posted).
We can note that the last result implies trivially that

p? (x) = lim K (m) (x|x0 ).


m→∞

for any x0 . In other words, starting from any x0 , the chain will reach stationarity.

We have now almost everything we need to move on to discuss Metropolis-Hastings method.

5.3 metropolis-hastings algorithm


We finally have all the ingredients to define the celebrated Metropolis-Hastings algorithm. We
will not need more technicalities in defining it.
The Metropolis-Hastings (MH) algorithm is a remarkable method which allows us to define
transition kernels (defined implicitly via the algorithm) where the detailed balance is satisfied
w.r.t. any p? we wish to sample from. I call this remarkable because it rids us of the need of

100
designing Markov kernels for specific probability distributions and provides a generic way to
design samplers that will target any measure we want. The algorithm relies on the idea of using
local proposals q(x0 |x) and accepting them with a certain acceptance ratio. The acceptance ratio
is designed so that the resulting samples X1 , . . . , Xn from the method form a Markov chain
that leaves p? invariant. We will provide the algorithm below, as seen from Algorithm 9. Note,

Algorithm 9 Pseudocode for Metropolis Hastings method


1: Input: The number of samples N , and starting point X0 .
2: for n = 1, . . . , N do
3: Propose (sample): X 0 ∼ q(x0 |Xn−1 )
4: Accept the sample X 0 with probability

p? (X 0 )q(Xn−1 |X 0 )
( )
0
α(Xn−1 , X ) = min 1, .
p? (Xn−1 )q(X 0 |Xn−1 )

5: Otherwise reject the sample and set Xn = Xn−1 .


6: end for
7: Discard first burnin samples and return the remaining samples.

as mentioned in the lecture, the last step of the method: When a sample is rejected, we do not
sample again – we set Xn = Xn−1 and continue sampling the next sample. This means that, if the
rejection rate is high, there will be a lot of duplicated samples and this is the expected behaviour.
Another important note is about the burnin period. Any Markov chain started at a random point
will take some time to reach stationarity (the whole magic is to be able to make them converge
faster). Therefore, we discard the first burnin samples and only return the remaining ones. This
is a common practice in MCMC methods.
We define the acceptance ratio as
p? (x0 )q(x|x0 )
r(x, x ) =
0
. (5.4)
p? (x)q(x0 |x)
We also note that in the practical algorithm, one does not need to implement the min operation.
For accepting with a certain probability (like in the rejection sampling), we draw U ∼ Unif(0, 1)
and check if U ≤ α(Xn−1 , X 0 ). However, if the ratio r(Xn−1 , X 0 ) is greater than 1, this sample
is always going to be accepted anyway. The min operation is important however for theoretical
properties of the kernel to hold.
As we mentioned above, the algorithm provides us with an implicit kernel K(xn |xn−1 ) – if
you think about it, it is just a way to get Xn given Xn−1 . The specific structure of the algorithm,
however, ensures that we leave the right kind of distribution invariant – i.e. p? – that is our target
measure. We elucidate this in the following proposition.

Proposition 5.2 (Metropolis-Hastings satisfies detailed balance). The Metropolis-Hastings algo-


rithm satisfies detailed balance w.r.t. p? , i.e.,
p? (x)K(x0 |x) = p? (x0 )K(x|x0 ),
where K is the kernel defined by the MH algorithm.

101
Proof. We first define the kernel induced by the MH algorithm. This can be seen by inspecting
the algorithm:
K(x0 |x) = α(x, x0 )q(x0 |x) + (1 − a(x))δx (x0 ),
where δx is the Dirac delta function and
Z
a(x) = α(x, x0 )q(x0 |x)dx0 ,
X

is the probability of accepting a sample (hence 1 − a(x) is the probability of rejecting a new
sample while at point x). See Sec. 2.3.1 of Douc et al. (2018) for a rigorous derivation. Given this,
we write
p? (x)K(x0 |x) = p? (x)q(x0 |x)α(x0 , x) + p? (x)(1 − a(x))δx (x0 )
p? (x0 )q(x|x0 )
( )
0
= p? (x)q(x |x) min 1, 0
+ p? (x)(1 − a(x))δx (x0 )
p? (x)q(x |x)
= min {p? (x)q(x |x), p? (x0 )q(x|x0 )} + p? (x)(1 − a(x))δx (x0 )
0

p? (x)q(x0 |x)
( )
= min , 1 p? (x0 )q(x|x0 ) + p? (x0 )(1 − a(x0 ))δx0 (x)
p? (x0 )q(x|x0 )
= K(x|x0 )p? (x0 ),
which shows that the detailed balance holds! 
One can see that the algorithm works just the same with unnormalised densities, i.e., recall
p̄? (x)
p? (x) = ,
Z
where Z is the normalisation constant. In this case, the acceptance ratio becomes
p̄? (x0 )q(x|x0 )
r(x, x ) =
0
,
p̄? (x)q(x0 |x)
without any change as the normalising constants cancel out in (5.4). We will next describe certain
classes of proposals to sample from various kinds of distributions and assess their performance.

5.3.1 independent proposals


An important class of proposals that is used in practice is the independent proposal. Note that in
general we denoted our proposal with q(x0 |x), in particular, we would sample from q(x0 |xn−1 )
implying that in general the proposal uses the current state of the chain. This does not have to be
the case and we can as well chose just an independent proposal q(x0 ) to ease computations. The
acceptance ratio in this specific case becomes
p̄? (x0 )q(x)
r(x, x ) =
0
.
p̄? (x)q(x0 )
In the algorithm, this means that we compute
p̄? (X 0 )q(Xn−1 )
( )
α(Xn−1 , X 0 ) = min 1, .
p̄? (Xn−1 )q(X 0 )
We will see one example as follows.

102
Example 5.6 (Independent Gaussian proposal). Consider a Gaussian (artificial) target:

p? (x) = N (x; µ, σ 2 )

Assume we want to use MH to sample from it. Choose a proposal

q(x) = N (x; µq , σq2 ).

Compute the acceptance ratio.

Solution. The acceptance ratio can be computed in this case as:

p? (x0 )q(x)
r(x, x0 ) =
p? (x)q(x0 )
N (x0 ; µ, σ 2 )N (x; µq , σq2 )
=
N (x; µ, σ 2 )N (x0 ; µq , σq2 )
 
0 2 2
 
−µ)
√ 1
2πσ 2
exp − (x2σ 2
p1
2πσq2
exp − (x−µ q)
2σ 2 q
=  
2 (x0 −µ 2
 
q)
√ 1
2πσ 2
exp − (x−µ)
2σ 2
p1
2πσq2
exp − 2σq2
 
0 2 2
 
−µ)
exp − (x2σ 2 exp − (x−µ q)
2σ 2 q
=  
(x−µ)2 (x0 −µq )2
 
exp − 2σ 2
exp − 2σq2
  
1
− 12

(x0 −µ)2 −(x−µ)2
 − 2 (x−µq )2 −(x0 −µq )2
2σq
=e 2σ e

5.3.2 random walk (symmetric) proposals


Another important class of proposals is the random walk proposal. In this case, the proposal does
use the current state Xn−1 to define a proposal q(x0 |xn−1 ). These proposals in the random walk
(and more generally symmetric) case result in a density that is symmetric, i.e., q(x0 |x) = q(x|x0 ).
This leads to a considerable simplification in the acceptance ratio calculations. We will see some
examples below.

Example 5.7 (Random walk Gaussian proposal). Consider a mixture Gaussian target:

p? (x) = w1 N (x; µ1 , σ12 ) + w2 N (x; µ2 , σ22 ),

with some w1 , w2 > 0 and w1 + w2 = 1. Assume we want to use MH to sample from it. Choose
a proposal

q(x0 |x) = N (x0 ; x, σq2 ),

103
The density and its histogram Markov chain
0.7 4

0.6
3
0.5
2
0.4

0.3 1

0.2
0
0.1

0.0 1
2 1 0 1 2 3 4 5 0 200 400 600 800 1000
Figure 5.2: Random walk Gaussian proposal for a mixture of two Gaussians.

and derive the acceptance ratio.

Solution. This proposal is symmetric so we can write

p? (x0 )q(x|x0 )
r(x, x0 ) =
p? (x)q(x0 |x)
p? (x0 )
= ,
p? (x)
w1 N (x0 ; µ1 , σ12 ) + w2 N (x0 ; µ2 , σ22 )
= ,
w1 N (x; µ1 , σ12 ) + w2 N (x; µ2 , σ22 )

which is a considerable simplification. See Fig. 5.2 for a demonstration.

5.3.3 gradient based (langevin) proposals


One of the powerful proposal alternatives is to choose the proposal based on the gradient of the
target distribution p? . Note that we can compute ∇ log p? (x) without necessarily needing the
normalising constant, since

∇ log p? (x) = ∇ log p̄? (x) − ∇ log Z


| {z }
0

Therefore, without doing much more than what we are already doing (using unnormalised
density), we can inform the proposal by using the gradient of the target distribution:

q(x0 |x) = N (x0 ; x + γ∇ log p? (x), 2γI),

This algorithm is widely popular in the fields of statistics and especially in machine learning. This
approach is called Metropolis adjusted Langevin algorithm (MALA)

104
5.3.4 bayesian inference with metropolis-hastings
We can finally use the Metropolis-Hastings method for Bayesian inference. In what follows, we
will provide some examples for this and visualisations resulting from the sampling procedures.
Recall that, with conditionally independent observations y1 , . . . , yn , we have the Bayes theo-
rem as
Qn
p(y1:n |x)p(x) i=1 p(yi |x)p(x)
p(x|y1:n ) = = .
p(y1:n ) p(y1:n )
We write
n
Y
p(x|y1:n ) ∝ p(yi |x)p(x),
i=1

and set
n
Y
p̄? (x) = p(yi |x)p(x),
i=1

which is the unnormalised posterior density. We can then use the Metropolis-Hastings algorithm
to sample from this posterior density. A generic Metropolis-Hastings method for Bayesian
inference is described in Algorithm 10.

Algorithm 10 Pseudocode for Metropolis Hastings method for Bayesian inference


1: Input: The number of samples N , and starting point X0 .
2: for i = 1, . . . , N do
3: Propose (sample): X 0 ∼ q(x0 |Xn−1 )
4: Accept the sample Xn = X 0 with probability

p̄? (x0 )q(xn−1 |x0 )


( )
0
α(Xn−1 , X ) = min 1, .
p̄? (xn−1 )q(x0 |xn−1 )

5: Otherwise reject the sample and set Xn = Xn−1 .


6: end for
7: Discard first burnin samples and return the remaining samples.

Example 5.8 (Source localisation). This example is taken from Cemgil (2014). Consider the
problem of source localisation in the presence of three sensors with three noisy observations.
The setup in this example can be seen from the left part of Fig. 5.3. We have three sensors
surrounding an object we are trying to locate. The sensors receive noisy observations on R2 . We
are trying to locate the object based on these observations. We define our prior rather broadly:
p(x) = N (x; µ, σ 2 I) where µ = [0, 0] and σ 2 = 20. We assume that the observations are coming
from

p(yi |x, si ) = N (yi ; kx − si k, σy2 ),

105
where si is the location of the ith sensor on R2 for i = 1, 2, 3. We assume that the observations
are independent and that the noise is independent of the location of the object (of course, for
the sake of the example, we simulate our observations from the true model which is not the
case in the real world). Devise a MH sampler to sample from the posterior density of x, i.e., the
distribution over the location of the hidden object.

Solution. We can write the posterior density as

p(x|y1 , y2 , y3 , s1 , s2 , s3 ) ∝ p(y1 , y2 , y3 |x, s1 , s2 , s3 )p(x),

and given conditional independence we have


3
Y
p(x|y1 , y2 , y3 , s1 , s2 , s3 ) ∝ p(x) p(yi |x, si ).
i=1

This sort of Bayes update follows from the conditional Bayes rule introduced in Prop. 3.2. In
order to design the MH scheme, therefore, we need to just evaluate the likelihood and the prior
for MH. We choose a random walk proposal:

q(x0 |x) = N (x0 ; x, σ 2 I).

This is symmetric so the acceptance ratio is:

p(x0 )p(y1 |x0 , s1 )p(y2 |x0 , s2 )p(y3 |x0 , s3 )


r(x, x0 ) = .
p(x)p(y1 |x, s1 )p(y2 |x, s2 )p(y3 |x, s3 )

An example solution to this problem can be seen from Fig. 5.3 and the code can be accessed from
our online companion.

Example 5.9 (Gaussian with unknown mean and variance Example 5.13 in Yıldırım (2017)).
Assume that we observe

Y1 , . . . , Yn |z, s ∼ N (yi ; z, s)

where we do not know z and s. Assume we have an independent prior on z and s:

p(z)p(s) = N (z; m, κ2 )IG(s; α, β).

where IG(s; α, β) is the inverse Gamma distribution

β α −α−1
!
β
IG(s; α, β) = s exp − .
Γ(α) s

Design an MH algorithm to sample from the posterior distribution p(z, s|y1:n ).

106
The object, the sensors, the posterior Path traces of the 2D Markov chain
3 Samples
True 1
2 Sensor

1 0

0 1
1
2
2

3 3
3 2 1 0 1 2 0 1000 2000 3000 4000
Figure 5.3: Solution of the source localisation problem.

Solution. We can explicitly write the priors as


β α −α−1
! !
1 (z − m)2 β
p(z)p(s) = √ exp − s exp − .
2πκ2 2κ2 Γ(α) s
We are after the posterior distribution
p(z, s|y1 , . . . , yn ) ∝ p(y1 , . . . , yn |z, s)p(z)p(s),
n
N (yi ; z, s)N (z; m, κ2 )IG(s; α, β).
Y
=
i=1

Let us call our unnormalised posterior as p̄? (z, s|y1:n ). In order to do this, we need to design
proposals over z and s. We choose a random walk proposal for z:
q(z 0 |z) = N (z 0 ; z, σq2 ).
and an independent proposal for s:
q(s0 ) = IG(s0 ; α, β).
The joint proposal therefore is
q(z 0 , s0 |z, s) = N (z 0 ; z, σq2 )IG(s0 ; α, β).
When we design the MH algorithm, we see that the acceptance ratio is
p̄? (z 0 , s0 |y1:n )q(z, s|z 0 , s0 )
r(z, s, z 0 , s0 ) =
p̄? (z, s|y1:n )q(z 0 , s0 |z, s)
p(z 0 )p(s0 ) [ nk=1 N (yk ; z 0 , s0 )] N (z; z 0 , σq2 )p(s)
Q
=
p(z)p(s) [ nk=1 N (yk ; z, s)] N (z 0 ; z, σq2 )p(s0 )
Q

N (z 0 ; m, κ2 ) [ nk=1 N (yk ; z 0 , s0 )]
Q
=
N (z; m, κ2 ) [ nk=1 N (yk ; z, s)]
Q

107
6 Target Distribution 6 Random Walk Metropolis

4 4

2 2

0 0

24 2 0 2 4 24 2 0 2 4
Figure 5.4: Banana density estimation using Random walk metropolis and plotting the histogram.

Example 5.10 (The banana density). Consider the following density:


!
x2 y4
p(x, y) ∝ exp − − − 2(y − x2 )2 .
10 10

This is only available in unnormalised form and it is an excellent test problem for many
algorithms to fail. Design an MH algorithm for this density.

Solution. We have
!
x2 y4
p̄? (x, y) = exp − − − 2(y − x2 )2 .
10 10

and let us choose the proposal

q(x0 , y 0 |x, y) = N (x0 ; x, σq2 )N (y 0 ; y, σq2 ).

This is a symmetric proposal so the acceptance ratio is

p̄? (x0 , y 0 )
r(x, y, x0 , y 0 ) = .
p̄? (x, y)

Note that it makes sense to only compute log-acceptance ratio here

log r(x, y, x0 , y 0 ) = log p̄? (x0 , y 0 ) − log p̄? (x, y),

and implement the acceptance rate by drawing U ∼ Unif(0, 1) and accepting if log U <
log r(x, y, x0 , y 0 ). The result can be seen from Fig. 5.4.

108
5.4 gibbs sampling
We will now go into another major class of MCMC samplers, called Gibbs samplers. The idea of
Gibbs samplers is that, given a joint distribution of many variables p(x1 , . . . , xd ), one can build a
Markov chain that samples from this distribution by sampling from the conditional distributions.
This will also allow us straightforwardly sample from high-dimensional distributions. The
downside of this approach is that, one has to derive the conditional distributions, which can be
difficult. However, if one can do this, then the Gibbs sampler can be a very efficient method.
In this chapter, we denote our target similarly as p? (x) where x ∈ Rd and define the full
conditional distributions as

pm,? (xm |x1 , . . . , xm−1 , xm+1 , . . . , xd ) = pm,? (xm |x1:m−1 , xm+1:d ) = pm,? (xm |x−m ),

where x−m = (x1 , . . . , xm−1 , xm+1 , . . . , xd ) is the vector of all variables except xm . For a mo-
ment, assume that the full conditionals are available. Also assume, we obtain Xn−1 ∈ Rd at the
n − 1’th iteration of the algorithm. To denote individual components, we use Xn−1,m to denote
the m’th component of Xn−1 . Of course, the key aspect of the Gibbs sampler is to derive the

Algorithm 11 Pseudocode for the Gibbs sampler


1: Input: The number of samples N , and starting point X0 ∈ Rd .
2: for n = 1, . . . , N do
3: Sample

Xn,1 ∼ p1,? (Xn,1 |Xn−1,2 , . . . , Xn−1,d )


Xn,2 ∼ p2,? (Xn,2 |Xn,1 , Xn−1,3 , . . . , Xn−1,d )
Xn,3 ∼ p3,? (Xn,3 |Xn,1 , Xn,2 , Xn−1,4 , . . . , Xn−1,d )
Xn,d ∼ pd,? (Xn,d |Xn,1 , Xn,2 , . . . , Xn,d−1 )

4: end for
5: Discard first burnin samples and return the remaining samples.

full conditional distributions. We will come back to this point, but we will first investigate why
the Gibbs sampling approach provides us a valid MCMC kernel, in other words, how the Gibbs
sampler satisfies the detailed balance.
Let us denote x = (xm , x−m ). It is easy to see from Algorithm 11 that the Gibbs sampler for
every iteration (at time n) is defined as d separate operations, each sampling from the conditional
distribution. We can first look at what goes on in each of these d updates . It is also easy to see
that, the kernel defined in each of these d updates is given as

Km (x0 |x) = pm,? (x0m |x−m )δx−m (x0−m ),

where δx−m (x0−m ) is the Dirac delta function. Intuitively, each step samples from the full con-
ditional pm,? (·|x−m ) for mth dimension where m ∈ {1, . . . , d} and leaves others unchanged,
which is enforced by the term δx−m (x0−m ). One can then see that the entire Gibbs kernel can be
written as

K = K1 K2 . . . Kd .

109
Note that each kernel is an integral operator – therefore the above equation is almost symbolic, it
does not mean multiplication of kernels. We will now show that the Gibbs kernel satisfies the
detailed balance.

Proposition 5.3. The Gibbs kernel K leaves the target distribution p? invariant.

Proof. We first show that each kernel Km satisfies the detailed balance condition:
p? (x)Km (x0 |x) = p? (x)pm,? (x0m |x−m )δx−m (x0−m )
= p? (x−m )pm,? (xm |x−m )pm,? (x0m |x−m )δx−m (x0−m )
= p? (x0−m )pm,? (x0m |x0−m )pm,? (xm |x0−m )δx0−m (x−m )
= p? (x0 )Km (x|x0 ).
The steps of this derivation follows from the fact that the use of Dirac allows us to exchange
variables x and x0 . This shows that Km satisfies the detailed balance condition, therefore, we have
Z
Km (x0 |x)p? (x)dx = p? (x0 ),

i.e., Km leaves p? invariant. Let us denote now the integral


Z
(Km , p? ) = Km (x0 |x)p? (x)dx = p? .

One can see that we have (K2 , (K1 , p? )) = (K2 , p? ) = p? , which is true for all m = 1, . . . , d.
Therefore, we see that application of d kernels K1 , . . . , Kd will leave p? invariant. 
We can also see why Gibbs sampling works by relating it to the Metropolis-Hastings algorithm.
Recall that we can see our sampling from the conditional as a proposal, i.e.,
qm (x0 |x) = pm,? (x0m |x−m )δx−m (x0−m ).
If we calculate the acceptance ratio for this proposal
p? (x0 )qm (x|x0 )
( )
0
αm (x |x) = min 1, ,
p? (x)qm (x0 |x)
p? (x0 )Km (x|x0 )
( )
= min 1, .
p? (x)Km (x0 |x)
We see that this is equal to 1 as the detailed balance is satisfied for qm (which is Km – see the
proof of Proposition 5.3).
As we noted before, we have shown that the kernel K would leave p? invariant, but this would
not give us proper convergence guarantees. Note that the version of the algorithm we presented
is called deterministic scan Gibbs sampler. The reason for this is that the algorithm in Alg. 11
is implemented so that we sample x1 , . . . , xd in order, scanning the variables deterministically.
It turns out, while this sampler’s convergence guarantees cannot be established easily, there is
an algorithmic fix which results in a procedure that is also guaranteed to converge. Instead of
scanning the variables deterministically, we can sample them in a random order. This is called
the random scan Gibbs sampler. The algorithm is given in Alg. 12. We will now see an example.

110
Algorithm 12 Random scan Gibbs sampler
1: Input: The number of samples N , and starting point X0 ∈ Rd .
2: for n = 1, . . . , N do
3: Sample j ∼ {1, . . . , d}

Xn,j ∼ pj,? (Xn,j |Xn,−j ),

4: end for
Original Image Noisy Image Denoised Image

Figure 5.5: Denoising of an image using Gibbs sampler. The left column shows the original image, the middle
column shows the noisy image, and the right column shows the denoised image. I used σ = 1, J = 4 for this and
the Gibbs sampler scanned the entire image only 10 times.

Example 5.11 (Image denoising). A biologist knocked your door as some of the images from the
microscope were too noisy. You decide to help and use the Gibbs sampler for this. The model is
given as follows.
Consider a set of random variables Xij for i = 1, . . . , m and j = 1, . . . , n. This is a matrix
modeling an m × n image. We assume that we have an image that takes values Xij ∈ {−1, 1} –
note that this is an “unusual” image, as the images usually take values between [0, 255] (or [0, 1]).
We assume that the image is corrupted by noise, i.e., we have a noisy image
Yij = Xij + σij ,
where ij ∼ N (0, 1) and σ is the standard deviation of the noise. We assume that the noise is
independent of the image. We want to recover the image Xij from the noisy image Yij and utilise
Gibbs sampler for this purpose.
The goal is obtain (conceptually) p(X|Y ), i.e., samples from p(X|Y ) given Y . For this, we
need to specify a prior p(X). We take this from the literature and place as a prior a smooth
Markov random field (MRF) assumption. This is formalised as
1
p(Xij |X−ij ) =
exp(JXij Wij ),
Z
where Wij is the sum of the Xij ’s in the neighbourhood of Xij , i.e.,
X
Wij = Xkl = Xi−1,j + Xi+1,j + Xi,j−1 + Xi,j+1 .
kl:neighbourhood of (i,j)

111
This is an intuitive model of the image, making the current value of the pixel depend on the
values of its neighbours. The exercise is to design the Gibbs sampler for this problem.

Solution. We aim at using a Gibbs sampler approach from sampling the posterior p(X|Y ). Note
that now we need to sample from full conditionals, e.g., for each (i, j), we need to sample from
Xij ∼ p(Xij |X−ij , Yij ). We derive the full conditional as

p(Yij |Xij = k)p(Xij = k|X−ij )


p(Xij = k|X−ij , Yij ) = P ,
k∈{−1,1} p(Yij |Xij = k)p(Xij = k|X−ij )

where p(Yij |Xij = k) = N (Yij ; k, σ 2 ) is the likelihood of the noisy image given the value of the
pixel. We can easily compute these probabilities since each term in the Bayes rule is computable
(and 1/Z cancels). Therefore, we can get explicit expressions for q = p(Xij = 1|X−ij , Yij ) and
1 − q = p(Xij = −1|X−ij , Yij ). We can then sample from the full conditional as

1 with probability q,
Xij ∼
−1 with probability 1 − q.

We can now loop over (i, j) (to sample from each full conditional) and sample from the full
conditionals. This is the Gibbs sampler algorithm as described above. The results of this procedure
can be seen from Fig. 5.5.

Let us consider another example.

Example 5.12 (Beta-Binomial Gibbs sampler). Consider the following model

Γ(α + β) α−1
p(θ) = Beta(θ; α, β) = θ (1 − θ)β−1 ,
Γ(α)Γ(β)
and
!
n x
p(x|θ) = Bin(x; n, θ) = θ (1 − θ)n−x .
x

Derive the Gibbs sampler to sample from the joint distribution p(x, θ).

Solution. We know that, for this, we need full conditionals, i.e., we need p(x|θ) and p(θ|x). We
can see that p(x|θ) is already provided in the definition of the model. Therefore, we only need to
derive the posterior. We can write the joint distribution as
!
n x Γ(α + β) α−1
p(x, θ) = p(x|θ)p(θ) = θ (1 − θ)n−x θ (1 − θ)β−1 ,
x Γ(α)Γ(β)
!
Γ(α + β) n x+α−1
= θ (1 − θ)n−x+β−1 .
Γ(α)Γ(β) x

112
For Bayes theorem p(θ|x) = p(x|θ)p(θ)/p(x), we also need to compute p(x). This is given by
!
Z 1 Z
Γ(α + β) n x+α−1
1
p(x) = p(x, θ)dθ = θ (1 − θ)n−x+β−1 dθ,
0 0 Γ(α)Γ(β) x
!
n Γ(α + β) Γ(x + α)Γ(n − x + β)
= .
x Γ(α)Γ(β) Γ(n + α + β)

Therefore, we can compute the posterior as


 
n Γ(α+β)
 θ x+α−1 (1 − θ)n−x+β−1

p(x, θ) xΓ(α)Γ(β)
p(θ|x) = =  
n Γ(α+β)
,
p(x)  Γ(x+α)Γ(n−x+β)

x Γ(α)Γ(β)
 Γ(n+α+β)

= Beta(θ; x + α, n − x + β).

Therefore we can sample from p(θ|x) using any method to simulate a Beta variable. The Gibbs
sampler is then defined as follows:

• Initialise x0 , θ0

• For k = 1, 2, . . .:

– Sample θk ∼ p(θ|xk−1 )
– Sample xk ∼ p(x|θk )

• Return xk , θk for k = 1, 2, . . ..

We also note that simulated xk are approximately from p(x) which also gives us a way to approxi-
mate p(x).

5.4.1 metropolis-within-gibbs
One remarkable feature of the Gibbs sampler is that when we cannot derive the full conditionals
(or too lazy to do it), we can instead target the full conditional with a single Metropolis step at
each iteration. This is called the Metropolis-within-Gibbs algorithm and, remarkably, it samples
from the correct posterior!
Consider a generic target p(x, y). In many situations, it is easier to write unnormalised full
conditionals, i.e., we would have access to p̄(x|y) and p̄(y|x), but not p(x|y) and p(y|x). In this
case, we can use the Metropolis-within-Gibbs algorithm – meaning that instead of sampling
from the full conditional, we can take a Metropolis step to sample from the full conditional. The
algorithm would be summarised as follows. Given (xn−1 , yn−1 )
1. Sample x0 ∼ qx (x0 |xn−1 ) and accept x0 with probability

p̄(x0 |yn−1 )qx (xn−1 |x0 )


rx =
p̄(xn−1 |yn−1 )qx (x0 |xn−1 )

i.e., set xn = x0 with probability rx and xn = xn−1 otherwise.

113
2. Sample y 0 ∼ qy (y 0 |yn−1 ) and accept y 0 with probability
p̄(y 0 |xn )qy (yn−1 |y 0 )
ry =
p̄(yn−1 |xn )qy (y 0 |yn−1 )
i.e., set yn = y 0 with probability ry and yn = yn−1 otherwise.

Example 5.13 (Metropolis-within-Gibbs for Example 5.9). Let us return to Example 5.9. To
recall the model, assume that we observe
Y1 , . . . , Yn |z, s ∼ N (yi ; z, s)
where we do not know z and s. Assume we have an independent prior on z and s:
p(z)p(s) = N (z; m, κ2 )IG(s; α, β).
where IG(s; α, β) is the inverse Gamma distribution
β α −α−1
!
β
IG(s; α, β) = s exp − .
Γ(α) s
In other words, we have
β α −α−1
! !
1 (z − m)2 β
p(z)p(s) = √ exp − s exp − .
2πκ2 2κ2 Γ(α) s
We are after the posterior distribution
p(z, s|y1 , . . . , yn ) ∝ p(y1 , . . . , yn |z, s)p(z)p(s),
n
N (yi ; z, s)N (z; m, κ2 )IG(s; α, β).
Y
=
i=1

Let us call our unnormalised posterior as p̄? (z, s|y1:n ). Now instead of MH or defining Gibbs
(requires us to derive full conditionals), derive the Metropolis-within-Gibbs algorithm.

Solution. For this, note the unnormalised full conditionals:


n
N (yi ; z, s)N (z; m, κ2 ),
Y
p̄? (z|s, y1:n ) =
i=1

and
n
Y
p̄? (s|z, y1:n ) = N (yi ; z, s)IG(s; α, β).
i=1

In order to do this, we need to design proposals over z and s to target p̄(z|s, y1:n ) and p̄(s|z, y1:n )
respectively. This step will be a standard Metropolis as if we are solving each problem indepen-
dently. We choose a random walk proposal for z:
q(z 0 |z) = N (z 0 ; z, σq2 ).
and an independent proposal for s:
q(s0 ) = IG(s0 ; α, β).
Therefore, we Metropolis-within-Gibbs can be implemented as follows

114
• Initialise z0 , s0

• For k = 1, 2, . . .:

• Metropolis step for z-marginal:

– Sample z 0 ∼ q(z 0 |zk−1 )


– Accept z 0 and set zk = z 0 with probability

p̄? (z 0 |sk−1 , y1:n )


rz =
p̄? (zk−1 |sk−1 , y1:n )

which is simplified due to the symmetric proposal.


– Otherwise set zk = zk−1 .

• Metropolis step for s-marginal:

– Sample s0 ∼ q(s0 )
– Accept s0 and set sk = s0 with probability

p̄? (s0 |zk , y1:n )q(sk−1 )


rs =
p̄? (sk−1 |zk , y1:n )q(s0 )
Qn
N (yi ; z, s0 )
= Qn i=1
i=1 N (yi ; z, sk−1 )

– Otherwise set sk = sk−1 .

• Return zk , sk for k = 1, 2, . . ..

5.5 langevin mcmc methods


We briefly introduced Metropolis-adjusted Langevin algorithm (MALA) in Sec. 5.3.3. MALA
is just one example of a more general class of samplers, called Langevin MCMC algorithms,
which are based on Langevin dynamics. These approaches are at the forefront of modern MCMC
methods, used in a variety of settings, including sampling from high-dimensional targets, deep
learning, sampling from Bayesian neural networks, and so on. We will now introduce the
Langevin MCMC methods and see how they work.
Consider again our target p? defined on Rd . It turns out, we can use stochastic differential
equations (SDEs) to sample from p? (recall that SDEs are differential equations with a stochastic
term). Consider the following SDE:

dXt = ∇ log p? (Xt )dt + 2dBt , (5.5)
where Bt is a standard Brownian motion. It turns out the marginal distributions of Xt driven by
this SDE converge to p? as t → ∞. In other words, in many suitable metrics, we can quantify the
convergence d(pt , p? ) → 0 as t → ∞. This means that all we need to draw samples from p? is to

115
numerically solve this SDE which can be done with a variety of numerical methods (akin to ODE
solvers) . One caveat in this situation is that, while the SDE would target p? , its discretisation
would incur bias. This is why MALA is “Metropolised”.
Let us recall the MALA algorithm. We start with a point X0 and then define the proposal
q(xn |xn−1 ) = N (xn ; xn−1 + γ∇ log p? (xn−1 ), 2γId ), (5.6)
where γ > 0 is a step size. We then sample from q to obtain Xn . We then accept Xn with
probability
!
p? (Xn )q(Xn−1 |Xn )
α(Xn , Xn−1 ) = min 1, . (5.7)
p? (Xn−1 )q(Xn |Xn−1 )
Recall that the MALA proposal would not define a symmetric proposal, therefore, we would
need to compute the ratio. Now one can see that, Eq . (5.6) can be equivalently written as

Xn = Xn−1 + γ∇ log p? (Xn−1 ) + 2γWn , (5.8)
where Wn ∼ N (0, I). The relationship of Eq. (5.8) to (5.5) can be seen by noting that the
discretisation of the SDE in (5.5) would exactly take the form of Eq. (5.8). Therefore, MALA uses
this Langevin SDE as the proposal and then accept/reject its samples. This has a beneficial effect
of correcting the bias of the discretisation.
However, the Metropolis step can be computationally infeasible in higher dimensions, just
as in the case of rejection sampling. Higher dimensional problems cause the acceptance rate to
vanish, which results in slow convergence. To remedy this situation, a common approach is to
simply drop the Metropolis step and use the following iteration:

Xn = Xn−1 + γ∇ log p? (Xn−1 ) + 2γWn . (5.9)
This is a simple (and biased) MCMC method - which is called the unadjusted Langevin algorithm
(ULA). The ULA is a discretisation of the SDE, and as such, its stationary measure is not p? .
However, under various conditions, it can be shown that the limiting distribution of ULA pγ? can
be made arbitrarily close to p? as γ → 0. This means that the ULA can be a viable alternative.

Example 5.14. Consider the target p? (x) = N (x; µ, σ 2 ). Write down the ULA for this target
and derive the stationary distribution of the chain.

Solution. We can write the ULA as


∂ x−µ
log p? (x) = − 2 .
∂x σ
We can then write the iterates of ULA as
∂ √
Xn = Xn−1 + γ log p? (Xn−1 ) + 2γWn ,
∂x
Xn−1 − µ √
= Xn−1 − γ + 2γWn ,
σ2

γ

γ √
= 1 − 2 Xn−1 + 2 µ + 2γWn ,
σ σ

116
where Wn ∼ N (0, 1). In this simple case, we can compute the stationary distribution of the
chain and analyse its relationship to the true target p? . Let
γ γ
a = 1 − 2 , b = 2 µ.
σ σ
We can write now the iterates beginning at x0 as

x1 = ax0 + b + 2γW1 ,
√ √
x2 = a2 x0 + ab + a 2γW1 +b + 2γW2 ,
| {z }
ax1
√ √ √
x3 = a3 x0 + a b + a2 2γW1 + ab + a 2γW2 +b + 2γW3 ,
2
| {z }
ax2
..
.
n−1 n−1 √
x n = an x 0 + ak b + ak 2γWn−k .
X X

k=0 k=0

We can compute the expected value


n−1
n
ak b,
X
E[Xn ] = a x0 +
k=0

since Wk are zero mean. As n → ∞, we have



b
ak b =
X
µ∞ = n→∞
lim E[Xn ] = = µ.
k=0 1−a
since 0 < a < 1. The variance of the iterates as n → ∞ can also be computed. Note that for
finite n, we have
n−1 √
!
var(xn ) = var k
X
a 2γWk ,
k=0
n−1
(a2 )k ,
X
= 2γ
k=0
1 − a2n
= 2γ .
1 − a2
Therefore, we obtain the limiting variance as
1
lim var(xn ) = 2γ
n→∞ 1 − a2
1
= 2γ
γ 2
 
1− 1− σ 2

1
= 2γ 2γ γ 2 ,
σ2
− σ4
4

= 2 .
2σ − γ

117
Therefore, we obtained the target measure of ULA as
!
2σ 4
pγ? (x) = N x; µ, 2 ,
2σ − γ

which is different than p? . Note that in this particular case, the means of the pγ? and p? agree. It
can be seen that the bias enters the picture through the variance, but this vanishes as γ → 0.

We can also derive the ULA for the Banana density.

Example 5.15. Consider the Banana density


!
x2 y4
p(x, y) ∝ exp − − − 2(y − x2 )2 .
10 10

Derive ULA for this density.

Solution. Note the density is only available in unnormalised form:


!
x2 y4
p̄? (x, y) = exp − − − 2(y − x2 )2 .
10 10

Recall that ∇ log p̄? (x, y) = ∇ log p? (x, y). Therefore, we will directly compute the unnor-
malised gradients:
" −x
+ 8x(y − x2 )
#
5
∇ log p? (x, y) = −2y 3 .
5
− 4(y − x2 )

Therefore, the update in 2D is given by


" # " #
xn+1 x
= n + γ∇ log p? (xn , yn ) + 2γVn
yn+1 yn
" #!
1 0
where Vn ∼ N 0, .
0 1

Example 5.16 (Bayesian inference with ULA). Show that we can also straightforwardly perform
Bayesian inference in this setting by derive ULA for a generic posterior density given p(y|x) and
p(x).

118
Solution. Recall the target posterior density in this setting
n
Y
p̄? (x|y) = p(x) p(yk |x),
k=1

where p(x) is the prior density and p(yk |x) is the likelihood where observations are conditionally
i.i.d given x. We can write the ULA iterates as

Xn = Xn−1 + γ∇ log p̄? (Xn−1 |y) + 2γVn ,
n √
!
X
= Xn−1 + γ ∇ log p(Xn−1 ) + ∇ log p(yk |Xn−1 ) + 2γVn .
k=1

A common problem arising in machine learning and statistics is big data, where the number
of observations n is large. In this case, both ULA and MALA are infeasible as both require the
iterates above to be evaluated, e.g., each iteration involves summing n terms. If n is order of
millions, this is computationally infeasible. In this case, we can use the stochastic gradients. This
is only applicable in the setting of ULA as we will see below, which is one reason why ULA-type
methods are more popular than MALA-type methods.

5.5.1 stochastic gradient langevin dynamics


The problem of large number of data points arise in the setting of ULA as a sum, therefore, we
should look for estimating large sums with something cheaper. Consider the following sum of
(arbitrary) numbers:
n
1X
g= gi .
n k=1
If n is simply too large to compute this sum efficiently, we can instead resort to unbiased estimates
of it. This can be done by sampling i1 , . . . , iK ∼ {1, . . . , n} uniformly and constructing
K
1 X
ĝ = gi .
K k=1 k
This estimate is an unbiased estimate of g, i.e.,
K K
" #
1 X 1 X
E[ĝ] = E gik = E[gik ] = g.
K k=1 K k=1
This idea can be used to construct stochastic gradients. We provide some examples below.

Example 5.17 (Large scale Bayesian inference). Recall the problem setting in Example 5.16:
n √
!
X
Xn = Xn−1 + γ ∇ log p(x) + ∇ log p(yk |Xn−1 ) + 2γVn .
k=1

119
Design the stochastic gradient sampler for this case.

Solution. Assume we sample uniformly i1 , . . . , iK from {1, . . . , n}, we can then approximate the
sum
n K
X n X
∇ log p(yk |Xn−1 ) ≈ ∇ log p(yik |Xn−1 ).
k=1 K k=1

Note that (n/K) factor comes here as the sum itself did not have (1/n) term (as opposed to the
sum example above). Therefore, the stochastic gradient Langevin dynamics (SGLD) iterate can
be written as
K √
!
n X
Xn = Xn−1 + γ ∇ log p(x) + ∇ log p(yik |Xn−1 ) + 2γVn .
K k=1

This is also called data subsampling as one can see that the gradient only uses a subset of the data.
Every iteration is cheap and computable as we only need to compute K terms. This is a very
popular method in Bayesian inference and is used in many applications.

5.6 mcmc for optimisation


MCMC methods were originally motivated by optimisation problems. These methods are a good
candidate to solve challenging, nonconvex optimisation problems with multiple minima due to
the intrinsic noise in the algorithms. In this section, we will briefly look at two MCMC methods
that can be used for optimisation: (i) simulated annealing and (ii) Langevin MCMC.

5.6.1 background
It is important to note that a sampler can be used as an optimiser in the following context.
Consider the target density
pβ? (x) ∝ exp(−βf (x)),
where β > 0 is a parameter. It is known in the literature that the density pβ? (x) concentrates
around the minima of f as β → ∞ (Hwang, 1980). This connection between probability
distributions and optimisation spurred the development of MCMC methods for optimisation. In
what follows, we describe two methods that exploit this connection.

5.6.2 simulated annealing


Consider now a sequence of target distributions defined as
pβ? t (x) ∝ exp(−βt f (x)),
where βt > 0 is a sequence of increasing parameters. This algorithm anneals the target distribution
so that pβ? t (x) becomes concentrated around the minima of f . At the same time, the method uses

120
Algorithm 13 Simulated Annealing
1: X0 .
2: for t = 1, 2, . . . do
3: X 0 ∼ q(x|Xt−1 ) √(symmetric proposal, e.g., random walk)
4: Set βt (e.g. βt = 1 + t)
5: Accept Xt with probability

p̄βt (X 0 )
( )
min 1, βt? .
p̄? (Xt−1 )

6: Otherwise set Xt = Xt−1 .


7: end for

each distribution as a proposal by an accept-reject mechanism. Drawing from previous section’s


MH algorithm, we can easily see the simulated annealing (SA) algorithm from Algorithm 13.
One can see that this simulated annealing method takes a special and intuitive case for
optimisation. If we look at the acceptance ratio

p̄β? t (X 0 ) exp(−βt f (X 0 ))
βt = = exp(βt (f (Xt−1 ) − f (X 0 ))),
p̄? (Xt−1 ) exp(−βt f (Xt−1 ))

we can see that the acceptance ratio is a function of the difference in the objective function values.
If f (X 0 ) ≤ f (Xt−1 ), this proposal will take higher values, possibly bigger than 1 depends on
the improvement. If, however, f (X 0 ) ≥ f (Xt−1 ), the acceptance ratio will be small as it should
be. Scheduling of (βt )t≥0 is a design problem that depends on the specific cost function under
consideration.

Example 5.18. Consider the following challenging cost function

f (x) = − (cos(50x) + sin(20x))2 exp(−5x2 ), x ∈ [−1, 1]. (5.10)

This is a function with multiple local minima and is nonconvex. The function has one global
minima√ and we aim at finding it. Describe the simulated annealing method with a schedule
βt = 1 + t.

Solution. We see that βt → ∞ as t grows. We use a random walk proposal with a standard
deviation of σq = 0.1. We implement this on the log domain. We initialise X0 ∼ Unif(−1, 1).
The algorithm is implemented as, given Xt−1

• X 0 ∼ q(x|Xt−1 ) = N (x; Xt−1 , σq2 )

• Sample u ∼ Unif(0, 1)

• Accept if

log(u) < βt (f (Xt−1 ) − f (X 0 ))

121
0.0
0.5
1.0
1.5
2.0
2.5
3.0
f(x)
3.5 Xt
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
Figure 5.6: Simulated annealing for the function in Eq. 5.10. The red line shows the final estimate of SA algorithm.

• Otherwise set Xt = Xt−1 .

The result can be seen from Figure 5.6. We can see that the algorithm is able to find the global
minima.

5.6.3 langevin mcmc for optimisation


The family of MCMC methods can be used for optimisation as well. We will showcase one
example.

Example 5.19 (ULA for Optimisation). Assume that we try to solve the following problem:

arg min f (x),


x∈R

where f (x) = 2σ1 2 (x − µ)2 and µ and σ are known. Of course, (i) we do not really need the
scaling factor 2σ1 2 and (ii) we can simply solve this problem exactly (no surprise, the minimiser
is µ). For the sake of this example, design a Langevin MCMC method to optimise this cost
function and show its properties.

Solution. We can convert the optimisation problem into a sampling problem by defining the
target density as

p? (x) ∝ exp(−βf (x)).

One can argue that we will not know the normalising constant – which is exactly true. In general,
to solve the problem minx∈Rd f (x), one constructs a target density p? (x) ∝ e−βf (x) . To sample

122
The density and its histogram Markov chain Autocorrelation
4
1.0
0.7
3
0.6 0.8

0.5 2 0.6
0.4
1 0.4
0.3

0.2 0 0.2
0.1
1 0.0
0.0
2 1 0 1 2 3 4 5 0 200 400 600 800 1000 0 20 40 60 80 100
The density and its histogram Markov chain Autocorrelation
0.7
4 1.0
0.6
3 0.8
0.5

0.4 2 0.6

0.3 1 0.4

0.2
0 0.2
0.1
1 0.0
0.0
2 1 0 1 2 3 4 5 0 200 400 600 800 1000 0 20 40 60 80 100

Figure 5.7: Random walk Metropolis-Hastings for a mixture of two Gaussians. The top panel shows the situation
where σq = 0.5, so chain gets stuck in modes. This causes a high autocorrelation, and as such, this sampler is
not considered to be a good one. When we set σq = 4, then the chain exhibits low autocorrelation and is a good
sampler.

from this density, we resort to a modified version


s

Xn+1 = Xn + γ∇ log p? (Xn ) + Vn ,
β

where β is a parameter that is called the inverse temperature. We can see following the same
logic in Example 5.14 that, we have a target distribution

2σ 4
pβ? (x) = N (x; µ, ).
β(2σ 2 − γ)

One can see that as β → ∞, we have pβ? (x) → δµ (x), i.e., the target distribution is a Dirac delta
at µ. This is an example of a more general result where sampling from pβ? (x) ∝ exp(−βf (x))
(as it is what the sampler is doing) leads to distributions that concentrate on the minima of f (x)
as β → ∞.
In our case, for large β, the distribution would be concentrated around µ, that is maximum.
Therefore, samples from this distribution would be very close to µ. The error can be verified and
quantified in a number of challenging and nonconvex settings (Zhang et al., 2023).

123
5.7 monitoring and postprocessing mcmc output
There are a number of ways to monitor the MCMC samples to ensure that the algorithm is
working as expected. We will discuss a few of them here.

5.7.1 trace plots


The simplest way to monitor the MCMC output is to plot the trace of the samples. This is a plot of
the samples against the iteration number. This is what we have been doing in previous examples.
If the trace plots show you that the chain is still “moving”, then you can conclude that the chain is
not yet converged. On the other hand, a trace plot from MH can also show you that the chain is
stuck. It is then straightforward to conclude simple convergence issues from trace plots.

5.7.2 autocorrelation plots


The autocorrelation plot is a plot of the autocorrelation function of the samples. The autocorrela-
tion function is defined as
Cov(Xt , Xt+k )
ρk = .
Var(Xt )

This can be empirically computed on the samples coming from the Markov chain (xk )k∈N . Since
the aim of MCMC is to obtain nearly independent samples from a target p? , we expect a good
MCMC chain to exhibit low autocorrelation. A bad chain which is not mixing well will exhibit
high autocorrelation. An example can be seen from Fig. 5.7 and see its caption for more details.
One way to choose the proposal variance is to ensure that the chain has a low autocorrelation.
This is a very simple way to monitor the chain.

5.7.3 effective sample size


There is a notion of effective sample size for MCMC methods. However, its computation is
trickier than the IS one and it is usually implemented using software packages. The definition of
the ESS for MCMC chains is given as

N
ESS = P∞ ,
1+2 k=1 ρk

where ρk is the autocorrelation function. The ESS is an approximate measure of the number of
independent samples that we have. For example, if the chain exhibits no autocorrelation, then
the ESS is equal to the number of samples. If the chain exhibits high autocorrelation, then the
ESS will be very low, as the sum in denominator will be large.
The computation of effective sample size in MCMC is usually done by software packages. We
will not go into the details of this computation here.

5.7.4 thinning the mcmc output


One way to reduce the autocorrelation of the MCMC samples is to thin them. This is a post-
processing step that is done after the MCMC chain has been generated. The idea is to discard

124
Acceptance rate: 0.24672467246724672 Autocorrelation Autocorrelation (every 10th sample)
4 1.0 1.0
0.6
3 0.8 0.8
0.5

2 0.6 0.6
0.4

0.3 0.4 0.4


1
0.2 0.2
0.2
0
0.1
0.0
0.0
0.0 1
2 0 2 4 0 200 400 600 800 0 20 40 60 80 100 0 20 40 60 80 100

Figure 5.8: Thinning of MCMC samples. We keep every 10th sample for the same mixture of Gaussians example
with σq = 2. It can be seen that the thinned MCMC chain exhibits significantly lower autocorrelation.

some of the samples and keep only a subset of them. This is done by keeping every kth sample.
Since autocorrelation in an MCMC chain decays naturally over time, after reaching stationary,
we can choose every kth of them and discard the rest. This will still give us a chain with the same
stationary measure but with a lower autocorrelation. A demonstration of this can be seen from
Fig .5.8.

125
SEQUENTIAL MONTE CARLO
6
In this chapter, we introduce sequential Monte Carlo (SMC) methods. These methods are used to
approximate a sequence of target distributions rather than just a single, fixed target. This can have
a number of applications, including filtering and smoothing in state space models. We will briefly
introduce state-space models, SMC and its connection to importance sampling, and application of
SMC to filtering in state-space models, which is also called particle filtering.

6.1 introduction
In this section, we depart from our standard setting where we have a single, fixed target p? (x). In
many problems in the real world, the target distributions are evolving over time. For example,
consider the example of tracking a target, a straightforward extension of the source localisation
problem we discussed in Example 5.8. Instead of a fixed target and fixed measurements, we could
have easily the case of a moving target and fixed/moving sensors. In this case, we could recompute
our posterior every time we get a new measurement, however, this could become very prohibitive
(imagine every time you get new data, you need to run a new MCMC chain!). However, the
applications of this framework is not limited to simple localisation examples, it broadly generalises
to many dynamical systems. A few examples are volatility estimation in financial time series,
robotics (tracking and control of moving arms), infectious disease modelling (tracking the spread
of a disease), and many more. The idea of evolving sequence of distributions can also be used to
target static problems, as we have seen in the example of simulated annealing.
Our running example in this section will be state-space models. A good example within this
setting will be the target tracking example which summarises the notion of a hidden state and a
sequence of observations. However, it is crucial to observe that the example generalises to any
situation where a hidden, evolving quantity to be estimated (out in the wild) and a stream of data
is received to update our latest belief on the state of the object.

126
x0 x1 x2 ... xt

y1 y2 ... yt

Figure 6.1: The conditional independence structure of a state-space model.

6.2 state-space models


Consider a Markov process (Xt )t≥0 defined on the measurable space X with X ⊂ Rdx . This
process denotes the signal of interest, e.g., the state of an object, the velocity field of a partial
differential equation (PDE), hence we call it the signal process. Similarly, we define another
sequence of random variables (yt )t≥1 , defined on Y ⊂ Rdy , to denote our observation sequence,
or the observation process. This sequence denotes the observed data coming from the signal
process and it can typically consist of noisy sensor measurements or noisy observations. Based
on this two sequences, we can define a model which we name as a state-space model. This is
typically by three distributions (Doucet et al., 2000)
X0 ∼ µ(x0 )
Xt |{Xt−1 = xt−1 } ∼ f (xt |xt−1 ),
Yt |{Xt = xt } ∼ g(yt |xt ),
where µ is called the prior distribution, f is a Markov transition kernel defined on X, and g as
the likelihood function. For convenience, we always assume the densities exist in this document
but a general construction is possible. See Fig. 6.1 for the conditional independence structure of
this class of models.
An important consequence of the conditional independence structure described above is that
we can write the joint distribution as
t
Y
π̄t (x0:t , y1:t ) = µ(x0 ) f (xk |xk−1 )g(yk |xk ).
k=1

This is going to be important. To see the analogy with standard Bayesian inference, recall
that with a given prior p(x) and a likelihood p(y|x), we could write the joint distribution as
p(x, y) = p(y|x)p(x). This is a similar structure above for the joint distribution.
Recall also that we have used joint distributions as unnormalised densities throughout. This
will be the same in this section where the joint distribution above will act like an unnormalised
density. Another way to look at it is to consider y1:t completely fixed. This also makes it clear
that we have distributions of the form πt (x0:t ) (normalised) and π̄t (x0:t ) (unnormalised). We
will then apply what we have covered in Chapter 4 to this setting. The final thing to note is the
definition of the marginal likelihood in this setting. Let p(y1:t ) denote the marginal likelihood of
the observations y1:t , then we can write
Z t
Y
p(y1:t ) = µ(x0 ) f (xk |xk−1 )g(yk |xk )dx0:t . (6.1)
k=1

127
6.2.1 the filtering problem
Given a sequence of observations y1:t , a typical problem is to estimate the conditional distributions
of the signal process (Xt )t≥0 given the observed data. We denote this distribution with πt (xt |y1:t )
which is called the filtering distribution. The problem of sequentially updating the sequence of
filtering distributions (πt (xt |y1:t ))t≥1 is called the filtering problem.
To introduce the idea intuitively, consider the scenario of tracking a target. We denote the
states of the target with (xt )t≥0 which may include positions and velocities. We assume that
the target moves in space w.r.t. f , i.e., the transition model of the target is given by f (xt |xt−1 ).
Observations may consist of the locations of the target on R2 or power measurements with
associated sensors (which may result in high-dimensional observations). At each time t, we
receive a measurement vector yt conditional on the true state of the system xt . The likelihood of
each observation is assumed to follow g(yt |xt ).
We now provide a simple recursion to demonstrate one possible solution to the filtering
problem. Assume that we are given the distribution at time t − 1 (to define our sequential
recursion) and would like to incorporate a recent observation yt . One way to do so is to first
perform prediction
Z
ξt (xt |y1:t−1 ) = f (xt |xt−1 )πt−1 (xt−1 |y1:t−1 )dxt−1 , (6.2)

and obtain the predictive measure and then perform update

g(yt |xt )
πt (xt |y1:t ) = ξt (xt |y1:t−1 ) , (6.3)
p(yt |y1:t−1 )

where p(yt |y1:t−1 ) = ξt (xt |y1:t−1 )g(yt |xt )dxt is the incremental marginal likelihood.
R

Remark 6.1. We remark that the celebrated Kalman filter (Kalman, 1960) exactly implements
recursions (6.2)–(6.3) in the case of

µ(x0 ) = N (x0 ; µ0 , Σ0 ),
f (xt |xt−1 ) = N (xt ; Axt−1 , Q),
g(yt |xt ) = N (yt ; Cxt , R).

For this Gaussian system, computing the integral (6.2) and the update (6.3) is analytically tractable,
which results in Kalman filtering recursions of the mean and the covariance of the filtering
distribution πt (xt |y1:t ). We skip the update rules of the Kalman filter, as our main aim is to focus
on sequential Monte Carlo in this course.

Finally, we can move on to show how to update joint filtering distribution of the states x0:t .

128
To see this, note the recursion
π̄t (x0:t , y1:t )
πt (x0:t |y1:t ) =
p(y1:t )
π̄t−1 (x0:t−1 , y1:t−1 ) f (xt |xt−1 )g(yt |xt )
=
p(y1:t−1 ) p(yt |y1:t−1 )
f (xt |xt−1 )g(yt |xt )
= πt−1 (x0:t−1 |y1:t−1 ) .
p(yt |y1:t−1 )
This recursion will be behind the sequential Monte Carlo method we use for filtering in the next
sections.

6.3 sequential monte carlo for filtering


6.3.1 importance sampling: recap
Before we introduce the sequential Monte Carlo sampling for filtering, we recall the basic im-
portance sampling idea and its terminology accounting for the change of notation within this
chapter. Assume that we aim at estimating expectations of a given density π, i.e., we would like
to compute
Z
Eπ [ϕ(X)] = ϕ(x)π(x)dx.

We also assume that sampling from this density is not possible and we can only evaluate the
unnormalised density π̄(x). One way to estimate this expectation is to sample from a proposal
measure q and rewrite the integral as
Z
Eπ [ϕ(X)] = ϕ(x)π(x)dx,
ϕ(x) π̄(x)
R
q(x)
q(x)dx
= R π̄(x)
,
q(x)
q(x)dx
(i)
ϕ(x(i) ) π̄(x )
1 PN
N i=1 q(x(i) )
≈ PN π̄(x(i) )
, x(i) ∼ q, i = 1, . . . , N. (6.4)
1
N i=1 q(x(i) )

Let us now introduce the unnormalised weight function1


π̄(x)
W (x) = . (6.5)
q(x)
With this, the Eq. (6.4) becomes
1 PN (i) (i)
i=1 ϕ(x )W (x )
ϕ̂N
IS = N
1 PN (i)
, x(i) ∼ q, i = 1, . . . , N,
N i=1 W (x )
PN (i) (i)
i=1 ϕ(x )W
= PN (i)
, x(i) ∼ q, i = 1, . . . , N,
i=1 W
1 dγ
More technically, these weights are the evaluations of the Radon-Nikodym derivative W (x) = dq (x) (which,
in this case, is just a ratio as we assume absolute continuity implicitly).

129
where W(i) = W (x(i) ) are called the unnormalised weights. Finally, we can obtain the estimator
in a more convenient form,
N
ϕ̂N w(i) ϕ(x(i) ),
X
IS =
i=1

by introducing the normalised importance weights


W(i)
w(i) = PN (i)
, (6.6)
i=1 W

for i = 1, . . . , N . We note that the particle approximation of π in this case is given as


N
π N (x)dx = w(i) δx(i) (x)dx.
X
(6.7)
i=1

In the following section, we will derive the importance sampler aiming at building particle
approximations of πt (x0:t |y1:t ) for a state-space model.

6.3.2 importance sampling for state-space models: the emergence of


the general particle filter
In this section, we simply derive an importance sampler for the joint filtering distribution
πt (x0:t |y1:t ). We will see in the process that the particle filter is a special case of this conceptually
simple importance sampler (defined just in many variables instead of one) and the infamous
bootstrap particle filter is a further simplified case.
Let us assume that, in order to build an estimator of πt (x0:t |y1:t ), we have a proposal distribu-
tion over the entire path space x0:t denoted q(x0:t ). Note that, we also denote the unnormalised
distribution of x0:t as π̄(x0:t , y1:t ) which is given as
t
Y
π̄(x0:t , y1:t ) = µ(x0 ) f (xk |xk−1 )g(yk |xk ). (6.8)
k=1

This simply the joint distribution of all variables (x0:t , y1:t ). Just as in the regular importance
sampling case in eq. (6.5), we write
π̄(x0:t , y1:t )
W0:t (x0:t ) = .
q(x0:t )
(i)
Obviously, given samples from the proposal x0:t ∼ q(x0:t ), one can easily build the same weighted
(i) (i)
measure as in (6.7) on the path space by evaluating the weight W0:t = W0:t (x0:t ) for i = 1, . . . , N
and building a particle approximation
N
(i)
π N (x0:t )dx0:t =
X
w0:t δx(i) (x0:t )dx0:t .
0:t
i=1

where
(i)
(i) W0:t
w0:t = PN (i)
.
i=1 W0:t

130
However, this would be an undesirable scheme: We would need to store all variables in memory
which is infeasible as t grows. Furthermore, with the arrival of a new observation yt+1 , this
would have to be re-done, as this importance sampling procedure does not take into account
the dynamic properties of the SSM. Therefore, implementing this sampler to build estimators
sequentially is out of question.
Fortunately, we can design our proposal in certain ways so that this process can be done
sequentially, starting from 0 to t. Furthermore, this would allow us to run the filter online and
incorporate new observations. The clever choices of the proposal here lead to a variety of different
particle filters as we shall see next. Let us consider a decomposition of the proposal
t
Y
q(x0:t ) = q(x0 ) q(xk |x1:k−1 ).
k=1

Note that, based on this, we can build a recursion for the function W (x0:t ) by writing

π̄(x0:t , y1:t )
W0:t (x0:t ) = ,
q(x0:t )
π̄(x0:t−1 , y1:t−1 ) f (xt |xt−1 )g(yt |xt )
= ,
q(x0:t−1 ) q(xt |x0:t−1 )
f (xt |xt−1 )g(yt |xt )
= W0:t−1 (x0:t−1 ) ,
q(xt |x0:t−1 )
= W0:t−1 (x0:t−1 )Wt (x0:t ). (6.9)

That is, under this scenario, the weights can be computed recursively – given the weights of time
t − 1, one can evaluate W0:t (x0:t ) and update the weights. However, this would not solve the
infeasibility problem mentioned earlier, as the cost of evaluating using the whole path of samples
is still out of question. Finally, to remedy this, we can further simplify our proposal
t
Y
q(x0:t ) = q(x0 ) q(xk |xk−1 ).
k=1

by removing dependence to the past, essentially choosing a Markov process as a proposal. This
allows us to obtain purely recursive weight computation

π̄(x0:t , y1:t )
W0:t (x0:t ) = , (6.10)
q(x0:t )
π̄(x0:t−1 , y1:t−1 ) f (xt |xt−1 )g(yt |xt )
= , (6.11)
q(x0:t−1 ) q(xt |xt−1 )
f (xt |xt−1 )g(yt |xt )
= W0:t−1 (x0:t−1 ) , (6.12)
q(xt |xt−1 )
= W0:t−1 (x0:t−1 )Wt (xt , xt−1 ), (6.13)

using only the samples from time t − 1 and time t. The advantage of this scheme is explicit in the
notation: Note that the final weight function Wt only depends on (xt , xt−1 ), but not the whole
past as in (6.9). The function Wt (xt , xt−1 ) is called the incremental weight function.

131
6.3.3 sequential importance sampling
We can now see how the one-step update of this sampler works given a new observation. Assume
(i) (i)
that we have computed the unnormalised weights W1:t−1 = W (x0:t−1 ) recursively and obtained
(i) (i)
samples x0:t−1 . As we mentioned earlier, we only need the last sample xt−1 to obtain the weight
(i)
update given in (6.13). And also note that W1:t−1 for i = 1, . . . , N are just numbers, they do not
need the storage of previous samples. Given this, we can now sample from the Markov proposal
(i) (i)
xt ∼ q(xt |xt−1 ) and compute the weights of the path sampler at time t as
(i) (i) (i)
W1:t = W1:t−1 × Wt ,

where
(i) (i) (i)
(i) f (xt |xt−1 )g(yt |xt )
Wt = (i) (i)
.
q(xt |xt−1 )
(i)
What we described in other words is that, given the samples xt−1 , we first perform sampling step
(i)
xt ∼ q(xt |xt−1 )

and then compute


(i) (i) (i)
(i) f (xt |xt−1 )g(yt |xt )
Wt = (i) (i)
.
q(xt |xt−1 )
and update
(i) (i) (i)
W1:t = W1:t−1 × Wt .

These are unnormalised weights and we normalise them to obtain,


(i)
(i) W
w1:t = PN 1:t (i) ,
i=1 W1:t

which finally leads to the empirical measure,


N
(i)
π N (x0:t )dx0:t =
X
w1:t δx(i) (x0:t )dx0:t .
0:t
i=1

The full scheme is given in Algorithm 14. This method is called sequential importance sampling
(SIS). This is not very popular in the literature due to the well known weight degeneracy problem.
We next introduce a resampling step to this method and will obtain the first particle filter in this
lecture.

6.3.4 sequential importance sampling with resampling: the general


particle filter
We finally describe the general particle filter by extending the above method with a resampling
step employed after the weighting step. We will show in a practical session that the SIS method

132
Algorithm 14 Sequential Importance Sampling (SIS)
(i)
1: Sample x0 ∼ q(x0 ) for i = 1, . . . , N .
2: for t ≥ 1 do
(i) (i)
3: Sample: xt ∼ q(xt |xt−1 ),
4: Compute weights:
(i) (i) (i)
(i) f (xt |xt−1 )g(yt |xt )
Wt = (i) (i)
.
q(xt |xt−1 )

and update
(i) (i) (i)
W1:t = W1:t−1 × Wt .

Normalise weights,
(i)
(i) W
w1:t = PN 1:t (i) .
i=1 W1:t

5: Report
N
(i)
πtN (x0:t )dx0:t =
X
w1:t δx(i) (x0:t )dx0:t .
0:t
i=1

6: end for

without resampling easily degenerates, i.e., after some time, only a single weight approximates
to 1 and others to 0, rendering the method a point estimate. To keep the particle diversity, a
resampling method is introduced in between weighting and sampling steps. This step does not
introduce a systematic bias, although, it adds additional terms to the overall Lp error.
With the additional resampling step, the sequential importance sampling with resampling
(SISR) takes the form given in Algorithm 15. We note that, effectively, resampling step sets
(i)
W1:t−1 = 1/N for i = 1, . . . , N . Therefore, we only need to compute the last incremental
weight and weight our particles with the current weight. Also, note that the resampling step does
introduce extra error but does not induce bias, since moments of πtN does not change.

6.3.5 the bootstrap particle filter


In the general particle filter, the proposal q(xt |xt−1 ) is a design choice to be made and this
depends on our specific knowledge of a good proposal for a given system. For example, one can
incorporate future observations into this proposal in an ad-hoc or use the proposal choices like
in the auxiliary particle filter (APF).
A generic choice exists, however, that is simply setting q(xt |xt−1 ) = f (xt |xt−1 ), i.e., using
the transition density of the SSM under consideration as a proposal. The algorithm simplifies
considerably in this case and the resulting method is called the bootstrap particle filter (BPF)
which is given in Alg. 16. This algorithm has multiple appealing intuitive explanations beyond

133
Algorithm 15 Sequential Importance Sampling with Resampling (SISR)
(i)
1: Sample x0 ∼ q(x0 ) for i = 1, . . . , N .
2: for t ≥ 1 do
(i) (i)
3: Sample: x̃t ∼ q(xt |xt−1 ),
4: Compute weights:
(i) (i) (i)
(i) f (x̃t |xt−1 )g(yt |x̃t )
Wt = (i) (i)
.
q(x̃t |xt−1 )

Normalise weights,
(i)
(i) Wt
wt = PN (i)
.
i=1 Wt
5: Report
N
(i)
πtN (xt )dxt =
X
wt δx̃(i) (xt )dxt .
t
i=1

6: Resample:
N
(i) X (i)
xt ∼ wt δx̃(i) (xt )dxt .
t
i=1

7: end for

the derivation we provided based on importance sampling here. It can be most generally thought
as an evolutionary method. To uncover some of this intuition, see Fig. 6.2.
(i)
To elaborate the interpretation, consider a set of particles xt−1 representing the state of the
system at time t − 1. If our state-space transition model f (xt |xt−1 ) is well-specified (that is,
if the underlying system we aim at tracking does indeed move according to f ), then the first
intuivite step we can do to predict where the state would be at time t would be to move particles
(i) (i)
according to f , that is sampling x̃t ∼ f (xt |xt−1 ) which is the first step of the BPF. This gives us
(i)
a predictive distribution which consists of x̃t for i = 1, . . . , N . The prediction step (naturally)
does not require to observe the data point at yt . Once we observe the data point yt , we can then
use this data point to evaluate a fitness measure for our particles. In other words, if a predictive
(i) (i)
particle x̃t is a good fit to the observation, we would expect its likelihood g(yt |x̃t ) to be high.
Otherwise, this likelihood would be low. Thus, it intuitively makes sense to use our likelihood
evaluations as “weights”, that is to compute a measure of fitness for each particle. That is exactly
what the BPF does at the second step by computing weights using the likelihood evaluations. The
final step is then to use these relative weights to resample – a step that is used to refine the cloud
of particles we have. Simply, the resampling step removes some of the particles with low weights
(that are bad fits to the observation) and regenerates the particles with high weights.
The connection to evolutionary terms are clearer within this interpretation. The sampling step
in the BPF can be seen as “mutation” that introduces changes to an individual particle according

134
Figure 6.2: Intuitive model of BPF (Figure courtesy Victor Elvira).

to some mutation mechanism (in our case, the dynamics). Then, weighting and resampling
correspond to “selection” step, where individual particles are evaluated w.r.t. a fitness measure
coming from the environment (defined by an observation) and individuals are reproduced in a
random manner w.r.t. their fitness.

6.3.6 practical implementation of the bpf


Of course, the BPF can become numerically unstable if the weights are too small or too large.
This is in line with the theme we have seen about computing small or large numbers (especially
involving normalisation) throughout this course. To avoid a problem here, too, we need to
perform the comptutations in the log-domain. For example, after sampling from the proposal
(i)
x̃t ∼ f (xt |xt−1 ), we can compute the log-weights as
(i) (i)
log Wt = log g(yt |x̃t )
(i)
We can then compute the normalised weights wt using the trick introduced in Sec. 4.6. This
will ensure the stable computation of weights and prevent instability.

6.3.7 marginal likelihood computation with bpf


The BPF can naturally be used to compute p(y1:t ) sequentially. In order to see that we have the
decomposition

p(y1:t ) = p(y1:t−1 )p(yt |y1:t−1 ).

135
Algorithm 16 Bootstrap particle filter (BPF)
(i)
1: Sample x0 ∼ q(x0 ) for i = 1, . . . , N .
2: for t ≥ 1 do
(i) (i)
3: Sample: x̃t ∼ f (xt |xt−1 ),
4: Compute weights:
(i) (i)
Wt = g(yt |x̃t ).

Normalise weights,
(i)
(i) Wt
wt = PN (i)
.
i=1 Wt
5: Report
N
(i)
πtN (xt )dxt =
X
wt δx̃(i) (xt )dxt .
t
i=1

6: Resample:
N
(i) X (i)
xt ∼ wt δx̃(i) (xt )dxt .
t
i=1

7: end for

In order to recursively compute p(y1:t ), we need to estimate p(yt |y1:t−1 ) using the BPF. This is
possible via the use of the predictive density
Z
p(yt |y1:t−1 ) = g(yt |xt )ξ(xt |y1:t−1 )dxt . (6.14)

We note that the predictive density can be built using the particles after sampling (as explained
above). In short, we can build the predictive density
N
1 X
ξ N (xt |y1:t−1 )dxt = δ (i) (xt )dxt ,
N i=1 x̃t

if the resampling is done at every iteration (otherwise, the weights from the previous iteration
has to be used). Plugging this back into Eq. (6.14), we arrive at the empirical estimate of the
predictive density
N
1 X (i)
pN (yt |y1:t−1 ) = g(yt |x̃t ).
N i=1

Finally, the full marginal likelihood can be computed as


t
N
pN (yk |y1:k−1 ).
Y
p (y1:t ) = (6.15)
k=1

136
Incredibly, this estimate is unbiased (see Lemma 2 of (Crisan et al., 2018) for a proof). This is
incredibly useful for many things, including model selection.
Note that, computation of the marginal likelihood requires numerical care, as the product
of likelihoods can easily underflow. To avoid this, we can use logsumexp trick to compute the
log-marginal likelihood as
N  
N
X (i)
log p (y1:t ) = log exp log g(yt |x̃t ) − log N,
i=1
N  
X (i)
= log exp log Wt ) − log N
i=1

6.4 particle markov chain monte carlo


The unbiasedness property mentioned above of the marginal likelihood estimator can also be
used to build MCMC algorithms for parameter inference in these models. We will look at one
such method here, called particle MCMC (Andrieu et al., 2010). The idea relies on the fact that
one can use “unbiased” estimates of various terms in the acceptance ratio and this would still
leave the original target invariant (Andrieu and Roberts, 2009).
To formalise, the estimator in Eq. (6.15) satisfies
h i
p(y1:T ) = E pN (y1:T ) .

Now let us see how we can use this for parameter inference in state-space models. Assume that
we have a parameter θ that we would like to infer in the SSM. We further have a prior distribution
p(θ). The existence of parameter means that the transition density fθ (xt |xt−1 ) and the likelihood
gθ (yt |xt ) are now parameterised by θ. Let us for now fix θ and write the marginal likelihood of
the SSM as given in (6.1) for fixed θ as
Z
p(y1:T |θ) = π̄θ (x0:T , y1:T )dx0:T .

It is obvious that for every θ, we can estimate this using the BPF and the estimator in (6.15). To
be completely explicit, for every θ, we can get estimators pN (y1:T |θ) of the marginal likelihood
such that
h i
p(y1:T |θ) = E pN (y1:T |θ) .

Assume next for simplicity that we have the knowledge of p(y1:T |θ) and p(θ) and want to design
a Metropolis-Hastings sampler for the posterior p(θ|y1:T ). As we have seen before, we can do
this via choosing a proposal q(θ0 |θ) and writing the Metropolis-Hastings method as
• Sample θ0 ∼ q(θ0 |θn ).
• Compute the acceptance ratio
p(y1:T |θ0 )p(θ0 )q(θn |θ0 )
( )
α(θn , θ0 ) = min 1, .
p(y1:T |θn )p(θn )q(θ0 |θn )

• Accept θ0 with probability α(θn , θ0 ) and set θn+1 = θ0 .

137
• Otherwise, reject θ0 and set θn+1 = θn .

We have everything we want to sample from the parameter posterior p(θ|y1:T ) except the fact that
we do not have p(y1:T |θ) in closed form. However, we can use the unbiased estimator pN (y1:T |θ)
instead. This would still leave the target invariant (Andrieu and Roberts, 2009; Andrieu et al.,
2010). Assume now that we have run the particle filter and obtained pN (y1:T |θn ) for the current
parameter θn . Then the particle MH algorithm can be summarised as

• Sample θ0 ∼ q(θ0 |θn ).

• Run the particle filter with θ0 and compute the estimator:


T
pN (y1:T |θ0 ) = pN (yk |y1:k−1 , θ0 ).
Y

k=1

• Compute the acceptance ratio

pN (y1:T |θ0 )p(θ0 )q(θn |θ0 )


( )
0
α(θn , θ ) = min 1, N .
p (y1:T |θn )p(θn )q(θ0 |θn )

• Accept θ0 with probability α(θn , θ0 ) and set θn+1 = θ0 .

• Otherwise, reject θ0 and set θn+1 = θn .

Remarkably, this can work out of the box, in the sense that, it functions as a valid MCMC method
and will sample from the marginal p(θ|y1:T ) asymptotically!

6.5 examples
We will next consider some examples of the BPF in action.

Example 6.1 (Tracking a moving target in 2D). Let us assume that we would like to track a
2D moving target. The model is given by In this experiment, we consider a tracking scenario
where a target is observed through sensors collecting radio signal strength (RSS) measurements
contaminated with additive heavy-tailed noise. The target dynamics are described by the model,

xt = Axt−1 + ut ,
4
where xt ∈ R" denotes
# the target state, consisting of its position rt ∈ R2 and its velocity, vt ∈ R2 ,
rt
hence xt = ∈ R4 . Each element in the sequence {ut }t∈N is a zero-mean Gaussian random
vt
vector with covariance matrix Q. The parameters A and Q are selected as
" #
I κI2
A= 2 ,
0 0.99I2

138
and
κ3 κ2
" #
3 2
I 2 2
I
Q= κ2 ,
2 2
I κI2

where I2 is the 2 × 2 identity matrix and κ = 0.04. The observation model is given by

yt = Hxt + vt ,

where yt ∈ R2 is the measurement assumed to be noisy. Implement the particle filter for this
problem (see the code companion).

Example 6.2 (Tracking Lorenz 63 system). In this example, we consider the tracking of a discre-
tised stochastic Lorenz 63 system given by

x1,t = x1,t−1 − γs(x1,t − x2,t ) + γξ1,t ,

x2,t = x2,t−1 + γ(rx1,t − x2,t − x1,t x3,t ) + γξ2,t ,

x3,t = x3,t−1 + γ(x1,t x2,t − bx3,t ) + γξ3,t ,

where γ = 0.01, r = 28, b = 8/3, s = 10, and ξ1,t , ξ2,t , ξ3,t ∼ N (0, 1) are independent Gaussian
random variables. The observation model is given by

yt = [1, 0, 0]xt + ηt ,

where ηt ∼ N (0, σy2 ) is a Gaussian random variable. Implement the particle filter for this problem
(see the code companion).

139
BIBLIOGRAPHY

Agapiou, Sergios; Papaspiliopoulos, Omiros; Sanz-Alonso, Daniel; and Stuart, Andrew M. 2017.
Importance sampling: Intrinsic dimension and computational cost. In Statistical Science, pp.
405–431. Cited on p. 81.

Akyildiz, Omer Deniz. March 2019. Sequential and adaptive Bayesian computation for inference
and optimization. Ph.D. thesis, Universidad Carlos III de Madrid. Can be accessed from:
http://akyildiz.me/works/thesis.pdf. Cited on pp. 67, 69, and 77.

Akyildiz, Ömer Deniz and Míguez, Joaquín. 2021. Convergence rates for optimised adaptive
importance samplers. In Statistics and Computing, vol. 31, no. 2, pp. 1–17. Cited on pp. 80
and 81.

Andrieu, Christophe; Doucet, Arnaud; and Holenstein, Roman. 2010. Particle Markov chain
Monte Carlo methods. In Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), vol. 72, no. 3, pp. 269–342. Cited on pp. 137 and 138.

Andrieu, Christophe and Roberts, Gareth O. 2009. The pseudo-marginal approach for efficient
Monte Carlo computations. In . Cited on pp. 137 and 138.

Barber, David. 2012. Bayesian reasoning and machine learning. Cambridge University Press.
Cited on p. 57.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer. Cited on
p. 57.

Cemgil, A Taylan. 2014. A tutorial introduction to Monte Carlo methods, Markov Chain Monte
Carlo and particle filtering. In Academic Press Library in Signal Processing, vol. 1, pp. 1065–1114.
Cited on p. 105.

Crisan, Dan; Míguez, Joaquín; and Ríos-Muñoz, Gonzalo. 2018. On the performance of paral-
lelisation schemes for particle filtering. In EURASIP Journal on Advances in Signal Processing,
vol. 2018, pp. 1–18. Cited on p. 137.

Devroye, Luc. 1986. Non-Uniform Random Variate Generation. Cited on p. 12.

Douc, Randal; Moulines, Eric; Priouret, Pierre; and Soulier, Philippe. 2018. Markov chains.
Springer. Cited on pp. 98 and 102.

Douc, Randal; Moulines, Éric; and Stoffer, David. 2013. Nonlinear Time Series: Theory, Methods
and Applications with R Examples. Chapman & Hall. Cited on p. 98.

140
Doucet, Arnaud; Godsill, Simon; and Andrieu, Christophe. 2000. On sequential Monte Carlo
sampling methods for Bayesian filtering. In Statistics and computing, vol. 10, no. 3, pp. 197–208.
Cited on p. 127.

Elvira, Víctor; Martino, Luca; and Robert, Christian P. 2018. Rethinking the effective sample size.
In International Statistical Review. Cited on p. 90.

Hwang, Chii-Ruey. 1980. Laplace’s method revisited: weak convergence of probability measures.
In The Annals of Probability, pp. 1177–1182. Cited on p. 120.

Kalman, Rudolph Emil. 1960. A new approach to linear filtering and prediction problems. In
Journal of Fluids Engineering, vol. 82, no. 1, pp. 35–45. Cited on p. 128.

Lamberti, Roland; Petetin, Yohan; Septier, François; and Desbouvries, François. 2018. A double
proposal normalized importance sampling estimator. In 2018 IEEE Statistical Signal Processing
Workshop (SSP), pp. 238–242. IEEE. Cited on p. 80.

Martino, Luca; Luengo, David; and Míguez, Joaquín. 2018. Independent random sampling
methods. Springer. Cited on pp. 13, 22, 26, and 40.

Murphy, Kevin P. 2007. Conjugate Bayesian analysis of the Gaussian distribution. In def, vol. 1,
no. 2σ2, p. 16. Cited on pp. 44 and 54.

———. 2022. Probabilistic machine learning: an introduction. MIT press. Cited on p. 57.

Owen, Art B. 2013. Monte Carlo theory, methods and examples. Cited on p. 90.

Robert, Christian P and Casella, George. 2004. Monte Carlo statistical methods. Springer. Cited
on pp. 67 and 88.

———. 2010. Introducing Monte Carlo methods with R, vol. 18. Springer. Cited on p. 77.

Yıldırım, Sinan. 2017. Sabanci University IE 58001 Lecture notes: Simulation Methods for
Statistical Inference. Cited on pp. 32, 97, and 106.

Zhang, Ying; Akyildiz, Ömer Deniz; Damoulas, Theodoros; and Sabanis, Sotirios. 2023.
Nonasymptotic estimates for stochastic gradient Langevin dynamics under local conditions in
nonconvex optimization. In Applied Mathematics & Optimization, vol. 87, no. 2, p. 25. Cited on
p. 123.

141

You might also like