0% found this document useful (0 votes)

1K views189 pages

Hennig 2021 Probabilistic Machine Learning

The document provides lecture notes on probabilistic machine learning. It covers topics like reasoning under uncertainty, probabilistic reasoning, Gaussian probability distributions, Gaussian processes, graphical models, and variational inference. The notes were prepared by Professor Philipp Hennig for a course at the University of Tübingen and have been edited by several contributors.

Uploaded by

zzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views189 pages

Hennig 2021 Probabilistic Machine Learning

Uploaded by

zzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 189

CONTENT:

PHILIPP HENNIG
SCRIBES:
FREDERIK KÜNSTNER (2018/19)
A N N - K AT H R I N S C H A L K A M P ( 2 0 1 9 )
TIM REBIG (2020)
Z A F I R S TO JA N OV S K I ( 2 0 2 1 )

PROBABILISTIC
MACHINE
LEARNING

LECTURE NOTES
Questions about this document (typos, structure, etc.) should be directed to
Zafir Stojanovski at [email protected]

Questions regarding the content should be directed to

Philipp Hennig at [email protected]

Lecture notes prepared for the course

Probabilitstic Machine Learning
given by Professor Philipp Hennig at the University of Tübingen.

Thanks to Felix Dangel, Ann-Kathrin Schalkamp and Tim Rebig for

their feedback on the document

Last edited July 27, 2021.

Contributors: Philipp Hennig, Frederik Künstner, Ann-Kathrin
Schalkamp, Tim Rebig, Zafir Stojanovski
Introduction

Probabilistic inference is a foundation of scientific reasoning,

statistics, and machine learning. The goal of this course is to es-
tablish a formal framework for probabilistic reasoning, show how
to use it to build powerful inference mechanisms for real-world
problems and develop the technical tools necessary to implement
inference in practice.
The lecture begins with a general introduction to basic principles
of the rules of probability theory, then covers the probabilistic view
on standard settings such as supervised regression and classifica-
tion, unsupervised dimensionality reduction and clustering.
In a parallel thread through the lecture, we will also encounter a
number of popular algorithms for inference in probabilistic models,
including exact inference in Gaussian models, sampling, and free-
energy methods.
Some of the points we will cover are

• Connections between probabilistic inference and Boolean logic

• Learning functional relationships between variables.

• Establish a formal framework for probable reasoning

• A generalization from “shallow” and “deep” to “structured”

learning

• A general toolbox for encoding structured domain knowledge in

a learning agent, transferring it into a concrete algorithm

And much more to give a joint, connected, holistic view on reason-

ing, inference, learning and intelligence.
Contents

Reasoning under Uncertainty v 7

Probabilistic Reasoning v 15

Probabilities over Continuous Variables v 21

Monte Carlo Methods v 27

Markov Chain Monte Carlo v 33

Gaussian probability distributions v 39

Parametric Gaussian Regression v 45

Hierarchical Inference: learning the features v 55

Gaussian Processes v 61

Understanding Kernels v 73

Gauss-Markov Models v 83

Gaussian Process Classification v 91

Generalized Linear Models v 99

Exponential Family v 107

Graphical Models v 115

Factor graphs v 121

The Sum-Product Algorithm v 125

Extended Example: Topic Modeling v 133

Latent Dirichlet Allocation v 141

Efficient Inference and K-Means v 147

Mixture Models & EM v 151

Free Energy v 157

Variational Inference v 165

Customizing models and algorithms v 173

Making decisions v 181

Bibliography 189
Reasoning under Uncertainty v

An inference problem requires statements about the value of Probability theory is nothing but common
sense reduced to calculation.
an unobserved (latent) variable x based on observations y which are
related to x, but may not be sufficient to fully determine x. This — Pierre-Simon de Laplace
requires a notion of uncertainty.
We hope in this chapter to give an intuition on reasoning under
incomplete information as well as present a rigorous mathematical
construction of the foundation of modern probability theory.

Examples

A Card Trick v
Three cards with colored faces are placed into a bag. One card is
red on both faces, one is white on both faces and the last is red on
one face and white on the other - see Fig. 1. After mixing the bag,
we pick a card and see that one face is white. What is the color of
its other face?
While we do not have direct information about the back of the
card, we can use what we know about the setup to make an edu-
cated guess about the probability that it is also white. Make a guess; Figure 1: A card trick
is it 1/2? 2/3? Something else? We will revisit the problem at a later
stage.

Deductive and Plausible Reasoning v

Classically, computers do not allow for handling uncertainties out
of the box and their reasoning is based on porpositional logic.
Propositions are statements which can either be true or false. As an
example consider the scenario where we have two statements A =
“It rains” and B = “The street is wet”. The framework is limited
to falsify only two statements based on the binary values of truth
belonging to A and B. A ⇒ B which translates to "If it rains the
street gets wet" and ¬ B ⇒ ¬ A corresponding to "If the street is dry
it cannot have rained". For the other two combinations of true and
false of A and B the truth content cannot be determined, although
it might be more plausible that the street is dry given that it has
not rained. This limitation raises the necessity of extending the for-
malism of binary truth values to spectrum between true and false.
8 probabilistic machine learning

Probability theory provides a framework which allows us to dis-

tribute finite amount of truth and allows to make much more subtle
statements about the relationship between variables/propositions.
¬B B

Deductive Reasoning: A ⇒ B if A is true, then B is true

¬A
A is true thus B is true modus ponens
B is false thus A is false modus tollens A

Plausible Reasoning: P( B | A) > P( B) if A is true,

then B becomes more plausible

A is true thus B becomes more plausible

B is false thus A becomes less plausible
B is true thus A becomes more plausible
A is false thus B becomes less plausible 0

1 2 3
A helpful mental image to think about probabilities is to imag-

MANQUE
4 5 6
ine truth as a finite amount of “mass” which can be spread over

PASSE
a space of mutually exclusive “elementary” events. More events 7 8 9
can then be constructed from unions and intersections of sets of 10 11 12
elementary events. An example of this construction is roulette –
13 14 15
see Fig. 2 – where numbers 0 − 36 constitute the set of elementary
17

IMPAIR
events and Red/Black, Odd/Even, Low/High and more elaborate 16 18

PAIR
combinations are constructed from these elementary events. 19 20 21

22 23 24

Formalization v 25 26 27

28 29 30
The goal of this section is to establish a formal framework for prob-
31 32 33
able reasoning. For this purpose, we will introduce Kolmogorov’s
probability theory. Published in 1930, the approach by the Soviet 34 35 36

mathematician Andrey Kolmogorov1 still lays the foundation of 12P 12 M 12D 12D 12 M 12P
modern probability theory and is based on an axiomatic system.
Figure 2: Events for a Roulette
Kolmogorov’s Axioms are a pure mathematical construction. We 1
Kolmogorov. Grundbegriffe der
first present a simplified form of the axioms; Wahrscheinlichkeitsrechnung. 1933

Definition 1 (Kolmogorov’s Axiom (Simplified)). Let Ω be a space

of possible “elementary” events, such as samples or propositions, Kolmogorov defines measures on sets
using
and let F be the set of all possible subsets of Ω. The probability p
A ∩ B, A ∪ B, Ā.
of an event A ∈ F is a real map p : F → R that has the following
three properties: The shortcut p( A, B) is often used for
p ( A ∧ B ).

1. Non-Negativity: For all A ∈ F , 0 ≤ p( A) ≤ 1.

2. Normalization: p(Ω) = 1.

3. Additivity: If A and B are mutually exclusive, then

p ( A ∪ B ) = p ( A ) + p ( B ).
reasoning under uncertainty v 9

Contemporary formal form v

Now that we have an intuition, we only need to introduce some
mathematical objects used in Kolmogorov’s definitions to give the
full and more abstract formulation of the axioms.

Definition 2 (σ-algebra, measurable sets & spaces). Let E be a space

of elementary events. Consider the power set 2E , and let F ⊂ 2E be
a set of subsets of E. Elements of F are called random events. If F
satisfies the following properties, it is called a σ-algebra.

1. E ∈ F

2. ( A, B ∈ F ) ⇒ ( A − B ∈ F )
S T∞
N
3. ( A1 , A2 , · · · ∈ F ) ⇒ i =1 A i ∈ F ∧ i =1 Ai ∈ F

(this implies ∅ ∈ F . If E is countable, then 2E is a σ-algebra). If F

is a σ-algebra, its elements are called measurable sets, and ( E, F ) is
called a measurable space (or Borel space).

Sigma-algebras are very abstract objects and in most cases cannot

be written down explicitly. Nevertheless, its stability under the
stated operations on sets enable it to describe all possible events in
our space E. Moreover, its σ-additivity property garantees us that
truth is neither created out of thin air, nor destroyed.

Definition 3 (Measure & Probability Measure). Let ( E, F ) be a

measurable space (aka. Borel space). A nonnegative real function P :
F → R0,+ is called a measure if it satisfies the following properties:

1. P(∅) = 0

2. For any countable sequence { Ai ∈ F }i=1,..., of pairwise disjoint

sets (Ai ∩ A j = ∅ if i 6= j), P satisfies countable additivity (aka.
σ-additivity):
 
∞
[ ∞
P Ai  = ∑ P ( A i ).
i =1 i =1

The measure P is called a probability measure if P( E) = 1 (Note: for

probability measures, 1. is unnecessary). In this setting, ( E, F , P) is
called a probability space.

These two definitions constitute the contemporary formal form

of Kolmogorov’s axioms and give rise to the following theorem:

Theorem 4 (Sum Rule). From A + ¬ A = E we get

P( A) + P(¬ A) = P( E) = 1, thus P( A) = 1 − P(¬ A).

And from A = A ∩ ( B + ¬ B), using the notation P( A, B) = P( A ∩ B)

for the joint probability of A and B, we get the Sum Rule

P( A) = P( A, B) + P( A, ¬ B).
10 probabilistic machine learning

To be able to take into account events which have already oc-

curred, we need to define conditional probabilities.

Definition 5 (Conditional Probability). If P( A) > 0, the quotient

P( A, B)
P( B | A) =
P( A)
is called the conditional probability of B given A. It immediately gives

P( A, B) = P( B | A) P( A) = P( A | B) P( B).

It is easy to show that P( B | A) ≥ 0 , P( E | A) = 1 , and for

B ∩ C = ∅, we have P( B + C | A) = P( B | A) + P(C | A). Thus, for
a fixed A, ( E, F , P(· | A)) is a probability space.

The equations P( A, B) = P( B | A) P( A) = P( A | B) P( B) are also

2
Jaynes. Probability theory: The logic of
known as the Product Rule. For a thorough treatment, see Chapters science. Cambridge University Press,
1 and 2 in Probability Theory - the Logic of Science 2 . Using conditional 2003. URL bayes.wustl.edu/etj/
probabilities, we can now state the following extension of the Sum prob/book.pdf

Rule:

Theorem 6 (Law of Total Probability). Let A1 + A2 + · · · + An = E

and Ai ∩ A j = ∅ if i 6= j. Then, for any X ∈ F ,
n
P( X ) = ∑ P ( X | A i ) P ( A i ).
i =1
Sn
Proof. Because X = E ∩ X = i =1 ( A i ∩ X ), from σ-additivity we get
that
n de f . n
P( X ) = ∑ P ( Ai , X ) = ∑ P ( X | A i ) P ( A i ).
i =1 i =1

Bayes’ Theorem

Finally, we can state the mechanism that is at the heart of all proba-
bilistic reasoning.

Theorem 7 (Bayes’ Theorem). Let A1 + A2 + · · · + An = E and

Ai ∩ A j = ∅ if i 6= j. Then, for any X ∈ F ,

P ( Ai ) P ( X | Ai )
P ( Ai | X ) = .
∑nj=1 P( A j ) P( X | A j )

Proof. Apply the Sum Rule to the definition of the conditional

probability.

The language of inference, commonly used in Bayesian infer-

ence, denotes meaning to the terms within Bayes’ theorem. Assum-
ing that X is a hypothesis and D is an observation:
prior likelihood
z }| { z }| {
p( X ) × p( D | X )
p( X | D ) = .
| {z } p( D )
posterior | {z }
evidence
reasoning under uncertainty v 11

• The posterior is the probability that the hypothesis X is true after

observing D.

• The prior is the probability that the hypothesis X is true before

any observation.

• The likelihood is the conditional probability of observing D given

that the hypothesis is true.

• The evidence is the probability of observing D, regardless of the

truth of hypothesis X.

Bayes’ theorem states how to update the plausibility of the hypoth-

esis X based on the observation data D.
A marginal distribution is the distribution of a subset of variables, See wikipedia.org/wiki/
where some variables are averaged out. Given a joint distribution Posterior_probability#Example
and wikipedia.org/wiki/
p( A, B), the marginal p( A) is
Marginal_distribution#Real-world_example
for more examples.
p( A) = ∑ p ( A | B = b ) p ( B = b ),
b∈B

where B is the space of possible values for B. Note that despite the
name, the prior is not necessarily what you know before seeing the
data, but the marginal distribution P( X ) = ∑d∈D P( X, d) under all
possible data.

Revisiting the card trick v

There are multiple ways to define events to approach the problem;
here is one possible solution: Let C be the card picked out of the
bag with possible values { RR, RW, WW } – for Red-Red, Red-White
and White-White – and let W be the event “the observed face is
white”. The other side of the card is also white iff we have picked
C = WW, so we are interested in the probability p(C = WW |W ).
Applying Bayes’ Theorem, we have that

p(W |C = WW ) p(C = WW )
p(C = WW |W ) = .
p (W )

Filling in numbers,

• the prior probability of picking C = WW is 1/3,

• the probability that the observed face is white given WW is 1,

• because half of the faces are white, p(W ) = 1/2,

leading to p(C = WW |W ) = 1/3/1/2 = 2/3. Try to apply the same

strategy to solve the famous Monty Hall problem3 ! 3
wikipedia.org/wiki/Monty_Hall_problem

Revisiting Propositional Reasoning v

As deductive reasoning is a subset of plausible reasoning, using

Bayes’ theorem we can show the following:
12 probabilistic machine learning

Lemma 8. P( B | A) = 1 implies if A is true, then B is true

A⇒B is equivalent to p( B| A) = 1
¬B ⇒ ¬ A is equivalent to p(¬ A|¬ B) = 1
p( B|¬ A) ≤ p( B) A is false implies B becomes less plausible
p( A| B) ≥ p( A) B is true implies A becomes more plausible.

But plausible reasoning is more general, as one can see when

using conditional probabilities.

Lemma 9. P( B | A) ≥ P( B) implies if A is true, then B becomes

more plausible

p( B| A) ≥ p( B) A is true implies B becomes more plausible

p( B|¬ A) ≤ p( B) A is false implies B becomes less plausible
p( A| B) ≥ p( A) B is true implies A becomes more plausible
p(¬ A|¬ B) ≥ p(¬ A) B is false implies A becomes less plausible.

So far, inputs to the probability were propositional variables;

p( A) is the probability that A is true. In the remaining of this doc-
ument, p( A) is a function over the possible values that A can take,
with a slightly unusual notation. Given two binary variables A, B,
writing
p( A, B) = p( A) p( B)

means all of the following;

p( A = 0, B = 0) = p( A = 0) p( B = 0),

p( A = 0, B = 1) = p( A = 0) p( B = 1),

p( A = 1, B = 0) = p( A = 1) p( B = 0),

p( A = 1, B = 1) = p( A = 1) p( B = 1).

When Bayesian reasoning matches human reasoning v

Flaws of human reasoning can show up as well in the probabilistic

framework. We consider an example based on the poem The Raven4 , 4
https://en.wikipedia.org/wiki/The_Raven

in which you hear an unexpected tapping t at your door and you

start to ponder who that might be. We strongly recommend the
enjoyable recitation of the poem given in the lecture. If you assume
all candidates v1 , .., vn who might visit you are equally likely, using
Bayes theorem to infer knowledge will be useless. The constant like-
lihood terms in the evidence will cancel out, hence data that has a
uniform likelihood under all hypotheses does not provide informa-
tion, and therefore does not change the posterior probability.

P ( t | v1 ) · P ( v1 ) P ( t | v1 ) · P ( v1 ) P ( v1 )
P ( v1 | t ) = = =
P(t) P ( t | v1 ) ∑ i P ( v i ) ∑i P ( vi )

You might also convince yourself before opening the door that one
of the possible hypotheses is more likely than the others, a specific
visitor l you very much like to see. Doing so, you can unknowingly
reasoning under uncertainty v 13

adjust the prior in such a way that this event dominates the poste-
rior. This is a concern of critics of probabilistic reasoning who often
argue that it is possible to obtain any desired explanation for the
data as long as the hypothesis in question has non-zero probability.

P(t | `) · P(`) P(t | `) · P(`)

P(` | t) = =
P(t) P(t | `) P(`) + ∑i6=` P(t | vi ) P(vi )
.1

A similar human flaw would be to not accept that one of the possi-
ble hypotheses is actually impossible (the person l might be dead)
and needs to be assigned a prior probability of 0.

P(t | `) · P(`) P(t | `) · 0

P(` | t) = =
P(t) P(t)
=0

In the poem the person is greeted by darkness there and nothing more
after opening the door. Realizing that there is no visitor creates
an inconsistency in our theory as all former hypothesis v1 , .., vn
now have probability 0, which contradicts the observed tapping.
The hypothesis space has to contain some explanantion for the
observation t. The appropriate construction of the hypothesis space
can be one of the main challenges in practice.

P(v) = 0 P(`) = 0 P(t) = ∑ P(t | vi ) P(vi ) = 0?

Reasoning that the wind w might have caused the tapping at this
point and adding this new hypothesis reveals a frequent problem of
probabilistic inference. We have to know the correct variables before
we start reasoning and include them. Otherwise, no matter what
prior distribution we choose, the results will be flawed.

P(t | w) · P(w) P(t | w) · P(w)

P(w | t) = =
P(t) P(t, w) + ∑i P(t, vi )

If the hypothesis space does not include the correct explanation r

to begin with - in the poem it is a raven which is responsible for
the tapping - probability theory will always assign 0 probability to
r. This problem is independent of the mechanism of probabilistic
reasoning but demands creative thinking when setting up the space
of hypotheses.
14 probabilistic machine learning

Condensed content

The Rules of Probability:

• An inference problem requires statements about the value of an

unobserved (latent) variable x based on observations y which are
related to x, but may not be sufficient to fully determine x. This
requires a notion of uncertainty.

• the Sum Rule:

P( A) = P( A, B) + P( A, ¬ B)

• the Product Rule:

P( A, B) = P( A | B) · P( B) = P( B | A) · P( A)

• Bayes’ Theorem:

P( B | A) P( A) P( B | A) P( A)
P( A | B) = =
P( B) P( B, A) + P( B, ¬ A)

• Bayes’ Theorem provides the mechanism for inference:

likelihood of X under D prior of X

z }| { z }| {
P( D | X ) · P( X )
P( X | D ) =
| {z } P( D )
posterior of X given D | {z }
evidence for the model

• The fundamentality of probabilities has been debated at length.

Probabilities are not the only inference system, but they are
uniquely general, expressive, and powerful.

• Machine learning and AI can be approached in various ways.

The probabilistic viewpoint is the closest we have to a theory of
everything for ML.
Probabilistic Reasoning v

Probability Theory can stumble onto computational difficulties,

as the number of parameters required to describe a system grows
exponentially with the number of variables considered. The joint
distribution of n = 26 binary variables A, B, . . . , Z has 2n free
parameters, q1 , q2 , . . . , q2n −1 ,

p( A, B, . . . , Z ) = p1
p(¬ A, B, . . . , Z ) = p2
..
.
p(¬ A, ¬ B, . . . , Z ) = p67 108 863
n 1
p(¬ A, ¬ B, . . . , ¬ Z ) = 1 − ∑2i=− 1 pi

Storing the parameters alone would already require ≈ 67Mb of

RAM as we have to keep track of every single hypothesis in a com-
binatorially large space. In addition to a large memory requirement,
computing marginal probabilities such as p( A) is also time con-
suming. Thankfully, under some assumption, we can express the
joint distribution in fewer numbers.

The earthquake and the burglar v

Consider the following scenario, which we will use as an example

for probabilistic reasoning. Assume that you have a home alarm
system that can detect burglars, but can also be triggered by earth-
quakes. Being away from home, you receive a text message from
the alarm system, and you want to assess the probability that your
home is currently being robbed. To get more information, you can
turn on the radio, which will reliably broadcast a message if an
earthquake happened.
Let’s define the following observable variables,

A: The alarm was triggered,

R: The radio announced an earthquake,

and the following latent variables, which we will need to infer,

E: There was an earthquake,

B: There is a burglar in your home.

The joint probability distribution over those four binary variables

would typically need 24 − 1 = 15 = 8 + 4 + 2 + 1 parameters to be
16 probabilistic machine learning

fully represented,

p( A, R, E, B) = p( A| R, E, B) p( R| E, B) p( E| B) p( B). E B

However, we can use domain knowledge to remove irrelevant con-

ditions; R A

• We can assume that the probability of an earthquake is indepen-

Figure 3: Graphical model for the
dent of being robbed, such that p( E| B) = p( E). earthquake burglar example.

• Similarly, we can assume that the radio broadcast does not de-
pend on your house being robbed, such that p( R| E, B) = p( R| E).

• Lastly, we can assume that your home alarm is independent

of the radio broadcast, when conditioned on the occurrence of an
earthquake, that is p( A| R, E, B) = p( A| E, B).

Note that this last point does not imply that the alarm system is in-
dependent of the radio broadcast, p( A| R) 6= p( A). If an earthquake
increases the probability of false alarms and the probability of radio
broadcast, knowing that there was a radio broadcast increases the
probability that the alarm will go off. Those simplifications lead to
a system with 8 = 4 + 2 + 1 + 1 parameters,

p( A, R, E, B) = p( A| E, B) p( R| E) p( E) p( B).

To start reasoning about the problem, we will need to plug in a

few numbers. We will start by assuming that both earthquakes and
burglars are rare, and that each day has a 1/1,000 chance of seeing
any of them occurring, translating to a frequency of roughly one
earthquake/robbery every three years,

p( E) = 10−3 , p( B) = 10−3 .

We will assume that the radio is perfectly reliable, such that

p( R = 1| E = 1) = 1, p( R = 1| E = 0) = 0.

For the alarm, we will assume that it can send false alarms, with a
rate f = 1/1,000, that a burglar has a α B = 99/100 chance of triggering
it while an earthquake only has a α E = 1/100 chance of triggering it.
This yields the following table of probabilities,

p( A = 0| B = 0, E = 0) = (1 − f ) = 0.999,
p( A = 0| B = 0, E = 1) = (1 − f )(1 − α E ) = 0.98901,
p( A = 0| B = 1, E = 0) = (1 − f )(1 − α B ) = 0.00999,
p( A = 0| B = 1, E = 1) = (1 − f )(1 − α B )(1 − α E ) = 0.0098901,

p( A = 1| B = 0, E = 0) = f = 0.001,
p( A = 1| B = 0, E = 1) = 1 − (1 − f )(1 − α E ) = 0.01099,
p( A = 1| B = 1, E = 0) = 1 − (1 − f )(1 − α B ) = 0.99001,
p( A = 1| B = 1, E = 1) = 1 − (1 − f )(1 − α B )(1 − α E ) = 0.9901099.
probabilistic reasoning v 17

Using Bayes’ Theorem, we can now reason about various scenar-

ios. Given the information that our alarm went off, but without
knowledge of the radio broadcast, we can compute the probabil-
ity that there was a break-in. Plugging the numbers above in the
conditional probability formula yields

p( A = 1, B = 1)
p ( B = 1| A = 1) = ,
p ( A = 1)
∑ p( A = 1, B = 1, R, E)
= R,E = 0.495.
∑ B,R,E p( A = 1, B, R, E)
The somewhat lengthy calculation can be seen here v . If we also
know that the radio broadcasts an announcement about an earth-
quake v ,

p( A = 1, B = 1, R = 1)
p( B = 1| A = 1, R = 1) = ,
p ( A = 1)
∑ p( A = 1, B = 1, R = 1, E)
= E = 0.08.
∑ B,R,E p( A = 1, B, R, E)
The phenomenon of reducing the probability of an event by adding
more observation is often referred to as explaining away. In this
example the information about the radio announcement explains
away the break-in as a reason for the alarm.

The general recipe for probabilistic reasoning can be summa-

rized as

• Identifying all relevant variables: A, R, E, B

• Defining the joint probability (aka. the generative model):

P( A, R, E, B)

• Fixing certain variables through observing: A = 1

• Performing inference through Bayes’ Theorem

Graphical representation of (in)dependence v

A visual summary of our probabilistic model, shown in Fig. 3,
displays the relationship between variables as a directed graph. Ob-
servable variables are shown in dark nodes, while latent variables
are shown in white nodes.

Definition 10 (Bayesian Network). A Directed Graphical Model

(DGM), or Bayesian Network, is a probability distribution over
variables X1 , . . . , XD with the following structure,
D
p ( X1 , . . . , X D ) = ∏ p(Xi | pa(Xi )),
i =1

where pa( Xi ) are the parental variables of Xi , that is, Xi 6∈ pa( X j ) ∀ X j ∈

pa( Xi ). A DGM is represented by a Directed Acyclic Graph (DAG)
with the propositional variables as nodes, and arrows from parents
to children.
18 probabilistic machine learning

By the Product rule, every joint probability distribution can be

factorized into a dense DAG. The following factorization,

p( A, E, B, R) = p( A| E, B, R) p( R| E, B) p( E| B) p( B),

leads to one such DAG, but this other factorization leads to an-
other graphical representation where the direction of each edge is
reversed,

p( A, E, B, R) = p( B| A, E, R) p( E| A, R) p( R| A) p( A).

The direction of the arrows is not a causal statement. Representing

the probabilistic model as a DAG is not always useful, unless it
reveals independence, as in Fig. 3, where the factorization leading to
the graphical model is

p( A, E, B, R) = p( A| E, B) p( R| E) p( E) p( B).

Definition 11 (Independence). Two variables A and B are indepen-

dent iff their joint distribution factorizes into marginal distributions,

p( A, B) = p( A) p( B).

In that case p( A| B) = p( A) and we use the notation A ⊥

⊥ B.
Information about B does not give information about A and vice
versa.

Note that p( A| B) = p( A) is equivalent to the statement p( A, B) =

p( A) p( B) due to the definition of conditional probability as p( A, B) =
p ( A | B ) p ( B ).

Definition 12 (Conditional independence). Two variables A and B

are conditionally independent given variable C iff their conditional
distribution factorizes,

p( A, B | C ) = p( A|C ) p( B|C ).

In that case we have p( A | B, C ) = p( A|C ), i.e., in light of infor-

mation C, B provides no further information about A. We use the
notation A ⊥⊥ B | C.

v Independence and conditional independence are related but

do not imply each other. Consider the following example; given two
coins, let A be the event that the first coin shows head, let B be the
event that the second coin shows head and let C be the event that
both coins show the same result. A ⊥ ⊥ B should be intuitive, as
the result of a coin toss does not give information about another
coin toss, but we also have that A ⊥ ⊥ C and B ⊥ ⊥ C. To see why
this is, fix the value of a coin, say A is true. Then, we have that
p(C | A = 1) = p( B) = 1/2, which is equal to p(C ). However, we
have that A 6⊥ ⊥ B | C, as knowing the output of the second coin and
whether both coins show the same face gives full information on
the result of the first coin.
probabilistic reasoning v 19

Reading independence from DAGs is easier if we consider sub-

sets of variables. For subsets of one and two variables, the indepen-
dence structure is obvious. Starting with tri-variate structures it gets
more interesting. Note however, that it is not possible to deduce

p( A, B, C ) DAG Independence But!

A B C
p(C | B) p( B| A) p( A) A⊥
⊥C|B A 6⊥
⊥C
A B C
p( A| B) p(C | B) p( B) A⊥
⊥C|B A 6⊥
⊥C
A B C
p( B| A, C ) p( A) p(C ) A⊥
⊥C A 6⊥
⊥C|B
Figure 4: Independence structure
for tri-variate subgraphs. The graph
more complex relations by looking at those simple subgraphs - it structures from top to bottom are
is possible, for example, that A ⊥⊥ C | B but bringing in a new vari- also called chain graph, fan-out and
able D, we could have that A 6⊥ ⊥ C | B, D. Also, a single DAG does collider graph.

not necessarily reveal all the independence properties of a proba-

bilistic model and DAGs are therefore in that sense an incomplete
language.

The DAG for the two coins example is not unique. For exam-
ple, computing the probabilities

1 1
p ( A = 1) = , p ( B = 1) = ,
2 2
p(C = 1| A = 1, B = 1) = 1, p(C = 1| A = 0, B = 1) = 0,
p(C = 1| A = 1, B = 0) = 0, p(C = 1| A = 0, B = 0) = 1,

we have that the conditional probabilities imply that

p ( A | B ) = p ( A ), p ( B | C ) = p ( B ), p ( C | A ) = p ( C ), p ( C | B ) = p ( C ),

leading to the following three possible factorizations,

p( A, B, C ) = p(C | A, B) p( A) p( B),

p( A, B, C ) = p( A| B, C ) p( B) p(C ),
p( A, B, C ) = p( B| A, C ) p( A) p(C ),

each matching a DAG in Fig. 5.

Figure 5: Three possible DAG for the

A B A B A B two coin example.

C C C
20 probabilistic machine learning

Condensed content

Computing with Probabilities

• Probabilistic reasoning extends propositional logic

• instead of tracking a single true value, we have to assign proba-

bilities to combinatorially many hypotheses

• Two variables A and B are conditionally independent given

variable C, if and only if their conditional distribution factorizes,

P( A, B|C ) = P( A|C ) P( B|C )

Graphical Models and Conditional Independence

• Multivariate distributions can have exponentially many degrees

of freedom.

• (Conditional) independence helps reduce this complexity to

make things tractable in multi-variate problems.

• Directed graphical models provide a notation from which condi-

tional independence can be read off using simple rules.

• Every probability distribution is a DAG, but not every indepen-

dence structure of a distribution is captured by a DAG of it.
Probabilities over Continuous Variables v

Probability theory extends propositional logic with propo-

sitional variables A, . . . , Z ∈ {0, 1} ranging over the space of all
possible boolean assignments Ω, with a normalized probability
measure p : Ω → [0, 1], such that ∑w∈Ω p(w) = 1. Discrete probabil-
ity theory can also handle variables in a discrete set Ω = {0, 1, . . .}
using a similar probability measure, while continuous probability
theory uses the probability density function p : Ω → R+ to handle
continuous sample spaces, such as Ω = R, with the property that
R
w∈Ω p ( w ) dw = 1.
We will later see the precise definitions of the mathematical
objects mentioned in this section. For the moment the notation
should suffice to give a first intuition.
Let X be a random variable taking real values, X ∈ R, and define
the following events: A = ( X ≤ a), B = ( X ≤ b), W = ( a < X ≤ b).
As A and W are mutually exclusive, by the sum rule we have that

p ( B ) = p ( A ) + p (W ) , p (W ) = p ( B ) − p ( A ) .

Thinking of the events as functions of the limits they are checking,

∂
PX ( x ) = P( X ≤ x ), we can use their derivative p( x ) = ∂x PX ( x ), to
express this problem using integrands,
Z b
p( a < X ≤ b) = PX (b) − PX ( a) = p( x ) dx.
a

PX is called the cumulative distribution function (CDF) and f is the

probability density function (PDF).

The Product and the Sum rules apply to the probability den-
sity function, and taken together imply Bayes’ rule.

p ( x1 , x2 )
p ( x1 | x2 ) = Product rule,
p ( x2 ) 0.4
Z
p

0.2
p X1 ( x 1 ) = p X ( x1 , x2 ) dx Sum rule, 0
p(x | y) p(x, y) −2

R −2
−1
0
0
p ( x1 ) · p ( x2 | x1 )
x
1 2
y 2
p ( x1 | x2 ) = R Bayes’ Theorem.
p( x1 ) · p( x2 | x1 ) dx1
Figure 6: Joint probability density
Those rules, however, do not apply to the cumulative distribution function for two variables, highlight-
ing the marginal p(y) (rear panel)
function PX ( x ). Fig. 6 illustrates the joint, marginal and conditional and conditional probability density
densities on a two-dimensional example. p( x |y = 0) (cutting through the joint
density).
22 probabilistic machine learning

The base measure

Probability density functions are only defined relatively to a base

measure, and changes of variables need additional care v .

Theorem 13 (Change of Variable for Probability Density Functions).

Let X be a continuous random variable with PDF p X ( x ) over c1 <
x < c2 . And, let Y = u( X ) be a monotonic differentiable function
with inverse X = v(Y ). Then the PDF of Y is

dv(y) du( x ) −1

pY (y) = p X (v(y)) · = p X (v(y)) · .
dy dx

To understand the last factor and its inversion we recommend

to have a look at v . Assume that u is monotonically increasing,
u0 ( X ) > 0, and let d1 , d2 = u(c1 ), u(c2 ). pY is defined on d1 < y < d2
and we have that the CDF Py is defined w.r.t. the PDF p X as
Z v(y)
Py (y) = P(Y ≤ y) = P(u( X ) ≤ y) = P( X ≤ v(y)) = p( x ) dx,
c1

and the PDF pY follows from the CDF Py as

∂PY (y) ∂v(y)

pY ( y ) = = p X (v(y)) .
∂y ∂y
To obtain the absolute value, repeat the previous steps with a
monotonically decreasing change of variable such that u0 ( x ) < 0.
We can now state the generalization of the former theorem for
multivariate functions.

Theorem 14 (Transformation Law, general). Let X = ( X1 , . . . , Xd )

have a joint density p X . Let g : Rd → Rd be continuously differen-
tiable and injective, with non-vanishing Jacobian Jg . Then Y = g( X )
has density

 p ( g−1 (y)) · | J −1 (y)| if y is in the range of g,
X g
pY ( y ) =
0 otherwise.

Formal definitions
This is mostly useful to understand
We now give the promised formal definitions which lead the way to other reference material on the subject;
a rigorous formulation of densities and probabilities on continuous do not worry if the definitions sound
too convoluted at first.
spaces v . The first challenge arrives when deriving a σ-algebra F
for continuous spaces. We will not use the canonical way by taking
the power set of the elements of our continuous space Ω as our
σ-algebra. This is because the power sets can contain sets which
are not measurable with respect to the Lebesgue measure5 . This 5
https://en.wikipedia.org/wiki/Lebesgue_measure

measure allows the integration of a wider function class than the

Riemann integral and is usually sufficient for the integration of real
valued functions, apart from corner cases.
To approach σ-algebras for continuous spaces, we need to use
open sets. Hence, we start with the definition of topological spaces.
probabilities over continuous variables v 23

Definition 15 (Topology). Let Ω be a space and τ be a collection of

sets. We say τ is a topology on Ω if

• Ω ∈ τ, and ∅ ∈ τ

• any union of elements of τ is in τ

• any intersection of finitely many elements of τ is in τ.

The elements of the topology τ are called open sets. In the Euclidean
vector space Rd , the canonical topology is that of all sets U that
satisfy x ∈ U :⇒ ∃ε > 0 : ((ky − x k < ε) ⇒ (y ∈ U )).

Note that R is a topological space.

Definition 16 (Borel algebra). Let (Ω, τ ) be a topological space. The

Borel σ-algebra is the σ-algebra generated by τ. That is by taking τ
and completing it to include infinite intersections of elements from
τ, all complements to elements of τ, and restricting all unions of
elements from τ to countably many.

Now that we can define (Borel) σ-algebras on continuous spaces,

we have the tools to define distribution measures.

Definition 17 (Measurable Functions, Random Variables). Let

(Ω, F ) and (Γ, G) be two measurable spaces (i.e. spaces with σ-
algebras). A function X : Ω → Γ is called measurable if X −1 ( G ) ∈ F
for all G ∈ G. If there is, additionally, a probability measure P (see
definition in chapter 1) on (Ω, F ), then X is called a random variable.

Consider (Ω, F ) and (Γ, G). If both F and G are Borel σ-algebras,
then any continuous function X is measurable (and can thus be
used to define a random variable). This is because, for continu-
ous functions, pre-images of open sets are open sets and Borel
σ-algebras are the smallest σ-algebras to contain all those sets.

Definition 18 (Distribution Measure). Let X : Ω → Γ be a random

variable. Then the distribution measure (or law) PX of X is defined for
any G ⊂ Γ as

PX ( G ) = P( X −1 ( G )) = P({ω | X (ω ) ∈ G }).

Definition 19 (Probability Density Functions (pdf’s)). Let B be the

Borel σ-algebra in Rd . A probability measure P on (Rd , B) has a
density p if p is a non-negative (Borel) measurable function on Rd
satisfying, for all B ∈ B
Z Z
P( B) = p( x ) dx =: p( x1 , . . . , xd ) dx1 . . . dxd
B B

Note, not all measures have densities (ex. measures with point
masses).

Definition 20 (Cumulative Distribution Function (CDF)). For prob-

ability measures P on (Rd , B), the cumulative distribution function is
the function !
d
F ( x) = P ∏ ( Xi < x i ) .
i =1
24 probabilistic machine learning

(In particular for the univariate case d = 1, we have F ( x ) =

P (−∞, x ] ).
If F is sufficiently differentiable, then P has a density, given by

∂d F

p( x) = .
∂x1 · · · ∂xd
x

and, for d = 1,
Z b
P( a ≤ X < b) = F (b) − F ( a) = f ( x ) dx.
a

Example: inference of probabilities v

What is the probability - π - for a person to be wearing

glasses? As we do not know this probability, we can model our
uncertainty about it with a random variable π ranging in [ 0, 1 ]
and thus we learn the probabilities of probabilities. To answer the
question, we can collect some observations X and use inference;

p(X |π ) p(π ) p(X |π ) p(π )

p(π |X ) = = R .
p(X) p ( X | π ) p ( π ) dπ

To define the prior distribution, we can start with a uniform distri-

bution, p ( π ) = 1 if π ∈ [ 0, 1 ] , 0 elsewhere. Assuming we sample
observations independently, the likelihood of a positive or negative
sample, given knowledge of π, is

p ( X = 1 | π ) = π, p ( X = 0 | π ) = 1 − π.

For multiple observations, this process gives rise to a derived vari-

f = 1/3, N = 10
able which is binomial distributed and depends on π, illustrated in
Fig. 7. In terms of the traditional coin flipping example, the bino-
mial gives the distribution over the number of times a coin will 0.2
show head over N tosses, given a probability of landing head of
p (r )

π. The probability of sampling n positive and m negative observa-

0.1
tions, if the probability of an independent positive observation is
given by π, is
0

n+m 0 5 10
p ( n, m | π ) = π n (1 − π )m . r
n
Figure 7: Probability distribution for
the number of heads in N = 10 coin
Plugging this into the computation of the posterior yields flips with a probability of landing head
of f = 1/3.
(n+nm)π n (1 − π )m · 1 π n (1 − π ) m · 1
p(π |n, m) = R n+m .= R n .
( n )π n (1 − π )m · 1 dπ π (1 − π )m · 1 dπ

A nice choice for the prior p(π ), to make the computation easy,
is the Beta distribution with parameters a, b > 0,

π a −1 (1 − π ) b −1
p(π ) = ,
Z
probabilities over continuous variables v 25

R1
where Z is a normalization constant to ensure that 0 p(π ) dπ = 1
and is given by the Beta function,
Z 1
Z = B( a, b) = π a−1 (1 − π )b−1 dπ.
0

The uniform distribution can be represented as a Beta distribution6 6

wikipedia.org/wiki/Beta_distribution
with parameters a = b = 1. Given a Beta( a, b) prior, n positive and
m negative additional observations, the posterior is then

π n + a −1 (1 − π ) m + b −1
p(π |n, m) = .
B( a + n, b + m)

Condensed content

• Random Variables allow us to define derived quantities from

atomic events

• Borel σ-algebras can be defined on all topological spaces, allow-

ing us to define probabilities if the elementary space is continu-
ous.

• Probability Density Functions (pdf’s) distribute probability

across continuous domains.

– they satisfy “the rules of probability”:

Z
p( x ) dx = 1
Rd
Z
p X1 ( x 1 ) = p X ( x1 , x2 ) dx2 Sum rule
R
p ( x1 , x2 )
p ( x1 | x2 ) = Product rule
p ( x2 )
p ( x1 ) · p ( x2 | x1 )
p ( x1 | x2 ) = R Bayes’ Theorem.
p( x1 ) · p( x2 | x1 ) dx1

– Not every measure has a density, but all pdfs define measures
– Densities transform under continuously differentiable, injec-
tive functions g : x 7→ y with non-vanishing Jacobian as

 p ( g−1 (y)) · | J −1 (y)| if y is in the range of g,
X g
pY ( y ) =
0 otherwise.

• Probabilistic inference can even be used to infer probabilities.

Monte Carlo Methods v

As the next tool to add to our toolbox, we will look at Monte Carlo
methods.
In many probabilistic inference problems, the main computa-
tional issue is the computation of expectations and marginal proba-
bilities,
Z Z
E p( x) [ x ] = xp( x ) dx, p(y) = E p( x) p(y| x ) = p(y| x ) p( x ) dx,

which requires integrating over probability distributions. A simple

solution to approximate those integrals is to replace the integral by
a sum over samples,
Z n Z n
1 1
xp( x ) dx ≈
n ∑ xi , p(y| x ) p( x ) dx ≈
n ∑ p ( y | x i ),
i =1 i =1

if the samples xi are sampled independently from p( x ). As a gen-

eral formulation, we want to estimate
Z
φ := f ( x ) p( x ) dx = E p( x) f ( x ) .

Given independent samples x1 , . . . , xn from p( x ), the estimator

n
1
φ̂ =
n ∑ f ( xi )
i =1

is an unbiased estimator of φ, meaning

Z 1 n
Ex1 ,...,xn φ̂ =
n ∑ f ( xs ) p( xs ) dxs
s =1
n Z
1 1 n
=
n ∑ n s∑
E( f ( xs )) = φ,
f ( xs ) p( xs ) dxs =
s =1 =1
√
and its variance decreases at a rate of O 1/ n ,
" #2
2 1 n
n s∑
E(φ̂ − E(φ̂)) = E f ( xs ) − φ
=1
n n
1
=
n2 ∑ ∑ E( f (xs ) f (xr )) − φE( f (xs )) − E( f (xr ))φ + φ2
s =1 r =1
  
n  
1 2 
= ∑
 ∑ φ − 2φ2 + φ2  + E( f 2 ) − φ2 
n2 s =1
| {z }
r 6=s
| {z }
=0 var( f )

1
= var( f ) = O(n−1 ).
n
28 probabilistic machine learning

Definition 21 (Monte Carlo method). Algorithms that compute

expectations in the above way, using samples xi ∼ p( x ), are called
Monte Carlo methods (Stanisław Ulam, John von Neumann).

Examples: Sampling is a rough guess v

We can use sampling to estimate π; as the ratio of the quarter-unit-

circle to the unit square [0, 1] × [0, 1] is π/4, we can write π as an 1
integration over samples uniformly distributed over the unit square
0.8
Z
π=4 1{ x > x < 1}U ( x ) dx, 0.6
[0,1]×[0,1]
0.4
where U ( x ) is the PDF of the uniform distribution. This leads to a
simple algorithm for approximating π; draw samples from U ( x ) 0.2

and count the number of samples that fall within the unit circle 0
0 0.2 0.4 0.6 0.8 1
(x > x < 1).
Figure 8: Estimating π by sampling
While this procedure only needs ≈ 9 samples to get the first
digit right, it is not great when high precision is required; to get to
single-float precision (≈ 10−7 ), it needs about 1014 samples. Fig. 9
shows the error w.r.t. the number of samples

101 Figure 9: Error of the Monte-Carlo

p
MC var( f )/s estimate of π
6 π
100

4
10−1
φ̂

2
10−2

0
10−3

100 101 102 103 104 105 100 101 102 103 104 105
# samples # samples

Samples from a probability distribution can be used to roughly

estimate expectations, without having to design an elaborate inte-
gration algorithm.

Sampling

To use a Monte-Carlo method, we need to be able to sample from

p( x ). This is not an easy task in general, but there are algorithms to
turn samples from a uniform distribution into samples from p( x ).

Inverse Transform Sampling v

In some cases, if we know the distribution functions, we can use
the change of variable theorem to find the precise transformation of
uniform variables to variables of the desired distribution.
monte carlo methods v 29

Suppose we have access to the uniform random variable u ∼

U [0, 1] (i.e. u ∈ [0, 1], and p(u) = 1). Furthermore, suppose that we
want to sample from B ( x; α, 1), where

1
B ( x; α, 1) = x α −1
B(α, 1)
Z 1
Γ ( α ) Γ (1) 1
with B(α, 1) = = x α−1 dx =
Γ ( α + 1) 0 α
which gives us
B ( x; α, 1) = αx α−1
Now notice that by setting x = u1/α we get exactly a Beta-distributed
random variable:

∂u( x )

p x ( x ) = pu (u( x )) · = α · x α−1 = B( x; α, 1).
∂x

This is the easiest sampling method, as the solution is in closed

form. However, it only works for simple distributions where the
CDF and its inverse are known.

Figure 10: Graphical representation of

1 Inverse Transform Sampling for the
exponential distribution

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4

In an another example, for the exponential distribution with

Z x
1 − x/λ
p( x ) = e P( x ) = p( x̃ ) dx̃ = 1 − e− x/λ ,
λ −∞

by setting u = 1 − e− x/λ we get x (u) = −λ log u, where u ∼ U [0, 1].

Again, by using the method above, we can see that indeed x (u) is
distributed according to the exponential distribution. Note, 1 − u ∼
U [0, 1] and u ∼ U [0, 1] are identically distributed.

For multivariate distributions that do not factorize, computing

the inverse can be tricky, and the Inverse Transform is not guaran-
teed not be the fastest algorithm. For Gaussian distributions, the
Box-Muller transform is an example of an Inverse Transform sam-
pling method, but other algorithms such as Marsaglia’s polar form
or Ziggurat algorithm, based on rejection sampling, are faster.

Rejection Sampling v
One issue with Inverse Transform sampling is that the normaliza-
tion constant needs to be known; we cannot use an unnormalized
30 probabilistic machine learning

0.4 Figure 11: Graphical representation of

rejection sampling. The grey samples
0.3
will be rejected.

0.2

0.1

0
−4 −2 0 2 4 6 8 10

distribution. Rejection sampling can work around this if the shape of

the PDF p̃( x ) is known, where

p̃( x )
p( x ) = .
Z

If you can find a distribution q( x ) and a constant c such that cq( x )

is an upper bound for p̃( x ),

cq( x ) ≥ p̃( x ),

then it is possible to use samples from q and the uniform distribu-

tion to generate samples from p; draw s ∼ q( x ), u ∼ U [0, cq(s)], and
save s if u < p̃(s) (throw s away otherwise).
Even though it might seem wasteful to throw away samples, this
is needed in order to convert samples from q to samples from p
without knowing the relationship between these two.
Rejection sampling is simple, but its performance is bad when
the dimensionality of the distribution increases, as it rejects a large
proportion of samples. Consider the simple case of using rejec-
tion
sampling from a d-dimensional normal distribution,
p( x ) =
N x; 0, σp2 I , using a proxy distribution q( x ) = N x; 0, σq2 I ,
where σq > σp (in order to satisfy the upper bound property). The
optimal c that makes cq( x ) close to p( x ), while still remaining an
upper bound, is given by
!d/2 !d !
2πσq2 σq σq
c= = = exp d log .
2πσp2 σp σp

As the acceptance rate is proportional to the ratio of volume, 1/c,

and c scales exponentially in d, the rejection rate scales exponentially
in d as well. For σq /σp = 1.1, which should be a reasonable approx-
imation, leads to an acceptance rate of less than 1/104 for d = 100;
≈ 104 samples from q are needed to generate a sample from p.

Importance Sampling v
Importance sampling is a slightly less simple method; if it is not
possible to compute the inverse transform to sample from p( x ), but
the PDF can still be evaluated, we can use samples from a proxy
monte carlo methods v 31

distribution q( x ) by transforming the problem into assume q(x)>0 if p(x)>0

Z Z To be well defined, we need
p( x )
φ= f ( x ) p( x ) dx = f (x) q( x ) dx. q( x ) > 0 if p( x ) > 0.
q( x )

This is just the computation of the expectation over the proxy

distribution q of a new function, g( x ) = f ( x ) p( x )/q( x ), so we can
get an unbiased estimate using

1 p( x ) 1
φ̃ =
n ∑ f (xi ) q(xii) =
n ∑ f ( x i ) wi ,
i i

where wi = p( xi )/q( xi ) is known as the importance, or weight,

of the ith sample. If the normalization constant Z is unknown, it is
possible to estimate it “on the fly”, using p̃( x )/Z = p( x ) and
Z
11 p̃( xs )
f ( x ) p( x ) dx =
ZS ∑ f ( xs ) q( xs )
s
1 p̃( xs )/q( xs )
=
S ∑ f (xs ) 1 ∑ 0 p̃(x 0 )/q(x 0 ) =: ∑ f ( xs )w̃s
s S s s s s

This estimator is no longer unbiased, but it is still consistent; it will

eventually converge to the solution.

4 Figure 12: Graphical representation

p(x) of importance sampling. Showing the
q(x) true distribution p( x ), the proxy q( x )
w(x) weight function w( x ), f ( x ) = x being
3
a linear function and black dots from
log10 sample count

the true distribution on the left. On the

right we can see black samples drawn
2 from p( x ) · x and red importance
samples showing higher variance.

0
−2 0 2 4 6 8 0 50 100
x f ( x ), g ( x )

Importance sampling also has issues, especially in high dimen-

sions. The variance of a Monte-Carlo estimator is var[ f ( x )]/S,
and importance sampling replaces var[ f ( x )] with var[ g( x )] =
var[ f ( x ) p( x )/q( x )]. The variance of g can be large if q p some-
where. If p has “undiscovered highlands”, regions of high probabil-
ity to which q assigns very-low to no probability, some regions can
have the ratio p( x )/q( x ) grow to infinity as q( x ) goes to 0.

Condensed content

Sampling is a way of performing rough probabilistic computations

without having to design an elaborate integration algorithm, in
particular for expectations (including marginalization).
32 probabilistic machine learning

• ‘Random numbers’ generated by a computer don’t really need

to be unpredictable, as long as they have as little structure as
possible

• Uniformly distributed random numbers can be transformed

into other distributions. This can be done numerically efficiently
in some cases, and it is worth thinking about doing so

• Sampling is harder than global optimization. To produce exact

samples one needs a global description of the entire function
including knowledge of regions with high density (not just local
maxima!) as well as the cumulative density everywhere else.

• Practical Monte Carlo Methods aim to construct samples from

p̃( x )
p( x ) =
Z
assuming that it is possible to evaluate the unnormalized density p̃
(but not p) at arbitrary points.
Typical example: Compute moments of a posterior

p( D | x ) p( x ) 1
p( x | D ) = R as E p( x| D) ( x n ) ≈ ∑ xin with xi ∼ p( x | D )
p( D, x ) dx S s

• Rejection sampling is a primitive but exact method that works

with intractable models

• Importance sampling makes more efficient use of samples, but

can have high variance (and this may not be obvious)

• Producing exact samples is just as hard as high-dimensional inte-

gration. Thus, practical MC methods sample from an unnormal-
ized density p̃( x ) = Z · p( x )

• Even this, however, is hard, because it is hard to build a globally

useful approximation to the integrand
Markov Chain Monte Carlo v

The main issue of importance sampling is that the proxy distribu-

tion q( x ) needs to be a good approximation to p( x ) on its whole
domain, or in other words - everywhere. The idea behind Markov
Chain Monte Carlo methods is to generate samples by iteratively
building approximations of p that only need to be good locally.

Definition 22 (Markov Chains ). A joint distribution p( X ) over a

sequence of random variabels X := [ x1 , . . . , x N ] is said to have the
Markov property if

p ( x i | x 1 , x 2 , . . . , x i −1 ) = p ( x i | x i −1 ).

The sequence is then called a Markov chain.

The Markov property can be interpreted as forgetfulness, since

all but the most recent ’chain links’ are discarded. To illustrate the
intuition behind Markov Chain Monte Carlo, consider the following
iterative algorithm to find the maximum of p( x );

• Draw a proposal x 0 ∼ q( x 0 | xt ). proposal distribution q

• Compute a = p( x 0 )/p( xt ).

• If a > 1, accept xt+1 = x 0 , else reject x’ and set xt+1 = xt . add node in this chain

This procedure only returns the maximum of p( x ) eventually, but

can be adapted to return a sample instead by tweaking the rules
of transition from xt to xt+1 , leading to the Metropolis-Hasting
method v ;

• Draw a proposal x 0 ∼ q( x 0 | xt ) from

a proposal distribution q, for
0 0
example q( x | xt ) = N x ; xt , σ . 2

p( x 0 ) q( xt | x 0 )
• Compute a = p( xt ) q( x 0 | xt )
.

• If a > 1, accept xt+1 = x 0 . reject: x' twice in chain

• Otherwise, accept with probability a, and reject with probability
1 − a. The outcome can be decided by drawing uniformly from
[0, 1] and comparing it to a.
The Markov chain stays at the same place for one time period when
rejecting and the corresponding point will later show up at least
2 times. Usually, the proposal distribution is symmetric, such that
q ( x t | x 0 ) = q ( x 0 | x t ).
34 probabilistic machine learning

Using this method, the samples will spend more “time” in re-
gions where p( x ) is high (lower probability of sampling a better
proposition) and less “time” in regions where p( x ) is low (any
proposition would be good), but the algorithm can still visit regions
of low probability (see Fig. 13 for an example). See chi-feng.github.io/mcmc-demo/
app.html#RandomWalkMH for a visu-
Metropolis-Hasting draws samples from p( x ) in the limit sample
alization of the Metropolis-Hastings
of infinite sampling steps. The proof sketch involves the existence algorithm in 2D created by Chi Feng.
of a stationary distribution, which is a distribution that does not
change over time (anymore). For Markov Chains, its existence can
be shown through the detailed balance equation:

p ( x ) T ( x → x 0 ) = p ( x 0 ) T ( x 0 → x ),

where T ( x → x 0 ) is the probability of transitioning from x to x 0 ;

Markov Chains satisfying the detailed balance equation have at

least one stationary distribution:
Z Z Z
p( x ) T ( x → x 0 ) dx = p( x 0 ) T ( x 0 → x ) dx = p( x 0 ) T ( x 0 → x ) dx = p( x 0 ).

Uniqueness of the stationary distribution comes from ergodicity

of the sequence { xt }t∈N . The sequence created by the Metropolis-
Hasting algorithm fulfills this criteria by definition.

Definition 23 (Ergodicity). A sequence { xt }t∈N is called ergodic if

1. is a-periodic (contains no recurring sequence)

2. has positive recurrence: xt = x∗ implies there is a t0 > t such that

p ( x t0 = x ∗ ) > 0

Theorem 24 (Convergence of Metropolis-Hasting, simplified). If

q( x 0 | xt ) > 0 ∀( x 0 , xt ), then for any x0 , the density of { xt }t∈N ap-
proaches p( x ) as t → ∞.

However, this is not a statement about the convergence rate. To get

an idea of the convergence rate, consider the case of sampling from
a d-dimensional Gaussian, where the largest and smallest eigen-
values of the covariance matrix are L and e, as shown in Fig. 14 for
two dimensions.
markov chain monte carlo v 35