0% found this document useful (0 votes)
1K views189 pages

Hennig 2021 Probabilistic Machine Learning

The document provides lecture notes on probabilistic machine learning. It covers topics like reasoning under uncertainty, probabilistic reasoning, Gaussian probability distributions, Gaussian processes, graphical models, and variational inference. The notes were prepared by Professor Philipp Hennig for a course at the University of Tübingen and have been edited by several contributors.

Uploaded by

zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views189 pages

Hennig 2021 Probabilistic Machine Learning

The document provides lecture notes on probabilistic machine learning. It covers topics like reasoning under uncertainty, probabilistic reasoning, Gaussian probability distributions, Gaussian processes, graphical models, and variational inference. The notes were prepared by Professor Philipp Hennig for a course at the University of Tübingen and have been edited by several contributors.

Uploaded by

zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 189

CONTENT:

PHILIPP HENNIG
SCRIBES:
FREDERIK KÜNSTNER (2018/19)
A N N - K AT H R I N S C H A L K A M P ( 2 0 1 9 )
TIM REBIG (2020)
Z A F I R S TO JA N OV S K I ( 2 0 2 1 )

PROBABILISTIC
MACHINE
LEARNING

LECTURE NOTES
Questions about this document (typos, structure, etc.) should be directed to
Zafir Stojanovski at [email protected]

Questions regarding the content should be directed to


Philipp Hennig at [email protected]

Lecture notes prepared for the course


Probabilitstic Machine Learning
given by Professor Philipp Hennig at the University of Tübingen.

Thanks to Felix Dangel, Ann-Kathrin Schalkamp and Tim Rebig for


their feedback on the document

Last edited July 27, 2021.


Contributors: Philipp Hennig, Frederik Künstner, Ann-Kathrin
Schalkamp, Tim Rebig, Zafir Stojanovski
Introduction

Probabilistic inference is a foundation of scientific reasoning,


statistics, and machine learning. The goal of this course is to es-
tablish a formal framework for probabilistic reasoning, show how
to use it to build powerful inference mechanisms for real-world
problems and develop the technical tools necessary to implement
inference in practice.
The lecture begins with a general introduction to basic principles
of the rules of probability theory, then covers the probabilistic view
on standard settings such as supervised regression and classifica-
tion, unsupervised dimensionality reduction and clustering.
In a parallel thread through the lecture, we will also encounter a
number of popular algorithms for inference in probabilistic models,
including exact inference in Gaussian models, sampling, and free-
energy methods.
Some of the points we will cover are

• Connections between probabilistic inference and Boolean logic

• Learning functional relationships between variables.

• Establish a formal framework for probable reasoning

• A generalization from “shallow” and “deep” to “structured”


learning

• A general toolbox for encoding structured domain knowledge in


a learning agent, transferring it into a concrete algorithm

And much more to give a joint, connected, holistic view on reason-


ing, inference, learning and intelligence.
Contents

Reasoning under Uncertainty v 7

Probabilistic Reasoning v 15

Probabilities over Continuous Variables v 21

Monte Carlo Methods v 27

Markov Chain Monte Carlo v 33

Gaussian probability distributions v 39

Parametric Gaussian Regression v 45

Hierarchical Inference: learning the features v 55

Gaussian Processes v 61

Understanding Kernels v 73

Gauss-Markov Models v 83

Gaussian Process Classification v 91


6

Generalized Linear Models v 99

Exponential Family v 107

Graphical Models v 115

Factor graphs v 121

The Sum-Product Algorithm v 125

Extended Example: Topic Modeling v 133

Latent Dirichlet Allocation v 141

Efficient Inference and K-Means v 147

Mixture Models & EM v 151

Free Energy v 157

Variational Inference v 165

Customizing models and algorithms v 173

Making decisions v 181

Bibliography 189
Reasoning under Uncertainty v

An inference problem requires statements about the value of Probability theory is nothing but common
sense reduced to calculation.
an unobserved (latent) variable x based on observations y which are
related to x, but may not be sufficient to fully determine x. This — Pierre-Simon de Laplace
requires a notion of uncertainty.
We hope in this chapter to give an intuition on reasoning under
incomplete information as well as present a rigorous mathematical
construction of the foundation of modern probability theory.

Examples

A Card Trick v
Three cards with colored faces are placed into a bag. One card is
red on both faces, one is white on both faces and the last is red on
one face and white on the other - see Fig. 1. After mixing the bag,
we pick a card and see that one face is white. What is the color of
its other face?
While we do not have direct information about the back of the
card, we can use what we know about the setup to make an edu-
cated guess about the probability that it is also white. Make a guess; Figure 1: A card trick
is it 1/2? 2/3? Something else? We will revisit the problem at a later
stage.

Deductive and Plausible Reasoning v


Classically, computers do not allow for handling uncertainties out
of the box and their reasoning is based on porpositional logic.
Propositions are statements which can either be true or false. As an
example consider the scenario where we have two statements A =
“It rains” and B = “The street is wet”. The framework is limited
to falsify only two statements based on the binary values of truth
belonging to A and B. A ⇒ B which translates to "If it rains the
street gets wet" and ¬ B ⇒ ¬ A corresponding to "If the street is dry
it cannot have rained". For the other two combinations of true and
false of A and B the truth content cannot be determined, although
it might be more plausible that the street is dry given that it has
not rained. This limitation raises the necessity of extending the for-
malism of binary truth values to spectrum between true and false.
8 probabilistic machine learning

Probability theory provides a framework which allows us to dis-


tribute finite amount of truth and allows to make much more subtle
statements about the relationship between variables/propositions.
¬B B

Deductive Reasoning: A ⇒ B if A is true, then B is true


¬A
A is true thus B is true modus ponens
B is false thus A is false modus tollens A

Plausible Reasoning: P( B | A) > P( B) if A is true,


then B becomes more plausible

A is true thus B becomes more plausible


B is false thus A becomes less plausible
B is true thus A becomes more plausible
A is false thus B becomes less plausible 0

1 2 3
A helpful mental image to think about probabilities is to imag-

MANQUE
4 5 6
ine truth as a finite amount of “mass” which can be spread over

PASSE
a space of mutually exclusive “elementary” events. More events 7 8 9
can then be constructed from unions and intersections of sets of 10 11 12
elementary events. An example of this construction is roulette –
13 14 15
see Fig. 2 – where numbers 0 − 36 constitute the set of elementary
17

IMPAIR
events and Red/Black, Odd/Even, Low/High and more elaborate 16 18

PAIR
combinations are constructed from these elementary events. 19 20 21

22 23 24

Formalization v 25 26 27

28 29 30
The goal of this section is to establish a formal framework for prob-
31 32 33
able reasoning. For this purpose, we will introduce Kolmogorov’s
probability theory. Published in 1930, the approach by the Soviet 34 35 36

mathematician Andrey Kolmogorov1 still lays the foundation of 12P 12 M 12D 12D 12 M 12P
modern probability theory and is based on an axiomatic system.
Figure 2: Events for a Roulette
Kolmogorov’s Axioms are a pure mathematical construction. We 1
Kolmogorov. Grundbegriffe der
first present a simplified form of the axioms; Wahrscheinlichkeitsrechnung. 1933

Definition 1 (Kolmogorov’s Axiom (Simplified)). Let Ω be a space


of possible “elementary” events, such as samples or propositions, Kolmogorov defines measures on sets
using
and let F be the set of all possible subsets of Ω. The probability p
A ∩ B, A ∪ B, Ā.
of an event A ∈ F is a real map p : F → R that has the following
three properties: The shortcut p( A, B) is often used for
p ( A ∧ B ).

1. Non-Negativity: For all A ∈ F , 0 ≤ p( A) ≤ 1.

2. Normalization: p(Ω) = 1.

3. Additivity: If A and B are mutually exclusive, then

p ( A ∪ B ) = p ( A ) + p ( B ).
reasoning under uncertainty v 9

Contemporary formal form v


Now that we have an intuition, we only need to introduce some
mathematical objects used in Kolmogorov’s definitions to give the
full and more abstract formulation of the axioms.

Definition 2 (σ-algebra, measurable sets & spaces). Let E be a space


of elementary events. Consider the power set 2E , and let F ⊂ 2E be
a set of subsets of E. Elements of F are called random events. If F
satisfies the following properties, it is called a σ-algebra.

1. E ∈ F

2. ( A, B ∈ F ) ⇒ ( A − B ∈ F )
S T∞ 
N
3. ( A1 , A2 , · · · ∈ F ) ⇒ i =1 A i ∈ F ∧ i =1 Ai ∈ F

(this implies ∅ ∈ F . If E is countable, then 2E is a σ-algebra). If F


is a σ-algebra, its elements are called measurable sets, and ( E, F ) is
called a measurable space (or Borel space).

Sigma-algebras are very abstract objects and in most cases cannot


be written down explicitly. Nevertheless, its stability under the
stated operations on sets enable it to describe all possible events in
our space E. Moreover, its σ-additivity property garantees us that
truth is neither created out of thin air, nor destroyed.

Definition 3 (Measure & Probability Measure). Let ( E, F ) be a


measurable space (aka. Borel space). A nonnegative real function P :
F → R0,+ is called a measure if it satisfies the following properties:

1. P(∅) = 0

2. For any countable sequence { Ai ∈ F }i=1,..., of pairwise disjoint


sets (Ai ∩ A j = ∅ if i 6= j), P satisfies countable additivity (aka.
σ-additivity):
 

[ ∞
P Ai  = ∑ P ( A i ).
i =1 i =1

The measure P is called a probability measure if P( E) = 1 (Note: for


probability measures, 1. is unnecessary). In this setting, ( E, F , P) is
called a probability space.

These two definitions constitute the contemporary formal form


of Kolmogorov’s axioms and give rise to the following theorem:

Theorem 4 (Sum Rule). From A + ¬ A = E we get

P( A) + P(¬ A) = P( E) = 1, thus P( A) = 1 − P(¬ A).

And from A = A ∩ ( B + ¬ B), using the notation P( A, B) = P( A ∩ B)


for the joint probability of A and B, we get the Sum Rule

P( A) = P( A, B) + P( A, ¬ B).
10 probabilistic machine learning

To be able to take into account events which have already oc-


curred, we need to define conditional probabilities.

Definition 5 (Conditional Probability). If P( A) > 0, the quotient


P( A, B)
P( B | A) =
P( A)
is called the conditional probability of B given A. It immediately gives

P( A, B) = P( B | A) P( A) = P( A | B) P( B).

It is easy to show that P( B | A) ≥ 0 , P( E | A) = 1 , and for


B ∩ C = ∅, we have P( B + C | A) = P( B | A) + P(C | A). Thus, for
a fixed A, ( E, F , P(· | A)) is a probability space.

The equations P( A, B) = P( B | A) P( A) = P( A | B) P( B) are also


2
Jaynes. Probability theory: The logic of
known as the Product Rule. For a thorough treatment, see Chapters science. Cambridge University Press,
1 and 2 in Probability Theory - the Logic of Science 2 . Using conditional 2003. URL bayes.wustl.edu/etj/
probabilities, we can now state the following extension of the Sum prob/book.pdf

Rule:

Theorem 6 (Law of Total Probability). Let A1 + A2 + · · · + An = E


and Ai ∩ A j = ∅ if i 6= j. Then, for any X ∈ F ,
n
P( X ) = ∑ P ( X | A i ) P ( A i ).
i =1
Sn
Proof. Because X = E ∩ X = i =1 ( A i ∩ X ), from σ-additivity we get
that
n de f . n
P( X ) = ∑ P ( Ai , X ) = ∑ P ( X | A i ) P ( A i ).
i =1 i =1

Bayes’ Theorem

Finally, we can state the mechanism that is at the heart of all proba-
bilistic reasoning.

Theorem 7 (Bayes’ Theorem). Let A1 + A2 + · · · + An = E and


Ai ∩ A j = ∅ if i 6= j. Then, for any X ∈ F ,

P ( Ai ) P ( X | Ai )
P ( Ai | X ) = .
∑nj=1 P( A j ) P( X | A j )

Proof. Apply the Sum Rule to the definition of the conditional


probability.

The language of inference, commonly used in Bayesian infer-


ence, denotes meaning to the terms within Bayes’ theorem. Assum-
ing that X is a hypothesis and D is an observation:
prior likelihood
z }| { z }| {
p( X ) × p( D | X )
p( X | D ) = .
| {z } p( D )
posterior | {z }
evidence
reasoning under uncertainty v 11

• The posterior is the probability that the hypothesis X is true after


observing D.

• The prior is the probability that the hypothesis X is true before


any observation.

• The likelihood is the conditional probability of observing D given


that the hypothesis is true.

• The evidence is the probability of observing D, regardless of the


truth of hypothesis X.

Bayes’ theorem states how to update the plausibility of the hypoth-


esis X based on the observation data D.
A marginal distribution is the distribution of a subset of variables, See wikipedia.org/wiki/
where some variables are averaged out. Given a joint distribution Posterior_probability#Example
and wikipedia.org/wiki/
p( A, B), the marginal p( A) is
Marginal_distribution#Real-world_example
for more examples.
p( A) = ∑ p ( A | B = b ) p ( B = b ),
b∈B

where B is the space of possible values for B. Note that despite the
name, the prior is not necessarily what you know before seeing the
data, but the marginal distribution P( X ) = ∑d∈D P( X, d) under all
possible data.

Revisiting the card trick v


There are multiple ways to define events to approach the problem;
here is one possible solution: Let C be the card picked out of the
bag with possible values { RR, RW, WW } – for Red-Red, Red-White
and White-White – and let W be the event “the observed face is
white”. The other side of the card is also white iff we have picked
C = WW, so we are interested in the probability p(C = WW |W ).
Applying Bayes’ Theorem, we have that

p(W |C = WW ) p(C = WW )
p(C = WW |W ) = .
p (W )

Filling in numbers,

• the prior probability of picking C = WW is 1/3,

• the probability that the observed face is white given WW is 1,

• because half of the faces are white, p(W ) = 1/2,

leading to p(C = WW |W ) = 1/3/1/2 = 2/3. Try to apply the same


strategy to solve the famous Monty Hall problem3 ! 3
wikipedia.org/wiki/Monty_Hall_problem

Revisiting Propositional Reasoning v

As deductive reasoning is a subset of plausible reasoning, using


Bayes’ theorem we can show the following:
12 probabilistic machine learning

Lemma 8. P( B | A) = 1 implies if A is true, then B is true

A⇒B is equivalent to p( B| A) = 1
¬B ⇒ ¬ A is equivalent to p(¬ A|¬ B) = 1
p( B|¬ A) ≤ p( B) A is false implies B becomes less plausible
p( A| B) ≥ p( A) B is true implies A becomes more plausible.

But plausible reasoning is more general, as one can see when


using conditional probabilities.

Lemma 9. P( B | A) ≥ P( B) implies if A is true, then B becomes


more plausible

p( B| A) ≥ p( B) A is true implies B becomes more plausible


p( B|¬ A) ≤ p( B) A is false implies B becomes less plausible
p( A| B) ≥ p( A) B is true implies A becomes more plausible
p(¬ A|¬ B) ≥ p(¬ A) B is false implies A becomes less plausible.

So far, inputs to the probability were propositional variables;


p( A) is the probability that A is true. In the remaining of this doc-
ument, p( A) is a function over the possible values that A can take,
with a slightly unusual notation. Given two binary variables A, B,
writing
p( A, B) = p( A) p( B)

means all of the following;

p( A = 0, B = 0) = p( A = 0) p( B = 0),

p( A = 0, B = 1) = p( A = 0) p( B = 1),

p( A = 1, B = 0) = p( A = 1) p( B = 0),

p( A = 1, B = 1) = p( A = 1) p( B = 1).

When Bayesian reasoning matches human reasoning v

Flaws of human reasoning can show up as well in the probabilistic


framework. We consider an example based on the poem The Raven4 , 4
https://en.wikipedia.org/wiki/The_Raven

in which you hear an unexpected tapping t at your door and you


start to ponder who that might be. We strongly recommend the
enjoyable recitation of the poem given in the lecture. If you assume
all candidates v1 , .., vn who might visit you are equally likely, using
Bayes theorem to infer knowledge will be useless. The constant like-
lihood terms in the evidence will cancel out, hence data that has a
uniform likelihood under all hypotheses does not provide informa-
tion, and therefore does not change the posterior probability.

P ( t | v1 ) · P ( v1 ) P ( t | v1 ) · P ( v1 ) P ( v1 )
P ( v1 | t ) = = =
P(t) P ( t | v1 ) ∑ i P ( v i ) ∑i P ( vi )

You might also convince yourself before opening the door that one
of the possible hypotheses is more likely than the others, a specific
visitor l you very much like to see. Doing so, you can unknowingly
reasoning under uncertainty v 13

adjust the prior in such a way that this event dominates the poste-
rior. This is a concern of critics of probabilistic reasoning who often
argue that it is possible to obtain any desired explanation for the
data as long as the hypothesis in question has non-zero probability.

P(t | `) · P(`) P(t | `) · P(`)


P(` | t) = =
P(t) P(t | `) P(`) + ∑i6=` P(t | vi ) P(vi )
.1

A similar human flaw would be to not accept that one of the possi-
ble hypotheses is actually impossible (the person l might be dead)
and needs to be assigned a prior probability of 0.

P(t | `) · P(`) P(t | `) · 0


P(` | t) = =
P(t) P(t)
=0

In the poem the person is greeted by darkness there and nothing more
after opening the door. Realizing that there is no visitor creates
an inconsistency in our theory as all former hypothesis v1 , .., vn
now have probability 0, which contradicts the observed tapping.
The hypothesis space has to contain some explanantion for the
observation t. The appropriate construction of the hypothesis space
can be one of the main challenges in practice.

P(v) = 0 P(`) = 0 P(t) = ∑ P(t | vi ) P(vi ) = 0?


i

Reasoning that the wind w might have caused the tapping at this
point and adding this new hypothesis reveals a frequent problem of
probabilistic inference. We have to know the correct variables before
we start reasoning and include them. Otherwise, no matter what
prior distribution we choose, the results will be flawed.

P(t | w) · P(w) P(t | w) · P(w)


P(w | t) = =
P(t) P(t, w) + ∑i P(t, vi )

If the hypothesis space does not include the correct explanation r


to begin with - in the poem it is a raven which is responsible for
the tapping - probability theory will always assign 0 probability to
r. This problem is independent of the mechanism of probabilistic
reasoning but demands creative thinking when setting up the space
of hypotheses.
14 probabilistic machine learning

Condensed content

The Rules of Probability:

• An inference problem requires statements about the value of an


unobserved (latent) variable x based on observations y which are
related to x, but may not be sufficient to fully determine x. This
requires a notion of uncertainty.

• the Sum Rule:

P( A) = P( A, B) + P( A, ¬ B)

• the Product Rule:

P( A, B) = P( A | B) · P( B) = P( B | A) · P( A)

• Bayes’ Theorem:

P( B | A) P( A) P( B | A) P( A)
P( A | B) = =
P( B) P( B, A) + P( B, ¬ A)

• Bayes’ Theorem provides the mechanism for inference:

likelihood of X under D prior of X


z }| { z }| {
P( D | X ) · P( X )
P( X | D ) =
| {z } P( D )
posterior of X given D | {z }
evidence for the model

• The fundamentality of probabilities has been debated at length.


Probabilities are not the only inference system, but they are
uniquely general, expressive, and powerful.

• Machine learning and AI can be approached in various ways.


The probabilistic viewpoint is the closest we have to a theory of
everything for ML.
Probabilistic Reasoning v

Probability Theory can stumble onto computational difficulties,


as the number of parameters required to describe a system grows
exponentially with the number of variables considered. The joint
distribution of n = 26 binary variables A, B, . . . , Z has 2n free
parameters, q1 , q2 , . . . , q2n −1 ,

p( A, B, . . . , Z ) = p1
p(¬ A, B, . . . , Z ) = p2
..
.
p(¬ A, ¬ B, . . . , Z ) = p67 108 863
n 1
p(¬ A, ¬ B, . . . , ¬ Z ) = 1 − ∑2i=− 1 pi

Storing the parameters alone would already require ≈ 67Mb of


RAM as we have to keep track of every single hypothesis in a com-
binatorially large space. In addition to a large memory requirement,
computing marginal probabilities such as p( A) is also time con-
suming. Thankfully, under some assumption, we can express the
joint distribution in fewer numbers.

The earthquake and the burglar v

Consider the following scenario, which we will use as an example


for probabilistic reasoning. Assume that you have a home alarm
system that can detect burglars, but can also be triggered by earth-
quakes. Being away from home, you receive a text message from
the alarm system, and you want to assess the probability that your
home is currently being robbed. To get more information, you can
turn on the radio, which will reliably broadcast a message if an
earthquake happened.
Let’s define the following observable variables,

A: The alarm was triggered,


R: The radio announced an earthquake,

and the following latent variables, which we will need to infer,

E: There was an earthquake,


B: There is a burglar in your home.

The joint probability distribution over those four binary variables


would typically need 24 − 1 = 15 = 8 + 4 + 2 + 1 parameters to be
16 probabilistic machine learning

fully represented,

p( A, R, E, B) = p( A| R, E, B) p( R| E, B) p( E| B) p( B). E B

However, we can use domain knowledge to remove irrelevant con-


ditions; R A

• We can assume that the probability of an earthquake is indepen-


Figure 3: Graphical model for the
dent of being robbed, such that p( E| B) = p( E). earthquake burglar example.

• Similarly, we can assume that the radio broadcast does not de-
pend on your house being robbed, such that p( R| E, B) = p( R| E).

• Lastly, we can assume that your home alarm is independent


of the radio broadcast, when conditioned on the occurrence of an
earthquake, that is p( A| R, E, B) = p( A| E, B).

Note that this last point does not imply that the alarm system is in-
dependent of the radio broadcast, p( A| R) 6= p( A). If an earthquake
increases the probability of false alarms and the probability of radio
broadcast, knowing that there was a radio broadcast increases the
probability that the alarm will go off. Those simplifications lead to
a system with 8 = 4 + 2 + 1 + 1 parameters,

p( A, R, E, B) = p( A| E, B) p( R| E) p( E) p( B).

To start reasoning about the problem, we will need to plug in a


few numbers. We will start by assuming that both earthquakes and
burglars are rare, and that each day has a 1/1,000 chance of seeing
any of them occurring, translating to a frequency of roughly one
earthquake/robbery every three years,

p( E) = 10−3 , p( B) = 10−3 .

We will assume that the radio is perfectly reliable, such that

p( R = 1| E = 1) = 1, p( R = 1| E = 0) = 0.

For the alarm, we will assume that it can send false alarms, with a
rate f = 1/1,000, that a burglar has a α B = 99/100 chance of triggering
it while an earthquake only has a α E = 1/100 chance of triggering it.
This yields the following table of probabilities,

p( A = 0| B = 0, E = 0) = (1 − f ) = 0.999,
p( A = 0| B = 0, E = 1) = (1 − f )(1 − α E ) = 0.98901,
p( A = 0| B = 1, E = 0) = (1 − f )(1 − α B ) = 0.00999,
p( A = 0| B = 1, E = 1) = (1 − f )(1 − α B )(1 − α E ) = 0.0098901,

p( A = 1| B = 0, E = 0) = f = 0.001,
p( A = 1| B = 0, E = 1) = 1 − (1 − f )(1 − α E ) = 0.01099,
p( A = 1| B = 1, E = 0) = 1 − (1 − f )(1 − α B ) = 0.99001,
p( A = 1| B = 1, E = 1) = 1 − (1 − f )(1 − α B )(1 − α E ) = 0.9901099.
probabilistic reasoning v 17

Using Bayes’ Theorem, we can now reason about various scenar-


ios. Given the information that our alarm went off, but without
knowledge of the radio broadcast, we can compute the probabil-
ity that there was a break-in. Plugging the numbers above in the
conditional probability formula yields

p( A = 1, B = 1)
p ( B = 1| A = 1) = ,
p ( A = 1)
∑ p( A = 1, B = 1, R, E)
= R,E = 0.495.
∑ B,R,E p( A = 1, B, R, E)
The somewhat lengthy calculation can be seen here v . If we also
know that the radio broadcasts an announcement about an earth-
quake v ,

p( A = 1, B = 1, R = 1)
p( B = 1| A = 1, R = 1) = ,
p ( A = 1)
∑ p( A = 1, B = 1, R = 1, E)
= E = 0.08.
∑ B,R,E p( A = 1, B, R, E)
The phenomenon of reducing the probability of an event by adding
more observation is often referred to as explaining away. In this
example the information about the radio announcement explains
away the break-in as a reason for the alarm.

The general recipe for probabilistic reasoning can be summa-


rized as

• Identifying all relevant variables: A, R, E, B

• Defining the joint probability (aka. the generative model):


P( A, R, E, B)

• Fixing certain variables through observing: A = 1

• Performing inference through Bayes’ Theorem

Graphical representation of (in)dependence v


A visual summary of our probabilistic model, shown in Fig. 3,
displays the relationship between variables as a directed graph. Ob-
servable variables are shown in dark nodes, while latent variables
are shown in white nodes.

Definition 10 (Bayesian Network). A Directed Graphical Model


(DGM), or Bayesian Network, is a probability distribution over
variables X1 , . . . , XD with the following structure,
D
p ( X1 , . . . , X D ) = ∏ p(Xi | pa(Xi )),
i =1

where pa( Xi ) are the parental variables of Xi , that is, Xi 6∈ pa( X j ) ∀ X j ∈


pa( Xi ). A DGM is represented by a Directed Acyclic Graph (DAG)
with the propositional variables as nodes, and arrows from parents
to children.
18 probabilistic machine learning

By the Product rule, every joint probability distribution can be


factorized into a dense DAG. The following factorization,

p( A, E, B, R) = p( A| E, B, R) p( R| E, B) p( E| B) p( B),

leads to one such DAG, but this other factorization leads to an-
other graphical representation where the direction of each edge is
reversed,

p( A, E, B, R) = p( B| A, E, R) p( E| A, R) p( R| A) p( A).

The direction of the arrows is not a causal statement. Representing


the probabilistic model as a DAG is not always useful, unless it
reveals independence, as in Fig. 3, where the factorization leading to
the graphical model is

p( A, E, B, R) = p( A| E, B) p( R| E) p( E) p( B).

Definition 11 (Independence). Two variables A and B are indepen-


dent iff their joint distribution factorizes into marginal distributions,

p( A, B) = p( A) p( B).

In that case p( A| B) = p( A) and we use the notation A ⊥


⊥ B.
Information about B does not give information about A and vice
versa.

Note that p( A| B) = p( A) is equivalent to the statement p( A, B) =


p( A) p( B) due to the definition of conditional probability as p( A, B) =
p ( A | B ) p ( B ).

Definition 12 (Conditional independence). Two variables A and B


are conditionally independent given variable C iff their conditional
distribution factorizes,

p( A, B | C ) = p( A|C ) p( B|C ).

In that case we have p( A | B, C ) = p( A|C ), i.e., in light of infor-


mation C, B provides no further information about A. We use the
notation A ⊥⊥ B | C.

v Independence and conditional independence are related but


do not imply each other. Consider the following example; given two
coins, let A be the event that the first coin shows head, let B be the
event that the second coin shows head and let C be the event that
both coins show the same result. A ⊥ ⊥ B should be intuitive, as
the result of a coin toss does not give information about another
coin toss, but we also have that A ⊥ ⊥ C and B ⊥ ⊥ C. To see why
this is, fix the value of a coin, say A is true. Then, we have that
p(C | A = 1) = p( B) = 1/2, which is equal to p(C ). However, we
have that A 6⊥ ⊥ B | C, as knowing the output of the second coin and
whether both coins show the same face gives full information on
the result of the first coin.
probabilistic reasoning v 19

Reading independence from DAGs is easier if we consider sub-


sets of variables. For subsets of one and two variables, the indepen-
dence structure is obvious. Starting with tri-variate structures it gets
more interesting. Note however, that it is not possible to deduce

p( A, B, C ) DAG Independence But!


A B C
p(C | B) p( B| A) p( A) A⊥
⊥C|B A 6⊥
⊥C
A B C
p( A| B) p(C | B) p( B) A⊥
⊥C|B A 6⊥
⊥C
A B C
p( B| A, C ) p( A) p(C ) A⊥
⊥C A 6⊥
⊥C|B
Figure 4: Independence structure
for tri-variate subgraphs. The graph
more complex relations by looking at those simple subgraphs - it structures from top to bottom are
is possible, for example, that A ⊥⊥ C | B but bringing in a new vari- also called chain graph, fan-out and
able D, we could have that A 6⊥ ⊥ C | B, D. Also, a single DAG does collider graph.

not necessarily reveal all the independence properties of a proba-


bilistic model and DAGs are therefore in that sense an incomplete
language.

The DAG for the two coins example is not unique. For exam-
ple, computing the probabilities

1 1
p ( A = 1) = , p ( B = 1) = ,
2 2
p(C = 1| A = 1, B = 1) = 1, p(C = 1| A = 0, B = 1) = 0,
p(C = 1| A = 1, B = 0) = 0, p(C = 1| A = 0, B = 0) = 1,

we have that the conditional probabilities imply that

p ( A | B ) = p ( A ), p ( B | C ) = p ( B ), p ( C | A ) = p ( C ), p ( C | B ) = p ( C ),

leading to the following three possible factorizations,

p( A, B, C ) = p(C | A, B) p( A) p( B),

p( A, B, C ) = p( A| B, C ) p( B) p(C ),
p( A, B, C ) = p( B| A, C ) p( A) p(C ),

each matching a DAG in Fig. 5.

Figure 5: Three possible DAG for the


A B A B A B two coin example.

C C C
20 probabilistic machine learning

Condensed content

Computing with Probabilities

• Probabilistic reasoning extends propositional logic

• instead of tracking a single true value, we have to assign proba-


bilities to combinatorially many hypotheses

• Two variables A and B are conditionally independent given


variable C, if and only if their conditional distribution factorizes,

P( A, B|C ) = P( A|C ) P( B|C )

Graphical Models and Conditional Independence

• Multivariate distributions can have exponentially many degrees


of freedom.

• (Conditional) independence helps reduce this complexity to


make things tractable in multi-variate problems.

• Directed graphical models provide a notation from which condi-


tional independence can be read off using simple rules.

• Every probability distribution is a DAG, but not every indepen-


dence structure of a distribution is captured by a DAG of it.
Probabilities over Continuous Variables v

Probability theory extends propositional logic with propo-


sitional variables A, . . . , Z ∈ {0, 1} ranging over the space of all
possible boolean assignments Ω, with a normalized probability
measure p : Ω → [0, 1], such that ∑w∈Ω p(w) = 1. Discrete probabil-
ity theory can also handle variables in a discrete set Ω = {0, 1, . . .}
using a similar probability measure, while continuous probability
theory uses the probability density function p : Ω → R+ to handle
continuous sample spaces, such as Ω = R, with the property that
R
w∈Ω p ( w ) dw = 1.
We will later see the precise definitions of the mathematical
objects mentioned in this section. For the moment the notation
should suffice to give a first intuition.
Let X be a random variable taking real values, X ∈ R, and define
the following events: A = ( X ≤ a), B = ( X ≤ b), W = ( a < X ≤ b).
As A and W are mutually exclusive, by the sum rule we have that

p ( B ) = p ( A ) + p (W ) , p (W ) = p ( B ) − p ( A ) .

Thinking of the events as functions of the limits they are checking,



PX ( x ) = P( X ≤ x ), we can use their derivative p( x ) = ∂x PX ( x ), to
express this problem using integrands,
Z b
p( a < X ≤ b) = PX (b) − PX ( a) = p( x ) dx.
a

PX is called the cumulative distribution function (CDF) and f is the


probability density function (PDF).

The Product and the Sum rules apply to the probability den-
sity function, and taken together imply Bayes’ rule.

p ( x1 , x2 )
p ( x1 | x2 ) = Product rule,
p ( x2 ) 0.4
Z
p

0.2
p X1 ( x 1 ) = p X ( x1 , x2 ) dx Sum rule, 0
p(x | y) p(x, y) −2

R −2
−1
0
0
p ( x1 ) · p ( x2 | x1 )
x
1 2
y 2
p ( x1 | x2 ) = R Bayes’ Theorem.
p( x1 ) · p( x2 | x1 ) dx1
Figure 6: Joint probability density
Those rules, however, do not apply to the cumulative distribution function for two variables, highlight-
ing the marginal p(y) (rear panel)
function PX ( x ). Fig. 6 illustrates the joint, marginal and conditional and conditional probability density
densities on a two-dimensional example. p( x |y = 0) (cutting through the joint
density).
22 probabilistic machine learning

The base measure

Probability density functions are only defined relatively to a base


measure, and changes of variables need additional care v .

Theorem 13 (Change of Variable for Probability Density Functions).


Let X be a continuous random variable with PDF p X ( x ) over c1 <
x < c2 . And, let Y = u( X ) be a monotonic differentiable function
with inverse X = v(Y ). Then the PDF of Y is

dv(y) du( x ) −1

pY (y) = p X (v(y)) · = p X (v(y)) · .
dy dx

To understand the last factor and its inversion we recommend


to have a look at v . Assume that u is monotonically increasing,
u0 ( X ) > 0, and let d1 , d2 = u(c1 ), u(c2 ). pY is defined on d1 < y < d2
and we have that the CDF Py is defined w.r.t. the PDF p X as
Z v(y)
Py (y) = P(Y ≤ y) = P(u( X ) ≤ y) = P( X ≤ v(y)) = p( x ) dx,
c1

and the PDF pY follows from the CDF Py as

∂PY (y) ∂v(y)


pY ( y ) = = p X (v(y)) .
∂y ∂y
To obtain the absolute value, repeat the previous steps with a
monotonically decreasing change of variable such that u0 ( x ) < 0.
We can now state the generalization of the former theorem for
multivariate functions.

Theorem 14 (Transformation Law, general). Let X = ( X1 , . . . , Xd )


have a joint density p X . Let g : Rd → Rd be continuously differen-
tiable and injective, with non-vanishing Jacobian Jg . Then Y = g( X )
has density

 p ( g−1 (y)) · | J −1 (y)| if y is in the range of g,
X g
pY ( y ) =
0 otherwise.

Formal definitions
This is mostly useful to understand
We now give the promised formal definitions which lead the way to other reference material on the subject;
a rigorous formulation of densities and probabilities on continuous do not worry if the definitions sound
too convoluted at first.
spaces v . The first challenge arrives when deriving a σ-algebra F
for continuous spaces. We will not use the canonical way by taking
the power set of the elements of our continuous space Ω as our
σ-algebra. This is because the power sets can contain sets which
are not measurable with respect to the Lebesgue measure5 . This 5
https://en.wikipedia.org/wiki/Lebesgue_measure

measure allows the integration of a wider function class than the


Riemann integral and is usually sufficient for the integration of real
valued functions, apart from corner cases.
To approach σ-algebras for continuous spaces, we need to use
open sets. Hence, we start with the definition of topological spaces.
probabilities over continuous variables v 23

Definition 15 (Topology). Let Ω be a space and τ be a collection of


sets. We say τ is a topology on Ω if

• Ω ∈ τ, and ∅ ∈ τ

• any union of elements of τ is in τ

• any intersection of finitely many elements of τ is in τ.

The elements of the topology τ are called open sets. In the Euclidean
vector space Rd , the canonical topology is that of all sets U that
satisfy x ∈ U :⇒ ∃ε > 0 : ((ky − x k < ε) ⇒ (y ∈ U )).

Note that R is a topological space.

Definition 16 (Borel algebra). Let (Ω, τ ) be a topological space. The


Borel σ-algebra is the σ-algebra generated by τ. That is by taking τ
and completing it to include infinite intersections of elements from
τ, all complements to elements of τ, and restricting all unions of
elements from τ to countably many.

Now that we can define (Borel) σ-algebras on continuous spaces,


we have the tools to define distribution measures.

Definition 17 (Measurable Functions, Random Variables). Let


(Ω, F ) and (Γ, G) be two measurable spaces (i.e. spaces with σ-
algebras). A function X : Ω → Γ is called measurable if X −1 ( G ) ∈ F
for all G ∈ G. If there is, additionally, a probability measure P (see
definition in chapter 1) on (Ω, F ), then X is called a random variable.

Consider (Ω, F ) and (Γ, G). If both F and G are Borel σ-algebras,
then any continuous function X is measurable (and can thus be
used to define a random variable). This is because, for continu-
ous functions, pre-images of open sets are open sets and Borel
σ-algebras are the smallest σ-algebras to contain all those sets.

Definition 18 (Distribution Measure). Let X : Ω → Γ be a random


variable. Then the distribution measure (or law) PX of X is defined for
any G ⊂ Γ as

PX ( G ) = P( X −1 ( G )) = P({ω | X (ω ) ∈ G }).

Definition 19 (Probability Density Functions (pdf’s)). Let B be the


Borel σ-algebra in Rd . A probability measure P on (Rd , B) has a
density p if p is a non-negative (Borel) measurable function on Rd
satisfying, for all B ∈ B
Z Z
P( B) = p( x ) dx =: p( x1 , . . . , xd ) dx1 . . . dxd
B B

Note, not all measures have densities (ex. measures with point
masses).

Definition 20 (Cumulative Distribution Function (CDF)). For prob-


ability measures P on (Rd , B), the cumulative distribution function is
the function !
d
F ( x) = P ∏ ( Xi < x i ) .
i =1
24 probabilistic machine learning

(In particular for the univariate case d = 1, we have F ( x ) =



P (−∞, x ] ).
If F is sufficiently differentiable, then P has a density, given by

∂d F

p( x) = .
∂x1 · · · ∂xd
x

and, for d = 1,
Z b
P( a ≤ X < b) = F (b) − F ( a) = f ( x ) dx.
a

Example: inference of probabilities v

What is the probability - π - for a person to be wearing


glasses? As we do not know this probability, we can model our
uncertainty about it with a random variable π ranging in [ 0, 1 ]
and thus we learn the probabilities of probabilities. To answer the
question, we can collect some observations X and use inference;

p(X |π ) p(π ) p(X |π ) p(π )


p(π |X ) = = R .
p(X) p ( X | π ) p ( π ) dπ

To define the prior distribution, we can start with a uniform distri-


bution, p ( π ) = 1 if π ∈ [ 0, 1 ] , 0 elsewhere. Assuming we sample
observations independently, the likelihood of a positive or negative
sample, given knowledge of π, is

p ( X = 1 | π ) = π, p ( X = 0 | π ) = 1 − π.

For multiple observations, this process gives rise to a derived vari-


f = 1/3, N = 10
able which is binomial distributed and depends on π, illustrated in
Fig. 7. In terms of the traditional coin flipping example, the bino-
mial gives the distribution over the number of times a coin will 0.2
show head over N tosses, given a probability of landing head of
p (r )

π. The probability of sampling n positive and m negative observa-


0.1
tions, if the probability of an independent positive observation is
given by π, is
0
 
n+m 0 5 10
p ( n, m | π ) = π n (1 − π )m . r
n
Figure 7: Probability distribution for
the number of heads in N = 10 coin
Plugging this into the computation of the posterior yields flips with a probability of landing head
of f = 1/3.
(n+nm)π n (1 − π )m · 1 π n (1 − π ) m · 1
p(π |n, m) = R n+m .= R n .
( n )π n (1 − π )m · 1 dπ π (1 − π )m · 1 dπ

A nice choice for the prior p(π ), to make the computation easy,
is the Beta distribution with parameters a, b > 0,

π a −1 (1 − π ) b −1
p(π ) = ,
Z
probabilities over continuous variables v 25

R1
where Z is a normalization constant to ensure that 0 p(π ) dπ = 1
and is given by the Beta function,
Z 1
Z = B( a, b) = π a−1 (1 − π )b−1 dπ.
0

The uniform distribution can be represented as a Beta distribution6 6


wikipedia.org/wiki/Beta_distribution
with parameters a = b = 1. Given a Beta( a, b) prior, n positive and
m negative additional observations, the posterior is then

π n + a −1 (1 − π ) m + b −1
p(π |n, m) = .
B( a + n, b + m)

Condensed content

• Random Variables allow us to define derived quantities from


atomic events

• Borel σ-algebras can be defined on all topological spaces, allow-


ing us to define probabilities if the elementary space is continu-
ous.

• Probability Density Functions (pdf’s) distribute probability


across continuous domains.

– they satisfy “the rules of probability”:


Z
p( x ) dx = 1
Rd
Z
p X1 ( x 1 ) = p X ( x1 , x2 ) dx2 Sum rule
R
p ( x1 , x2 )
p ( x1 | x2 ) = Product rule
p ( x2 )
p ( x1 ) · p ( x2 | x1 )
p ( x1 | x2 ) = R Bayes’ Theorem.
p( x1 ) · p( x2 | x1 ) dx1

– Not every measure has a density, but all pdfs define measures
– Densities transform under continuously differentiable, injec-
tive functions g : x 7→ y with non-vanishing Jacobian as

 p ( g−1 (y)) · | J −1 (y)| if y is in the range of g,
X g
pY ( y ) =
0 otherwise.

• Probabilistic inference can even be used to infer probabilities.


Monte Carlo Methods v

As the next tool to add to our toolbox, we will look at Monte Carlo
methods.
In many probabilistic inference problems, the main computa-
tional issue is the computation of expectations and marginal proba-
bilities,
Z   Z
E p( x) [ x ] = xp( x ) dx, p(y) = E p( x) p(y| x ) = p(y| x ) p( x ) dx,

which requires integrating over probability distributions. A simple


solution to approximate those integrals is to replace the integral by
a sum over samples,
Z n Z n
1 1
xp( x ) dx ≈
n ∑ xi , p(y| x ) p( x ) dx ≈
n ∑ p ( y | x i ),
i =1 i =1

if the samples xi are sampled independently from p( x ). As a gen-


eral formulation, we want to estimate
Z  
φ := f ( x ) p( x ) dx = E p( x) f ( x ) .

Given independent samples x1 , . . . , xn from p( x ), the estimator


n
1
φ̂ =
n ∑ f ( xi )
i =1

is an unbiased estimator of φ, meaning


  Z 1 n
Ex1 ,...,xn φ̂ =
n ∑ f ( xs ) p( xs ) dxs
s =1
n Z
1 1 n
=
n ∑ n s∑
E( f ( xs )) = φ,
f ( xs ) p( xs ) dxs =
s =1 =1
√ 
and its variance decreases at a rate of O 1/ n ,
" #2
2 1 n 
n s∑
E(φ̂ − E(φ̂)) = E f ( xs ) − φ
=1
n n
1
=
n2 ∑ ∑ E( f (xs ) f (xr )) − φE( f (xs )) − E( f (xr ))φ + φ2
s =1 r =1
  
n  
1 2 
= ∑
 ∑ φ − 2φ2 + φ2  + E( f 2 ) − φ2 
n2 s =1
| {z }
r 6=s
| {z }
=0 var( f )

1
= var( f ) = O(n−1 ).
n
28 probabilistic machine learning

Definition 21 (Monte Carlo method). Algorithms that compute


expectations in the above way, using samples xi ∼ p( x ), are called
Monte Carlo methods (Stanisław Ulam, John von Neumann).

Examples: Sampling is a rough guess v

We can use sampling to estimate π; as the ratio of the quarter-unit-


circle to the unit square [0, 1] × [0, 1] is π/4, we can write π as an 1
integration over samples uniformly distributed over the unit square
0.8
Z
π=4 1{ x > x < 1}U ( x ) dx, 0.6
[0,1]×[0,1]
0.4
where U ( x ) is the PDF of the uniform distribution. This leads to a
simple algorithm for approximating π; draw samples from U ( x ) 0.2

and count the number of samples that fall within the unit circle 0
0 0.2 0.4 0.6 0.8 1
(x > x < 1).
Figure 8: Estimating π by sampling
While this procedure only needs ≈ 9 samples to get the first
digit right, it is not great when high precision is required; to get to
single-float precision (≈ 10−7 ), it needs about 1014 samples. Fig. 9
shows the error w.r.t. the number of samples

101 Figure 9: Error of the Monte-Carlo


p
MC var( f )/s estimate of π
6 π
100

4
10−1
φ̂

2
10−2

0
10−3

100 101 102 103 104 105 100 101 102 103 104 105
# samples # samples

Samples from a probability distribution can be used to roughly


estimate expectations, without having to design an elaborate inte-
gration algorithm.

Sampling

To use a Monte-Carlo method, we need to be able to sample from


p( x ). This is not an easy task in general, but there are algorithms to
turn samples from a uniform distribution into samples from p( x ).

Inverse Transform Sampling v


In some cases, if we know the distribution functions, we can use
the change of variable theorem to find the precise transformation of
uniform variables to variables of the desired distribution.
monte carlo methods v 29

Suppose we have access to the uniform random variable u ∼


U [0, 1] (i.e. u ∈ [0, 1], and p(u) = 1). Furthermore, suppose that we
want to sample from B ( x; α, 1), where

1
B ( x; α, 1) = x α −1
B(α, 1)
Z 1
Γ ( α ) Γ (1) 1
with B(α, 1) = = x α−1 dx =
Γ ( α + 1) 0 α
which gives us
B ( x; α, 1) = αx α−1
Now notice that by setting x = u1/α we get exactly a Beta-distributed
random variable:

∂u( x )

p x ( x ) = pu (u( x )) · = α · x α−1 = B( x; α, 1).
∂x

This is the easiest sampling method, as the solution is in closed


form. However, it only works for simple distributions where the
CDF and its inverse are known.

Figure 10: Graphical representation of


1 Inverse Transform Sampling for the
exponential distribution

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4

In an another example, for the exponential distribution with


Z x
1 − x/λ
p( x ) = e P( x ) = p( x̃ ) dx̃ = 1 − e− x/λ ,
λ −∞

by setting u = 1 − e− x/λ we get x (u) = −λ log u, where u ∼ U [0, 1].


Again, by using the method above, we can see that indeed x (u) is
distributed according to the exponential distribution. Note, 1 − u ∼
U [0, 1] and u ∼ U [0, 1] are identically distributed.

For multivariate distributions that do not factorize, computing


the inverse can be tricky, and the Inverse Transform is not guaran-
teed not be the fastest algorithm. For Gaussian distributions, the
Box-Muller transform is an example of an Inverse Transform sam-
pling method, but other algorithms such as Marsaglia’s polar form
or Ziggurat algorithm, based on rejection sampling, are faster.

Rejection Sampling v
One issue with Inverse Transform sampling is that the normaliza-
tion constant needs to be known; we cannot use an unnormalized
30 probabilistic machine learning

0.4 Figure 11: Graphical representation of


rejection sampling. The grey samples
0.3
will be rejected.

0.2

0.1

0
−4 −2 0 2 4 6 8 10

distribution. Rejection sampling can work around this if the shape of


the PDF p̃( x ) is known, where

p̃( x )
p( x ) = .
Z

If you can find a distribution q( x ) and a constant c such that cq( x )


is an upper bound for p̃( x ),

cq( x ) ≥ p̃( x ),

then it is possible to use samples from q and the uniform distribu-


tion to generate samples from p; draw s ∼ q( x ), u ∼ U [0, cq(s)], and
save s if u < p̃(s) (throw s away otherwise).
Even though it might seem wasteful to throw away samples, this
is needed in order to convert samples from q to samples from p
without knowing the relationship between these two.
Rejection sampling is simple, but its performance is bad when
the dimensionality of the distribution increases, as it rejects a large
proportion of samples. Consider the simple case of using rejec-
tion
 sampling from a d-dimensional normal distribution,
 p( x ) =
N x; 0, σp2 I , using a proxy distribution q( x ) = N x; 0, σq2 I ,
where σq > σp (in order to satisfy the upper bound property). The
optimal c that makes cq( x ) close to p( x ), while still remaining an
upper bound, is given by
!d/2 !d !
2πσq2 σq σq
c= = = exp d log .
2πσp2 σp σp

As the acceptance rate is proportional to the ratio of volume, 1/c,


and c scales exponentially in d, the rejection rate scales exponentially
in d as well. For σq /σp = 1.1, which should be a reasonable approx-
imation, leads to an acceptance rate of less than 1/104 for d = 100;
≈ 104 samples from q are needed to generate a sample from p.

Importance Sampling v
Importance sampling is a slightly less simple method; if it is not
possible to compute the inverse transform to sample from p( x ), but
the PDF can still be evaluated, we can use samples from a proxy
monte carlo methods v 31

distribution q( x ) by transforming the problem into assume q(x)>0 if p(x)>0


Z Z To be well defined, we need
p( x )
φ= f ( x ) p( x ) dx = f (x) q( x ) dx. q( x ) > 0 if p( x ) > 0.
q( x )

This is just the computation of the expectation over the proxy


distribution q of a new function, g( x ) = f ( x ) p( x )/q( x ), so we can
get an unbiased estimate using

1 p( x ) 1
φ̃ =
n ∑ f (xi ) q(xii) =
n ∑ f ( x i ) wi ,
i i

where wi = p( xi )/q( xi ) is known as the importance, or weight,


of the ith sample. If the normalization constant Z is unknown, it is
possible to estimate it “on the fly”, using p̃( x )/Z = p( x ) and
Z
11 p̃( xs )
f ( x ) p( x ) dx =
ZS ∑ f ( xs ) q( xs )
s
1 p̃( xs )/q( xs )
=
S ∑ f (xs ) 1 ∑ 0 p̃(x 0 )/q(x 0 ) =: ∑ f ( xs )w̃s
s S s s s s

This estimator is no longer unbiased, but it is still consistent; it will


eventually converge to the solution.

4 Figure 12: Graphical representation


p(x) of importance sampling. Showing the
q(x) true distribution p( x ), the proxy q( x )
w(x) weight function w( x ), f ( x ) = x being
3
a linear function and black dots from
log10 sample count

the true distribution on the left. On the


right we can see black samples drawn
2 from p( x ) · x and red importance
samples showing higher variance.

0
−2 0 2 4 6 8 0 50 100
x f ( x ), g ( x )

Importance sampling also has issues, especially in high dimen-


sions. The variance of a Monte-Carlo estimator is var[ f ( x )]/S,
and importance sampling replaces var[ f ( x )] with var[ g( x )] =
var[ f ( x ) p( x )/q( x )]. The variance of g can be large if q  p some-
where. If p has “undiscovered highlands”, regions of high probabil-
ity to which q assigns very-low to no probability, some regions can
have the ratio p( x )/q( x ) grow to infinity as q( x ) goes to 0.

Condensed content

Sampling is a way of performing rough probabilistic computations


without having to design an elaborate integration algorithm, in
particular for expectations (including marginalization).
32 probabilistic machine learning

• ‘Random numbers’ generated by a computer don’t really need


to be unpredictable, as long as they have as little structure as
possible

• Uniformly distributed random numbers can be transformed


into other distributions. This can be done numerically efficiently
in some cases, and it is worth thinking about doing so

• Sampling is harder than global optimization. To produce exact


samples one needs a global description of the entire function
including knowledge of regions with high density (not just local
maxima!) as well as the cumulative density everywhere else.

• Practical Monte Carlo Methods aim to construct samples from

p̃( x )
p( x ) =
Z
assuming that it is possible to evaluate the unnormalized density p̃
(but not p) at arbitrary points.
Typical example: Compute moments of a posterior

p( D | x ) p( x ) 1
p( x | D ) = R as E p( x| D) ( x n ) ≈ ∑ xin with xi ∼ p( x | D )
p( D, x ) dx S s

• Rejection sampling is a primitive but exact method that works


with intractable models

• Importance sampling makes more efficient use of samples, but


can have high variance (and this may not be obvious)

• Producing exact samples is just as hard as high-dimensional inte-


gration. Thus, practical MC methods sample from an unnormal-
ized density p̃( x ) = Z · p( x )

• Even this, however, is hard, because it is hard to build a globally


useful approximation to the integrand
Markov Chain Monte Carlo v

The main issue of importance sampling is that the proxy distribu-


tion q( x ) needs to be a good approximation to p( x ) on its whole
domain, or in other words - everywhere. The idea behind Markov
Chain Monte Carlo methods is to generate samples by iteratively
building approximations of p that only need to be good locally.

Definition 22 (Markov Chains ). A joint distribution p( X ) over a


sequence of random variabels X := [ x1 , . . . , x N ] is said to have the
Markov property if

p ( x i | x 1 , x 2 , . . . , x i −1 ) = p ( x i | x i −1 ).

The sequence is then called a Markov chain.

The Markov property can be interpreted as forgetfulness, since


all but the most recent ’chain links’ are discarded. To illustrate the
intuition behind Markov Chain Monte Carlo, consider the following
iterative algorithm to find the maximum of p( x );

• Draw a proposal x 0 ∼ q( x 0 | xt ). proposal distribution q

• Compute a = p( x 0 )/p( xt ).

• If a > 1, accept xt+1 = x 0 , else reject x’ and set xt+1 = xt . add node in this chain

This procedure only returns the maximum of p( x ) eventually, but


can be adapted to return a sample instead by tweaking the rules
of transition from xt to xt+1 , leading to the Metropolis-Hasting
method v ;

• Draw a proposal x 0 ∼ q( x 0 | xt ) from


 a proposal distribution q, for
0 0
example q( x | xt ) = N x ; xt , σ . 2

p( x 0 ) q( xt | x 0 )
• Compute a = p( xt ) q( x 0 | xt )
.

• If a > 1, accept xt+1 = x 0 . reject: x' twice in chain


• Otherwise, accept with probability a, and reject with probability
1 − a. The outcome can be decided by drawing uniformly from
[0, 1] and comparing it to a.
The Markov chain stays at the same place for one time period when
rejecting and the corresponding point will later show up at least
2 times. Usually, the proposal distribution is symmetric, such that
q ( x t | x 0 ) = q ( x 0 | x t ).
34 probabilistic machine learning

Using this method, the samples will spend more “time” in re-
gions where p( x ) is high (lower probability of sampling a better
proposition) and less “time” in regions where p( x ) is low (any
proposition would be good), but the algorithm can still visit regions
of low probability (see Fig. 13 for an example). See chi-feng.github.io/mcmc-demo/
app.html#RandomWalkMH for a visu-
Metropolis-Hasting draws samples from p( x ) in the limit sample
alization of the Metropolis-Hastings
of infinite sampling steps. The proof sketch involves the existence algorithm in 2D created by Chi Feng.
of a stationary distribution, which is a distribution that does not
change over time (anymore). For Markov Chains, its existence can
be shown through the detailed balance equation:

p ( x ) T ( x → x 0 ) = p ( x 0 ) T ( x 0 → x ),

where T ( x → x 0 ) is the probability of transitioning from x to x 0 ;


!
p ( x 0 )q( x | x 0 )
p( x ) T ( x → x 0 ) = p( x )q( x 0 | x ) min 1,
p( x )q( x 0 | x )
 
= min p( x )q( x 0 | x ), p( x 0 )q( x | x 0 )
!
0 0 p( x )q( x 0 | x )
= p( x )q( x | x ) min ,1
p( x 0 )q( x | x 0 )
= p ( x 0 ) T ( x 0 → x ).

Markov Chains satisfying the detailed balance equation have at


least one stationary distribution:
Z Z Z
p( x ) T ( x → x 0 ) dx = p( x 0 ) T ( x 0 → x ) dx = p( x 0 ) T ( x 0 → x ) dx = p( x 0 ).

Uniqueness of the stationary distribution comes from ergodicity


of the sequence { xt }t∈N . The sequence created by the Metropolis-
Hasting algorithm fulfills this criteria by definition.

Definition 23 (Ergodicity). A sequence { xt }t∈N is called ergodic if


it

1. is a-periodic (contains no recurring sequence)

2. has positive recurrence: xt = x∗ implies there is a t0 > t such that


p ( x t0 = x ∗ ) > 0

Theorem 24 (Convergence of Metropolis-Hasting, simplified). If


q( x 0 | xt ) > 0 ∀( x 0 , xt ), then for any x0 , the density of { xt }t∈N ap-
proaches p( x ) as t → ∞.

However, this is not a statement about the convergence rate. To get


an idea of the convergence rate, consider the case of sampling from
a d-dimensional Gaussian, where the largest and smallest eigen-
values of the covariance matrix are L and e, as shown in Fig. 14 for
two dimensions.
markov chain monte carlo v 35

evaluating p every time is actually very expensive

p( x )
0.4
q( x )
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x
how the dots distributed in the plot depends on how the Markov chain moves
Figure 13: v Example of a MCMC
execution with a Gaussian proposal
distribution. Steps 1–5 and step 300.
The sample distribution still appears
not to be uniform and the Markov
chain has not yet mixed ’perfectly’.
36 probabilistic machine learning

We have to set the width of q to be approximately e, otherwise


2
the acceptance rate r will be too low. The Metropolis-Hastings will
do a random walk in two dimensions, and after t steps will have
moved a distance of approximately
r h 0
i √
E k xt − x0 k2 ≈ e rt.

Therefore, to create one independent draw at a distance L of the −2


starting position, the Markov Chain has to run for at least
 2 −2 0 2
L
t≈r
e
Figure 14: Metropolis-Hastings on a
two-dimensional Gaussian
steps. In practice (for example if the distribution has isolated islands,
such as a mixture of Gaussians) the situation can be much worse.

Gibbs Sampling v
chi-feng.github.io/mcmc-demo/
This is a special case of Metropolis-Hastings. It employs the idea app.html#GibbsSampling,banana
that sampling from a high-dimensional joint distribution is often provides a visualization of the Gibbs
Sampling created by Chi Feng.
difficult, while sampling from a one-dimensional conditional dis-
tribution is easier. So instead of directly sampling from the joint
distribution p( x ), the Gibbs sampler alternates between drawing
from the respective conditional distributions p( xi | x j j6=i ).

1 procedure Gibbs(p( x ))
2 xi ^ rand() ∀i initialize randomly
3 for t = 1, . . . , T do
(t) (t) (t)
4 x1t+1 ∝ p( x1 | x2 , x3 , . . . , xm )
( t + 1 ) ( t ) (t)
5 x2t+1 ∝ p( x2 | x1 , x3 , . . . , x m )
..
6 .
t + 1 ( t +1) ( t +1) ( t +1)
7 x m ∝ p ( x m | x1 , x2 , . . . , x m −1 )
8 end for
9 end procedure

It generates an instance from the distribution of each variable in


turn, conditioning on the current values of the other variables.
Thus, Gibbs is useful when drawing from the joint is hard/infeasi-
ble, while drawing from the conditional is more tractible. Although
there are theoretical guarantees for the convergence of Gibbs Sam-
pling, as it is a special form of Metropolis-Hastings, it is unkown
how many iterations are needed to reach the stationary distribu-
tion. In fact just as for Metropolis, the state space is explored by
a slow random walk, therefore needing many iterations in high-
dimensional spaces.

Chi Feng offers also nice depic-


The study of MCMC methods is a field of its own, and tions of Hamiltonian Monte Carlo
more elaborate methods exist. Hamiltonian v , or Hybrid chi-feng.github.io/mcmc-demo/
app.html#RandomWalkMH,banana and
Monte Carlo methods introduce momentum variables to reduce
Hamiltonian Monte Carlo with NUTS
the diffusion, and require gradients of p. Several variations adapt chi-feng.github.io/mcmc-demo/
app.html#NaiveNUTS,banana.
markov chain monte carlo v 37

to the local shape of p (Riemannian MCMC), among which the


No U-Turn Sampler (NUTS)7 v is currently the gold standard for 1
https://arxiv.org/abs/1111.4246

models allowing automatic differentiation. Another method, known


as Slice Sampling, gives efficient (exponentially fast) exploration in
one dimension and has almost no free parameters. However, this
method suffers in large dimensional settings.
In nontrivial situations, no Markov-Chain sampling method
(except exact sampling) gives exact finite-time bounds. Diagnostic
tricks exist, but are not flawless. However, MCMC approaches exact
answers (after an unknown time), whereas the other tools in our
toolboxes can only produce approximations.

Condensed content

• Markov Chain Monte Carlo circumvents building a globally


useful approximation and breaks down sampling into local
dynamics

• Moreover, it samples correctly in the asymptotic limit

• However, avoiding random walk behaviour requires careful


design, because the method will only converge well on the scale
in which the local models cover the global problem

• Hamiltonian MCMC methods (like NUTS) are currently among


the state of the art (sequantial MC being an alternative):

– they require the solution of an ordinary differential equation


(the Hamiltonian dynamics)
– their hyperparameters are tuned using elaborate subroutines
– this is typical of all good numerical methods!

• These methods are available in software packages

Reminder: Monte Carlo methods converge stochastically. This


stochastic rate is an optimistic bound for MCMC, because it has
to be scaled by the mixing time (the time until the Markov chain is
"close" to its steady state distribution). Even though Monte Carlo
methods are a powerful and well-developed tool, they are most
likely not the final solution to integration.
Gaussian probability distributions v

Given that we have already introduced the basics of continuous


probabilities, let us now delve into arguably one of the most impor-
tant distributions defined over continuous spaces - the Gaussian
distribution. Due to its significance within probability theory, the
whole chapter should be considered as condensed content.

Definition 25 (Univariate Gaussian distribution). Parametrized by


two scalars, the mean µ and variance σ2 , the density function of the
univariate Gaussian distribution is
  1 ( x − µ )2

N x; µ, σ2 = √ e 2σ2 .
2πσ2

0.4 Figure 15: Univariate Gaussian


probability distribution, here shown
0.3 with µ = 3, σ2 = 1.

( x − µ )2
p( x )

0.2 p( x ) = √1 e− 2σ2 =: N ( x; µ, σ2 )
σ 2π
0.1
x
0
0 1 2 3 4 5 6
µ−σ µ µ+σ

The more interesting, but more difficult to visualize on a 2D


piece of paper, is the multivariate extension;

Definition 26 (Multivariate Gaussian distribution). Parametrized


by a mean vector µ ∈ Rn and a positive definite8 covariance matrix 8
A symmetric matrix A ∈ Rn×n is
Σ ∈ Rn×n , the density of the n-dimensional Gaussian is given by positive definite if, for all vectors
v ∈ Rn with v 6= 0, v> Av > 0 (or
  positive semi-definite if ≥ 0).
 1 1 > −1 Equivalent statement: All eigenvalues
N x, µ, Σ = exp − ( x − µ) Σ ( x − µ) . of a symmetric matrix A are non-
(2π )n/2 |Σ|1/2 2
negative.

Σ| refersto the determinant


The expression |  of the
 covariance
R
matrix Σ. As R N x; µ, σ2 dx = 1 and N x; µ, σ2 > 0 ∀ x, N is
a well defined probability measure.
Gaussians have useful properties v such as being symmetric in
 
x and µ: N x; µ, Σ = N µ; x, Σ , as well as being exponentiations
40 probabilistic machine learning

of a quadratic polynomial:
 
 1 |
N x; µ, Σ = exp a + η | x − x Λx
2
 
1
= exp a + η | x − |
tr( xx Λ)
2
8
with the natural parameters Λ = Σ−1 (the precision matrix), η =
Λµ, and the sufficient statistics x, xx| . The scaling of the normal 6

distribution is incorporated in the constant a. 4


Furthermore, the equi-probability lines of Gaussians are ellip- µ2

x2
soids (See Fig. 16).
0
Those properties make it convenient to perform inference on
Gaussian random variables, using the simple tools of linear algebra. −2

−4
−4 −2 0 µ1 4 6 8
The Gaussian is its own conjugate prior, meaning that x1
given a Gaussian prior p( x ) and a Gaussian likelihood p(y| x ), the Figure 16: Two-dimensional Gaussian
distribution.
posterior p( x |y) is also a Gaussian (see Fig. 17). For
   
p( x ) = N x; µ, σ2 and p(y| x ) = N y; x, ν2 ,
p( x )
the posterior is given by 0.8
p(y | x )
  p( x | y)
p(y| x ) p( x )
p( x |y) = R = N x; m, s2 , 0.6
p(y| x ) p( x ) dx

p( x )
0.4
σ −2 µ + ν −2 y 1
where m = and s2 = −2 .
σ −2 + ν −2 σ + ν −2
0.2
The derivation of these expressions can be seen here v .
0
Gaussians are closed under multiplication.
0 2 4 6
x
N ( x; a, A) N ( x; b, B) = N ( x; c, C ) Z
Figure 17: Gaussian Prior, Likelihood
where C = ( A−1 + B−1 )−1 , c = C ( A −1 a + B −1 b ), and Posterior.
and Z = N ( a; b, A + B) .

Gaussians are closed under linear projections. If x is a 5



random variable distributed according to N x; µ, Σ and A is an ar-
bitrary matrix (of according size), then Ax is distributed according
x2

to (see Fig. 18)   0


>
N Ax; Aµ, AΣA .

0 5
Gaussians are closed under marginalization. Marginal- x1

ization is a special case of a linear projection: it is a projection with Figure 18: Linear projection of a
Gaussian distributed random vari-
a matrix which has the entry 1 on its diagonal at the indices of the able.
variables for which we construct the marginal, and 0s elsewhere.
Assuming that x, y, z are distributed according to
     
 x   µ x   Σ xx Σ xy Σ xz 


N  y ; µy  , Σyx Σyy Σyz 
 ,
     
z µz Σzx Σzy Σzz
gaussian probability distributions v 41

then the projections given by


 
1 0 0
 
Ax =   0 0 0  results in the marginal distribution for x

0 0 0
 
0 0 0
 
and e.g. Ayz =  
0 1 0 in the marginal distribution (integrating) over x in y and z.
0 0 1
This property corresponds to the Sum rule,
Z Z Z
p( x, y) dy = p(y| x) p( x) dy = p( x) p(y| x) dy = p( x),

Gaussians are closed under conditioning of scaled vectors,


i.e.,
p( x, y)
p( x| Ax = y) = ,
p(y)
 
= N x; µ + ΣA> ( AΣA> )−1 (y − Aµ), Σ − ΣA> ( AΣA> )−1 AΣ

Bayes’ Theorem also leads to Gaussian variables, thanks


to conditioning and marginalization.
Theorem 27 (Bayes’ Theorem with Gaussians).

If p( x) = N x; µ, Σ

and p(y | x) = N y; Ax + b, Λ ,

then p(y) = N y; Aµ + b, Λ + AΣA|
 
 |

and p( x | y) = N 
 x; µ + ΣA ( AΣA| + Λ)−1 (y − ( Aµ + b)), Σ − ΣA| (|AΣA{z
|
+ Λ})−1 AΣ

| {z }| {z }
gain residual Gram matrix

The Matrix Inversion Lemma


The computations outlined above require the inversion of an [ N ×
N ] matrix. If F  N, it can be beneficial to perform the inversion
in the [ F × F ] space, which is possible using the Matrix Inversion
Lemma, also known as the Woodbury Matrix Identity9 . In its general 9
wikipedia.org/wiki/Woodbury_matrix_identity

form, it states that

( A + UCV )−1 = A−1 − A−1 U (C −1 + VA−1 U )−1 VA−1 ,


where A, U, C, V are matrices of size [ N × N ], [ N × F ], [ F × F ]
and [ F × N ]. Applying those identities to the computation of the
posterior we get that
 
 |

p( x | y) = N 
 x; µ + ΣA ( AΣA| + Λ)−1 (y − ( Aµ + b)), Σ − ΣA| (|AΣA{z
|
+ Λ})−1 AΣ

| {z }| {z }
gain residual Gram matrix
 
 
= N  x; (Σ −1
A | Λ −1 A ) −1 ( A | Λ −1 ( y − b ) + Σ −1 µ ), ( Σ −1 + A | Λ −1 A ) −1 ) 
+ {z
| } | {z }
precision matrix precision matrix
42 probabilistic machine learning

Numerical Stability
Inverting matrices is very often subject to numerical instability,
which happens when the matrices are close to singular10 . If you 10
wikipedia.org/wiki/Invertible_matrix

want to compute x = A−1 b, Numpy will happily invert your matrix


if it technically is non-singular, using

x = numpy.linalg.inv11 (A) @ b. 11
Docs for numpy.linalg.inv

but the results might be nonsensical if it is close to singular. A better


option is to ask Numpy to solve the system Ax = b with

x = numpy.linalg.solve12 (A, b), 12


Docs for numpy.linalg.inv

which is more stable. If you need to compute the multiplication of


A−1 with multiple vectors or matrices, and the matrix A is positive
definite13 , you can pre-compute the Cholesky decomposition14 of A 13
wikipedia.org/wiki/Positive-definite_matrix

with 14
wikipedia.org/wiki/Cholesky_decomposition

L = numpy.linalg.cholesky15 (A) 15
Docs for numpy.linalg.cholesky

and use Scipy’s routine to solve Ax = b systems using the Cholesky


decomposition as a starting point,

x = scipy.linalg.cho_solve16 (L, b). 16


Docs for scipy.linalg.cho_solve

For added stability, if your matrix A is positive definite but close to


positive semi-definite, you can try to compute the Cholesky of the
slightly modified matrix A0 = A + eI, where e is a small constant,
say 10−6 , to ensure that the eigenvalues of A0 are above 0.

Infering (in)dependence
A neat property that Gaussian distributions have is that they also
allow us to infer the (in)dependence between the variables.
A zero off-diagonal element in the covariance matrix implies
marginal independence.
  
[Σ]ij = 0 ⇒ p( xi , x j ) = N xi ; [µ]i , [Σ]ii · N x j ; [µ] j , [Σ] jj

An example for this property can be seen under v applied to a


model with fan-out structure.
A zero off-diagonal element in the precision matrix implies inde-
pendence conditioned on all other variables v
   
[Σ−1 ]ij = 0 ⇒ p( xi , x j | x6=i,j ) = N xi | x6=i,j · N x j | x6=i,j .
gaussian probability distributions v 43

Condensed content

• Gaussian distributions provide the linear algebra of inference.

– products of Gaussians are Gaussians


– linear maps of Gaussian variables are Gaussian variables
– marginals of Gaussians are Gaussians
– linear conditionals of Gaussians are Gaussians

• If all variables in a generative model are linearly related, and


the distributions of the parent variables are Gaussian, then all
conditionals, joints and marginals are Gaussian, with means and
covariances computable by linear algebra operations.

• A zero off-diagonal element in the covariance matrix implies


independence if all other variables are integrated out

• A zero off-diagonal element in the precision matrix implies


independence conditional on all other variables
Parametric Gaussian Regression v

In this chapter we look at how to use Gaussian distributions to


learn functions in the case of supervised learning.
Assume we have observation data X, y, representing a set of
input-output pairs, and we would like to learn the relation between
them: f ( x ) ≈ y. To learn f , we can assume that the outputs are
distributed according to
 
p(y| f ( x )) = N y; f ( x ), σ2 I .

Let as also assume y ∈ R N , incicating that we have N indepen-


dent samples. On the other hand, let X ∈ R N × D , indicating a
D-dimensional observation for each of the N samples. For a starting
example, assume that the inputs are 1-dimensional, i.e. D = 1, as
illustrated in the example in Fig. 19.

Figure 19: Small dataset example

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

For now, assume that f is a linear function with weights w1 , w2 :

f ( x ) = w1 + w2 x.

We introduce the abstraction feature mapping φ, giving the feature


vector φx : !
1
φ ( x ) = φx = ,
x

such that we can rewrite the function as f ( x ) = φx> w. The useful-


ness of this particular notation will become clearer soon.

Assuming the following prior on the weights p(w) = N µ, Σ
will make our life easier, as the whole inference process becomes
computable using linear algebra. Fig. 20 shows an example of such a
46 probabilistic machine learning

prior on w, along with the resulting prior on the function f (stem-


ming from the projection rule on Gaussians):
  if every variable is Gaussian and every relationship is linear, all
p( f ) = N f x ; φx> µ, φx> Σφx . marginals and conditionals are also Gaussian

2
10

f (x)
0
w2

−2 −10

−2 0 2 −8 −6 −4 −2 0 2 4 6 8
w1 x
Figure 20: Gaussian prior on the
weights v , along with the matching
Gaussian prior on the function space
To recap the notation, we have a dataset X = [ x1 , .., xn ], X ∈
X N , and an output dataset y ∈ R N , a function f ( x ) ∈ R, and a
feature vector for each data point (which we choose as φx = [1, x ]> ).
To make use of vectorization, we build the following feature matrix
containing the feature vectors
!
  1 ... 1
φX = φx1 . . . φx N = .
x1 . . . x n

Furthermore, we can think of the function f applied to the data X


as a vector:
   
φx>1 w f ( x1 )
>
   
f X = f ( X ) = φX w=  
 ...  =  ... .

φx>N w f (xN )

Using this notation, our aforementioned assumption on the distri-


bution of the outputs y yields the following likelihood:
   
p(y|w, φX ) = N y; f X , σ2 I = N y; φX>
w, σ2 I

Finally, by applying Bayes’ rule on the prior of the weights p(w)


and the likelihood p(y|w, φX ), we can compute the posterior on the
weights using pure linear algebra:
   −1   −1 
p(w|y, φX ) = N w; µ + ΣφX φX >
ΣφX + σ2 I >
( y − φX >
µ), Σ − ΣφX φX ΣφX + σ2 I >
φX Σ .
| {z } | {z }
posterior mean posterior covariance

Simlarly as before, the posterior on the weights p(w|y, φX ) gives


rise to the posterior on the function (Fig. 21):
   −1
p( f x |y, φX ) = N f x ;φx> µ + φx> ΣφX φX
>
ΣφX + σ2 I ( y − φX>
µ ),
  −1 
φx> Σφx − φx> ΣφX φX>
ΣφX + σ2 I >
φX Σφx .
parametric gaussian regression v 47

2
10

f (x)
0
w2

−2 −10

−2 0 2 −8 −6 −4 −2 0 2 4 6 8
w1 x
Figure 21: Posterior on the weights,
and the matching posterior on the
Those equations can be difficult to digest at first glance, but do function, after seeing several data-
points. The result of applying more
not be intimidated. We will first see some results those equations datapoints can be seen here v .
produce, but will return to them at the end of the chapter for more
details.

Feature functions

So far, we have shown how to perform linear regression using a


feature vector φx = [1 x ]> . However, notice that the learning
process is not limited to just linear features of X. The process we
described is linear in the weights w, but not the features φx .
Instead of limiting ourselves to linear relationships, we can de-
fine a new feature function to capture second and third order poly-
nomial structure,
h i>
φ ( x ) = φx = 1 x x2 x3 ,

or even up to n-th order polynomials of x,


h i>
φ ( x ) = φx = 1 x x2 x3 ... xn ,

Furthermore, one can use sines and cosines to get a Fourier regres-
sion
h i>
φ( x ) = φx = cos( x ) cos(2x ) cos(3x ) sin( x ) sin(2x ) sin(3x ) ,

Again, we are not limited to any particular set of features - any


combination of step functions, Legendre or Laguerre polynomials,
bell curves, or really anything you can think of, is possible.
Each of those feature functions gives rise to a different prior,
along with a posterior on the function f . The characteristics of
the different posteriors will differ, but the inference framework
we described above remains unchanged. The choice of features is
essentially unconstrained. This poses the question of how to choose The jupyter notebook “Gaus-
good feature functions. In the next chapter, we will see that we can sian_Linear_Regression” allows you
to try out different combinations of
define a parametrized family of features, where the parameters kernels and may help to understand
controlling the features (called hyperparameters) can be optimized the introduced abstraction of a feature
mapping.
themselves. Even better, in the subsequent chapter, we will discover
that, under certain circumstances, it is possible to use infinitely
many features.
48 probabilistic machine learning

The following figures illustrate the effect that different feature


functions have on the resulting priors and posteriors of f . We
highly recommend to watch the amazing animations at v .

20 Figure 22: Non-linear, one-


dimensional dataset.

10
y

−10

−8 −6 −4 −2 0 2 4 6 8
x

Cubic regression, using polynomials


up to the third order

10 φx = [1 x x 2 x 3 ] >
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Fourier regression, using sines and


cosines with different frequencies, e.g.
φα ( x ) = [cos(αx ) sin(αx )]
parametric gaussian regression v 49

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
50 probabilistic machine learning

Pixel regression, using functions of the


form

10 φα ( x ) = 1 if x ∈ [α, α + 1], −1 otherwise


f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

V regression, using differently shifted


absolute values of the form

10 φα ( x ) = | x − α| − α.
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
parametric gaussian regression v 51

Eiffel Tower regression, using Laplace


distributions with different locations,

10 φα ( x ) = e−| x−α|
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
Bell curve regression, using Gaussian
distributions with different locations,
2
10 φα ( x ) = e−( x−α)
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Condensed content

• Gaussian distributions can be used to learn functions

• Analytical inference is possible using general linear models

f ( x ) = φ( x )| w = φx| w

• Then the posterior on both w and f is Gaussian

• The choice of features φ : X → R is essentially unconstrained


52 probabilistic machine learning

More details on the posterior equation

In the notation we introduced, the posterior can be very unintuitive,


especially the posterior on function values. We give more details
here, in the hope of making the structure more visible.

Dimensionality
It is useful to look at the dimensionality of each variable to get a
sense of the computation complexity. Assume we have a dataset 1 F
with N data points and choose a set of F features to fit.
F µ Σ
• The prior mean µ and prior covariance matrix Σ of the weights

w yield the prior on the weights p(w) = N w; µ, Σ , where µ is a
F-dimensional vector and Σ a [ F × F ] matrix. 1 N
• The function
 vector fX and noise matrix σ2 I
give the likelihood
2
p(y) = N y; f X , σ I , where f X is a N-dimensional vector and N fX σ2
σ2 I a [ N × N ] (diagonal) matrix for N data points.
• The feature function generates a [ F × N ] matrix that links the
[ N × N ] data space and the [ F × F ] feature space. N

Fig. 23 and 24 illustrate the dimensionality of the computations re- F φX


quired for the posterior of the weights and function values. Please
keep in mind that the dimensionality of the posterior on the func-
tion is "flexible" in the sense that we can evaluate it for an arbitrary
number of datapoints.


 −1
 
   
    

p(w|y, φX ) = N w ;


µ + Σ φX 


φ>
X Σ φX + σ2 
 

y - φ>
X
µ ,



 −1

  
  
  
Σ - Σ φX  φ>
X Σ φX + σ 2
 φ>
X Σ 
  

Figure 23: Weight posterior, with


dimensional representation

 −1
 
   
    

p(fX |y, φX ) = N  fX ;


φ>
X
µ + φ>
X Σ φX 


φ>
X Σ φX + σ2 
 

y - φ>
X
µ ,



 −1

  
  
  
φ>
X Σ φX - φ>
X Σ φX  φ>
X Σ φX + σ 2
 φ>
X Σ φX 
  

Figure 24: Function posterior, with


dimensional representation.
parametric gaussian regression v 53

The computations outlined above require the inversion of [ N ×


N ] matrices. If F  N, it can be beneficial to perform those inver-
sions in the [ F × F ] space, as we have seen in the previous chapter,
using the Matrix Inversion Lemma.
This identity gives the following two ways to compute the poste-
riors:

  −1 Figure 25: Matrix Inversion Lemma
p(w | y, φX ) = N w; >
µ + ΣφX φX ΣφX + σ2 I >
( y − φX µ ), for the posterior on the weights


  −1
>
Σ − ΣφX φX ΣφX + σ2 I > 
φX Σ ,

  −1  
= N w; Σ −1 + σ −2 φX φX
>
Σ −1 µ + σ −2 φX y ,

  −1
Σ −1 + σ −2 φX φX
> .


  −1 Figure 26: Matrix Inversion Lemma
>
p( f X | y, φX ) = N φX w; >
φX >
µ + φX >
ΣφX φX ΣφX + σ2 I >
( y − φX µ), for the posterior on the function
values.

  −1
>
φX >
ΣφX − φX >
ΣφX φX ΣφX + σ2 I >
φX ΣφX ,

  −1  
>
= N  φX w; φX Σ−1 + σ−2 φX φX
>
Σ −1 µ + σ −2 φX y ,

  −1
φX Σ −1 + σ −2 φX φX
> >
φX .

To reiterate, both approaches give the same solution, but come at


different computational cost.
Hierarchical Inference: learning the features v

We previously saw how to apply Bayesian inference to learn the pa-


rameters of a linear function. We also saw that we could adjust the
function being inferred by altering the choice of features. However,
the choice of features can be a daunting task, since any transfor-
mation could be picked. We will now see that it is possible to also
learn which features to use.

Hierarchical Bayesian Inference v

Searching an infinite-dimensional space of feature functions is


difficult. Luckily, we can temporarily restrict ourselves to a finite-
dimensional sub-space of a feature family characterized by some 1

parameters, and search this subspace instead. Consider the family 0.8
of feature functions parametrized by θ,
0.6
1
φ( x )

φ( x; θ ) =  ,
1 + exp − x−
θ2
θ1 0.4

0.2
illustrated in Fig. 27. The number of feature functions is still infi-
nite, as there is an infinite array of choice for (θ1 , θ2 ), but the di- 0
−5 0 5
mensionality of the parametrization is fixed. x
The parameters θ1 and θ2 can be treated as unknown parameters,
just as the weights w. However, it is more difficult to infer them, Figure 27: Parametrized Family of
Functions,
since the likelihood function
    ! −1
x − θ1
p( x |w, θ ) = N y; φ( x; θ )> w, σ2 I , φ( x; θ ) = 1 + exp −
θ2
.

contains a non-linear mapping of θ. Due to this non-linearity, we The parameter θ1 controls the intercept
—where on the x-axis the function
cannot use the full Bayesian treatment through linear algebra opera- value crosses the 1/2 point— and the
tions for the Normal distribution. It is still technically possible to per- parameter θ2 controls the slope.
form inference over θ, but the computational cost is very prohibitive.
However, if θ is known, the distributions related to the weights are
still linear combinations of Gaussians, and we can still use linear al-
gebra to infer the weights. It would be nice to have an approximative
solution for θ, but still do full inference on the weights w. This is
where Maximum Likelihood (ML) and Maximum A-Posteriori (MAP)
estimation come into play. Instead of integrating over θ, we can fit it
by selecting the most likely values using Maximum A-Posteriori weighs the
likelihood term by a prior on θ,
θ ? = arg max p( D |θ ), θ ? = arg max p(θ | D ) . p( D |θ ) p(θ )
θ θ p(θ | D ) = ∝ p ( D | θ ) p ( θ ),
| {z } | {z } p( D )
Maximum Likelihood Maximum A-Posteriori
and maximizes the posterior w.r.t. θ
(hence, “A-Posteriori”).
56 probabilistic machine learning

The parameters of the linear model w will still get a full proba-
bilistic treatment and will be integrated out in the inference of the
posterior, but the parameters that select the features (also known
as hyper-parameters) θ, are too costly to properly infer and will get
fitted.
To get a better understanding where these expressions come
from, notice that the evidence in our posterior for f
p(y | f , x, θ) p( f |, θ) p(y | f , x, θ) p( f |, θ)
p( f | y, x, θ) = R =
p(y | f , x, θ) p( f |, θ) d f p(y | x, θ)
| {z }
the evidence

becomes the likelihood for our posterior over θ


now is a likelihood
z }| {
p(y | θ) p(θ)
p(θ | y) = R
p(y | θ0 ) p(θ0 ) dθ0
which we want to maximize in both cases.

Maximum Likelihood in practice v


Where we previously used
To find the “best fit” θ ? , we have to solve arg maxθ p(y|w, X, θ ). In  
p(y| f X ) = N y; f X , σ2 I
order to make the problem easier, we can use a few tricks to trans-
form p(y|w, X, θ ). Based on the observation that a transformation for the likelihood, we now use Λ for
the covariance matrix, as in
g( p(y|w, X, θ )) does not change the maximization problem as long 
as g( x ) > g(y) ⇔ x > y, we can see that, p(y| f X ) = N y; f X , Λ
for the more general case.
1. Solving arg maxθ f ( x ) is equivalent to solving arg maxθ log f ( x ).

2. Solving arg maxθ f ( x ) is equivalent to solving arg minθ − f ( x ).

3. Solving arg maxθ f ( x ) + c for some c independent of θ is equiva-


lent to solving arg maxθ f ( x ).

This gives

θ ? = arg max p(y|w, X, θ )


θ
(1)
= arg max log p(y|w, X, θ )
θ
(2)
= arg min − log p(y|w, X, θ )
θ
   −1   N
1 θ> > θ> θ θ> θ> θ
= arg min (y − φX µ) φX ΣφX + Λ (y − φX µ) + log det φX ΣφX + Λ + log(2π )
θ 2 2
 

(3) 1 θ> >



θ> θ
 −1
θ>

θ> θ


= arg min  ( y − φX µ ) φX ΣφX +Λ ( y − φX µ) + log det φX ΣφX +Λ .
θ 2 | {z } | {z }
Square Error Model complexity

We can drop the term N/2 log(2π ) because it does not affect the
minimization.
To better see that the first term is a square error, it might be
 −1/2 2
θ> θ

useful to rewrite it as φX ΣφX + Λ ( y − φX µ )
>
, which
θ
hierarchical inference: learning the features v 57

is the squared error of the distance between φXθ > µ and y scaled

by (the square-root
 of) the precision
 matrix. The Model Complexity
term, log det φXθ > Σφθ + Λ , measures the “volume” of hypotheses
X
covered by the joint Gaussian distribution.
The Model Complexity term, also called Occam’s factor, adds a
Numquam ponenda est pluralitas sine
penalty for features that lead to a large hypothesis space. This is necessitate.
based on the principle that, everything kept equal, simpler explana- Plurality must never be posited with-
out necessity.
tions should be favored over more complex ones.
The aforementioned minimization procedure tries to both: — William of Occam

θ >µ
• explain the observed data well – by making the resulting φX
close to y

• keep the model complexity low

To get a visual intuition for how this abstract expression relates to


complexity, have a look at v .

It is important to note that by using Maximum Likelihood (or The usual way to train such networks,
however, does not include the Occam
Maximum A-Posteriori) solutions, we do not the capture uncer- factor. The method used here is of-
tainty on the hyper-parameters. However, they make it possible to ten referred to as Type-II Maximum
get some solution about which features to use in a reasonable time, Likelihood, whereas neural networks
typically use Type-I. The following
which would be intractable otherwise. reference contains more details on the
If you are worried about fitting or hand-picking features for application of those ideas to neural
networks;
Bayesian regression, remember that this also applies for deep learn-
MacKay. The Evidence Framework
ing, where we have to reason about the choice of activation func- Applied to Classification Networks.
tions. By highlighting assumptions and priors, the probabilistic Neural Computation, 1992
view forces us to address this problem directly, rather than obscur-
ing them with notation and intuitions. L
m9 m8
e c
Connection to deep learning v
m5
m6 G m7
Up until this point, we haven’t really talked about how to solve
m4
the minimization problem stemming from the Maximum Likeli-
∆ K
hood over the model’s hyperparameters. Since the optimization
problem doesn’t have an analytical solution, we can turn to a very m2 m3
φ
prominent tool commonly used in deep learning – Automatic Dif-
ferentiation.
m1
In general, Automatic Differentation (AD) is a set of techniques
to evaluate the derivative of a function specified by a computer θ
program. AD exploits the fact that every computer program (i.e.
mathematical expression), no matter how complicated, executes a Figure 28: The computation graph for
L ( θ ).
sequence of elementary arithmetic operations (addition, subtrac-
tion, multiplication, division, etc.) and elementary functions (exp,
log, sin, cos, etc.). By applying the chain rule repeatedly to these
operations, derivatives of arbitrary order can be computed auto-
matically, accurately, and using at most a small constant factor more
arithmetic operations than the original program.
Looking back at our derived loss L(θ ), we split the computations
58 probabilistic machine learning

as follows:
  −1 =:∆
z }| { !
1
θ| | θ|  θ| θ| θ
L(θ ) = ( y − φX µ) φX ΣφX
θ
+Λ (y − φX µ) + log φX ΣφX + Λ
2 | {z } | {z }
=:K =:c
| {z }
=:G
| {z }
=:e

with the appropriate computation graph (computer program) visu-


alized in Fig. 28.
Given the following computation graph, there are two common
modes through which we could obtain the partial derivative ∂L∂θ :
forward and backward mode. The specifics of the two modes are
covered in detail at the following article17 . 17
wikipedia.org/wiki/Automatic_differentiation

An important aspect to note is the computational complexity of


the two modes – in settings when the output (in our case – the loss
L) is lower-dimensional than the input (in our case – the hyperpa-
rameters θ), the backward mode is more efficient due to the order
in which we multiply the Jacobians.
Once we have established the tool to compute the gradient, we
could use any well known optimization routine, such as Gradient
Descent, in order to minimize the loss L(θ ).

Gaussian Regression as a neural network


A linear Gaussian regression is a single hidden layer neural net-
work, with quadratic output loss and fixed input layer, where
hyper-parameter fitting corresponds to training the input layer –
see Fig. 29 for an illustration.

y output
Figure 29: Graphical representation
of Hierarchical Bayesian Linear Re-
gression. The parameters controlling
weights w1 w2 w3 w4 w5 w6 w7 w8 w9 the features, θ, are learned from the
data using Maximum Likelihood, in a
similar fashion as a Neural Network,
features [ φx ]1 [ φx ]2 [ φx ]3 [ φx ]4 [ φx ]5 [ φx ]6 [ φx ]7 [ φx ]8 [ φx ]9 and full inference is carried over the
weights of those features, w.

parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9

x input

If we consider doing Bayesian inference on the weights of a deep


neural network, we might as well not integrate out the final layer’s
weights, as they might not be of particular importance, given the
complexity of the entire model.
Baysian inference on the weights of this network results in a
posterior p(w, θ |y), which is proportional up to normalization of
p(y|w, φθ ) p(w, θ ). Under the assumptions that the observations y
are independent given the model (w, θ), and that the observation
hierarchical inference: learning the features v 59

noise is Gaussian distributed, we can derive the following:


n
p(y | w, φθ ) p(w, θ ) = p(w, θ ) · ∏ p(yi | w, φiθ )
i =1
n
|

= p(w, θ ) · ∏ N (yi ; φiθ w, σ2 )
i =1

As the hierachical structure of the model makes computing the


posterior increasingly difficult, we might suffice with calculating a
best ’guess’ for our parameters w and θ
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ
n
1 |
= arg min − log p(w, θ ) + ∑ kyi − φiθ w k2
w,θ 2σ2 i =1

Further assuming that our prior for w and θ is also Gaussian and
centered with unit covariance matrix, we get

n
1 |
arg max p(w, θ | y) = arg min ∑ wi2 + ∑ θ 2j + ∑ kyi − φiθ w k2
w,θ w,θ i j
2σ2 i =1

1 n b
|
 
≈ arg min ∑ wi2 + ∑ θ 2j + ∑ ky β − φβθ wk2 ∼ N r + L(θ, w), O(b−1 ) .
w,θ i j
2σ2 b β =1
| {z } | {z }
r (θ,w) L(θ,w)

This is an empirical risk minimization problem with quadratic em-


pirical risk L(θ, w) and a regularizer. Therefore, training this deep
neural network for a regression task using batches b for subsam-
pling, is a generalized least squares problem.
Automatic Differentiation (AD), Gradient Descent and data sub-
sampling (Stochastic Gradient Descent) are algorithmic tools that
are just as helpful for Bayesian inference as they are for deep learn-
ing. The two domains, deep learning and Bayesian inference, are
not separate – they are just different perspectives. It is possible to
construct a point estimate for a Bayesian model, as well as to con-
struct full posteriors for deep networks. The different viewpoints
(probabilistic and statistical/empirical) on Machine Learning of-
ten overlap and inform each other. Understanding Bayesian linear
(Gaussian) regression can help us build a better intuition for deep
learning as well.

Condensed content

• Parameters θ that affect the model should ideally be part of the


inference process. The model evidence
Z
p(y | θ) = p(y | f , θ) p( f | θ) d f ,

the denominator in Bayes’ theorem


p(y | θ) p(θ)
p(θ | y) = R ,
p(y | θ0 ) p(θ0 ) dθ0
is the (“type-II” or “marginal”) likelihood for θ
60 probabilistic machine learning

• If analytic inference on θ is intractable (which it usually is), θ


can be fitted by “type-II” maximum likelihood or maximum
a-posteriori inference which fits a point-estimate for feature
parameters.

• Bayesian inference still has effects here because the marginal


likelihood gives rise to complexity penalties / Occam factors.

• The Occam factor


θ| θ
log φX ΣφX + Λ

measures model complexity as the “volume” of hypotheses


covered by the joint Gaussian distribution.

• MAP inference is an optimization problem, and can be per-


formed in the same way as other optimization-based ML ap-
proaches, including deep learning. That is, using the same op-
timizers (e.g. stochastic gradient descent), the same automatic
differentiation frameworks (e.g. TensorFlow / pyTorch, etc.) and
the same data subsampling techniques.

• A linear Gaussian regressor corresponds to a single hidden layer


neural network, with quadratic output loss, and fixed input layer.
Hyperparameter-fitting corresponds to training the input layer.
The usual way to train such network, however, does not include
the Occam factor.

• It is possible to construct full posteriors for deep networks.


Gaussian Processes v

In the previous chapter, we have seen that it is possible to learn


which features to use from a parametric family. We will now delve
into Gaussian Processes which, instead of learning a fixed number
of features, can get away with using infinitely many features (in finite
time!). This does not mean that the model is infinitely complex.
Using N data points, the final posterior will still be described using
an N-dimensional vector for the mean and an [ N × N ] covariance
matrix. However, the number of features considered for the posterior
covariance will be infinite, and the number of features selected will
grow with the number of data points. This is an example of a non-
parametric model, where the complexity grows when more data is
added.

Mean function and Kernel v

Consider the posterior function value for a single data point x 0 , f x0 ,


where the posterior is conditioned on a bigger dataset of inputs X
and output y;
   −1
p( f x0 |y, φX ) = N f x0 ; φx>0 µ + φx>0 ΣφX φX >
ΣφX + σ2 I >
( y − φX µ ),
  −1 
φx>0 Σφx0 − φx>0 ΣφX φX >
ΣφX + σ2 I φX>
Σφx0 .

We can use the following abstraction To see how Gaussian Process infer-
ence can be implemented and how the
mean function: m( x ) = φx> µ, m : X → R, introduced abstraction allows to hide
the feature functions in the computa-
covariance function (Kernel): K ( a, b) = φa> Σφb , K : X × X → R, tion, take a look at the jupyter note-
book Gaussian_Process_Regression.ipynb.
to rewrite the posterior as

p( f x0 |y, φX ) = N f x0 ; m x0 + K x0 X (KXX + σ2 I )−1 (y − m X ),

Kx0 x0 − Kx0 X (KXX + σ2 I )−1 KXx0 ,

where m a = φa> µ and Kab = φa> Σφb . The feature vectors φX , φx0 are
hidden in the computation of the mean function and the kernel. We
will see that for some models, it is not necessary to construct the
(infinite) feature vectors to compute the posterior – the mean and
kernel can be computed in closed form.
62 probabilistic machine learning

More features might mean cheaper computation v

For simplicity, fix the prior covariance to be independent with co-


variance matrix Σ = σ2 cmax − F
cmin
I, where F is the number of fea-
tures, cmax > cmin are constants. Now, assume that we are trying to
learn features of the form
!
( x − c ` )2
φ( x, c` ) = exp − ,
2λ2

with parameters c1 < . . . < c F in [cmin , cmax ]. The kernel can then be
written as
! !
F 2 2
c max − c min ( a − c ) ( b − c )
φa> Σφb = σ2 ∑ exp − 2λ2 `
exp − `
,
F `=1
2λ2
! !
2 cmax − cmin ( a − b )2 F (c` − 12 ( a + b))2

F
exp −
4λ2 ∑ exp − λ2
.
`=1

If we increase the number of features towards ∞, the number of


F
features per unit of dc approaches cmax − cmin dc and
!Z !
2 1 2
( a − b ) cmax ( c − ( a + b )
lim φa> Σφb = σ2 exp − exp − 2
dc.
F →∞ 4λ2 cmin λ2

The part inside the integral is an unnormalized Gaussian probabil-


ity distribution function. Further taking the limit as cmin → −∞ and

cmax → ∞, the integral converges to the normalization factor πλ:
!
>
√ 2 ( a − b )2
Kab := lim φa Σφb = πλσ exp − .
F →∞,
4λ2
cmax →∞,
cmin →−∞

This specific kernel is known as a Radial Basis Function, or Square(d)-


exponential Kernel.

Definition 28 (Positive Definite/Mercer Kernels v ). K : X × X →


R is a positive definite Kernel, or Mercer Kernel, if for any finite
collection X = [ x1 , . . . , xn ] the matrix KXX ∈ R N × N constructed
from
[KXX ]ij = K ( xi , x j )
is positive semi-definite18 . 18
wikipedia.org/wiki/Positive-definite_matrix

Definition 29 (Positive Definite). A matrix A ∈ R N × N is positive


(semi-)definite if, for any x ∈ R N , x 6= 0,
>
{z > 0} ,
|x Ax x > Ax ≥ 0 .
| {z }
positive definite positive semi-definite

Equivalently, A is positive (semi-)definite if

• All its eigenvalues are positive (non-negative for semi-definite).

• It is a Gram matrix - the outer product of N vectors [φi ]i=1,...,N -


and has full rank (not necessary for semi-definite).
gaussian processes v 63

Visualizing kernels

The following figure shows the prior for different feature functions
and how increasing the number of features (by taking the limit
towards infinity) leads to a kernel. The final posteriors are inferred
from the dataset introduced in Fig. 22, Page 48

Visualizing: Radial Basis Functions


The Radial Basis Function v , or
square(d)-exponential Kernel, is
generated from functions of the form
!
( x − c ` )2
10 φ` ( x ) = exp − ,
2λ2
f (x)

and its limit is given by


0 !
√ ( a − b )2
K ( a, b) = πλσ2 exp − .
4λ2
−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
64 probabilistic machine learning

Visualizing: Wiener process


The Wiener Process is defined from
step functions, that start from some c0
at 0 and switch to 1 at some threshold,
(
10 1 if x ≥ c` ,
φ` ( x ) =
0 otherwise,
f (x)

0 and converges to
K ( a, b) = σ2 (min( a, b) − c0 ).
−10 The derivation can be found here v.
−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
gaussian processes v 65

Cubic Splines
The Cubic Splines start similarly
to the Wiener process’ as threshold
functions, but they keep increasing
linearly after being activated, as RELU
activations functions,
(
x − c` if x ≥ c` ,
10 φ` ( x ) =
0 otherwise,
f (x)

and converge to
0 
1 3
K ( a, b) = σ2 min( a − c0 , b − c0 )
3
−10 1 2

+ | a − b| min( a − c0 , b − c0 ) .
2
−8 −6 −4 −2 0 2 4 6 8
x Its derivation can be seen here v

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
66 probabilistic machine learning

Kernels can also be combined to form new kernels.

Theorem 30. If K1 , K2 are Mercer kernels from X × X → R and φ


is a mapping from Y → X, then the following functions are also
Mercer kernels (up to minor regularity assumption);

• Scaling: αK1 ( a, b) for α ∈ R+ , a, b ∈ X.

• Kernel Addition: K1 ( a, b) + K2 ( a, b), for a, b ∈ X.

• Change of representation: K1 (φ( a), φ(b)), for a, b ∈ Y.

• Kernel Multiplication: K1 ( a, b)K2 ( a, b), for a, b ∈ X.

The first two properties should be easy to prove from the proper-
ties of positive semi-definite matrices. The third property is the re-
sult of Mercer’s Theorem19 , which we will cover in the next chapter, 19
wikipedia.org/wiki/Mercer’s_theorem

and the last property is the result of the Shur Product Theorem20 .
Its proof is involved and is the result of the fact that the Hadamard 20
wikipedia.org/wiki/Schur_product_theorem

Product21 of two positive semi-definite matrices give positive semi- 21


wikipedia.org/wiki/Hadamard_product_(matrices)

definite matrices. The following figures show some examples of


transformed kernels. The animations can also be viewed here v .
gaussian processes v 67

A simple RBF Kernel, as a reference


point,
!
( a − b )2
10 K ( a, b) = exp −
2
.
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Different scaling for the RBF Kernel,


to affect the effect of noise,
!
2 ( a − b )2
10 K ( a, b) = 10 · exp −
2
.
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
68 probabilistic machine learning

Transforming the inputs of a RBF


Kernel, to change the width of the
bumps, here
!
10 ( a − b )2
K ( a, b) = 20 · exp − .
2 · 52
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Another transformation of the inputs


of a RBF Kernel, here the width of the
bumps is reduced with
!
10 ( a − b )2
K ( a, b) = 20 · exp − .
2 · 0.52
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
gaussian processes v 69

A non-linear transformation of the


inputs of a RBF Kernel, here using
!
(φ( a) − φ(b))2
10 K ( a, b) = 20 · exp −
2
,
f (x)

 3
x+8
0 φ( x ) = .
5

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

A sum of Kernels, here using a


simple RBF kernel and a quadratic
feature function,
!
10 a − b )2
K ( a, b) = 20 · exp − + φ ( a ) > φ ( b ),
2
f (x)

0 φ ( x ) = [1 x x 2 ] > .

−10

−8 −6 −4 −2 0 2 4 6 8
x

10
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x
70 probabilistic machine learning

We can also learn kernels using similar techniques as we


have seen before, by parametrizing the mean and/or the kernels;
!
( a − b ) 2
mθ ( x ) = φ( x )> θ, Kθ ( a, b) = θ1 exp − .
2θ22

We can use Maximum Likelihood, as we saw for learning features.


However, in this case, the number of parameters we need to fit to
learn the kernel is typically less than the number of parameters
we need for learning the features as described in the last chapter.
To learn the features, we needed to put some parameters for each
feature to be learned. The kernel already takes care of selecting
individual features, so we only need a parameter to specify the
family of features to pick from.

Gaussian Processes are defined as follow;

Definition 31 (Gaussian Process). Let µ : X → R be any func-


tion and K : X × X → R be a Mercer kernel. A Gaussian Process

p( f ) = GP f ; µ, K is a probability distribution over the function
f : X → R such that every finite restriction to function values f X =

[ f x1 , . . . , f xn ] is a Gaussian distribution p( f X ) = N f X , µ X , KXX .

A Gaussian Process is a prior over function values, to be used in


Bayesian inference with a likelihood in order to get a posterior.

Given a prior p( f ) = GP f ; m, K and a likelihood p(y| f ) =

N y; f X , Λ , the posterior is given by
 
p( f |y) = GP f x0 ; m x0 + Kx0 X (KXX + Λ)−1 (y − m X ), Kx0 x0 − Kx0 X (KXX + Λ)−1 KXx0 .
The term Gaussian does not refer to the use of a specific kernel,
but to the probability distribution over the function values. There
are many kernels, and new ones can be constructed from combina-
tions of known ones. As we have seen, they can be learned just like
features using MAP or numerical integration.
gaussian processes v 71

Condensed content

• Prominent examples for kernels

k( a, b) = exp(−( a − b)2 ) Gaussian / Square Exponential / RBF kernel


k( a, b) = min( a − t0 , b − t0 ) Wiener process
1
k( a, b) = min3 ( a − t0 , b − t0 ) cubic spline kernel
3
1
+ | a − b| · min2 ( a − t0 , b − t0 )
2 !
2 −1 p 2a| b
k( a, b) = sin Neural Network kernel (Williams, 1998)
π (1 + 2a| a)(1 + 2b| b)

Gaussian Processes

• Sometimes it is possible to consider infinitely many features at


once, by extending from a sum to an integral. This requires some
regularity assumption about the features’ locations, shape, etc.

• The resulting nonparametric model is known as a Gaussian


process

• Inference in GPs is tractable (though at polynomial cost O( N 3 )


in the number N of datapoints)

• Gaussian processes are an extremely important basic model for


supervised regression. They have been used in many pratical
applications, but also provide a theoretical foundation for more
complicated models, e.g. in classification, unsupervised learning,
deep learning.
Understanding Kernels v

Warning: The following are simplified


expositions! Some regularity assump-
Gaussian Processes and Kernel Methods are tightly tions have been dropped for easier
readability. For the full story of the re-
related concepts. The goal of this chapter is to show the con- lationship on GPs and kernel methods,
nections between these two methods as representatives of the two check out
dominating schools of thought in Machine Learning – the proba- Kanagawa, Hennig, Sejdinovic, and
bilistic and the statistical view. In order to achieve this goal, we will Sriperumbudur. Gaussian Processes and
cover the following three questions: Kernel Methods: A Review on Connections
and Equivalences. CoRR, 2018. URL
arxiv.org/abs/1807.02582
• Can kernels be thought of as "infinitely large matrices"?
For a deeper treatment, you can check
• What is the connection between kernel machines and Gaussian the following
Processes?
Schölkopf and Smola. Learning
with Kernels: support vector machines,
• If Gaussian Processes use infinitely many features, can they learn
regularization, optimization, and beyond.
every function? MIT Press, 2002

Rasmussen and Williams. Gaussian


Kernels as infinitely large matrices v processes for machine learning. MIT
Press, 2006

Let us start things off with a couple of Linear Algebra refreshers.

Definition 32 (Eigenvalue). Let A ∈ Rn×n be a matrix. A scalar


λ ∈ C and vector v ∈ Cn are called eigenvalue and corresponding
eigenvector if
n
[ Av]i = ∑ [ A]ij [v] j = λ[v]i .
j =1

Theorem 33 (Spectral theorem for symmetric positive-definite ma-


trices). The eigenvectors of symmetric matrices (A = A| ) are real,
and form the basis of the image of A. A symmetric positive def-
inite matrix A can be written as a Gramian (outer product) of the
eigenvectors:
n
[ A]ij = ∑ λ a [ v a ]i [ v a ] j and λ a > 0 ∀ a = 1, . . . , n.
a =1

Now, the question is: can we somehow generalize these two


statements to our objects of interest – the kernels? Since the kernels
inherently embody a feature transformation of the inputs from a
finite to an infinitely-dimensional space, we turn to a generalization
from eigenvectors (which are generally finite), to eigenfunctions
(which can be thought of as infinitely long vectors).
74 probabilistic machine learning

Definition 34 (Eigenfunction). A function φ : X → R and a scalar


λ ∈ C that obey
Z
k( x, x̃ )φ( x̃ ) dν( x̃ ) = λφ( x )

are called an eigenfunction and an eigenvalue of k with respect to ν.

Theorem 35 (Mercer, 1909). Let (X, ν) be a finite measure space and


k : X × X → R be a continuous (Mercer) kernel. Then, there exist
eigenvalues/functions (λi , φi )i∈ I w.r.t. ν such that I is countable,
all λi are real and non-negative, the eigenfunctions can be made
orthonormal, and the following series converges absolutely and
uniformly ν2 -almost-everywhere:
k ( a, b) = ∑ λi φi ( a)φi (b) ∀ a, b ∈ X.
i∈ I

In short, kernels have eigenfunctions, just like matrices have


eigenvectors. Moreover, Mercer’s theorem states that the eigenfunc-
tions generate the kernels.
In the sense of Mercer’s theorem, one may vaguely think of a
kernel k : X × X → R evaluated at k( a, b) for a, b ∈ X as the
“element” of an “infinitely large” matrix k ab .
However, notice that this interpretation is only relative to the
measure ν : X → R – by changing the measure, we obtain different
eigenfunctions for the kernel.
Lastly, notice that Mercer’s Theorem is not a constructive state-
ment – it simply states that the kernel could be decomposed through
the eigenfunctions, but doesn’t say anything about how to explicitly
compute these objects.
In general, it is not straightforward to find the eigenfunctions.
However, in the special case of stationary kernels, Salomon Bochner
showed that it is in fact possible to determine the eigenfunctions.

Definition 36 (Stationary kernel). A kernel k ( a, b) is called station-


ary if it can be written as

k ( a, b) = k(τ ) with τ := a − b

Theorem 37 (Bochner’s theorem (simplified)). A complex-valued


function k on RD is the covariance function of a weakly stationary
mean square continuous complex-valued random process on RD if,
and only if, its Fourier transform is a probability (i.e. finite positive)
measure µ:
Z Z   ∗
| | |
k(τ ) = e2πis τ dµ(s) = e2πis a e2πis b dµ(s)
RD RD
 |
  |

By thinking of the product between e2πis a and e2πis b as
an outer product of orthonormal basis functions, we can start to see
the realization of Mercer’s theorem. If we ignore a few caveats (in
particular, Mercer’s theorem talks about countable sets, whereas
Bochner’s theorem is related to uncountable sets), we can inter-
pret these Fourier functions as the eigenfunctions of the stationary
kernel.
understanding kernels v 75

This crucial insight has been used to perform a linear-time approx-


imation to Gaussian Process Regression (Rahimi & Recht, NeurIPS
2008), which in the general case is cubic in the number of data
points N.

Connection between Kernel Machines and Gaussian Processes v

Many different methods are equivalent or closely related to Gaus-


sian process regression, such as: Kriging (in geosciences), Kernel
ridge regression, Wiener–Kolmogorov prediction, or even Linear
Least-Squares. In this section, for a moment we want to step out of
the probabilistic framework, and observe how these methods relate
to Gaussian Processes.

The Gaussian Process Posterior Mean is a Kernel ridge estimate


In the previous chapters we saw that the posterior over function
values takes the following form:

p(y | f X ) p( f ) N (y; f X , σ2 I )GP ( f x,X ; m, k)


p( f x | y) = =
p(y) N (y; m X , k XX + σ2 I )
= GP ( f x ; m x + k xX (k XX + σ2 I )−1 (y − m X ), k xx − k xX (k XX + σ2 I )−1 k Xx )

We can obtain the expectation (i.e. the mean in the Gaussian set-
ting) of the functions at the explicit locations X by maximizing the
posterior:

E p( f X |y) ( f X ) = arg max p( f X | y)


f X ∈R| X |

= arg min − p( f X | y)
fX

= arg min − log p( f X | y)


fX
1 1
= arg min 2
ky − f X k2 + k f X − m X k2k where k f X k2k := f X| k− 1
XX f X
fX 2σ 2

This leads us to the following realization: the posterior mean esti-


mator of Gaussian (process) regression is equal to the regularized
least-squares estimate with the regularizer k f k2k . This is also known
as the kernel ridge estimate.
Now, let’s dive even deeper into this relationship between the
statistical and probabilistic schools of thought, by first introducing
the definition on Reproducing Kernel Hilbert Spaces.

Definition 38 (Reproducing Kernel Hilbert Space (RKHS)). Let


H = (X, h·, ·i) be a Hilbert space of functions f : X → R. Then H is
called a Reproducing Kernel Hilbert Space if there exists a kernel
k : X × X → R s.t.
76 probabilistic machine learning

1. ∀ x ∈ X : k(·, x ) ∈ H

2. ∀ f ∈ H : h f (·), k(·, x )iH = f ( x ) (k reproduces H)

Nevertheless, this is a slightly abstract definition. A more useful


viewpoint for our purpose is the following theorem:

Theorem 39 (Reproducing kernel map representation). Let X, ν, (φi , λi )i∈ I


be defined as before. Let ( xi )i∈ I ⊂ X be a countable collection of
points in X. Then the RKHS can also be written as the space of
linear combinations of kernel functions:
( )
α̃i β̃ i
Hk = f ( x ) := ∑ α̃i k( xi , x ) with h f , giHk := ∑
i∈ I i∈ I
k ( xi , xi )

Given the above theorem, consider the Gaussian process p( f ) =


GP (0, k) with likelihood p(y | f , X ) = N (y; f X , σ2 I ). Then, the
RKHS is the space of all possible posterior mean functions

n
µ( x ) = k xX (k XX + σ2 I )−1 y = ∑ wi k( x, xi ) for n ∈ N.
| {z } i =1
:=w

This means that we can think of the RKHS as the space that is
spanned by the posterior mean functions of GP regression.

Figure 30: Posterior mean of a GP


with RBF kernel, represented as a
sum of individual Gaussians, centered
10 at the observed points, scaled by their
posterior weights.
f (x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Given a particular dataset, we can use the reproducing kernel


map representation to express the posterior mean as a sum of func-
tions (see Fig. 30). Now, let’s formally state our findings:

Theorem 40 (The Kernel Ridge Estimate). Consider the model


p( f ) = GP ( f ; 0, k), p(y | f ) = N (y; f X , σ2 I ). The posterior mean

m( x ) = k xX (k XX + σ2 I )−1 y

is the element of the RKHS Hk that minimizes the regularized `2


loss
1
L( f ) = 2 ∑( f ( xi ) − yi )2 + k f k2Hk .
σ i
understanding kernels v 77

GP’s expected square error is the RKHS’s worst case square error
At this point, one might say that the Bayesian and the frequentist
viewpoint are roughly the same. However, notice that when we talk
about the posterior distribution of a Gaussian process regression,
we are not just talking about the posterior mean (the mode), but we
also consider the width – thus encapsulating the entire probability
distribution. This ability to quantify uncertainty is often seen as the
main selling point of the probabilistic framework – by keeping track
of the remaining volume of hypotheses, we can be certain about
our estimate. A natural question that arises is if there is a statistical
interpretation of the posterior variance in the Gaussian process
framework.
For a moment, suppose that we have noise-free observations.
Given this assumption, let’s observe how far the posterior mean
could be from the truth in a given RKHS:
 2
2  −1 
sup m( x ) − f ( x ) = sup ∑ f ( xi ) [KXX k( X, x )]i − f ( x )
f ∈H,k f k≤1 f ∈H,k f k≤1 i | {z }
wi
* +2
reproducing property: = sup ∑ wi k(·, xi ) − k(·, x), f (·)
i H
2


Cauchy-Schwartz: (|h a, bi| ≤ k ak · kbk) = ∑ wi k(·, xi ) − k(·, x )
i
H
reproducing property: = ∑ wi w j k( xi , x j ) − 2 ∑ wi k( x, xi ) + k( x, x )
ij i
−1
= k xx − k xX KXX k Xx = E|y [( f x − µ x )2 ]

which is exactly the posterior variance of the Gaussian process. Let


us further formalize this:

Theorem 41. Assume p( f ) = GP ( f ; 0, k) and noise-free observa-


tions p(y | f ) = δ(y − f X ). The GP posterior variance (the expected
square error)

v( x ) := E p( f |y) ( f ( x ) − m( x ))2 = k xx − k xX KXX


−1
k Xx

is a worst-case bound on the divergence between m( x ) and an RKHS


element of bounded norm:

v( x ) = sup (m( x ) − f ( x ))2


f ∈Hk ,k f k≤1

To reiterate, the GP’s expected square error is the RKHS’s worst-


case square error for a bounded norm.

Samples from the posterior GP are not in the RKHS


In the third aspect of our comparison between the probabilistic and
statistical frameworks, we look at the samples from the Gaussian
process posterior. For that purpose, let us introduce a yet third
representation of the RKHS in terms of eigenfunctions:
78 probabilistic machine learning

Theorem 42 (Mercer Representation). Let X be a compact metric


space, k be a continuous kernel on X, ν be a finite Borel measure
whose support is X. Let (φi , λi )i∈ I be the eigenfunctions and values
of k w.r.t. ν. Then the RKHS Hk is given by
( )
Hk = f ( x ) := ∑ αi λ1/2 2 : 2
i φi ( x ) such that k f kHk = ∑ αi < ∞ with h f , g iHk : = ∑ αi β i
i∈ I i∈ I i∈ I

For f = ∑i∈ I αi λ1/2 1/2


i φi and g = ∑i ∈ I β i λi φi .
A compact space, simplified, is a
Furthermore, let us introduce one more theorem that will allow space that is both bounded (all points
have finite distance from each other)
us to sample from the Gaussian Process:
and closed (it contains all limits).
wikipedia.org/wiki/Compact_space
Theorem 43 (Karhunen-Loève Expansion). Let X be a compact
metric space, k : X × X, k be a continuous kernel, ν a finite Borel
measure whose support is X, and (φi , λi )i∈ I as above. Let (zi )i∈ I be
a collection of iid. standard Gaussian random variables:

zi ∼ N (0, 1) and E[zi , z j ] = δij , for i, j ∈ I.

Then (simplified!):

f (x) = ∑ zi λ1/2
i φi ( x ) ∼ GP (0, k ).
i∈ I

Using these two statements, we obtain the following crucial


result:

Corollary 44 (Wahba, 1990. Proper proof in Kanagawa et al., Thm. 4.9).


If I is infinite, f ∼ GP (0, k ) implies almost surely f 6∈ Hk .
To see this, note
!
E(k f k2Hk ) = E ∑ z2i = ∑ E[z2i ] = ∑ 1 6< ∞
i∈ I i∈ I i∈ I

To reiterate, even though we managed to (seemingly) write down


the samples from the Gaussian Process in the Mercer Representa-
tion, the corollary proves that they are in fact not in the RKHS.
At this point, a natural question that arises is if there is a way to
modify the RKHS space such that it will include the samples from
the Gaussian Process. The following theorem states the sufficient
conditions for this to apply:

Theorem 45 (Kanagawa, 2018. Restricted from Steinwart, 2017, itself


generalized from Driscoll, 1973). Let Hk be a RKHS and 0 < θ ≤ 1.
Consider the θ-power of Hk given by
( )
Hkθ = f ( x ) := ∑ αi λiθ/2 φi (x) such that k f k2Hk := ∑ α2i < ∞ with h f , g iHk : = ∑ αi β i .
i∈ I i∈ I i∈ I

Then,

∑ λ1i −θ < ∞ ⇒ f ∼ GP (0, k) ∈ Hkθ with prob. 1


i∈ I
understanding kernels v 79

Can Gaussian Process regressors learn every function? v

Kernels for which the RKHS lies dense in the space of all continu-
ous functions are known as universal kernels. One such example is
the square-exponential (also known as Gaussian, or RBF kernel):
( a − b )2
k ( a, b) = exp( )
2
When using such kernels for GP/kernel-ridge regression, for any
continuous functions f and e > 0, there is an RKHS element fˆ ∈ Hk
such that k f − fˆk < e (where k · k is the maximum norm on a
compact subset of X).

Figure 31: Prior for the Gaussian


Process regression which is tasked
5 with learning the target function
(colored in black)
f (x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

However, notice that the above statement doesn’t say anything


about the rate of convergence. Let’s illustrate this problem with an
example. After observing our prior (see Fig. 31), let’s see the shape
of the posterior as we gradually sample points from the true under-
lying function and fit our model.

Figure 32: Posterior for the Gaussian


Process regression after observing 2
5 samples from the target function
f (x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

In the beginning, after sampling 2-10 points from the target


function (see Fig. 32 and Fig. 33) we can confidently say that the GP
regressor adequately learns the shape of the target function given
the modest size of the dataset. However, after fitting the model to
20 points from the target function, we notice that things go horribly
wrong (Fig. 34). Unfortunately, this behavior is present even when
we increase our dataset to 500 evaluations (see Fig. 35).
80 probabilistic machine learning

Figure 33: Posterior for the Gaussian


Process regression after observing 10
5 samples from the target function
f (x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Figure 34: Posterior for the Gaussian


Process regression after observing 20
5 samples from the target function
f (x)

−5

−8 −6 −4 −2 0 2 4 6 8
x

Figure 35: Posterior for the Gaussian


Process regression after observing 500
5 samples from the target function
f (x)

−5

−8 −6 −4 −2 0 2 4 6 8
x
understanding kernels v 81

In fact, there are two main aspects that are wrong with the ob-
served properties of our estimator. The first one is that, as we in-
crease the number of evaluations, we start seeing the mean of the
posterior largely deviating from the true function, thus increasing 101

the overall error. The second one is that the uncertainty contracts,
meaning that the algorithm becomes more and more certain in the 100
predictions it is making, even though we observe that the posterior

k f − m k2
is nothing alike the true function.
10−1
Note that the statements about universality that we have made at
the beginning of the section still do apply. It is just that the conver-

R
gence rate of the algorithm is terribly low. 10−2
If f is “not well covered” by the RKHS, the number of data
points required to achieve e error can be exponential in e. Out-
10−3 0
side of the observation range, there are no guarantees at all. The 10 101 102 103
# function evaluations
following technical theorem defines the notions of convergence
precisely: Figure 36: Convergence rate of the GP
regressor. The√golden
 lines indicate
Theorem 46 (v.d. Vaart & v. Zanten, 2011). Let f 0 be an element rate of O 1/ n
β .
of the Sobolev space W2 [0, 1]d with β > d/2. Let k s be a kernel
on [0, 1]d whose RKHS is norm-equivalent to the Sobolev space
W2s ([0, 1]d ) of order s := α + d/2 with α > 0. If f 0 ∈ C β ([0, 1]d ) ∩
β
W2 ([0, 1]d ) and min(α, β) > d/2, then we have The Sobolev space W2s (X) is the
vector space of real-valued functions
Z 
over X whose derivatives up to s-
ED n | f 0 k f − f 0 k L2 ( P ) dΠn ( f |Dn ) = O(n−2 min(α,β)/(2α+d) )
2
X
( n → ∞ ), th order have bounded L2 norm.
L2 ( PX ) is the Hilbert space of square-
(1) integrable functions with respect to
where EX,Y | f0 denotes expectation with respect to Dn = ( xi , yi )in=1 PX .
wikipedia.org/wiki/Sobolev_space
with the model xi ∼ PX and p(y | f 0 ) = N (y; f 0 ( X ), σ2 I ), and
Πn ( f |Dn ) the posterior given by GP-regression with kernel k s .

The important takeaway from the theorem is: If f 0 is from a


sufficiently smooth space, and Hk is “covering” that space well,
then the entire GP posterior (including the mean!) can contract
around the true function at a linear rate. Gaussian Processes are
“infinitely flexible” as they can learn infinite-dimensional functions
arbitrarily well.
82 probabilistic machine learning

Condensed content

• Gaussian process regression is closely related to kernel ridge


regression.

– the posterior mean is the kernel ridge / regularized kernel


least-squares estimate in the RKHS Hk .

m( x ) = k xX (k XX + σ2 I )−1 y = arg min ky − f X k2 + k f k2Hk


f ∈Hk

– the posterior variance (expected square error) is the worst-


case square error for bounded-norm RKHS elements.

v( x ) = k xx − k xX (k XX )−1 k Xx = arg max k f ( x ) − m( x )k2


f ∈Hk ,k f k H ≤1
k

• Similar connections apply for most kernel methods.

• GPs are quite powerful: They can learn any function in the
RKHS (a large, generally infinite-dimensional space!)

• GPs are quite limited: If f 6∈ Hk , they may converge very (e.g. ex-
ponentially) slowly to the truth.

• But if we are willing to be cautious enough (e.g. with a rough


kernel whose RKHS is a Sobolev space of low order), then poly-
nomial rates are achievable. (Unfortunately, exponentially slow
in the dimensionality of the input space)

For a practical and hands-on example of using Gaussian Process


Regression, we recommend watching the following comprehensive
lecture v .
Gauss-Markov Models v

Graphical models and the (conditional) independence


structure they entail proved to be a very crucial concept that
allows for tractable inference. In this lecture we revisit these ideas
and combine them some of the pivotal properties of the Gaussian
Distribution discussed in the preceding lectures.

Time Series v

In lecture 2 we became familiarized with the following “chain“


graphical model, along with the independence it entails:

p( A, B, C ) DAG Independence But!


A B C
p(C | B) p( B| A) p( A) A⊥
⊥C|B A 6⊥
⊥C

We notice that its design is suggestive of an underlying tempo-


ral structure. We can think of these graphs as a representation of a
process that evolves over time. Due to its inherent conditional inde-
pendence structure, the process has finite memory – what happens
at the next time step only depends on what the current situation in
the world is, thus decoupling the prediction for future states from
the values of the past states.

Definition 47 (Time series). A time series is a sequence [y(ti )]i∈N


of observations yi := x (ti ) ∈ Y, indexed by a scalar variable
t ∈ R. In many applications, the time points ti are equally spaced:
ti = t0 + i · δt . Models that account for all values t ∈ R are called
continuous time, while models that only consider [ti ]i∈N are called
discrete time.

Some examples include: climate and weather observations, sen-


sor reading in cars, EEG, ECG, stock prices, and many more.
Inference in time series often has to happen in real-time, and
scale to an unbounded set of data, typically on small-scale or em-
bedded systems. For these reasons, it has to be of (low) constant
time and memory complexity.
Now, let us introduce a slight change of notation that will prove
to be convenient:
84 probabilistic machine learning

• In the previous lectures, we had observations y ∈ RD at N


locations x ∈ X. We also assumed latent function f ∈ R M , such
that y ≈ H f ( x ).

• In our current setting, the notion of a local finite memory only


works in an ordered space of inputs. Thus, we impose X ⊂ R.

• This leads us to the following updates: We observe y1 , . . . , y N


with yi ∈ RD at times [t1 , . . . , t N ] with ti ∈ R. Furthermore, we
assume latent state xi ∈ R M , such that yi ≈ Hx (ti ). In the current
settings, the state x will constitute the local memory.
Given that we have introduced the necessary change in notation, let
us turn to a fundamental definition that we will build upon.
Definition 48 (Markov chain). A joint distribution p( X ) over a
sequence of random variables X := [ x0 , . . . , x N ] is said to have the
Markov property if

p ( x i | x 0 , x 1 , . . . , x i −1 ) = p ( x i | x i −1 ).

The sequence is then called a Markov chain.


We now want to explore how the conditional independence
structure that is encoded in this graph has an effect on inference
algorithms for time series problems. In particular, we will make the
following two assumptions:
1. The joint distribution over the latent variables x has the Markov
property: p( xt | X0:t−1 ) = p( xt | xt−1 ) x0 x1 xt

2. The observations y are local – each yt only depends on the latent ··· ·
xt : p(yt | X ) = p(yt | xt )
y0 y1 yt
In the typical predictive setting of these problems, given observed
data Y0:t−1 = (y0 , y1 , . . . yt−1 ), we first want to infer the current Figure 37: The graphical model under
latent state xt : the two assumptions
R .
j6=t p ( X ) p (Y0:t−1 | X ) dx j
p( xt | Y0:t−1 ) = R
p( X ) p(Y0:t−1 | X ) dX
R    
j6=t p ( Y0:t − 1 | X 0:t − 1 ) p ( x 0 ) ∏ 0 < j < t p ( x j | x j − 1 ) dx j p ( x t | x t − 1 ) ∏ j > t p ( x j | x j − 1 ) dx j
= R    
p(Y0:t−1 | X0:t−1 ) p( x0 ) ∏0< j<t p( x j | x j−1 ) p( xt | xt−1 ) ∏ j>t p( x j | x j−1 ) dX
R  
j<t p ( xt | xt−1 ) p (Y0:t−1 | X0:t−1 ) p ( x0 ) ∏0< j<t p ( x j | x j−1 ) dx j
= R  
j≤t p ( Y 0:t − 1 | X 0:t − 1 ) p ( x 0 ) ∏ 0 < j < t p ( x j | x j − 1 ) dx j
Z
= p( xt | xt−1 ) p( xt−1 | Y0:t−1 ) dxt−1

A rather simplified expansion is to note that in fact, the joint is:

p( xt , xt−1 |y1:t−1 ) = p( xt | xt−1 , y1:t−1 ) p( xt−1 |y1:t−1 ) = p( xt | xt−1 ) p( xt−1 |y1:t−1 )

Then, we can integrate over xt−1 to obtain the Chapman-Kolmogorov


equation, which embodies the predict step of the inference process:
Z
p( xt |y1:t−1 ) = p( xt | xt−1 ) p( xt−1 |y1:t−1 )dxt−1 .
gauss-markov models v 85

After observing the new data point yt , we want to update our


belief for xt :
p(yt | xt ) p( xt | Y0:t−1 )
p( xt | Y0:t ) = R
p(yt | xt ) p( xt | Y0:t−1 ) dxt

Unsurprisingly, this is known as the update step.


The process of alternating between predicting and updating,
known as filtering, can be performed in linear time in the number of
data points (remember that Gaussian Process is cubic in the number
of data points). Interestingly, if we were only ever interested in
predicting (which is often the case), then we are practically done.
Nevertheless, there are situations in which we want to look back
through time and correct the predictions that we made, by intro-
ducing information from future observations (y0 , y1 , . . . , y T ):
Z Z
p( xt | Y ) = p( xt , xt+1 | Y ) dxt+1 = p( xt | xt+1 , Y ) p( xt+1 | Y ) dxt+1

Let us further explore the quantity p( xt | xt+1 , Y ):

p(Yt+1:n | xt+1 , xt , Y0:t ) p( xt | xt+1 , Y0:t )


p ( x t | x t +1 , Y ) = R
p(Yt+1:n | xt+1 , xt , Y0:t ) p( xt | xt+1 , Y0:t ) dxt
p(Yt+1:n | xt+1 , Y0:t ) · p( xt | xt+1 , Y0:t )
= R = p( xt | xt+1 , Y0:t )
p(Yt+1:n | xt+1 , Y0:t ) · p( xt | xt+1 , Y0:t ) dxt
p( xt , xt+1 | Y0:t ) p( xt+1 | xt , Y0:t ) p( xt | Y0:t ) p( xt+1 | xt ) p( xt | Y0:t )
p( xt | xt+1 , Y0:t ) = = =
p( xt+1 | Y0:t ) p( xt+1 | Y0:t ) p( xt+1 | Y0:t )

Plugging back in, we obtain:


Z
p ( x t +1 | Y )
p( xt | Y ) = p( xt | Y0:t ) p ( x t +1 | x t ) dx
p( xt+1 | Y0:t ) t+1

This process is formally known as smoothing, and its complexity is


also linear in the number of data points.
To summarize, Markov Chains formalize the notion of a stochas-
tic process with a local finite memory. Inference over Markov Chains
separates into three operations that can be performed in linear time:

Filtering: O( T )
Z
predict: p( xt | Y0:t−1 ) = p( xt | xt−1 ) p( xt−1 | Y0:t−1 ) dxt−1 (Chapman-Kolmogorov Eq.)
p(yt | xt ) p( xt | Y0:t−1 )
update: p( xt | Y0:t ) =
p(yt )

Smoothing: O( T )
Z
p ( x t +1 | Y )
smooth: p( xt | Y ) = p( xt | Y0:t ) p ( x t +1 | x t ) dx
p( xt+1 | Y0:t ) t+1

To rewrite what we did, we constructed the following inference


algorithm:

1 procedure Inference(Y, p( x0 ), p( xt | xt−1 ) ∀t, p(yt | xt ) ∀t)


2 for i=1,. . . ,n do Filtering
R
3 p( xt | y1:t−1 ) = p( xt | xt−1 ) p( xt−1 | Y0:t−1 ) dxt−1 Predict
86 probabilistic machine learning

4 p( xt | y1:t ) = p(yt | xt ) p( xt | Y0:t−1 )/p(yt ) Update


5 end for
6 for i=n-1,. . . ,0 do Smoothing
R
7 p( xt | Y ) = p( xt | Y0:t ) p( xt+1 | xt ) p( xt+1 | Y )/p( xt+1 | Y1:t ) dxt+1
8 end for
9 return p( xt | Y ) ∀t = 0, . . . , n return all marginals
10 end procedure

Gauss-Markov Models v

The inference algorithm that we have just described is abstract, in


the sense that we never specified any of the probability distribu-
tions. In order to provide a full implementation, we make use of the
prominent Gaussian Distribution.
Given that all relationships between the variables are linear and
Gaussian, the assumptions about the model translate as follows:
1. p( x (ti+1 ) | X1:i ) = N ( xi+1 ; Axi , Q) , with p( x0 ) = N ( x0 ; m0 , P0 )

2. p(yi | X ) = N yi ; Hxi , R
Under these assumptions, we obtain the following concrete imple-
mentations for the three steps of the inference algorithm:
Z
predict: p( xt | Y1:t−1 ) = p( xt | xt−1 ) p( xt−1 | Y1:t−1 ) dxt−1
Z
= N ( xt ; Axt−1 , Q) · N ( xt−1 ; mt−1 , Pt−1 ) dxt−1

= N ( xt , Amt−1 , APt−1 A| + Q)
= N ( xt , m− −
t , Pt )

p(yt | xt ) p( xt | Y1:t−1 )
update: p( xt | Y1:t ) =
p(yt )
N (yt ; Hxt ; R)N ( xt ; m− −
t , Pt )
= − |
N (yt ; Hm− t , HPt H )
= N ( xt , m− −
t + Kz, ( I − KH ) Pt )
= N ( xt , mt , Pt ) where
K := Pt− H | ( HPt H | + R)−1 , (gain)
z := yt − Hm−
t (residual)

Z
p ( x t +1 | Y )
smooth: p( xt | Y ) = p( xt | Y0:t ) p ( x t +1 | x t ) dx
p( xt+1 | Y1:t ) t+1
Z
N ( xt+1 ; mst+1 , Pts+1 )
= N ( xt ; mt , Pt ) N ( xt+1 , Axt , Q) dx
N ( xt+1 ; mt+1 , Pt+1 ) t+1
|
= N ( xt , mt + Gt (mst+1 − m− s −
t+1 ), Pt + Gt ( Pt+1 − Pt+1 ) Gt )
= N ( xt , mst , Pts ) where
Gt := Pt A| ( Pt− )−1 (smoother gain)

In the Gauss-Markov setting, the filter and smoother are known as


the Kalman Filter and the Rauch-Tung-Striebel Smoother respec-
tively.
gauss-markov models v 87

This framework has such a great impact on various domains,


that the variables have standard names throughout the literature:

(Kalman) Filter:
p( xt ) = N ( xt ; m− −
t , Pt ) predict step
m−t = Amt−1 predictive mean
|
Pt− = APt−1 A + Q predictive covariance

p( xt | yt ) = N ( xt ; mt , Pt ) update step
zt = yt − Hm− t innovation residual
St = HPt− H | + R innovation covariance
Kt = Pt− H | S−1 Kalman gain
mt = m−t + Kzt estimation mean
Pt = ( I − KH ) Pt− estimation covariance

(Rauch Tung Striebel) Smoother:


p( xt | Y ) = N ( xt ; mst , Pts ) smooth step
Gt = Pt A| ( Pt−+1 )−1 RTS gain
mst = mt + Gt (mst+1 − m− t +1 ) smoothed mean
Pts = Pt + Gt ( Pt+1 − Pt+1 ) G|
s −
smoothed covariance

Stochastic Differential Equations v

In this section we explore the question of how this family of models


is related to the family of Gaussian Process regression models.
Up until now, we have only discussed about the predictions for
function values at particular points in time that are discretely spaced
away from each other. Nevertheless, time is a continuous object – so
to answer the posed question, we have to think about what happens
in between the observations. Warning: The following are drastic
In order to turn our Markov model into a continuous time model simplifications! The full proof requires
a deep dive into Stochastic Processes,
and relate it to Gaussian Process regression, we will take a pedes- which are not the main focus of this
trian approach. course.
We sample a set of states x from a joint Gaussian distribution
over discrete time points, by actually starting at a random point
x0 , and successively drawing points xt using the conditional p( xt |
xt−1 ) (see Fig. 38, where we sample every δt = 1, with Qδt = 1 and
A = 1).
Next, we increase the rate of sampling by drawing every δt = 1/2
(see Fig. 39). Notice that we have to adjust the variance Qδt = 1/2 in
order to prevent the values from exploding.
Now, we are interested in the resulting object for the limiting
case of δt → 0 (see Fig. 40). In particular, we would like to encode
that Qδt/δt approaches some kind of finite object (like a derivative;
however sample paths from this resulting (Wiener) process are
88 probabilistic machine learning

Figure 38: Samples of the latent state


xt with δt = 1, Qδt = 1
5

0
xt

−5

0 1 2 3 4 5 6 7 8 9 10
t

Figure 39: Samples of the latent state


xt with δt = 1/2, Qδt = 1/2
5

0
xt

−5

0 1 2 3 4 5 6 7 8 9 10
t

almost surely not differentiable). For this reason, we introduce a


new object Qdt := dω, known as the Wiener measure. Note that this
is a non-standard construction: dω can be defined more elegantly,
but this goes beyond the scope of this course.

Figure 40: Samples of the latent state


xt with δt → 0, Qδt = ???
5

0
xt

−5

0 1 2 3 4 5 6 7 8 9 10
t

The important takeaway is that we started with a Markov chain,


explored the limiting case of sampling over a continuous time do-
main, and ended up with a Gaussian Process. Now, let us formalize
the relationship between these two families of models.

Definition 49 (Stochastic Differential Equation). The linear, time-


gauss-markov models v 89

invariant Stochastic Differential Equation (SDE)

dx (t) = Fx (t) dt + L dωt ,

together with x (t0 ) = x0 , describes the local behavior of the


(unique) Gaussian process with
Z min ta ,t
b | (t
E( x (t)) =: m(t) = e F(t−t0 ) x0 cov( x (t a ), x (tb )) =: k(t a , tb ) = e F(ta −τ ) LL| e F b −τ ) dτ
t0

This GP is known as the solution of the SDE. It gives rise to the


discrete-time stochastic recurrence relation p( xti+1 | xti ) = N ( xti+1 ; Ati xti , Qti )
with
Z t −t
i +1 i
A t i = e F ( t i +1 − t i ) and Q ti = e Fτ LL| e Fτ dτ.
0
Note that the exponential function
Let us now look at two examples in order to better understand also generalizes to matrices:
the statements above. Firstly, notice that by setting F = 0, L = θ, the ∞
Xi
e X := ∑ .
SDE yields the scaled Wiener process: i =0
i!

With the following properties:


m ( t ) = x0 k(t a , tb ) = θ 2 (min(t a , tb ) − t0 )
e0 = I
along with a Markov chain whose parameters are: ( e X ) −1 = e − X
X = VDV −1 ⇒ e X = Ve D V −1
A=I Q t i = θ 2 ( t i +1 − t i )
ediagi di = diagi edi

In another example, by setting F = − λ1 , L = √2θ , the SDE yields the det e X = etrX .
λ
Ornstein-Uhlenbeck process:
 
t − t0 |t a −tb | 2t0 −t a −tb
m ( t ) = x0 e − λ k (t a , tb ) = θ 2 e− λ − e λ

along with a Markov chain whose parameters are:


 
A = e− t/λ Qti = θ 2 1 − e t/λ
δ −2δ
90 probabilistic machine learning

Condensed content
For more on Gaussian and approxi-
• Markov Chains capture finite memory of a time series through mately Gaussian filters see, e.g.
conditional independence Simo Särkkä. Bayesian Filtering and
Smoothing Cambridge University Press,
2013
• Gauss-Markov models map this state to linear algebra https://users.aalto.fi/~ssarkka/
pub/cup_book_online_20131111.pdf
• Kalman filter is the name for the corresponding algorithm

• SDEs (Stochastic Differential Equations) are the continuous-


time limit of discrete-time stochastic recurrence relations (in
particular, linear SDEs are the continuous-time generalization
discrete-time linear Gaussian systems)

• Complexity of all necessary operations is linear, O( N ) in the


number of data points (as opposed to O( N 3 ) for general GPs).
(Although not shown, this includes hyperparameter inference!)
Gaussian Process Classification v

So far, we have seen how to model continuous outputs.


This consisted of learning the functional relation of an output y
given some features x as y ∼ p(y | f ( x )). In this chapter, we
see how to adapt this framework to do classification: given data
that can belong to two classes, say A and B, we want to learn the
probability that a sample with features x belongs to class A.

4 Figure 41: Simple classification


problem. It is possible to draw a line
2
to perfectly separate the samples.
0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

4 Figure 42: Imperfect linear classifi-


cation problem. It is not possible to
2
perfectly separate the two classes with
0 a simple line and the boundary needs
x2

to be fuzzy.
−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

4 Figure 43: Nonlinear classification


problem. While it is not possible to
2
find a single line to separate the two
0 classes, a more complex boundary can
x2

be found.
−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

4 Figure 44: Non-separable classifica-


tion problem. The overlap between
2
the two classes is too strong to be
0 separable, but some structure does
x2

exist.
−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

To clarify the difference between regression and classification,


consider the two definitions;
92 probabilistic machine learning

Definition 50 (Regression). Given input-output pairs ( xi , yi )i=1,...,n


with xi ∈ X and yi ∈ Rd , we want to find a function f : X → Rd
such that f models yi ≈ f ( xi )

Definition 51 (Classification). Given input-output pairs ( xi , ci )i=1,...,n


with xi ∈ X and ci ∈ {0, 1, . . . , d − 1}, we want to find a probability
π : X → U d , where U d = { p ∈ [0, 1]d : ∑id=1 pi = 1}, such that π
models ci ≈ π xi

For a first approach, we will only consider binary classification,


with y ∈ {−1, 1}:
(
π (x) if y = 1,
p(y| x ) =
1 − π (x) if y = −1.

Note that this is very similar


 to regression,
 where we were tasked
with learning p(y| x ) = N y; f x , σ2 . The main difference between
the two is the domain: in classification y ∈ {−1, 1}, instead of y ∈ R.
To account for this, we will introduce a link function. This simply
is a mapping from a real value f ∈ R to a probability π ∈ [0, 1], The sigmoid has some very useful
properties. In particular, it is symmetric
using the sigmoid or logistic link function;
σ( f ) = 1 − σ(− f ),
1 its inverse is easy to compute
π f = σ( f ) = .
1 + e− f f (π ) = log π f − log(1 − π f ),

We can use the logistic function on top of a Gaussian Process re- as is its derivative
gression model to adapt it for classification, thus creating logistic ∂π f
= π f (1 − π f ).
regression. In particular, take a Gaussian Process prior over f , ∂f

p( f ) = GP f ; m, k and use the likelihood
(
σ( f ) if y = 1,
p(y| f x ) = σ(y f x ) =
1 − σ( f ) if y = −1.

1 Figure 45: The sigmoid. Notice the


transformation of the Gaussian prob-
0.8 ability density functions when passed
through the sigmoid.
0.6
σ( f )

0.4

0.2

0
−4 −3 −2 −1 0 1 2 3 4
f

A slight issue with this formulation is that the posterior is no


longer Gaussian. For a matrix X and a vector Y, we have

p (Y | f X ) p ( f X ) σ(Y f X )N f X ; m, k
p ( f X |Y ) = = R  ,
p (Y ) σ (Y f X )N f X ; m, k d f X

n
1 −1
log p( f X |Y ) = − f X> KXX f X + ∑ log σ (yi f xi ) + const.
2 i =1

This makes the computation of the posterior more complex, and in


most cases not even tractable. The following figures show the prior,
likelihood and posterior of such a classification model.
gaussian process classification v 93

Figure 46: GP Prior on f . Note the


Gaussian shape of the distribution.
f2

f1

Figure 47: The likelihood expressed


through the sigmoid link function,
which is non-Gaussian.
f2

f1

Figure 48: The resulting posterior on


f . Note the non-Gaussian shape of the
distribution.
f2

f1

However we do not always need to compute the full posterior –


an approximation providing the “key aspects” of the posterior can
be sufficient. Sometimes we are only interested in the moments of
the joint p( f , y) = p(y| f ) p( f ), rather than the entire distribution:

Z Z
The evidence: E p(y, f ) [1] = 1 · p(y, f ) d f = p(y, f ) d f = Z,
Z Z
  1
The mean: E p( f |y) f = f · p( f | y) d f = f · p( f , y) d f = f¯,
Z
h i Z 1
Z
The variance: E p( f |y) f 2 − f¯2 = f 2 · p( f | y) d f − f¯2 = f 2 · p( f , y) d f − f¯2 = var( f ).
Z
Recall that Z can be useful for parameter tuning, f¯ provides a
useful point estimate, and var( f ) is a good estimate of the error
around f¯.

The Laplace Approximation

The Laplace approximation is a rough and local approximation to


the posterior. Even though it can be arbitrarily wrong, it is com-
putationally very efficient. Furthermore, it works well for logistic
regression because the log posterior is concave (see Fig. 49).
The Laplace approximation
 builds a Gaussian approximation
q(θ ) = N θ; µ̂, Σ̂ to a non-Gaussian probability distribution
p(θ ), which in our context can be a likelihood or a posterior.
94 probabilistic machine learning

Figure 49: Laplace approximation (in


black) to the posterior (in red)
f2

f1

First, we find a (local) maximum to log p(θ ) (or, equivalently,


p(θ )) in order to determine the mean of q(θ ):
µ̂ = θ̂ = arg max log p(θ ),
θ

Next, we determine the covariance of p(θ ) by taking the negative


of the inverse-Hessian at θ̂. For this purpose, we make use of a
second-order Taylor expansion of log p(θ ) around θ̂, i.e. θ = θ̂ + δ,
in log space
 
1  
log p(θ ) = log p(θ̂ ) + δ| ∇∇| log p(θ̂ ) δ + O(δ3 )
2 | {z }
=:Ψ
Now, if we exponentiate both sides in the equation above, we
(roughly) obtain:
 
1 |
p(θ ) ≈ exp δ Ψδ + const
2
 
1 |
= exp (θ − θ̂ ) Ψ(θ − θ̂ ) + const
2
leading us to the following approximation:
 
q(θ ) = N θ; θ̂, −Ψ−1 ≈ p(θ ).

If p(θ ) is Gaussian, the approximation is exact and q(θ ) = p(θ ).

Application to Gaussian Process Logistic Regression


First, we find the maximum-a-posteriori for the latent f on the
training set:
fˆ = arg max log p( f X |y)
fX
and then assign a Gaussian posterior at the training points:
  −1 !
ˆ
2
q( f X ) = N f X ; f , − ∇ f X log p( f X |y) ˆ
=: N ( f X ; fˆ, Σ̂)
fx = f

Now, we can approximate the posterior predictions of test points at


fx:

Z
q( f x |y) =p( f x | f X )q( f X ) d f X ,
Z
 
−1 −1
= N f x ; m x + KxX KXX ( f X − m X ), Kxx − KxX KXX KXx q( f X ) d f X ,
 
−1 ˆ −1 −1 −1
= N f x ; m x + KxX KXX ( f − m X ), Kxx − KxX KXX KXx + KxX KXX ΣKXX KXx .
gaussian process classification v 95

Prediction probabilities can then be computed as: Recall that if:


Z p( x ) = N ( x; m, V ) , p(z | x ) = N (z; Ax, B)
π̂ x = E p( f x |y) [π x ] ≈ Eq( f x |y) [π x ] = σ( f x )q( f x |y) d f x ,
Then:
Z
or alternatively: p(z) = p(z | x ) p( x )dx
    
π̂ x ≈ σ Eq f x = N z; Am, AVA> + B

Caution: the two ways of computing the prediction probabilities are


not equivalent!

Implementing the Laplace Approximation for GP classification


First, let us write down our assumptions explicitly:
n
 1
p( f ) = GP f ; m, k p(y | f X ) = ∏ σ ( y i f xi ) σ(z) =
1 + e−z
i =1

Now, let us compute the log-posterior:

log p( f X | y) = log p(y | f X ) + log p( f X ) − log p(y)


n
1
= ∑ log σ(yi f xi ) − 2 ( f X − mX )| KXX
−1
( f X − m X ) + const
i =1

Next, we need to compute the gradient of the log-posterior:


n  
−1 ∂ log σ(yi f xi ) yi + 1
∇ log p( f X | y) = ∑ ∇ log σ(yi f xi ) − KXX ( f X − m X ) with
∂ f xj
= δij
2
− σ ( f xi )
i =1

From there, we need to compute the Hessian:


n ∂2 log σ (yi f xi )
∇∇| log p( f X | y) = ∑ ∇∇| log σ(yi f xi ) − KXX
−1
with = −δia δib σ( f xi )(1 − σ( f xi ))
i =1
∂ f x a ∂ f xb | {z }
=:wi with 0<wi <1
−1 −1
=: − diag w − K = −(W + K ) ^ convex minimization / concave maximization!

Finally, all we need is an optimizer which will find the local max-
imum of the log-posterior. Practically, any optimizer that follows
the gradient will do the job. For the sake of illustration, we will
consider the second order Newton Optimization method, since it is
very efficient for convex optimization problems such as ours.
1 procedure GP-Logistic-Train(K XX , m X , y)

2 f ^ mX initialize
3 while not converged do
y +1
4 r ^ 2 − σ( f ) = ∇ log p(y | f X ), gradient of log likelihood

5 W ^ diag(σ ( f ) (1 − σ ( f ))) = −∇∇ log p(y | f X ), Hessian of log likelihood


−1
6 g ^ r − KXX ( f − mX ) compute gradient
7 H ^ −(W + K −1 )−1 compute inverse Hessian
8 ∆ ^ Hg Newton step
9 f ^ f −∆ perform step
10 converged ^ k∆k < e check for convergence
11 end while
12 return f
13 end procedure
96 probabilistic machine learning

Note that this particular implementation can be numerically un-


stable as it (repeatedly) requires (W + K −1 )−1 . For a numerically
stable alternative, use B := I + W 1/2 KXX W 1/2 (cf. Rasmussen &
Williams).

Lastly, we need to specify the procedure for making predictions:


ˆ
1 procedure GP-Logistic-Predict( f , W, R, r, k, x) fˆ, W, R = Cholesky( B), r handed over from training
2 for i = 1, . . . , length ( x) do
3 f¯i ^ k xi X r −1
mean prediction (note at minimum, 0 = ∇ p( f X | y) = r − KXX ( f X − m X )).

4 s ^ R−1 (W 1/2 k Xxi ) pre-computation allows this step in O(n2 )


5 v ^ k xi xi − s| s v = cov( f x )
R
π̄i ^ σ ( f i )N ( f i , f¯i , v) d f i
R
6 predictive probability for class 1 is p(y | f¯) = p(yx | f x ) p( f x | f¯) d f x
7 end for entire loop is O(n2 m) for m test cases.
8 return π̄ x
9 end procedure

4 Figure 50: Pictorial view of GP logistic


regression.
2

0
f

−2

−4

0.8

0.6
π

0.4

0.2

0
−8 −6 −4 −2 0 2 4 6 8
x
gaussian process classification v 97

Condensed content

• Gaussian Process Classification is a Supervised method phrased


in a discriminative model with probabilistic interpretation

• It models binary outputs as a transformation of a latent func-


tion with a Gaussian process prior

• due to non-Gaussian likelihood, the posterior is non-Gaussian;


exact inference intractable

• Laplace approximation: Find MAP estimator, second order ex-


pansion for Gaussian approximation

• tune code for numerical stability, efficient computations

• Laplace approximation provides Gaussian posterior on training


points, hence evidence, predictions
Generalized Linear Models v

Gaussian Processes are super adaptive inference ma-


chines. To understand this statement, we will explore three main
topics: first, we see if Support Vectors Machine have a probabilis-
tic interpretation by relating them to GP classification; next, we
move towards Generalized Linear Models, which can be structured
in a way to learn arbitrary functions; finally, we make yet another
connection to deep learning.

Connection to Support Vector Machines v

Let’s quickly recap what we did in the previous lecture. For the
purpose of classification, we made extensive use of the sigmoid link
function:
1 dσ ( x )
σ( x) = with = σ( x )(1 − σ( x ))
1 + e− x dx
We were interested in extending Gaussian Process regression to the
classification setting, by constructing a Gaussian approximation
q( f X | y) for the (usually non-Gaussian) posterior distribution at
the training points. In order to find the mode of the posterior, we
had to compute the gradient of the log-posterior and set it to zero:
4
n
−1
∇ log p( f X | y) = ∑ ∇ log σ(yi f xi ) − KXX ( f X − mX ) = 0
i =1
n
−1
⇒ KXX ( f X − mX ) = ∑ ∇ log σ(yi f xi ) = ∇ log p(y | f X ) =: r 2
i =1

Then at test time, recall that the mode of the approximation is given
by:
0
fx

−1 ˆ
Eq ( f x ) = m x + k xX KXX ( f X − m X ) = m x + k xX r
y
Notice that in the last expression, the expected function value for fˆX
−2
a test point explicitly depends on the gradients of the log-likelihood fˆx
of the training points. w
σ( f )
In particular, observe the dashed-blue line in Fig. 51 – the two dσ/d f
training points in the middle have the largest value for the gradient. −4
−4 −2 0 2 4
In turn, these points will have the highest contribution in determin- x
ing the expected function value for a new test point.
Figure 51: Towards Support Vector
Machines. Notice how the two points
around zero would already provide a
strong support for a good classifier.
100 probabilistic machine learning

2
So, xi with | f i |  1, where ∇ log p(yi | f i ) ≈ 0, contribute almost log σ (y f )
nothing to Eq ( f x ). On the other hand, the xi with | f i | < 1 can be [1 − y f ] +
considered as “support points”. 1.5

This realization leads us to the idea to try and make the connec-

− log p(y | f )
tion between GP classification and Support Vector Machines.
1

In the statistical framework, we are often interested in maximiz-


ing the posterior, which is equivalent to minimizing the negative 0.5
log posterior:

0
− log p( f X | y) = ∑ − log σ(yi f i ) + k f X k2KXX −1 0 1 2
i f

= ∑ `(yi ; f i ) + k f X k2KXX Figure 52: Hinge loss along side the


i log likelihood for GP classification

By explicitly setting the loss term with the Hinge Loss `(yi ; f i ) =
[1 − yi f i ]+ , we obtain the Support Vector Machine learning algo-
rithm.
At this point, one could postulate that the Hinge Loss is the
limiting object for the log likelihood (see Fig. 52), for which the
gradient for f > 1 is zero.
This would describe the phenomena of support points in the
previous example of GP classification. In turn, this would mean
that we would have constructed a probabilistic interpretation of
Support Vector Machines.
Unfortunately, that is not the case, as the Hinge loss is not a
log-likelihood:

exp(`(yi ; f i )) + exp(`(−yi ; f i ))
= exp(`( f i )) + exp(`(− f i )) 6= const

Figure 53: The Hinge loss is not a log


likelihood. Notice that while for every
1
f we get σ ( f ) + σ(− f ) = 1 (dotted
red line), that is not the case for
p(y | f ) = σ (y f )???

exp([1 − yi f i ]+ ) + [1 + yi f i ]+ 6= const
(dotted black line). This is a necessary
requirement, since the likelihood
0.5
is a probability distribution in the
observed data (but not necessarily the
σ( f ) latent parameters).
σ (− f )
exp([1 − f ]+ )
exp(−[1 + f ]+ )
0
−1 −0.5 0 0.5 1 1.5 2
f

We can conclude that SVMs are an example of a machine learn-


ing algorithm without a proper probabilistic interpretation. Never-
theless, the probabilistic view can help with intuition for the statisti-
cal interpretation.
generalized linear models v 101

Generalized Linear Models v

In this section, we look into various ways of extending the Laplace


approximation.
We first start off by extending GP classification to the multi-class
setting. The first necessary change that we need to introduce is to
adapt the latent function f to produce C outputs for each point. So,
at the n locations, the latent variables are:
(1) (1) (2) (2) (C ) (C )
fX = [ f1 , . . . , fn , f1 , . . . , fn , . . . , f1 , . . . , fn ]

At location xi , generate probabilities for each class by taking the


softmax:
(c)
(c) (c) exp( f i )
p ( yi | f i ) = πi = (c̃)
∑C
c̃=1 exp( f i )
The remaining derivations are analogous to the binary case.

This example illustrates that, by changing the link function,


and making the appropriate adaptations, we can tune Gaussian
Processes to learn a very large class of probabilistic models.

In fact, the choice of link functions is practically limitless, as


they are only required to be continuous (and of course the Laplace
approximation should be meaningful for the given posterior). To
illustrate this point, see Fig. 54, where we make use of various link
functions.

σ( x ) = 1/(1 + e−x ) σ( x) = ex σ( x ) = ( x/3)8 exp(−2( x/3)) Figure 54: Towards Generalized


1 30 1 Linear Models. We push samples from
0.8
the posterior latent function f (blue)
0.75
20 through three different link functions.
0.6
σ( f )

σ( f )

σ( f )

0.5
0.4
10
0.25 0.2
0 0 0

3
f (x)

−3
−6
x

Definition 52 (Generalized Linear Model). (For our purposes,)


a generalized linear model (GLM) is a probabilistic regression Note the distinction to a general linear
model for a function f with a Gaussian process prior p( f ) and a model (GP prior and likelihood, with
non-linear kernel k)
non-Gaussian likelihood p(y | f x ).
Until now, we have mainly focused on approximating the pos-
terior distribution. Let us show with an example that this is not
mandatory, as we can use the Laplace approximation to model the
likelihood.
Say we are modeling the number of new COVID cases as a func-
tion of the number of days since the outbreak (see Fig. 55). In stan-
dard GP regression, we assumed the likelihood and prior to be of
102 probabilistic machine learning

Figure 55: Number of new COVID


cases per day. Data: Robert Koch
4,000 Institut, 22 May 2020.
new cases

2,000

0
0 20 40 60 80 100 120 140
days since outbreak

the form:

p(y | f T ) = N (y; f T , σ2 I ) p( f ) = GP f ; 0, k

One flaw with the current formulation is that GP inherently as-


sumes that the function values are in the range (−∞, ∞), whereas
our count data is strictly non-negative.
One way to work around this is to log-transform the data (see
Fig. 56), perform GP regression, and then exponentiate to go back
in the original scale . This will ensure that the data is in the correct
range.

4 Figure 56: Transformed data using the


log-transform. Other transformations
are also possible.
log10 (new cases)

−2
0 20 40 60 80 100 120 140
days since outbreak

One crucial aspect that we notice from the data is that the count
values in the first 40 days since the outbreak differ significantly
from the rest of the days (observe the values of the solid blue line,
ignore the noise bands for now).
In the standard GP regression, our likelihood assumed that
the noise had the same scale σ across all observations. Ideally, we
would like to represent the values in the first 40 days with higher
uncertainty.
In order to account for this, we perform Laplace approximation
on the likelihood:

p(y | f T ) = N (y; exp( f T ), σ2 I )



∂ log p(y | f T )
= 0 ⇒ fˆT = log y
∂ fT ˆ
f =f
T T
2
∂ log p(y | f T ) y2
= −
∂2 f T ˆ
σ2
ft = fT

⇒ q(y | f T ) = N ( f T ; log y, σ diag(1/y2 )) ≈ p(y | f T )


2
generalized linear models v 103

Now, with this new approximation for the likelihood, we can ac-
count for different uncertainties at different points in time.

Bayesian Deep Learning v

Having introduced Generalized Linear Models, in this section we


explore whether this view still has a connection to deep learning. In
short – the answer is yes, and we will demonstrate this with two
use cases.

Hyperparameter optimization
First, we show that computing evidences (marginal likelihoods)
is possible, which could then be used to perform hyperparameter
optimization. Let’s start by writing down the evidence term:
Z Z 
p(y | X ) = p(y, f | X ) d f = exp log p(y | f ) p( f | X ) d f

In the standard GP setting, both the likelihood and the prior were
Gaussian, so computing this integral only involved linear algebra.
However, in the case of Generalized Linear Models, this is not the
case, as the likelihood is typically non-Gaussian. For this reason, we
construct Laplace approximation:
  1
log p(y | f ) p( f | X ) ≈ log p(y | fˆ) p( fˆ | X ) − ( f − fˆ)| (K −1 + W )( f − fˆ) = log q(y, f | X )
2
From there, we have:
   Z  
1
p(y | X ) ≈ q(y | X ) = exp log p(y | fˆ) p( fˆ | X ) exp − ( f − fˆ)| (K −1 + W )( f − fˆ) d f
2
  
= exp log p(y | fˆ) N ( fˆ; m X , k XX )(2π )n/2 |(K −1 + W )−1 |1/2

Recall from earlier chapters that for type-II maximum likelihood


estimation, it is typically more convenient to compute the log of the
evidence term, since it is numerically more stable:

1 1
log q(y | X ) = log p(y | f ) − ( fˆ − m X )| KXX
−1 ˆ
( f − m X ) − log(|K | · |K −1 + W |)
2 2
n
1 1
= ∑ σ(yi f xi ) − ( fˆ − m X )| KXX
−1 ˆ
( f − m X ) − log | B|
i =1
2 2

From there, we can use an optimization algorithm that follows the


gradient of log q(y | X ) in order to find the best set of hyperparam-
eters.

Last layer Laplace approximation


Now, we show that it is possible to construct Laplace approxima-
tion for the last layer. This brings us a step closer towards Bayesian
Deep Learning.
104 probabilistic machine learning

Consider the following assumption for the distribution of the


parameters of the last layer v and the corresponding likelihood:
n
p(v) = N (v, µ, Σ) p(y | f X ) = ∏ σ ( y i f xi ) with v ∈ RF , φX ∈ Rn× F
i =1

Then, we construct the log-posterior:

log p(v | y) = log p(y | v) + log p(v) − log p(y)


n
| 1
= ∑ log σ(yi φxi v) − 2 (v − µ)| Σ−1 (v − µ) + const
i =1

From there, we only need to compute the gradient and the Hessian:
 
n
| ∂ log σ (yi φx|i v) yi + 1
∇ log p(v | y) = ∑ ∇ log σ(yi φxi v) − Σ−1 (v − µ) with = [ φxi ] j − σ(φx|i v)
i =1
∂v j 2
n
| ∂2 log σ (yi φx|i v)
∇∇| log p(v | y) = ∑ ∇∇| log σ(yi φxi v) − Σ−1 with = −[φxi ] a [φxi ]b σ(φx|i v)(1 − σ(φx|i v))
i =1
∂v a ∂vb | {z }
=:wi
 
|
=: −(W + Σ−1 ) = − φX diag(w)φX + Σ −1 ∈ R F × F

. . . and we have all the ingredients for Laplace approximation. Note


that this is still a convex optimization problem since we only used
the last layer.

Deep Learning
Consider a deep feedforward neural network:
n
p ( yi | W ) = ∏ σ ( f W ( xi ))
i =1
fW (x) = w|L φ(wL−1 φ(. . . (w1 x) . . . ))

We know that standard deep learning amounts to (“type-I”) maxi-


mum a-posteriori estimation:
n
W ∗ = arg max p(W | y) = arg min − ∑ log σ ( f W ( xi )) − log p(W )
W W i −1
n
= arg min − ∑ log σ( f W ( xi )) − β2 kW k2 =: arg min J (W )
x2

W i −1 W

One crucial problem of deep learning is overconfidence – the


network outputs prediction with high confidence even in domains
that are far away from the data (see Fig. 57). This has been shown in x1

the following theorem: Figure 57: The problem of overconfi-


dence.

Theorem 53 (Hein et al. 2019). Let Rd = ∪rR=1 Qr and f |Qr ( x) =


Ur x + cr be the piecewise affine representation of the output of a
ReLU network on Qr . Suppose that Ur does not contain identical
rows for all r = 1, . . . , R, then for almost any x ∈ Rn and any e > 0,
there exists a δ > 0 and a class i ∈ {1, . . . , k} such that it holds
softmax( f (δx), i ) ≥ 1 − e. Moreover, limδ→∞ softmax( f (δx), i ) = 1.
generalized linear models v 105

This provides the motivation to progress towards Bayesian Deep


Learning. We start by replacing the point estimate prediction p(y =
1 | x ) = σ( f W ( x )) with the marginal:
Z
p(y = 1 | x ) = σ ( f W ( x )) p(W | y) dW

In order to compute this quantity, we need to approximate the


posterior on W by Laplace:

p(W | y) ≈ N (W; W ∗ , −(∇∇| J (W ))−1 ) =: N (W; W ∗ , Ψ)

We know that in a general deep learning setting, f is not linear in


the weights W. For this reason, we construct a linear approxima-
tion:
d fW ∗ (x)
f W ( x ) ≈ f W ∗ ( x ) + G ( x )(W − W ∗ ) where G ( x ) =
dW
From there, we can approximate the probability distribution over
functions using linear algebra:
Z
p( f W ( x )) = p( f | W ) p(W ) dW

≈ N ( f ( x ); f W ∗ ( x ), G ( x )ΨG ( x )| ) =: N ( f ( x ); m( x ), v( x ))

and approximate the marginal (MacKay, 1992) as:


!
m( x )
p(y = 1 | x ) ≈ σ p
1 + π/8 v( x )

Using these findings, it is possible to bound the confidence on the


predictions of the network. The following theorem formalizes the
x2

statement:

Theorem 54 (Kristiadi et al., 2020). Let f W : Rn → R be a binary


ReLU classification network parametrized by W ∈ R p with p ≥ x1
n, and let N (W |W ∗ , Ψ) be the approximate posterior. Then for Figure 58: Uncertainty of the predic-
tions. The shade indicates the level of
any input x ∈ Rn , there exists an α > 0 such that for any δ ≥
confidence.
α, the confidence σ (|z(δx)|) is bounded from above by the limit
limδ→∞ σ (|z(δx)|). Furthermore,
!
|u|
lim σ (|z(δx)|) ≤ σ p ,
δ→∞ smin ( J ) π/8 λmin (Ψ)

where u ∈ Rn is a vector depending only on W and the n × p


∂u
matrix J := ∂W ∗ is the Jacobian of u w.r.t. W at W ∗ .
W
The good news is that Bayesian Deep Learning is not necessarily
costly, if one is willing to use approximations, such as: Backpack for pytorch is a collection
of lightweight extensions for sec-
• a low-rank approximation of the Hessian ond order quantities (curvature and
variance), available at:
• a block-diagonal approximation of the Hessian
backpack.pt
• the Hessian of the last layer F. Dangel, F. Künstner, P. Hennig
BackPACK: Packing more into Backprop
• or even just the diagonal of the Hessian ICLR 2020
106 probabilistic machine learning

Condensed content

Support Vector Machines

• arise if the empirical risk has zero gradient for large values of f
(→ hinge-loss)

• unfortunately, this does not amount to a log likelihood, so there


is no natural probabilistic interpretation (and thus uncertainty)
for the SVM

Generalized Linear Models

• extend Gaussian (process) regression to non-Gaussian likeli-


hoods

• the Laplace approximation yields a computationally lightweight


approximate posterior for such models. It is better than a point-
estimate, but one has to take care to ensure it is working, espe-
cially if the likelihood is not log-concave

Bayesian Deep Learning

• deep neural networks can have badly calibrated uncertainty


when used as (MAP) point estimates

• Laplace approximations can fix this issue

• Laplace approximations are not for free, but feasible for many
deep models, and easy to implement
Exponential Family v

The biggest obstacles in probabilistic inference are of


computational nature. Nevertheless, this is yet another chapter
where we introduce a new item in our toolbox that will help us
keep things tractable. The hardest part of computing the posterior
in Bayes’ rule,
p(y| x ) p( x )
p( x |y) = R ,
p(y| x ) p( x ) dx
R
comes from the integral for the evidence Z = p(y| x ) p( x ) dx.
Naturally, computing expectations with respect to the posterior
  Z
E p( x |y) f ( x ) = f ( x ) p( x |y) dx

as well as solving the optimization problem to obtain the (Type-2)


Maximum-A-Posteriori solution

x ? = arg max p( x |y)


x

also inherits from this difficulty. Using Gaussian distributions, we


have seen that we can sidestep the problem: if we have a Gaussian
prior and a Gaussian likelihood, the posterior is also Gaussian.
This resulted in the ability to compute (some of) the quantities
analytically.

Conjugate Priors v

We will extend this idea with the concept of conjugate priors. A prior
is said to be conjugate to a likelihood if the posterior arising from
the combination of the likelihood and the conjugate prior has the
same form as the prior. The Gaussian prior is the conjugate prior to
the Gaussian likelihood.

For binary distributions, such as the probability π that a coin


flip shows head, we have that the likelihood of seeing a heads and b
tails in a sequence x1 , . . . , x a+b is

p ( x | π ) = π a (1 − π ) b .

In an earlier chapter, we have seen that the Beta distribution22 was 22


wikipedia.org/wiki/Beta_distribution
108 probabilistic machine learning

a sensible choice of prior. With hyper-parameters α, β, it is defined


as
 π α −1 (1 − π ) β −1
p(π ) = B π; α, β = ,
B(α, β)
where the beta function, B(α, β), is the normalization constant. This
choice of prior leads to a posterior that is also a Beta distribution,
 π α + a −1 (1 − π ) β + b −1
p(π | x ) = B α + a, β + b = .
B(α + a, β + b)

For Categorical distributions, where an observation x can be


one of K classes, if we doneote nk as the number of observations of
class k, and πk as the probability of seeing an observation of class k,
then the likelihood is given by
n
p( x |π ) = ∏ πk k .
k

Taking a Dirichlet23 distribution with hyperparameters α1 , . . . , αK as 23


wikipedia.org/wiki/Dirichlet_distribution

the prior,
1 α −1
p ( π ) = D ( α1 , . . . , α K ) =
B ( α1 , . . . , α K ) ∏ πk k ,
k

leads to a Dirichlet posterior, p(π1 , . . . , πk | x ) = D (α1 + n1 , . . . , αK + nK ).


For an additional example involving inferring the mean and
covariance of the Gaussian distribution, see v .

The formal definition of a conjugate prior is as follows; for 24


E. Pitman. Sufficient statistics and in-
trinsic accuracy. Mathematics Proceed-
more details, see 24 , 25 .
ings of the Cambridge Philosophical
Definition 55 (Conjugate prior). Let D and x be a data-set and Society, 1936
25
P. Diaconis and D. Ylvisaker. Con-
a variable to be inferred, connected by the likelihood p( D | x ) = jugate priors for exponential families.
`( D; x ). A conjugate prior to ` for x is a probability measure with Annals of Statistics, 1979
PDF p( x ) = π ( x, θ ) of functional form π, such that
`( D; x )π ( x; θ )
p( x | D ) = R = π ( x; θ 0 ).
`( D; x )π ( x; θ ) dx
That is, the posterior arising from ` is of the same functional form
as the prior, with updated parameters.
As we will see in the next chapter, conjugate priors allow for ana-
lytic Bayesian inference.

Exponential Family v

Definition 56 (Exponential Family). A probability distribution over


a variable x ∈ X ⊂ Rn with the functional form
  1  
pw ( x ) = h( x ) exp φ( x )> w − log Z (w) = h( x ) exp φ( x )> w ,
Z (w)
is called an exponential family of probability measures. The func-
tion φ : X → Rd is called the sufficient statistics, the parameters
w ∈ Rd are called the natural parameters of pw , the normalization
constant Z : Rd → R is the partition function, and the function
h( x ) : X → R+ is the base measure.
exponential family v 109

Some examples of Exponential Family distributions, with sufficient


statistics and domain, are shown in Table 1.

Distribution φ( x ) X Table 1: (Incomplete) List of Exponen-


tial Family distributions.
Bernoulli [x] {0, 1}
Poisson [x] R+
Laplace [1, x ] R See wikipedia.org/wiki/
Exponential_family#Table_of_distributions
χ2 [ x, − log x ] R
for a more exhaustive list.
Dirichlet [log x ] R+
Euler (Γ) [ x, log x ] R+
Wishart [ X, log det X ] { X ∈ R N × N : v> Xv ≥ 0∀v ∈ R N }
Gaussian [ x, xx > ] RN
Boltzmann [ X, triag( XX > )] {0, 1} N

Exponential Families have Conjugate priors. Taking the


Exponential Family likelihood
 
pw ( x |w) = exp φ( x )> w − log Z (w) ,

the prior parametrized with α and ν,


 !> ! 
w α
pα (w | α, ν) = exp  − log F (α, ν) ,
− log Z (w) ν

where the normalization constant F (α, ν) is given by


Z  
F (α, ν) = exp α> w − ν> log Z (w) dw,

gives rise to the posterior


n
pα (w|α, ν) ∏ pw ( xi |w) ∝ pα (w|α + ∑ φ( xi ), ν + n).
i =1 i

The predictive distribution p( x ) can similarly be computed,


Z
p( x ) = pw ( x |w) pα (w|α, ν) dw,
Z  
= exp (φ( x ) + α)> w + (ν + 1) log Z (w) + log F (α, ν) dw,
F (φ( x ) + α, ν + 1)
= .
F (α, ν)

Computing F (α, ν) can be tricky, and in general is the main chal-


lenge when constructing an Exponential Family distribution.

Sufficient statistics “suffice” for maximum likelihood


estimation. To see why, take the likelihood of n i.i.d. samples for
an exponential family,
!
n n
p ( x1 , . . . , x n | w ) = ∏ p(xi |w) = exp ∑ φ(xi )> w − n log Z(w) .
i =1 i =1
110 probabilistic machine learning

The maximum likelihood estimate for w is found at

n
1
∇w log p( x1 , . . . , xn |w) = 0 ⇒ ∇w log Z (w) =
n ∑ φ ( x i ).
i =1

Hence, it suffices to collect the statistics φ( xi ), compute ∇w log Z (w)


and solve for w? .

Exponential families allow for analytical computation


of integrals. In particular, we can analytically compute the ex-
pectation of the sufficient statistic with respect to the exponential
R
family. Because it holds that X dpw ( x ) = 1, we have:

Z Z
∇w pw ( x | w) dx = ∇w pw ( x | w) dx
Z Z
= φ( x ) dpw ( x | w) − ∇w log Z (w) dpw ( x | w) = 0

=⇒ E pw (φ( X )) = ∇w log Z (w)

Note that the first expression is 0, because we are essentially taking


the derivative of a constant.

It is important to clarify that both of these nice properties (maxi-


mum likelihood estimation, and being able to analytically compute
integrals) hinge on the fact that log Z (w) is analytically known.

Conjugate prior inference v

In this section we explore how we can use exponential families


to learn distributions. In fact, we will find out that we can even do
Bayesian distribution regression, also known as conjugate prior
inference. For this purpose, we introduce the following quantity:

Definition 57 (Kullback-Leibler divergence). Let P and Q be prob-


ability distributions over X with pdf’s p( x ) and q( x ), respectively. Some important properties:
 
The KL-divergence from Q to P is defined as: DKL Pk Q 6= DKL Qk P

DKL Pk Q ≥ 0, ∀ P, Q (Gibbs’ inequality)
Z
! 
 p( x ) DKL Pk Q = 0 ⇐⇒ p ≡ q almost everywhere
DKL Pk Q := log dp( x )
q( x )

Now back to our problem: assume we are given samples [ xi ]i=1,...,n


s.t. xi ∼ p( x ). Using these samples, we would like to approximate
the distribution p( x ) using an exponential family:

 |

p( x ) ≈ p̂( x | w) = exp φ( x ) w − log Z (w)
exponential family v 111

Maximum Likelihood Regression on distributions


First, we tackle this problem through maximum likelihood estima-
tion. In particular, to find ŵ, consider:

ŵ = arg min DKL p( x )k p̂( x | w)
w ∈Rd
Z  
= arg min log p( x ) − log p̂( x | w) dp( x )
w ∈Rd
Z
|
= arg min log p( x ) dp( x ) +E p (φ( x )) w − log Z (w) = arg min Llog (w)
w ∈Rd | {z } w ∈Rd
−H( p )

Now, we can find the minimum at ∇w Llog (w) = 0, where:


n
1
E p (φ( x )) ≈
n ∑ φ(xi ) = ∇w log Z(w)
i =1

MAP Regression on distributions


Next, we tackle the problem through MAP estimation. In this case,
we consider the following conjugate prior:
 | 
p F (w | α, ν) = exp w α − ν log Z (w) − log F (α, ν)

Note that we do not need to know its normalizer F. Then, in order


to find ŵ, consider:

ŵ = arg min DKL p( x )k p̂( x, w)
w ∈Rd
Z   |
= arg min log p( x ) − log p̂( x | w) dp( x ) + α w − ν log Z (w)
w ∈Rd
Z
| |
= arg min log p( x ) dp( x ) +E p (φ( x )) w − log Z (w) + α w − ν log Z (w) = arg min L̃log (w)
w ∈Rd | {z } w ∈Rd
−H( p )

Now, we can find the minimum at ∇w L̃log (w) = 0, where:


n
1 n 1
E p (φ( x )) ≈
n ∑ φ ( xi ) = n+ν
∇w log Z (w) − α
n
i =1

Full Bayesian Regression on distributions


Finally, we can do a full Bayesian treatment to our problem. In
particular, we can compute the posterior on w, using the above
mentioned conjugate prior:
!
∏n pw ( xi | w) p F (w|α, ν)
p(w | x, α, ν) = Ri=1 = pF w | α + ∑ φ ( x i ), ν + n
p( x | w) p(w | α, ν) dw i

Note that in this case, we do need to know the normalizer of the


prior. Let us see the interpretation of this approach from the statisti-
cal viewpoint. Note that the Hessian of the posterior at the mode is
given by the following expression:
|
∇∇ p F (w | α, ν)|w? =arg max p(w|α,ν) = −νp(w? |α, ν)∇w ∇w log Z (w? )
112 probabilistic machine learning

As we increase the number of data points n → ∞, the posterior con-


centrates at w? . In particular, the curvature of the Hessian increases
around the mode because ν = νprior + n → ∞, with:
n
α 1
∇w log Z (w? ) = +
n n ∑ φ(xi ) = E p (φ(x))
i =1

In turn, this results in the maximum likelihood solution:



pw ( x | w? ) = arg min DKL p( x )k pw ( x | w)
w

For an example of constructing an exponential family and fitting a


distribution, see v .
exponential family v 113

Condensed Content

• Conjugate Priors allow analytic inference of nuisance parameters


in probabilistic models.

• Exponential Families (EF) guarantee the existence of conjugate


priors, although not always tractable ones.

• They also allow analytic MAP inference from only a finite set of
sufficient statistics.

• Conjugate prior inference with exponential families is a form


of Bayesian regression on distributions (Gaussian process in-
ference, in this sense, is inference on the unknown mean of a
Gaussian distribution).

– Given data x1 , . . . , x N drawn iid. from unknown p( x ), consider


approximating p( x ) ≈ pw ( x | w) with an EF.
– The maximum likelihood and MAP estimates for w can be
computed in O ( N ).
– If the conjugate prior to pw (which itself is an EF) is tractable,
it allows for full Bayesian inference.
– Asymptotically, the posterior concentrates around the maxi-
mum likelihood estimate, which is the minimizer of the KL-

divergence DKL pk pw within the exponential family.

• The hardest part is finding the normalization constant. In fact,


finding the normalization constant is the only hard part.
Graphical Models v

Keeping track of the entire hypothesis space is combi-


natorially hard. In the earlier chapters, we have seen that to
represent a full joint probability distribution over four variables
would require 24 − 1 = 15 parameters,

p( a, b, c, d) = p( a|b, c, d) p(b|c, d) p(c|d) p(d),

but removing irrelevant conditions (based on domain knowledge)


can reduce the number of required parameters. If, for example,

p( a, b, c, d) = p( a|b, c) p(b|d) p(c) p(d),

then we only need 8 parameters to represent the distribution.


We will see that Graphical models provide a nice language to
convey this independence information.

Directed Graphical Models v

Recall the procedure of constructing directed graphical models (or E B


Bayesian networks)

1. For each variable in the joint distribution, draw a circle.


R A

2. For each term p( x1 , . . . , |y1 , . . .) in the factorized joint distribu-


tion, draw an arrow from every parent (right side, yi ) to every Figure 59: Directed Graphical Model
for the factorization
child (left side, xi ).
p( A, E, B, R) = p( A| E, B) p( R| E) p( E) p( B).
3. Fill in all observed variables (variables we want to condition on).

leading to a graphical model such as the one shown in Fig. 59.

w w µ, Σ
Repeated observations and hyperparameters can be ex-
pressed using some syntactic sugar to make it easier to draw com- =

plex graphical models. A box with sharp edges drawn around a set y1 y2 ... yn yi σ
of nodes and labeled with a number n is called a plate and denotes
n copies of the content of the box. A small filled circle denotes a
(hyper-)parameter that is set or optimized, and which is not part of xi
n
the generative model.
Figure 60: Plates and Hyperparameters
n
p(y, w) = ∏ N (yi ; φ(xi )T w, σ2 )N (w; µ, Σ)
i =1
116 probabilistic machine learning

Independence and Directed Graphs


By the product rule, every joint probability distribution can be
factorized, but not every factorization is useful. Directed graphs are
also an imperfect representation, as a joint probability distribution
can have multiple factorizations, each leading to a different graph
expressing some of the independencies, but not all. Remember the
atomic independence structure we surveyed in an earlier chapter:

p( A, B, C ) DAG Independence But!


A B C
p(C | B) p( B| A) p( A) A⊥
⊥ C|B A 6⊥
⊥C
A B C
p( A| B) p(C | B) p( B) A⊥
⊥ C|B A 6⊥
⊥C
A B C
p( B| A, C ) p( A) p(C ) A⊥
⊥C A 6⊥
⊥ C|B
Figure 61: Independence structure for
tri-variate subgraphs.
A more general statement about independence called d-separation,
comes from Pearl26 , and the following presentation is from Bishop27 . 26
J. Pearl. Probabilistic Reasoning in
Intelligent Systems. Morgan Kaufmann,
Theorem 58 (d-separation). Consider a general directed acyclic 1988
27
graph, in which A, B, C are non-intersecting sets of nodes whose Christopher Bishop. Pat-
tern Recognition and Machine
union may be smaller than the complete graph. To ascertain whether Learning. Springer, 2006. URL
A ⊥⊥ B|C, consider all possible paths, regardless of the direction, https://www.microsoft.com/en-us/
research/uploads/prod/2006/01/
from any node in A to any node in B. Any such path is considered Bishop-Pattern-Recognition-and-Machine-Learning-
blocked if it includes a node such that either pdf

• the arrows on the path meet either head-to-tail or tail-to-tail at


the node, and the node is in C, or

• the arrows meet head-to-head at the node, and neither the node,
nor any of its descendants is in C.

If all paths are blocked, then A is said to be d-separated from B by


C, and A ⊥ ⊥ B|C.

Thus, all further considerations about computations on the graph


can be made in a local fashion based on Markov Blankets.

Definition 59 (Markov Blanket – for directed graphs). The Markov


Blanket of node xi is the set of all parents, children, and co-parents28
of xi . Conditioned on the blanket, xi is independent of the rest of xi

the graph.

The directed nature of connections in Bayesian belief networks


reflects the fact that a conditional probability has a left- and right- Figure 62: Example of a Markov
hand side, p( x | a). This is convenient since it allows writing down Blanket for a directed graph
the graph directly from the factorization. However, conditional 28
The co-parents of x are the (other)
parents of the children of x
independence statements (d-separation) are tricky: blocking a path
requires notions of parents and co-parents, and different rules
depending on whether arrows meet head-to-head or head-to-tail.
Moreover, there are joint distributions whose set of conditional
independences can not be represented by a single directed graph.
graphical models v 117

Undirected Graphical Models v

Undirected Graphical Models, or Markov Random Fields (MRF),


are another notation in which conditional independence can be
x1
stated as “two nodes are independent if all paths connecting them
are blocked”.
x2 x4 x6 x8
Definition 60 (Markov Random Field). An undirected Graph
G = (V, E) is a set V of nodes and edges E. G and a set of random
variables mapping to the nodes X = { Xv }v∈V form a Markov Ran- x3 x5 x7
A S B
dom Field if, for any subsets A, B ⊂ V and a separating set S (a set
such that every path from A to B passes through S), X A ⊥
⊥ X B | XS . Figure 63: Markov Random Field with
Separating set
The above definition is known as the global Markov property. It im-
plies the weaker pairwise Markov property: Any two nodes u, v that
do not share an edge are conditionally independent given all other
variables: Xu ⊥
⊥ Xv | XV \{u,v} .

Markov Blankets are simpler for Markov Random Fields;

Definition 61 (Markov Blanket – for undirected graphs). For a


xi
Markov Random Field, the Markov Blanket of node xi is the set of
all direct neighbors of xi . Conditioned on the blanket, xi is indepen-
dent of the rest of the graph.

Essentially, MRFs allow for a more compact definition of condi- Figure 64: Markov Blanket for a
Markov Random Field
tional independence compared to directed graphs. Nevertheless, the
associated joint probability distribution cannot be easily read from
the graph.

By the pairwise Markov property, any two nodes xi , x j not con-


nected by an edge have to be conditionally independent given the
rest of the graph. Thus, the joint factorizes into

p( xi , x j | x\{i,j} ) = p( xi | x\{i,j} ) p( x j | x\{i,j} ).

Hence, for the factorization to hold, nodes that do not share an


edge must not be in the same factor. This leads to the use of cliques
x1 x2
to define the factorization.

Definition 62 (Clique). Given a graph G = (V, E), a clique is a sub-


set c ⊂ V such that there is an edge between all pairs of nodes in x3 x4
c. A maximal clique is a clique such that it is impossible to include
any other nodes from V, without it ceasing to be a clique. Figure 65: A clique (in gold) and
maximal clique (in red)
Any distribution p( x ) that satisfies the conditional independence
structure of the graph G can be written as a factorization over all
cliques – but also just over all maximal cliques, since any clique is
part of at least one maximal clique. Using the set of all maximal
cliques C gives

1
p ( x1 , . . . , x n ) =
Z ∏ ψc ({ xi ∈ c}).
c∈C
118 probabilistic machine learning

In directed graphs, each factor p( xch | xpa ) had to be a probability


distribution of the children, and not of the parents. In MRFs, there
is no distinction between parents and children, so we only know
that each potential function ψc ({ xi ∈ c}) ≥ 0. The normalization
constant Z is the partition function
Z
Z := ∏ ψc ({ xi ∈ c}) dx1 , . . . , xn .
c∈C

Because of the loss of structure from directed to undirected graphs,


we have to explicitly compute Z. This can be NP-hard, and is the
primary downside of MRFs; for n discrete variables with k states
each, computing Z may require summing kn terms.

The Boltzmann distribution


Markov Random Fields with positive potentials (ψc ({ xi ∈ c}) > 0)
are Exponential Families, since we can write

ψc ({ xi ∈ c}) = exp(− Ec ({ xi ∈ c}))

for some function Ec , and introduce the scaling factors wc to get


!
p( x1 , . . . , xn ) = exp − ∑ wc Ec ({ xi ∈ C}) + log Z .
c∈C

This gives rise to a Boltzmann distribution (or Gibbs measure);

Definition 63 (Boltzmann distribution). A probability distribution


with PDF of the form

p( x ) = exp(− E( x ))

is called a Boltzmann or Gibbs distribution, and E( x ) is known as


the energy function.

Any Gibbs measure (and any MRF) is an exponential family;


it may not necessarily be of the helpful kind as Z (wc ) can be in-
tractable.

The Gaussian case


For a set of variables x1 , . . . , xn that are jointly Gaussian distributed,

p( x) = N x; µ, Σ ,

the MRF can be constructed directly from the inverse covariance.


Recall that if the inverse covariance (precision) matrix contains a
zero at element, [Σ−1 ]ij , then xi ⊥
⊥ x j | x\i,j This implies that an edge
exists between each node that has [Σ−1 ]ij 6= 0.
graphical models v 119

Condensed content

Directed Graphical Models (Bayesian Networks)

• directly encode a factorization of the joint, as it can be read off by


parsing the graph from the children to the parents.

• However, reading off conditional independence structure is


tricky as it requires considering d-separation.

• Directed graphs are a direct mapping of a generative processes. For


this reason, they tend to be useful in highly structured problems
with mixed data types, such as physical, biological, chemical or
social processes, where the causal structure is known.

• When you want to model a process for which you have a “sci-
entific” theory or some generative knowledge, writing down the
directed model is a good start.

Undirected Graphical Models (Markov Random Fields)

• directly encode the conditional independence structure.

• However, reading off the joint from the graph is tricky as it re-
quires calculating the normalization constant, which is usually
intractable.

• MRFs tend to be useful in particularly regular, but high-dimensional


problems with unclear generative model, such as those encoun-
tered in computer vision and statistical physics.

• When your model has millions of parameters, and you are more
worried about computational complexity than interpretability,
the conditional independence structure of MRFs can help keep-
ing things tractable.
Factor graphs v

So far, we utilized directed and undirected graphs as


tools to graphically represent and inspect properties of
joint probability distributions. Both are primarily a design
tool, each with its strengths and weaknesses. In this chapter, we will
introduce a third type of graphical models, along with a general-
purpose algorithm for automated inference and efficient Maximum-
A-Posteriori computation.

From Directed to Undirected Graphs

Before we introduce this new form of graphs, let us first observe


how we can transition from a directed to an undirected graph.
Given a directed graph, it is possible to find an equivalent undi-
rected graph. For some models, such as Markov Chains, this is
straightforward:

p ( x ) = p ( x 1 ) p ( x 2 | x 1 ) . . . p ( x n | x n −1 ),
1
= ψ ( x , x2 ) . . . ψn−1,n ( xn−1 , xn ).
Z 1,2 1

Figure 66: Directed and Undirected


x1 x2 ... x n −1 xn graph for a Markov chain

x1 x2 ... x n −1 xn

In general, we need to ensure that each conditional term in


the directed graph is captured in at least one clique of the undi-
rected graph (see Fig. 67). For nodes with only one parent, we can
drop the arrow, thus obtaining p( xc | x p ) = ψc,p ( xc , x p ). However,
for nodes with several parents, we have to connect all their par-
ents. This process is known as moralization, and frequently leads to
densely connected graphs, losing all value of the structure.
122 probabilistic machine learning

Figure 67: Directed to Undirected, after


x1 x3 x1 x3
moralization

x2 x2

x4 x4

Strengths and weaknesses


Directed and undirected graphs offer tools to graphically represent
and inspect properties of joint probability distributions. Both are
primarily a design tool, and each framework has its strengths and
weaknesses.
In Fig. 68, the conditional independence properties of the di-
rected graph on the left cannot be represented by any MRF over
the same three variables; and the conditional independence proper-
ties of the MRF on the right cannot be represented by any directed
graph on the same four variables.

A B C Figure 68: Example for the limits of


graphical models
A B

C D

A ⊥⊥ B | ∅ and A 6⊥⊥ B | C x 6⊥⊥ y | ∅ ∀ x, y and C ⊥⊥ D | A ∪ B and A ⊥⊥ B | C ∪ D

To formalize this, consider a distribution p( x) and its graphical


representation G = (Vx , E).
If every conditional independence statement satisfied by the dis-
tribution can be read off the graph, then G is called a D-map of p.
For example, the fully disconnected graph is a trivial D-map for
every p, since this graph models all possible conditional indepen-
dence statements for the given variables, thus including the ones
implied by p.
On the other hand, if every conditional independence statement
implied by G is also satisfied by p, then G is called an I-map of p.
For example, the fully connected graph is a trivial I-map for every
p, since it doesn’t model any conditional independence statements
for the given variables, hence p satisfies this trivial empty set of
conditional independence statements.
A graph G that is both an I-map and a D-map of p is called a
perfect map of p. The set of distributions p for which there exists
a directed graph that is a perfect map is distinct from the set of p
for which there exists a perfect MRF map (see Fig 68; on the other
hand, Markov Chains are an example where both the MRF and the
directed graph are perfect). There exist p for which there are neither
a directed nor an undirected graphs are perfect maps (e.g. two
coins and a bell).
factor graphs v 123

In order to alleviate some of the limitations of the directed and


undirected graphs, in the next section we introduce factor graphs.

Factor Graphs
x1 x2 x3
Factor Graphs are an explicit representation of functional relation-
ships;

Definition 64 (Factor graph). A factor graph is a bipartite graph


G = (V, F, E) of variables v ∈ V, factors f ∈ F and edges E, such
that each edge connects a factor to a variable. Figure 69: Example of a factor graph,
with variables nodes x1 , x2 , x3 and
To construct a factor graph from a directed graph factors in boxes.

p( x ) = ∏ p(xch |xpa(ch) ),
ch

draw a circle for each variable xi , a box for each conditional in the
factorization and connect each xi to the factorizations it appears in.

Figure 70: Conversion of a directed


w µ, Σ
graph for a parametric regression to a
factor graph

yi N
σ w µ, Σ

σ xi
N
xi yi
n n

To construct a factor graph from an MRF

1
p( x ) =
Z ∏ ψc ({ xi ∈ c}),
c∈C

draw a circle for each variable xi , a box for each factor (clique) ψc
and connect each ψc to the variables used in the factor.

Some properties of Factor Graphs

Factor Graphs can express structure not visible in MRFs:

x1 x2 x1 x2 x1 x2

f ? ? fa
fb
x3 x3 x3
124 probabilistic machine learning

Sometimes, they can mask conditional independence structures:

p ( x1 , x2 , x3 ) = p ( x3 | x1 , x2 ) p ( x1 ) p ( x2 ) p ( x1 , x2 , x3 ) = p ( x1 , x2 | x3 ) p ( x3 )

x1 x2 x1 x2 x1 x2

? p ?

x3 x3 x3

But they can also reveal functional relationships:

p( x) = p23 ( x2 , x3 | x1 ) p( x1 ) p ( x ) = p2 ( x2 | x1 ) p3 ( x3 | x1 ) p ( x1 )

x1 x1 x1
p2 p3
p23 ? ?

x2 x3 x2 x3 x2 x3

The graphical view itself does not always capture the entire
structure. Nevertheless, when factor graphs are encoded with an
explicit functional form, part of the structure can be automatically
deduced and used for inference. For this purpose, we introduce the
Sum-Product algorithm.
The Sum-Product Algorithm v

The Sum-Product, message passing or Belief propagation


algorithm ( 29 , 30 , 31 ) leverages the structure of factor 29
J. Pearl. Probabilistic Reasoning
in Intelligent Systems. Morgan
graphs to perform inference. More precisely, it computes the
Kaufmann, 1988
marginal distribution 30
S.L. Lauritzen and D.J. Spiegel-
Z halter. Local computations with prob-
p ( xi ) = p( x1 , . . . , xi , . . . , xn ) dx j6=i , abilities on graphical structures and
their application to expert systems.
given that the joint p( x1 , . . . , xn ) is represented by a factor graph. Journal of the Royal Statistical
Society, 1988
31
F.R. Kschischang, B.J. Frey, and
Base case: Markov chains H.-A. Loeliger. Factor graphs
and the sum-product algorithm.
Filtering and Smoothing are special cases of the sum-product al- IEEE Transactions on Information
Theory, 2001
gorithm on chains. For simplicity, consider the Markov Chain with
discrete variables xi ∈ [1, . . . , k], such that
1
p ( x1 , . . . , x n ) = ψ ( x0 , x1 ) . . . ψn−1,n ( xn−1 , xn ).
Z 0,1

ψ0,1 ψ(i−1),i ψi,(i+1) ψ(n−1),n


x0 ··· x i −1 xi x i +1 ··· xn

Figure 71: Factor Graph of a Markov


The marginal p( xi ) is then given by Chain

p ( xi ) = ∑ p ( x1 , . . . , x n ),
x 6 =i
   
!  !
1   
Z x∑ ∑ ψ0,1 (x0 , x1 )   ∑ ψi,i+1 (xi , xi+1 ) . . . ∑ ψn−1,n (xn−1 , xn ) ,
=  ψi−1,i ( xi−1 , xi ) . . .
i −1 x0 x i +1 xn
| {z }| {z }
: = µ → ( xi ) : = µ ← ( xi )
1
= µ → ( x i ) µ ← ( x i ),
Z
with Z = ∑ xi µ→ ( xi )µ← ( xi ). The terms µ→ ( xi ) and µ← ( xi ) are
called messages, which can be computed recursively

µ → ( xi ) = ∑ ψi−1,i (xi−1,i )µ→ (xi−1 ),


x i −1

µ ← ( xi ) = ∑ ψi,i+1 (xi,i+1 )µ← (xi+1 ).


x i +1
126 probabilistic machine learning

 
By storing local messages, all marginals can be computed in O nk2 ,
as in filtering and smoothing. Computing a message from the pre-
ceding one can be done by taking the sum of the product of the
local factors and incoming messages. The local marginal can be
computed by taking the sum of the product of incoming messages,
hence the name of the algorithm.

An interesting insight is that we can adapt the Sum-product


algorithm to compute the most probable state – the MAP estimate
of p( x1 , . . . , xn ):

1
max p( x1 , . . . , xn ) = max · · · max ψ0,1 ( x0 , x1 ) · · · ψn−1,n ( xn−1 , xn )
x1 ,...,xn Z x0 xN
 !
1
= max ψ0,1 ( x0 , x1 ) · · · max ψn−1,n ( xn−1 , xn )
Z x0 ,x1 xn

Alternatively, for the purpose of numerical stability, we could com-


pute the log-max:
 !
log max p( x1 , . . . , xn ) = max log ψ0,1 ( x0 , x1 ) + · · · max log ψn−1,n ( xn−1 , xn ) − log Z
x1 ,...,xn x0 ,x1 xn

Finally, we might be interested in the actual maximizer ( x1 , . . . , xn )


that achieves the maximum. For this reason, we compute the argmax
in a similar fashion:
 !
arg max p( x1 , . . . , xn ) = arg max log ψ0,1 ( x0 , x1 ) + · · · + arg max log ψn−1,n ( xn−1 , xn ) 
x1 ,...,xn x0 ,x1 xn

Translating these findings into the language of message passing


yields the Viterbi Algorithm.

x03 x13 x23 x33 Figure 72: The Viterbi Algorithm.


First, we initialize the message passed
x02 x12 x22 x32 from node x0 to the factor f 01 . Next,
we perform the message passing pro-
x01 x11 x21 x31 cedure recursively until the end of the
graph. Lastly, in order to find the max-
f 0,1 f 1,2 f 2,3
x0 x1 x2 x3 imizers ximax , we need to backtrack.
  For efficient backtracking, one can
µ f i−1,i → xi ( xi ) = max log f i−1,i ( xi−1 , xi ) + µ xi−1 → f i−1,i ( xi−1 ) make use of the trellis structure.
x i −1
 
µ x0 → f01 =0 φ( xi ) = arg max log f i−1,i ( xi−1 , xi ) + µ xi−1 → f i−1,i ( xi )
x i −1

µ xi → f i,i+1 ( xi ) = µ f i−1,i → xi ( xi )
ximax max )
−1 = φ ( x i

To summarize, the sum-product algorithm splits the inference


into local messages being sent forwards and backwards along the
factor graph, allowing for inferring both the local marginals, as well
as the most-probable state.
the sum-product algorithm v 127

Sum-Product on Trees

The efficiency of the sum-product algorithm is preserved when,


instead of chains, the graph is a tree.
Definition 65 (Tree). An undirected graph is a tree if there is one,
and only one, path between any pair of nodes (such graphs have no
loops). A directed graph is a tree if there is only one node which
has no parent (the root); all other nodes only have one parent.
When such graphs are transformed into undirected graphs by mor-
alization, they remain a tree. A directed graph such that every pair
of nodes is connected by one and only one path is called a polytree.
When transformed into an undirected graph, such graphs generally
get loops, but the corresponding factor graph is still a tree.

Figure 73: Directed and Undirected


trees, along with their factor-graph
correspondences.

Consider a tree-structured factor graph over x = [ x1 , . . . , xn ] and


pick any variable x ∈ x. Because the graph is a tree, we can write
the joint p( x) as:
p( x) = ∏ Fs ( x, xs )
s∈ne( x )

where ne( x ) are the neighbors of x, and Fs is the sub-graph of


nodes xs other than x itself that are connected to neighbor s (which
is itself a tree!).
µ fs →x (x)
Now, consider the marginal distribution p( x ) = ∑ p(x). By
Fs ( x, xs )

x
x\ x fs
expanding the joint, we obtain:

p( x ) = ∑ ∏ Fs ( x, xs )
x\ x s∈ne( x ) Figure 74: Messages from factors to
! variables

= ∏ ∑ F(x, xs )
s∈ne( x ) xs
| {z }
=:µ f s → x ( x )

= ∏ µ fs →x (x)
s∈ne( x )
128 probabilistic machine learning

The sub-graphs Fs ( x, xs ) themselves factorize further into tree-

µ xm
xm


structured subgraphs:

fs
(x
m
)
x
Fs ( x, xs ) = f s ( x, x1 , . . . , xm ) G1 ( x1 , xs1 ) · · · Gm ( xm , xsm ) fs
µ fs →x (x)

Gi ( xi , xsi )
xi
where { x1 , . . . , xm } are the nodes in xs and xsi are the neighbors of
xi . Then, we obtain the factor-to-variable messages:
 

µ fs →x (x) = ∑ f s ( x, x1 , . . . , xm ) ∏ ∑ Gi ( xi , xsi ) Figure 75: Further factorization of the


x1 ,...,xm i ∈ne( f s )\ x xsi subgraph Fs
| {z }
= ∑ f s ( x, x1 , . . . , xm ) ∏ µ xi → f s ( xi )
x1 ,...,xm i ∈ne( f s )\ x

So, in order to compute the factor-to-variable messages µ f s → x ( x ),


one needs to sum over the product of the factor and remaining sub-
graph-sums. The latter themselves are messages from the variables
connected to f s .

To complete this “inductive” formulation, we need to formal-


ize the variable-to-factor messages. First, notice that the subgraphs fL
Gi ( xi , Xsi ) further factorize as follows: .
.. fs
xi
Gi ( xi , xsi ) = ∏ F` ( xi , xi` ) ..
.
`∈ne( xi )\ f s
f`
Then, for the variable-to-factor messages we have:
F` ( xi , xi` )
 

µ xi → f s ( xi ) = ∑ Gi (xi , xsi ) = ∑  ∏ F` ( xi , xi` )


xsi xsi `∈ne( xi )\ f s
 
Figure 76: Messages from the variables
= ∏ ∑ F` ( xi , xi` ) to the factors
`∈ne( xi )\ f s xi `

= ∏ µ f ` → xi ( x i )
`∈ne( xi )\ f s

So, in order to compute the variable-to-factor message µ xi → f s ( xi ),


take the product of all incoming factor-to-variable messages. µ x→ f ( x ) = 1

The sum-product algorithm then repeats those steps until reach- x


ing a leaf node. To initiate the messages at the leaves of the graph, f
which have no neighbors left, we define them to be unit for variable
leaves and identities for factor leaves; µ f →x (x) = f (x)

x
µ x→ f ( x ) = ∏∑ := 1 f
∅ ∅ Figure 77: Messages from leaf nodes
µ f →x (x) = ∑ f (x, ∅) ∏ := f ( x ). in the sum-product algorithm
∅ ∅
the sum-product algorithm v 129

To compute the marginal p( x ), we treat x as the root of the tree,


and perform the following operations:

1. initialize the leaf nodes:

• if leaf is a factor f ( x ), initialize µ f → x ( x ) := f ( x )


• if leaf is a variable x, initialize µ x→ f ( x ) := 1

2. pass messages from the leaves towards the root x:

µ f` →xj = ∑ f ` ( x j , x` j ) ∏ µ xi → f ` ( x i )
x` j i ∈{` j}=ne( f ` )\ x j

µxj → f` (x j ) = ∏ µ fi →xj (x j )
i ∈ne( x j )\ f `

3. at the root x, take the product of all incoming messages (and


normalize).

To get the marginal of each node, once the root has received all the
messages, pass messages from the root back to the leaves. Once
every node has received the messages from all their neighbors,
take the product of all incoming messages at each variable (and
normalize).
This implies that inference on the marginal of all variables in a
tree-structure factor-graph is linear in graph size.

Incorporating observations
If one or more nodes xo in the graph are observed (xo = x̂o ), we
introduce factors f ( xio ) = δ( xio − x̂io ) into the graph. This amounts
to “clamping” the variables to their observed value.
Say x := [ xo , xh ]. Because p( xo , xh ) ∝ p( xh | xo ), the sum-
product algorithm can thus be used to compute posterior marginal
distributions over the hidden variables xh .

Generalization to any graph


There is a generalization from trees to general graphs, known as the
junction tree algorithm. The principal idea is to join sets of vari-
ables in the graph into larger maximal cliques until the resulting
graph is a tree. The exact process, however, requires care to ensure
that every clique that is a sub-set of another clique ends up in that
clique.

The computational cost of probabilistic inference on the marginal


of a variable in a joint distribution is exponential in the dimension-
ality of the maximal clique of the junction tree, and linear in the
size of the junction tree. The junction tree algorithm is exact for any
graph (i.e. it produces correct marginals), and efficient in the sense
that, given a graph, in general there does not exist a more efficient
algorithm (without using properties of the functions instead of the
graph).
130 probabilistic machine learning

The Max-Product/Max-Sum Algorithm


In this section, we again tune our sum-product algorithm in order
to find the jointly most probable state xmax = arg maxx p( x). First,
let us note that arg maxx p( x) 6= ∏ arg maxxi p( xi ). For example:

0.6 0.4
x2 = 0 x2 = 1
0.7 x1 = 0 0.3 0.4
0.3 x1 = 1 0.3 0.0

Nevertheless, the max operation satisfies the following properties:

• max( ab, ac) = a max(b, c)

• max( a + b, a + c) = a + max(b, c)

• log maxx p( x) = maxx log p( x)

Thus, we can compute the most probable state xmax by taking the
sum-product algorithm and replacing all summations with maxi-
mizations (the max-product algorithm). For numerical stability, we
can further replace all products of p with sums of log p (the max-
sum algorithm). The only complication is that, if we also want to
know the arg max, we have to track it separately using an addi-
tional data structure.

To compute the marginal xmax , we choose any xi as the root of


the tree, and perform the following operations: For brevity, we only look at the max-
sum version of the algorithm.
1. initialize the leaf nodes: The max-product version is the same,
up to the terms highlighted in red (see
• if leaf is a factor f ( x ), initialize µ f → x ( x ) := log f ( x ) slides for the differences).

• if leaf is a variable x, initialize µ x→ f ( x ) := 0

2. pass messages from leaves towards root:

µ f ` → x j ( x j ) = max log f ` ( x j , x` j ) +
x` j
∑ µ xi → f ` ( x i )
i ∈{` j}=ne( f ` )\ x j

µxj → f` (x j ) = ∑ µ fi →xj (x j )
i ∈ne( x j )\ f `

3. additionally track indicator for identity of maximum (note: this


is a function of x j )

φ( x j ) = arg max log f ` ( x j , x` j ) +


x` j
∑ µ xi → f ` ( x i )
i ∈ne( f ` )\ x j

4. once the root has messages from all its neighbors, pass messages
from the root towards the leaves. At each factor node, set xmax
`j =
φ( x j ) (this is known as backtracking).
the sum-product algorithm v 131

Condensed content

• Factor graphs provide graphical representation of joint proba-


bility distributions that is particularly conducive to automated
inference

• In factor graphs that are trees, all marginals can be computed


in time linear in the graph size by passing messages along the
edges of the graph using the sum-product algorithm.

• Computation of each local marginal is exponential in the dimen-


sionality of the node. Thus, in general, the cost of inference is
exponential in clique-size, but linear in clique-number.

• An analogous algorithm, the max-sum algorithm, can be used to


find the joint most probable state (also in linear time).

• Both algorithms fundamentally rest on the distributive properties

a(b + c) = ab + ac max( ab, ac) = a · max(b, c)

Message passing provides the general framework for managing


computational complexity in probabilistic generative models as far
as it is caused by conditional independence. It does not, however,
address complexity arising from the algebraic form of continuous
probability distributions. We already saw that exponential families
address this latter issue. However, not every distribution is an ex-
ponential family. A main theme for the remainder will be how to
project complicated joint distributions onto factor graphs of expo-
nential families.
Extended Example: Topic Modeling v

Summarizing the history of modern civilization is hard.


This chapter starts off a series of lectures revolving around the
idea of building a model of history. For this purpose, we will make
use of a running example concerned with the State of the Union
addresses. Traditionally, these are annual speeches that have been 31
wikipedia.org/wiki/State_of_the_Union

delivered by the presidents of the United States since 1790. The


purpose of these addresses is to summarize the affairs of the US
federal government, which usually cover several important topics,
such as the nation’s budget, news, healthcare, social policy, and
many more. While the State of the Union (SotU) addresses are not a
perfect reflection of the US history, these speeches are very suitable
to work with. In particular, they have been historically regular,
are entirely available in text format, and most importantly, are
inherently topical. Our task in this series of lectures is to discover Disclaimer: this is not a course in
the topics of US history over time. natural language processing! There is
an entire toolbox of models for text
analysis that will not be discussed
here.
A first look at the data v The point of this exercise is to build
craftware: customized, effective and
In total, we will make use of D = 231 documents corresponding to efficient solutions to the learning task.
Nevertheless, the model ultimately
speeches that have been delivered in the period 1790 - 2019 (with 2
developed here is unusually expressive
speeches having been delivered in 1961 by Dwight D. Eisenhower in its structure, and more flexible than
and John F. Kennedy). The individual documents are roughly of standard tools.

length Id ∼ 103 words. Even though presidents are known to be


eloquent speakers, we approximate the usage of around V ∼ 10000
words from the vocabulary.
Since we are looking to reduce complexity, we have to throw out a
bit of structure. For the purpose of our analysis, we make two great
simplifications:
1. We remove redundant stop words required for human under-
standing, but carrying only negligible information (for example:
and, or, to).

2. We disregard the position of the words in the speeches, hence


modeling the texts as Bags of Words.
That being said, let us look at the first key quantity of interest – the
word frequency matrix X (see Fig. 78). The rows of the matrix rep-
resent each of the documents, i.e. speeches, whereas the columns
represent the identity of the word in the vocabulary.
134 probabilistic machine learning

0 Figure 78: The word frequency matrix.


Due to the large vocabulary size, a
truncated representation is shown
here.
100
D

200

0 100 200 300 400


V

Each item in the matrix represents the frequency of occurrence


of a given word, in a given document. By visual observation, one
can notice that certain words occur more often than others, and
inversely, one can notice that speeches usually focus on different
subsets of the words.

V words K topics Figure 79: Low rank decomposition of


the word frequency matrix

V words
D documents

D documents

K topics

X ∼ Q × U|

At a deeper level, we would be interested in the discovery of


topics that the speeches revolve around. For this purpose, we begin
our journey by looking into low rank matrix decomposition. Ul- Warning: the algorithms do not
timately, we would like to decompose the matrix X in such a way convey any more structure than they
are designed to do. Assigning any
that the discovery of the topics arises naturally (see Fig. 79). For this personal interpretation to the results is
purpose, we start with dimensionality reduction. inevitably subject to error.

Dimensionality reduction v

Consider a dataset X ∈ RD×V . Dimensionality Reduction aims to


find an encoding φ : RV → RK and a decoding ψ : RK → RV with
K  V such that the encoded representation

Z := φ( X ) ∈ RD ×K

is a good approximation of X in the sense that some reconstruction


loss of X̃ = ψ( Z ),

L( X, ψ( Z )) = L( X, ψ ◦ φ( X ))

is minimized or small. This may be done to:

• save memory

• construct a low-dimensional visualization

• “find structure”
extended example: topic modeling v 135

Linear PCA
Let us derive the famous Principle Component Analysis (PCA)
algorithm. Again, consider a dataset X ∈ RD×V . Furthermore,
consider an orthonormal basis {ui }i=1,...,V , u|i u j = δij . Then, we can
represent any point xd as a linear combination of the projections
onto the orthonormal basis:
V V
|
xd = ∑ (xd ui )ui =: ∑ αdi ui or simply vectorized X = ( XU )U |
i =1 i =1

An approximation of the point xd in K < D degrees of freedom is


given by any set ( A, b, U ) as follows:

K V
x̃d := ∑ adk uk + ∑ b` u `
k =1 `=K +1

In order to obtain the best approximation, find ( A, b, U ) such that


the square empirical risk is minimized:
 2
D D V K V
1 1
J=
D ∑ k xd − x̃d k2 = D ∑ ∑ xd − ∑ adk uk − ∑ bj u j 
d =1 d =1 v =1 k =1 j = K +1
v

First, let’s find adk and b j . Since the vectors u are orthonormal, recall
that ∑ j uij ukj = δik . Then, we simply differentiate with respect to the
parameters that we wish to optimize, and set the derivatives to zero
in order to obtain their optimal values:
 
V K V
∂J 2
∂ad`
=
D ∑ xd − ∑ adk uk − ∑ b j u j  (−u`v )
v =1 k =1 j = K +1
v
2 2 !
= (− x|d u` ) + ad` = 0
D D
=⇒ adk = x|d uk
 
D V K V
∂J 2
D d∑ ∑ xd − ∑ adk uk − ∑ bj u j  (−u`v )
=
∂b` =1 v =1 k =1 j = K +1
v
D
2 | !
=
D ∑ (−xd u` ) + 2b` = 0
d =1
1
=⇒ b j = x̄| u j where x̄ := ∑ xd
D d

From here, the residual simplifies:

K V V K V
| |
xd − x̃d = xd − ∑ adk uk − ∑ bj u j = ∑ ( xd u` )u` − ∑ ( xd uk )uk − ∑ ( x̄| u j )u j
k =1 j = K +1 `=1 k =1 j = K +1
K K V V
| |
= ∑ ( xd u` )u` − ∑ ( xd uk )uk + ∑ ( x|d u` )u` − ∑ ( x̄| u j )u j
`=1 k =1 `=K +1 j = K +1
V
= ∑ (( xd − x̄)| u j )u j
j = K +1
136 probabilistic machine learning

Using this result, along with the following notation for the sample
covariance matrix S := D1 ∑dD=1 ( xd − x̄)( xd − x̄)| , we obtain:

D D V
1 1
J= ∑ k xd − x̃d k2 = ∑ ∑ (( xd − x̄)| u j )2
D d =1
D d =1 j = K +1
V D
1 |
=
D ∑ ∑ u j (xd − x̄)(xd − x̄)| u j
j = K +1 d =1
V
= ∑ u|j Su j
j = K +1

Basically, we are almost done. In order to find a set of orthonormal


vectors ui that minimize the square reconstruction error J, choose U
as the eigenvectors of the sample covariance S. From there, we can
get the best rank K reconstruction x̃d by setting

K V M D
|
x̃d := ∑ adk uk + ∑ bj u j = ∑ ( x d ui ) ui + ∑ ( x̄| ui )ui
k =1 j = K +1 i =1 i = M +1

Accordingly, this yields the following loss:

V
J= ∑ λj
j = K +1

where λ j are the eigenvalues of S, sorted in a descending order.


Equivalently, if we first center the data

X̂ = X − 1x̄| (resulting in b = 0)

then the orthonormal vectors U are the (right) singular vectors of


X̂ = QΣU | .

Probabilistic PCA v
We have seen several times that various statistical algorithms have
probabilistic interpretation. In this section, we explore the prob-
abilistic aspects of PCA, in order to better understand its implicit
assumptions.

We start by treating the loss, up to scaling, as a non-normalized


negative log likelihood:

D
1
J = −c · log p( X | X̃ ) + log Z =
D ∑ k xd − x̃d k2
d =1

From here, it becomes obvious that the corresponding likelihood


should factorize as a product of D independent Gaussians:

D
p( X | X̃ ) = ∏ N (xd ; x̃d , σ2 I )
d =1

We also need to encode that we want a low-dimensional, linear em-


bedding. Furthermore, we want this embedding to be expressed in
extended example: topic modeling v 137

terms of independent (orthogonal) dimensions. Thus, consider the


following representation of xd : σ2
ad
xd =Vad + µ + ε where
p( ad ) = N (0; IK ), V ∈ RV × K , p(ε) = N (0; σ2 )
µ V
In particular, ad is the low-dimensional latent representation, V xd
is the linear mapping from lower to higher-dimensional space, µ
is the global dataset shift, and ε is the noise term. The correspond- D

ing graphical model can be seen on Fig. 80. That being said, the Figure 80: A graphical model of
probabilistic PCA.
marginal likelihood can be formulated as:
Z D
p( X ) = ∏ p(xd | ad ) p(ad ) dad = ∏ N (xd ; µ, C)
d =1 d

where C := VV | + σ2 I. Now, the corresponding log marginal


likelihood is:

DV D 1 D
log p( X ) = − log(2π ) − log |C | − ∑ ( xd − µ)| C −1 ( xd − µ)
2 2 2 d =1

Maximizing this expression with respect to µ yields:

x̄ = arg max log p( X )


µ

Thus, by plugging the maximizer back in, the maximum (log) likeli-
hood can be written as:
D 
log p( X ) = − V log(2π ) + log |C | + tr(C −1 S)
2
where S is again the sample covariance matrix. Furthermore, it can
be shown that the maximum likelihood estimates for V and σ2 are:

VML = U1:K (ΛK − σ2 I )1/2 R


V
2 1
σML =
V−K ∑ λj
j = K +1

where R is a rotation matrix (RR| = IK ) and S = UΛU | . Notice


that, in the (probabilistic) maximum likelihood setting of PCA,
we obtained that the optimal projection V also relies on the first
K eigenvectors of the sample covariance matrix S, up to scaling
(ΛK − σ2 I )1/2 and rotation R. Furthermore, we obtained that the
Gaussian noise occurring during reconstruction corresponds to the
average of the smallest (V − K ) eigenvalues of S.

By setting σ2 , µ, U with their maximum likelihood estimates,


along with the rotation R = I, one obtains the posterior over the
latent variable:

p( ad | xd ) = N ( ad ; (V | V + σ2 I )−1 V | ( xd − x̄), σ2 (V | V + σ2 I )−1 )


1 2 1/2
= N ( ad ; Λ−
K ( ΛK − σ IK ) U1:K ( xd − x̄), σ2 Λ−1 )
138 probabilistic machine learning

Primary results v
Now that we have obtained the first tool to analyze our dataset
with, let us see the primary results. If we ignore the preprocessing
steps, the implementation is a one-line solution in Python:

1 count_vect_lsa = CountVectorizer(max_features=VOCAB_SIZE, stop_words=[’000’])


2 X_count = count_vect_lsa.fit_transform(preprocessed).toarray()
3
4 U_, S_, V_T_ = np.linalg.svd(X_count, full_matrices=False)

Notice that we use the SVD approach for factorization, yet we


didn’t subtract the mean from the dataset Xcount . Strictly speak-
ing, this is not exactly PCA, but a rather popular algorithm in the
field of Natural Language Processing known as Latent Semantic
Indexing. Nevertheless, the singular value decomposition (SVD)
|
minimizes k X − QΣU k2F , for orthonormal matrices Q ∈ RD×K and
U ∈ RK ×V , and a diagonal Σ ∈ RK ×K with positive diagonal en-
tries (the singular values). We might naïvely think of Q as a mapping
|
from documents to topics, U from topics to words, and Σ as the
relative strength of topics.

0 0
2
4
K

6
8
02468
K
50 0
2
4
K

6
8

0 100 200 300 400


100
V
D

150

200

0 2 4 6 8
K

Figure 81: SVD factorization into


orthonormal matrices Q ∈ RD×K and
In fact, if we look at the first 5 rows (topics) of U, sort the columns
U ∈ RK ×V , and a diagonal Σ ∈ RK ×K
(words in vocabulary) by their intensity in decreasing order, and se-
lect the words for each topic with the largest values, we obtain the
following results:

1. tonight fight taxis faith century today enemy fellow

2. year program world new work need help america

3. dollar war program fiscal year expenditure million united


extended example: topic modeling v 139

4. man law dollar business national corporation legislation labor

5. administration policy energy program continue development

For each of the samples, one could postulate about the potential
topics from which the words were generated. However, there are
several problems with our approach:

• the matrices Q and U are in general dense: Every document con-


tains contributions from every topic, and every topic involves all
words.

• the entries in Q, U, and Σ are hard to interpret: They do not


correspond to probabilities

• the entries of Q, U can be negative (what does it mean to have a


negative topic?)

In the next chapter we look into how one could resolve these issues.
Latent Dirichlet Allocation v

Designing your own craftware is critically important


for good performance. After having built most of our toolbox,
we turn to designing a probabilistic machine learning model for
topic modeling that is adequate for our dataset on the State of the
Union addresses. Here is a general overview of the main steps
when designing your own model:

1. get the data

• try to collect as much meta-data as possible


• take a close look at the data

2. build the model

• identify quantities and data structures; assign names


• design a generative process (graphical model)
• assign (conditional) distributions to factors/arrows (use expo-
nential families!)

3. design the algorithm

• consider conditional independence


• try standard methods for early experiments
• run unit-tests and sanity-checks
• identify bottlenecks, find customized approximations and
refinements

4. Test the Setup

5. Revisit the Model and try to improve it, using creativity

As we discussed the issues of PCA in the last chapter, our goal now
is to create a model with the following properties:

• document sparsity: each document d should only contain a


small number of topics

• word sparsity: each topic k should only contain a small number


of the words v in the vocabulary

• non-negativity: a topic can only contribute positively to a docu-


ment.
142 probabilistic machine learning

Figure 82: Our wanted model structure

V words K topics

V words
D documents

D documents

K topics
W ∼ Π × Θ

Since the Dirichlet distribution encodes sparsity, and its values


are nonnegative, it thereby fulfills our required properties for an
adequate model. For this reason, we will build a Latent Dirichlet
Allocation model. We can think of LDA as a sparsity-inducing, non- The Dirichlet distribution is defined as
negative dimensionality reduction technique. It can be applied to p(π | α) = D(α)
any kind of grouped discrete data, but we will focus on its applica- Γ(∑k αk ) K
α −1
tion for natural lanugage processing. In that sense, it is a generative
=
∏k Γ(αk )
∏ πk k
k =1
probabilistic model that assumes a topic being a mixture over a 1 K
α −1
set of words and a document being a mixture over a set of topic
=
B(α) ∏ πk k
k =1

probabilities. The words are our only observed quantities, whereas


everything else is a latent variable.

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K ]
d = [1, . . . , D ]

Figure 83: Graphical model for Latent


Dirichlet Allocation
The generative model is as follows (for graphical representation,
see Fig. 83); To draw Id words wdi ∈ [1, . . . , V ] of document d ∈
[1, . . . , D ]:

• Draw K topic distributions θk over V words from


p(Θ | β) = ∏kK=1 D(θk ; β k )

• Draw D document distributions πd over K topics from


p(Π | α) = ∏dD=1 D(πd ; αd )

• Draw topic assignments cdik of word wdi from


c
p(C | Π) = ∏i,d,k πdkdik

• Draw word wdi from


c
p(wdi = v | cdi , Θ) = ∏k θkvdik

Useful notation: ndkv = #{i : wdi = v, cdik = 1}.


Write ndk: := [ndk1 , . . . , ndkV ] and ndk· = ∑v ndkv , etc.

The generative model is based on the assumption that each doc-


ument d is generated by first choosing a distribution over top-
ics θk and then generating each word wdi at random from an as-
signed topic cdi chosen from πd . The two distributions, Π and Θ
latent dirichlet allocation v 143

are Dirichlet distributions with respective parameters α and β. No-


tice that α determines the document-topic density – a larger value
leads to more topics per document. Furthermore, β determines the
topic-word density – a larger value leads to more words per topic.
To simplify matters, we will fix these quantities to constant values.
Lastly, note that C is a sparse matrix containing information about
which word belongs to which topic.

Designing an algorithm v

Since this is a learning problem, we are interested in inferring the


latent variables Π, Θ and C given the observations W. This amounts
to solving
p(C, Π, Θ, W )
p(C, Π, Θ | W ) = RRR
p(C, Π, Θ, W ) dC dΠ, dΘ
p(W | C, Π, Θ) · p(C, Π, Θ)
= RRR
p(C, Π, Θ, W ) dC dΠ, dΘ
Let us take a deeper look at the joint p(C, Π, Θ, W ). Using the prop-
erties of Directed Graphical models, we can factorize the joint:
p(C, Π, Θ, W ) = p(Π | α) · p(C | Π) · p(W | C, Θ) · p(Θ | β)
!     !
D D Id D Id K
= ∏ p(πd | αd ) ·  ∏ ∏ p(cdi | πd ) ·  ∏ ∏ p(wdi | cdi , Θ) · ∏ p(θk | β k )
d =1 d =1 i =1 d =1 i =1 k =1
!     !
D D Id   D Id   K
c c
= ∏ D(πd ; αd ) ·  ∏ ∏ ∏kK=1 πdkdik  ·  ∏ ∏ ∏kK=1 θkwdik  · di
∏ D(θk ; β k )
d =1 d =1 i =1 d =1 i =1 k =1

For now, let us focus on the terms colored in blue. By further ex-
panding the Dirichlet and utilizing the notation for ndkv mentioned
above, we obtain:
!  
D D Id  
c
p(Π | α) · p(C | Π) = ∏ D(πd ; αd ) ·  ∏ ∏ ∏K π dik  k =1 dk
d =1 d =1 i =1
 !  
D K D Id  
Γ(∑k αdk ) α −1 c
= ∏
∏ Γ ( α dk ) ∏ πdkdk  ·  ∏ ∏ ∏kK=1 πdkdik 
d =1 k k =1 d =1 i =1
!
D
Γ(∑k αdk ) K αdk −1+ndk·
= ∏
Γ(αdk ) k∏
πdk
d =1 ∏ k =1
One could perform the similar steps for the terms colored in green.
Then, we can formulate the joint as follows:
p(C, Π, Θ, W ) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)
! !
D
Γ(∑k αdk ) K αdk −1+ndk· K
Γ(∑v β kv ) V
β −1+n·kv
= ∏ ∏ πdk · ∏ ∏ θkvkv
d =1 ∏ k
Γ ( α )
dk k =1 k =1 ∏ v
Γ( β kv ) v =1

For a moment, let us briefly go back to the original factorization of


the joint:
!     !
D D Id   D Id   K
c cdik
p(C, Π, Θ, W ) = ∏ D(πd ; αd ) ·  ∏ ∏ ∏K π dik  ·  ∏ ∏ ∏K k =1 dk k =1 θkwdi
· ∏ D(θk ; β k )
d =1 d =1 i =1 d =1 i =1 k =1
144 probabilistic machine learning

Now, if we had Π, Θ (which we don’t), then the posterior p(C |


Θ, Π, W ) would be easy to compute from the expression above:

D Id ∏K ( π θ cdik
p(W, C, Θ, Π) dk kwdi )
p(C | Θ, Π, W ) = = ∏ ∏ k =1
∑C p(W, C, Θ, Π) d =1 i =1 ∑ k
0 ( π dk 0 θ k 0 w )
di

Note that this conditional independence can easily be read off from
the graph (see Fig. 83). Recall the definition for a Markov blanket in
directed graphical models: when for a given variable (in our case
C) we condition on its parents (Π), children (W) and co-parents (Θ),
the terms in the variable become independent i.e. factorize.

Now, let us go back to the derived factorization of the joint:


! !
D
Γ(∑k αdk ) K αdk −1+ndk· K
Γ(∑v β kv ) V
β −1+n·kv
p(C, Π, Θ, W ) = ∏ ∏ πdk · ∏ ∏ θkvkv
d =1 ∏ k Γ ( α )
dk k =1 k =1 ∏ v Γ ( β kv ) v =1

If we had C (which we don’t), given the above expression, the pos-


terior p(Θ, Π | C, W ) simplifies:

p(C, W, Π, Θ)
p(Θ, Π | C, W ) = R
p(Θ, Π, C, W ) dΘ dΠ
     
n n
∏d D(πd ; αd ) ∏k πdkdk· ∏k D(θk ; β k ) ∏v θkv·kv
=
p(C, W )
! !
= ∏ D(πd ; αd: + nd:· ) ∏ D(θk ; β k: + n·k: )
d k

Note that this conditional independence can not be easily read off
from the above graph!

At this point, it is a good idea to open up our toolbox and rea-


son about which tool can help us overcome the computational
complexities of our model. An idea arises – why not use a Markov
Chain Monte Carlo method to construct an approximate posterior
distribution? Since we have derived analytical formulations for the
conditionals P(C | Θ, Π, W ), P(Θ | C, W ), and P(Π | C, W ), it seems
like a good idea to use Gibbs Sampling. That being said, for the
algorithm to work, one iterates between:

Θ ∼ p(Θ | C, W ) = ∏ D(θk ; β k: + n·k: )


k
Π ∼ p(Π | C, W ) = ∏ D(πd ; αd: + nd:· )
d
D Id
∏kK=1 (πdk θkwdi )cdik
C ∼ p(C | Θ, Π, W ) = ∏∏ ∑k0 (πdk0 θk0 wdi )
d =1 i =1

by sampling from a Dirichlet distribution for Θ and Π, as well as


from a categorical distribution for C. Note that sampling from these
distributions is comparably easy and there are various libraries that
provide such methods. For this algorithm to work, all we have to
keep around are the counts n (which are sparse) and Θ, Π which
latent dirichlet allocation v 145

are comparably small. Thanks to factorization, much can be done in


parallel.

Unfortunately, this sampling scheme is relatively slow to move


out of initialization, because C strongly depends on Θ and Π, and
vice versa. For this reason, in order to obtain results in reasonable
time, properly vectorizing the code is of high importance.
Efficient Inference and K-Means v

The biggest adversary of the probabilistic approach is


computational complexity. In this chapter, we seek to fur-
ther optimize the sampling strategy in Latent Dirichlet Allocation,
which alternated between sampling C, and then Π and Θ. Later in
the chapter, we turn to a particularly popular algorithm for cluster-
ing data, known as K-Means. Naturally, we will seek to provide a
probabilistic interpretation.

Collapsed Gibbs Sampling v

As we have already discussed, the previous implementation of La-


tent Dirichlet Allocation alternates between sampling from C, and
then Π and Θ. A collapsed sampling method can converge much
faster by eliminating the latent variables that mediate between indi-
vidual data. For this reason, we develop an implementation which
does not need to sample Π and Θ, rather only samples C. Let’s
revisit the last formulation that we derived for the joint:
! !
D
Γ(∑k αdk ) K αdk −1+ndk· K
Γ(∑v β kv ) V β kv −1+n·kv
p(C, Π, Θ, W ) = ∏
Γ(αdk ) k∏
· ∏
Γ( β kv ) v∏
πdk θkv
d =1 ∏ k =1 k =1 ∏ v =1
! !
D K
B(αd + nd:· ) B( β k + n·k: )
= ∏ D(πd ; αd + nd:· ) · ∏ D(θk ; β k + n·k: )
d =1
B(αd ) k =1
B( β k )

If we marginalize out Θ and Π, we obtain:


! !
D K
B(αd + nd:· ) B( β k + n·k: )
p(C, W ) = ∏ · ∏
d =1
B(αd ) k =1
B( β k )
! !
Γ(∑k0 αdk0 ) Γ(αdk +ndk· ) Γ(∑ β ) Γ( β kv +n·kv )
= ∏
Γ ( ∑ α + n )
∏k Γ(α )
dk
∏ Γ(∑ βkvv +kvn·kv ) ∏v Γ( β kv )
d k dk
0 0 dk ·
0
k v

Now, we compute the probability of assigning topic k to word wdi ,


given all other topic assignments C \di and words W: Recall Γ( x + 1) = x · Γ( x ) ∀ x ∈ R+

\di \di \di


(αdk + ndk· )( β kwdi + n·kw )(∑v β kv + n·kv )−1
\di di
p(cdik = 1 | C , W) = \di \di \di
∑k0 (αdk0 + ndk0 · ) · ∑w0 ( β kw0 + n·kw0 ) · ∑v0 ( β kv0 + n·kv0 )−1

Essentially, we are done with our improvement of the algorithm.


The algorithm we end up with is quite simple – we only need to
keep track of some counter variables, and loop over the sampling
148 probabilistic machine learning

of word-topic assignments for the desired number of iterations.


Note that the distributions Π and Θ can be inferred later using the
counts we obtain after convergence. Following is a pseudocode
implementation for the algorithm:
1 procedure LDA(W, α, β)

2 γdkv ^ 0 ∀d, k, v initialize counts


3 while true do
4 for d = 1, . . . , D; i = 1, . . . , Id do can be parallelized
\di \di \di
5 cdi ∝ (αdk + ndk· )( β kwdi + n·kw )(∑v β kv + n·kv )−1 sample assignment
di
6 n ^ UpdateCounts (cdi ) update counts (check whether first pass or repeat)
7 end for
8 end while
9 end procedure

K-Means v

Until now, we have mostly looked into supervised problems – given


input-output pairs ( xi , yi )i=1,...,N , we recovered a function mapping
x to y. This problem structure includes regression, classification,
predicting time series, and structured output prediction. On the
other side of the spectrum are unsupervised problems, where we are
only given data ( xi )i=1,...,n , but no labels y. In this type of setting,
one could perform:

• Generative Modeling: assume the samples are generated indepen-


dently from some distribution p. Can we generate more samples
from p?

• Clustering: can we split the samples into [1, . . . , C ] classes? 32


A. Azzalini and A. W. Bowman. A
look at some data on the Old Faithful
For an example of a clustering problem, consider the recording geyser. Applied Statistics 39, pages 357–
of eruptions from the Old Faithful geyser 32 . The data captures 365, 1990. URL https://www.stat.
the relationship between the waiting time (the time since the last cmu.edu/~larry/all-of-statistics/
=data/faithful.dat
eruption) and the duration of the eruptions.

100 Figure 84: Possible clustering of the


Old Faithful dataset.
waiting time [mins]

80

60

40
1.5 2 2.5 3 3.5 4 4.5 5 5.5
duration [mins]

This dataset contains some structure, with two “blobs” of data-


points. An example of a possible clustering of this dataset is shown
in Fig. 84. In this chapter we look into how one can perform cluster-
ing, whereas in the next chapter we illustrate that clustering can be
seen as a subtype of Generative Modeling.
efficient inference and k-means v 149

The K-means algorithm is one of the oldest clustering methods,


with a relatively simple implementation:

1. Initialize by creating K means {mk }k=1,...,K at random values

2. Assign each datapoint xi to the nearest mean by solving

k i = arg minkmk − xi k2
k

and define binary responsibilities


(
1 if k i = k,
rki =
0 otherwise.

3. Update the means to be the means of each cluster

1
mk ←
Rk ∑ rki xi , where Rk = ∑ rki .
i i

4. Repeat the Assign and Update steps until the assignments do


not change.
Figure 85: Example run of the 2-means
algorithm on the Old Faithful dataset.

The K-means algorithm always converges, as there is a function de-


scribing the “amount of error” of the assignment which decreases
at each step.

Definition 66 (Lyapunov Function). In the context of iterative al-


gorithms, a Lyapunov Function J is a positive function of the algo-
rithm’s state variables that decreases in each step of the algorithm.

The existence of a Lyapunov function J means that one can think


of the algorithm as an optimization routine which guarantees con-
vergence at a local minimum.

For K-means, the function in question is

n K
J (r, m) = ∑ ∑ rik k xi − mk k2 .
i =1 k =1
150 probabilistic machine learning

The Assign step decreases J by definition. The Update step also


decreases J because it is convex33 in mk , and finds a local minimum 33
A function is convex if its Hessian
in mk : is positive (semi-)definite everywhere.
This holds if the second partial-
derivatives are positive everywhere,
∂ ∑i rik xi
J (r, m) = −2 ∑ rik ( xi − mk ) = 0 ⇒ mk = . which is true in our case:
∂mk i ∑i rik
∂2
J (r, m) = 2 ∑ rik ≥ 0.
∂m2k
However, the Lyapunov function is not convex in both r and m. For i

this reason it can have multiple local minima, resulting in different


outcomes based on the initialization. Even though the K-means
algorithm can work well, it has some issues:

• It has no way to set K.


Coupled with the random initialization issue, this can lead to
hard-to-interpret results, as shown in Fig. 86.

Figure 86: Two runs of 4-means on a


toy dataset. Different random initial-
• It cannot set the shape of the clusters. izations lead to different clusters.

Due to the choice of distance function k xi − mk k2 , the clusters


are assumed to be spherical. This can lead to issues, as shown in
Fig. 87.
Figure 87: Two examples of shape
issues with K-means.

K-means is a simple algorithm that always finds a stable clustering.


However, the resulting clusters can seem unintuitive, as they may
not capture the structure that we would expect. A probabilistic
interpretation of K-Means will yield clarity, and allow for fitting all
the necessary parameters.
Mixture Models & EM v

Probabilistic treatment allows for deeper understand-


ing. Last chapter we introduced K-Means, which is a simple algo-
rithm that always finds a stable clustering. However, we discovered
that the resulting clusterings can be unintuitive, as they do not
capture the shape of clusters or their number, and are subject to
random fluctuations. In this chapter, we discover that a probabilistic
interpretation of K-Means yields clarity, and allows fitting all the
parameters. As a neat side effect, it will lead us to the final entry in
our toolbox.

Gaussian Mixture Models v

Soft K-Means

In the last chapter we have discussed the main challenges of K-


Means. One idea to address some of these issues is to relax the hard
assignments rik = 1{k = arg minc kmc − xi k2 }. Appropriately,
this yields the Soft K-Means algorithm, which uses the softmax
approach:
exp(− βkmk − xi k2 )
rik = 2
.
∑c exp(− βkmc − xi k )

Soft K-means allows points to be partly assigned to several clusters


at the same time. How “soft” this assignment is, depends on the
stiffness parameter β. As β → 0, the assignments get more uniform
on all clusters; and as β → ∞ we get back to K-means. Even though
this does not resolve all of the issues (in particular, the assump-
tion that the shape of the clusters is spherical), this leads the way
towards a probabilistic interpretation of K-means.

Refined Soft K-Means

We continue our refinement of the K-Means algorithm by taking


a closer look at the loss. In many (but not all) cases, we know that
minimizing the empirical risk is frequently identified with maxi-
mizing the likelihood. Given the Lyapunov function minimized by
K-Means, let us try to identify the corresponding likelihood that is
152 probabilistic machine learning

being maximized:

n K
(r, m) = arg min ∑ ∑ rik k xi − mk k2
r,m i k
n K
= arg max ∑ ∑ rik (−1/2σ−2 k xi − mk k2 ) + const
i k
n  K 
= arg max ∏ ∑ rik exp −1/2σ−2 k xi − mk k2 /Z
i k
n K
= arg max ∏ ∑ rik N ( xi ; mi , σ2 I )
i k
= arg max p( x | m, r )

Given the likelihood, we realize that one probabilistic interpreta-


tion of the clustering problem is through the lens of a (generative)
Gaussian Mixture model. Naturally, the model makes the assump-
tion that the data points are generated from Gaussian distributions.
However, there are multiple underlying Gaussians, and we do not
know which one generated each datapoint.

For K clusters with means and variances (µk , Σk )k=1,...,K , the genera-
tive process first chooses which cluster to draw from with probabil-

ity πk and then samples from N x; µk , Σk .
The likelihood model can be written as
K  K
p( x |π, µ, Σ) = ∑ πk N x; µk , Σk with πk ∈ [0, 1], ∑ πk = 1.
k =1 k =1

Figure 88: Gaussian Mixture


(a) Data generated by 3 Gaussians,
(b) Same data – but we do not know
the cluster assignment,
(c) Possible clustering after identifica-
tion of the cluster.

0.2
Given a dataset x1 , . . . , xn , we want to learn the generative model 0.3
0.5
(π, µ, Σ) (see Fig. 90 for a graphical representation), using the likeli-
hood
n K  Figure 89: Generative model matching
p( x |π, µ, Σ) = ∏ ∑ πk N xi ; µk , Σk . Fig. 92.
i =1 k =1

Ideally, we would like to do Bayesian inference

p( x |π, µ, Σ) p(π, µ, Σ)
p(π, µ, Σ| x ) = ,
p( x )

but since the likelihood is not an exponential family, there is no ob-


vious conjugate prior. Furthermore, the posterior does not factorize
over µ, π, Σ, since for example µ 6⊥⊥ π | x.
mixture models & em v 153

Therefore, we try to maximize the (log) likelihood for π, µ, Σ:


π
 
n k
log p( x | π, µ, Σ) = ∑ log ∑ π j N (xi ; µ j , Σ j ) , where
i j
1 | −1
µk Σk
e− 2 ( x −µ) Σ ( x −µ) )
N ( x; µ, Σ) =
(2π )d/2 |Σ|1/2 k

To maximize w.r.t. µ, set the gradient of the log likelihood to 0:


n π j N ( xi ; µ j , Σ j ) x
∇µ j log p( x | π, µ, Σ) = − ∑ Σ −1 ( x i − µ j )
i ∑ j0 π j N ( xi ; µ j , Σ j ) j
| {z }
=:r ji n
n
1
∇µ j log p = 0 ⇒ µj =
Rj ∑ r ji xi R j := ∑ r ji Figure 90: Graphical Model for Gaus-
sian Mixture Model
i i

To maximize w.r.t. Σ set gradient of log likelihood to 0:

1 n π j N ( x i ; µ j , Σ j )  −1 
∇Σ j log p( x | π, µ, Σ) = − ∑ Σ ( xi − µ j )( xi − µ j )| Σ−1 − Σ−
j
1
2 i ∑ j0 π j N ( xi ; µ j , Σ j )
| {z }
=:r ji
1
n ∂|Σ|−1/2 /∂Σ = − |Σ|−3/2 |Σ|Σ−1
1 | 2
∇Σ j log p = 0 ⇒ Σj =
Rj ∑ r ji (xi − µ j )(xi − µ j ) R j := ∑ r ji ∂(v| Σ−1 v)/∂Σ = −Σ−1 vv| Σ−1
i i

To maximize w.r.t. π, enforce ∑ j π j = 1 by introducing a Lagrange


multiplier λ and optimize
 
n N ( xi ; µ j , Σ j )
∇π j log p( x | π, µ, Σ) + λ ∑ π j − 1 = ∑ +λ
0 π j N ( xi ; µ j , Σ j )
j i ∑j

Set the derivative to 0 and multiply both sides by π j :

n N ( xi ; µ j , Σ j ) n
0= ∑ πj ∑ j0 π j N ( xi ; µ j , Σ j )
+ λπ j = ∑ rij + λπ j
i i

Now, if we sum the above expression over all j, and use the con-
straint ∑ j π j = 1, we obtain the optimal value for λ:

λ = −n

Therefore, for the optimal value of π j we obtain:

Rj
πj =
n
If we know the responsibilities rij , we can optimize µ, Σ, π ana-
lytically. And if we know µ, π, we can set rij ! This leads us to the
following algorithm:

1. Initialize µ, π (e.g., random µ, uniform π, identity Σ).

2. Set
π j N ( xi ; µ j , Σ j )
rij =
∑kj0 π j0 N ( xi ; µ j0 , Σ j0 )
154 probabilistic machine learning

3. Update
n
1
Rj = ∑ r ji µj =
Rj ∑ rij xi
i i
n Rj
1
Σj =
Rj ∑ rij (xi − µ j )(xi − µ j )| πj =
n
i

4. Go back to 2.
This algorithm might seem arbitrary at first, but it is a case of the
Expectation-Maximization algorithm, which fits a probabilistic model
by alternating between (1) computing the expectation of some latent
variables – the responsibilities; and (2) maximizing the likelihood of
the parameters – the cluster parameters.

To make the connection to (soft) K-Means more apparent, consider


a diagonal covariance matrix Σ j = β−1 I for all j = 1, . . . , k. Then, for
the responsibilities we obtain:

π j N ( xi ; µ j , Σ j ) R j exp(− βk xi − m j k2 )
rij = =
∑kj0 π j0 N ( xi ; µ j0 , Σ j0 ) ∑ j0 R j0 exp(− βk xi − m j0 k2 )

So, we can conclude that the EM algorithm is indeed a refinement


of soft K-means. Notice that again, for β → ∞, we get back the
classic K-means.

Interestingly, some of the implicit assumptions (or rather patholo-


gies) of K-means become apparent once we have this probabilis-
tic perspective. In particular, we can deduce that K-means is the
maximum-likelihood estimate of a hard-assignment Gaussian Mix-
ture Model, with cluster variances σ2 → 0 (since β → ∞), resulting
π
in point-mass Gaussian components.

Expectation Maximization v
zi:
Let us note that, even though we have spent a lot of time and en-
ergy deriving the probabilistic interpretation of K-Means, this was
µk Σk
in fact not our ultimate goal.
k
We wanted to find a particular algorithmic structure that can be
used for probabilistic generative models where it is not straightfor- xi
ward to find a maximum likelihood expression in closed form. This
is exactly what the EM algorithm achieves. n
Figure 91: Graphical model for the
Let us first revisit the Gaussian Mixture Model, and introduce the Gaussian mixture with latent variables
z
latent variable z so that things simplify. Consider the binary ran-
dom variable zij ∈ {0; 1} s.t. ∑ j zij = 1. We define:

p(zij = 1) = π j p ( x i | z j = 1) = N ( x i ; µ j , Σ j )

Then, for the marginal we obtain:


k
p ( xi ) = ∑ p(z = j) p(xi | z = j) = ∑ π j N (x; µ j , Σ j )
j j
mixture models & em v 155

From there, we can easily compute the posterior for zij

p(zij = 1) p( xi | zij = 1, µ j , Σ j )
p(zij = 1 | xi , µ, Σ) =
∑kj0 p(zij0 = 1) p( xi | zij0 = 1, µ j , Σ j )
π j N ( xi ; µ j , Σ j )
=
∑ j0 π j0 N ( xi ; µ j0 , Σ j0 )
= rij

So it turns out that the responsibilities rij are the marginal posterior
probability ([E]xpectation) for zij = 1! In the previous chapter, we
have seen that if we knew the cluster responsibilities rij , we could
optimize µ, Σ and π analytically, and vice-versa. We did not know
z, so we replaced it with its expectation, leading to the Expectation-
Maximization algorithm, which repeats the two following steps:

(E) Compute the expectation of the latent variables

(M) Maximize the likelihood w.r.t. the parameters

Generic EM
The EM algorithm attempts to find maximum likelihood estimates
for models with latent variables. In this section, we describe a more
abstract view of EM which can be extended to other latent variable
models. Let x be the entire set of observed variables and z the entire
set of latent variables. We are interested in finding the maximum
(log) likelihood estimate for the model:
!
θ? = arg max log( P( x | θ )) = arg max log ∑ p(x, z | θ )
θ θ z

As we noted above, the existence of the sum inside the logarithm


prevents us from applying the log to the densities which results
in a complicated expression for the MLE. Now suppose that we
observed both x and z. We call { x, z} the complete data set, and we
say x is incomplete.

Again, if we knew z (which we don’t), the maximization would


be easy, since there would be only one term in the sum. Notice
however, that the information we do have about z is contained in
the posterior of the latent variables P(z | x, θ ). Since we don’t know
the complete log-likelihood p( x, z | θ ), we consider its expectation
under this posterior. This corresponds to the E-step. In the M-step,
we maximize this expectation in order to find a new estimate for
the parameters.
156 probabilistic machine learning

Basically, we are ready to formally write down our algorithm.


Once we initialize the parameters θ0 , we iterate between:

1. Compute p(z | x, θold )

2. Set θnew to the Maximum of the Expectation of the complete-data


log likelihood:
 
θnew = arg max ∑ p(z | x, θold ) log p( x, z | θ ) = arg max E p(z| x,θold ) log p( x, z | θ )
θ z θ

3. Check for convergence of either the log likelihood, or θ.

EM for Gaussian Mixtures


Let us return to our example once more and re-write the EM algo-
rithm in its generic form. Using the notation introduced earlier in
the section, we can write the likelihood as:
n k
p( x | π, µ, Σ) = ∏ ∑ π j N ( xi ; µ j , Σ j )
i j

Ideally, we would like to directly


 maximize  the log likelihood with
respect to the parameters θ := π j , µ j , Σ j :
j=1,...,k
 
 n k
log p( x | π, µ, Σ = log ∏ ∑ π j N ( xi ; µ j , Σ j )
i j
 
n k
= ∑ log ∑ π j N ( xi ; µ j , Σ j )
i j

Instead, maximizing the complete log-likelihood is easier:


 
 n k
zij
log p( x, z | π, µ, Σ) = log ∏ ∏ π j N ( xi ; µ j , Σ j )zij 
i j
 
= ∑ ∑ zij log π j + log N ( xi ; µ j , Σ j ) )
i j | {z }
easy to optimize (exponential families!)

Putting everything together, we obtain:

1. Compute p(z | x, θ ):

p(zij = 1) p( xi | zij = 1) π j N ( xi ; µ j , Σ j )
p(zij = 1 | xi , µ, Σ) = = =: rij
∑kj0 p(zij0 = 1) p( xi | zij0 = 1) ∑ j0 π j0 N ( xi ; µ j0 , Σ j0 )

2. Maximize
  
E p(z| x,θ ) log p( x, z | θ ) = ∑ ∑ rij log π j + log N ( xi ; µ j , Σ j )
i j

(notice that E p(z| x,θ ) [zij ] = rij )


Free Energy v

Machine Learning is the application of scientific mod-


eling to everything. In the first part of this chapter, we take
a deeper look at Expectation-Maximization in order to provide
an intuition for its convergence. Then we introduce an extremely
powerful and flexible approximation method known as Variational
Inference.

Convergence of EM v

Let us start this section by introducing a very convenient theorem


that we will make use of shortly after.
Theorem 67 (Jensen’s inequality (Jensen,1906)). Let (Ω, A, µ) be a
probability space, g be a real-valued, µ-integrable function and φ be
a convex function on the real line. Then
Z  Z
φ g dµ ≤ φ ◦ g dµ.
Ω Ω

Recall that in EM we constructed an approximate distribution


q(z) = p(z | x, θ ) for our latent quantity z. Using the Jensen’s
inequality, for any approximation q(z) s.t. q(z) > 0 wherever p( x, z |
Figure 92: Intuition for Jensen’s
θ ) > 0, it holds: inequality
Z
log p( x | θ ) = log p( x, z | θ ) dz
ln p(X|θ)
Z
p( x, z | θ )
= log q(z) dz
q(z)
Z
p( x, z | θ )
≥ q(z) log dz =: L(q)
q(z)
L (q, θ)

Thus, by maximizing the RHS in θ in the M-step, we increase a θ old θ new

lower bound on the LHS, which is the target quantity we want to


maximize. Figure 93: Optimizing the EM-
To demonstrate that it indeed maximizes the LHS, we will show likelihood leads to improvements
on the original likelihood. The EM
that the E-step makes the bound tight at the local θ. Lets us further algorithm is a special case of a more
expand the lower bound that we have derived: general type of algorithms known
Z Z as minorization-maximization (or its
p( x, z | θ ) p(z | x, θ ) · p( x | θ ) converse, majorization-minimization)
L(q) = q(z) log dz = q(z) log dz algorithm. The EM-likelihood minorizes
q(z) q(z)
Z Z the original likelihood, meaning that
p(z | x, θ ) it is always below it, and is equal at
= q(z) log dz + log p( x | θ ) q(z) dz
q(z) | {z } the current estimate of the parameters.
Thus maximizing it leads to improve-
=1
ments on the original likelihood.
158 probabilistic machine learning

Thus, by rearranging the terms we obtain: Recall:


Z DKL (qk p) ≥ 0
p(z | x, θ )
log p( x | θ ) = L(q) − q(z) log DKL (qk p) = 0 ⇔q≡p
q(z)
= L(q) + DKL (qk p(z | x, θ ))

The KL-divergence is non-negative, and is 0 iff p = q. On the other


hand, the function L(q) is a lower bound for log p( x ), and is known
as the Expectation Lower Bound (ELBO) or Variational Free Energy
in physics. Conveniently, the EM algorithm fits in this framework,
because its steps can be written as:

(E) Set q(z) = p(z| x, θold ), so that DKL qk p(z| x, θold ) = 0.

(M) Maximize the ELBO;


Z
θnew = arg max q(z) log p( x, z | θ ) dz
θ
Z
p( x, z | θ )q(z)
= arg max q(z) log dz
θ q(z)
Z
= arg max L(q, θ ) + q(z) log q(z) dz
θ
= arg max L(q, θ )
θ

DKL (qk p(z | x, θnew ))

DKL (qk p(z | x, θ ))


log p( x | θnew )
L(q, θnew )
log p( x | θ ) L(q, θ ) = log p( x | θ )
L(q, θ )

Figure 94: Expectation-Maximization


If p( x, z | θ ) is an exponential family with θ as the natural as a maximization of the Expectation
Lower-Bound (1) Initialization; (2)
parameters (ex: Gaussian Mixture Models), then optimization may E-step – updating q; (3) M-step –
be analytic: maximizing the lower bound.

p( x, z) = exp(φ( x, z)| θ − log Z (θ ))


L(q(z), θ ) = Eq(z) (φ( x, z)| θ − log Z (θ ))
= Eq(z) [φ( x, z)]| θ − log Z (θ )
∇θ L(q(z), θ ) = 0 ⇒ ∇θ log Z (θ ) = E p(x,z) [φ( x, z)] = Eq(z) [φ( x, z)]
free energy v 159

It is also possible to use numerical optimization procedures to


optimize L. When we set q(z) = p(z | x, θold ), we set DKL to its
minimum DKL (qk p(z | x, θ ) = 0, thus

∇θ log p( x | θold ) = ∇θ L(q, θold ) + ∇θ DKL (qk p(z | x, θold ))


= ∇θ L(q, θold )

From here, we could use an optimizer based on this gradient to


numerically optimize L. This is known as generalized EM.

It is straightforward to extend EM to maximize a posterior instead


of a likelihood, by just adding a log prior for θ. For this version of
the algorithm, first initialize θ0 , and then iterate between:

1. Compute q(z) = p(z | x, θold ), thereby setting


DKL (qk p(z | x, θ )) = 0

2. Compute θnew by maximizing the Evidence Lower Bound


Z
!
p( x, z | θ ) p(θ )
θnew = arg max q(z) log dz = arg max L(q, θ ) + log p(θ )
θ q(z) θ

3. Check for convergence of either the log likelihood, or θ.

It is relatively easy to see that we maximize the (log) posterior:

log p(θ | x ) , log p( x | θ ) + log p(θ ) ≥ L(q, θ ) + log p(θ )

Variational Approximation v

In Expectation-Maximization, we were maximizing the lower bound


L(q, θ )

log p( x |θ ) = L(q, θ ) + DKL qk p(z| x, θ )
!
p( x, z|θ )
L(q, θ ) = ∑ q(z) log ,
z q(z)
!
 p(z| x, θ )
DKL qk p(z| x, θ ) = − ∑ q(z) log ,
z q(z)

by successively setting q(z) = p(z| x, θ ) during the E-step, and


optimizing θ in the M-step. Another way to look at the problem
would be to maximize L w.r.t. q, where q is a distribution over the
variables z and θ. In the following formulation, we call the union
of all the parameters z (i.e., z ← z ∪ θ), and q(z) is a probability
distribution over z

log p( x ) = L(q) + DKL qk p(z| x )
Z
!
p( x, z|θ )
L(q) = q(z) log ,
z q(z)
Z
!
 p(z| x )
DKL qk p(z| x ) = − q(z) log .
z q(z)
160 probabilistic machine learning

Then, instead of iterating between z and θ, we could just maximize


L(q(z)) wrt. q (not z!). Since log p( x ) is constant, this amounts to

implicitly minimizing DKL qk p(z| x ) . It is important to note that
this is an optimization in the space of distribution q, instead of the
space of parameters z, θ.
In general, this will be intractable, because the optimal choice
for q is p(z | x ). Instead, we will look for an approximate solution
by restricting q to some family of distributions34 Q that are easier 34
One example would be to assume
to handle. This optimization problem can be viewed as finding the that the distribution over z is Gaussian,
but sometimes we can get away with
probability distribution q? ∈ Q that most closely approximates the just imposing restrictions on the
true likelihood p( x |z) in KL-divergence (or the posterior p(z| x ) ∝ factorization of q, not its analytic form.
p( x |z) p(z) with the addition of a prior over z).

Lemma 68. Consider the probability distribution p( x, z) and an


arbitrary probability distribution q(z) such that q(z) > 0 whenever
p(z) = ∑ x p( x, z) > 0. Then the following equality holds:

log p( x ) = L(q(z)) + DKL (q(z)k p(z | x ))


Z
!
p( x, z)
where L(q) := q(z) log dz
q(z)
Z
!
p(z | x )
DKL (qk p) := − q(z) log dz.
q(z)

Variational inference is a general framework to construct approx-


imating probability distributions q(z) to non-analytic posterior
distributions p(z | x ) by minimizing the functional

q∗ = arg min DKL (q(z)k p(z | x )) = arg max L(q)


q∈Q q∈Q

Mean Field Theory


In general, maximizing L(q) w.r.t. q(z) is hard because the ex-
tremum is exactly at q(z) = p(z| x ), which we assume to be non-
analytic. However, if one assumes that q(z) factorizes
n n
q(z) = ∏ qi ( zi ) = ∏ qi
i =1 i =1

then the bound simplifies. Let’s focus on a particular variable z j :

Z
!
n
L(q) = ∏ qi ( zi ) log p( x, z) − ∑ log qi (zi ) dz
i i
 
Z Z Z
= q j (z j )  log p( x, z) ∏ qi (zi ) dzi  dz j − q j (z j ) log q j (z j ) dz j + const
i6= j
Z Z
= q j (z j ) log p̃( x, z j ) dz j − q j (z j ) log q j (z j ) dz j + const
 
where log p̃( x, z j ) = Eq,i6= j log p( x, z) + const.
free energy v 161

Using this as a building block to find a “good” but tractable


approximation, we can initialize qi (zi ) to some initial distribution,
and then iteratively compute
Z Z
L(q) = q j log p̃( x, z j ) dz j − q j (z j ) log q j (z j ) + const,
 
= − DKL q j (z)k p̃( x, z j ) + const,

which we maximize w.r.t. q j . In turn, this minimizes DKL (q(z j )k p̃( x, z j )),
thus obtaining the minimum q∗j with

log q∗j (z j ) = log p̃( x, z j ) = Eq,i6= j (log p( x, z)) + const (?)

This expression identifies a function q j instead of a parametric form.


The optimization converges, because −L(q) can be shown to be
convex w.r.t. q.
In physics, this trick is known as mean field theory, because an
n-body problem is separated into n separate problems of individual
particles who are affected by the “mean field” p̃ summarizing the
effect of all other particles.

The Kullback-Leibler Divergence


This section provides additional intuition for the KL divergence. Let
us revisit the definition once again:

Definition 69 (Kullback-Leibler divergence). Let P and Q be


probability distribution over X with PDF p( x ) and q( x ). The KL-
divergence from Q to P is defined as
Z
!
 p( x )
DKL Pk Q := log p( x ) dx
q( x )

As we have discussed before, the KL-divergence is non-negative


 
and, in general, not symmetric; DKL Pk Q 6= DKL Qk P . The
direction of the KL-divergence is important;
Z
!
 q(z)
DKL pkq = − p(z) log dz is large if q(z) ≈ 0 where p(z)  0,
p(z)
Z
!
 p(z)
DKL qk p = − q(z) log dz is large if q(z)  0 where p(z) ≈ 0.
q(z)

Say p is a distribution we want to approximate by finding a dis-


tribution q ∈ Q that is close to p in KL-divergence. Minimizing

DKL pkq is often referred to as “nonzero-enforcing”, or also

“support-covering”. On the other hand, minimizing DKL qk p is
said to be “zero-enforcing” or “mode-seeking”, which refers to the
search of the optimal solution, as shown in Fig. 95 and 96.
162 probabilistic machine learning

Figure 95: Optimal approximation


q (green) to p (red), where p is a
Gaussian and q is restricted to a
Gaussian with diagonal covariance
(i.e., the factorization assumption).

(left) shows
 the optimal solution to
DKL qk p (the “zero-enforcing” or
“mode-seeking” direction) and (right)

the optimal solution to DKL pkq
(the “nonzero-enforcing” or “support-
covering”).

Figure 96: Optimal approximation q


(red) to p (blue), where p is a mixture
of Gaussians and q is restricted to a
Gaussian.

(a) and (b)


 show the two local optima of
DKL qk p (“zero-enforcing” or “mode-
seeking”) and (c) the optimal solution
to DKL pkq (“nonzero-enforcing” or
“support-covering”).
(a) (b) (c)
free energy v 163

Condensed content

EM

• to find maximum likelihood (or MAP) estimate for a model involv-


ing a latent variable
 !
 
θ∗ = arg max log p( x | θ ) = arg max log ∑ p( x, z | θ ) 
θ θ z

• Initialize θ0 , then iterate between

E Compute p(z | x, θold ), thereby setting DKL (qk p(z | x, θ ) = 0

M Set θnew to the maximize the Expectation Lower Bound (or


equivalently, minimize the Variational Free Energy
!
p( x, z | θ )
θnew = arg max L(q, θ ) = arg max ∑ q(z) log
θ θ z q(z)

• Check for convergence of either the log likelihood, or θ.

Variational Inference

• is a general framework to construct approximating probability


distributions q(z) to non-analytic posterior distributions p(z | x )
by minimizing the functional

q∗ = arg min DKL (q(z)k p(z | x )) = arg max L(q)


q∈Q q∈Q

• the beauty is that we get to choose q, so one can nearly always


find a tractable approximation.

• If we impose the mean field approximation q(z) = ∏i q(zi ), get

log q∗j (z j ) = Eq,i6= j (log p( x, z)) + const.

• for Exponential Family p things are particularly simple: we only


need the expectation under q of the sufficient statistics.

• Variational Inference is an extremely flexible and powerful ap-


proximation method. Its downside is that constructing the bound
and update equations can be tedious. For a quick test, variational
inference is often not a good idea. But for a deployed product, it
can be the most powerful tool in the box.
Variational Inference v

Derive your variational bound in the time it takes for


your Monte Carlo sampler to converge. Variational In-
ference is a powerful mathematical tool to construct efficient ap-
proximations to intractable probability distributions (not just point
estimates, but entire distributions). In this chapter, we dive into
a concrete implementation through the lens of Gaussian Mixture
Models. At the end, we finally come back to our topic modelling
example.

Application of Variational Inference to GMM v

To do Bayesian Inference using Variational Inference on the Mix-


ture of Gaussians problem, we will first define the priors over the
parameters.

For the covariances we will use the Wishart distribution – the


conjugate prior to the Gaussian with unknown precision P ∈ Rd×d
(which is symmetric and positive definite). It is the multivariate
version of the Gamma distribution:
 
det( P)(ν−d−1)/2 exp −tr(W −1 P)/2 Γd ( x ) = π d(d−1)/4 ∏dj=1 Γ( x + (1 − j)/2),
W ( P; W, ν) = where Γ( x ) is the Gamma function.
2νd/2 det(W )ν/2 Γd (ν/2)
Leading to the posterior
 ! −1 
n   
∏N xi , µ, Σ W Σ−1 ; W, ν ∝ W Σ−1 ; W −1 + ∑( xi − µ)( xi − µ)> , ν + n
i i

For the means and covariances we therefore end up with a


Normal-inverse-Wishart – the conjugate prior to the Gaussian with
unknown mean and precision;
n   −1
 
−1
 
−1
 
−1

∏ N x i ; µ, Σ N µ, µ 0 , γ0 Σ W Σ ; W, ν ∝ N µ, µ n , γ n Σ W Σ ; Wn , νn .
i

For the cluster probabilities π we will take a Dirichlet prior35 35


wikipedia.org/wiki/Dirichlet_distribution

– a prior over categorical probability distributions,


Γ(∑kK=1 αk ) α −1
p(π ) = D (π; α) =
∏k Γ(αk )
∏ πk k
k
166 probabilistic machine learning

α
π

The final distribution is then given by the following equations


and represented graphically in Fig. 97,
m, β W, ν
p( x, z, π, µ, Σ) = p( x |z, µ, Σ) p(π ) p(µ|Σ) p(Σ)
zn
  
p(µ|Σ) p(Σ) = ∏ N µk ; m, Σ/β W Σ−1 ; W, ν
k µ Σ
p(π ) = D (π; α) .
xn
We know that the full posterior p(z, π, µ, Σ| x ) is intractable (check
the graph), but we can consider an approximation with the factor-
N
ization
Figure 97: Graphical representation
q(z, π, µ, Σ) = q(z)q(π, µ, Σ). of the Gaussian Mixture Model with
priors

Computing the updates


Now, we take a look at how one can update those distributions
• Given q(z), what is the optimal q? (π, µ, Σ)?

• Given q(π, µ, Σ), what is the optimal q? (z)?

The update for q ( z ) is then given by

 
log q ? ( z ) = E q ( π,µ,Σ ) log p ( x, z, π, µ, Σ ) + const ,
   
= E q ( π ) log p ( z | π ) + E q ( µ,Σ ) log p ( x | z, µ, Σ ) + const,
 h i
  1 −1 > −1
= ∑ ∑ z nk E q ( π ) log π k + E q ( µ,Σ ) log det ( Σ ) − ( x n − µ k ) Σ k ( x − µ k ) + const,
n k 2
| {z }
: = log ρ nk

which requires the computation of

z ρnk z
q? (z) ∝ ∏ ∏ ρnknk , or, writing rnk =
∑ j ρnj
, q? (z) = ∏ ∏ rnknk , with rnk = Eq(z) [z] .
n k n k

Note that q? (z) factorizes over n, even though we did not impose
this restriction; we only imposed a factorization between z and
π, µ, Σ, which leads to conditional independence.
Computing those expectation for log ρnk can be a bit difficult to
do manually, but can be done given a table of values for

ψ( x ) = log Γ( x ).
∂x
where ψ( x ) is Digamma function36 . We need to compute 36
wikipedia.org/wiki/Digamma_function

 
ED(π;αk ) log πk = ψ(αk ) − ψ(∑ αk )
k
   D  
1 νk + 1 − d
E
W Σ− 1

k ;Wk ,νk
 log det Σ−
k = ∑ ψ
2
+ D log 2 + log det(Wk ),
d =1
h i
E 
1
 ( xn − µk )> Σ−1 ( xn − µk ) = D/β k + νk ( xn − mk )> Wk ( xn − mk ).
N (µk ;mk ,Σk /β k )W Σ−
k ;Wk ,νk
variational inference v 167

To compute the update for q ( π, µ, Σ ) , let us first define a


convenient notation;
1 1
Nk : = ∑ r nk , x̄ k =
Nk ∑ r nk x n , Sk =
Nk ∑ ( x n − x̄ k )( x n − x̄ k ) > .
n n n

We can then write the optimal q as

 
log q ? ( π, µ, Σ ) = E q ( z ) log p ( x, z, π, µ, Σ ) + const,
" #
= E q ( z ) log p ( π ) + ∑ log p ( µ k , Σ k ) + log p ( z | π ) + ∑ log p ( x n | z, µ, Σ )
k n
  
= log p ( π ) + ∑ log p ( µ k , Σ k ) + E q ( z ) log p ( z | π ) + ∑ ∑ E q ( z ) [ z nk ] log N x n ; µ k , Σ k + const.
k n k

This bound exposes another induced factorization, as π is now


independent from µ, Σ, and each ( µ k , Σ k ) are independent of each
other,
q ( π, µ, Σ ) = q ( π ) ∏ q ( µ k , Σ k ) .
k
We can compute the optimal distributions independently.

For q ( π ) , this leads to


 
log q ? ( π ) = log p ( π ) + E q ( z ) log p ( z | π ) + const,
= ( α − 1 ) ∑ log π k + ∑ ∑ r nk log π k + const,
k k n
?
q ( π ) = D ( π; α k : = α + Nk ) .

For q ( µ k , Σ k ) , (skipping the details of the derivation of the update


for Gaussians with conjugate priors) this leads to
  
1
q ? ( µ k , Σ k ) = N µ k , m k , Σ k /β k W Σ −
k ; W k k .
, ν

written with the shortcuts


1
β k = β + Nk , mk = ( βm + Nk x̄ k ) ,
βk
β Nk
νk = ν + Nk , Wk− 1 = W − 1 + Nk S k + ( x̄ − m )( x̄ k − m ) > .
β + Nk k

Comparison with the “standard” EM update


The update equation for z yields
!
  1/2 D ν
E q [ z nk ] = r nk ∝ π̃ k det Σ̃ −1
exp − − k ( x n − m k ) > Wk ( x n − m k ) ,
2β k 2

    
 
with log π̃ k = E D( π;α k ) log π k , log det Σ̃ − 1 =E 
1
 log det Σ −
k
1
.
k W Σ−
k ;Wk ,νk

This is very similar to the EM update,


  1/2  
1
r nk ∝ π k det Σ − 1 exp − ( x n − µ k ) > Σ −
k
1
( x n − µ k ) .
2
168 probabilistic machine learning

Variational inference is the Bayesian version of EM;


instead of maximizing the likelihood for θ = ( µ, Σ, π ) , we have
priors that maximize a variational bound. One advantage of this
approach is that the posterior can “decide” to ignore components,
because the Dirichlet prior can be chosen to favor sparse π. For
maximum likelihood, it is always favorable to maximize the num-
ber of components, as that allows putting a lot of mass on a small
number of (or even a single) datapoints. As an example, Fig. 99
shows the state of the approximation on the Old Faithful dataset
during the optimization process.

Figure 98: Variational Inference on a


Gaussian Mixture Model on the Old
Faithful dataset after 0, 15, 60 and 120
iterations. With a sparsity inducing
Dirichlet prior on π, the posterior
selects fewer clusters.

More generally, Variational Inference is a framework to con-


struct approximating probability distribution q(z) to a posterior
distribution p(z| x ) that lack an analytic solution by minimizing the
functional

q? = arg min DKL q(z)k p(z| x ) = arg max L(q).
q∈Q q∈Q

As we get to choose q, we can always find a tractable approxima-


tion, although the quality of the approximation will suffer if we put
in too many restriction on the family Q. If we impose the mean field
approximation q(z) = ∏i qi (zi ), we get that
 
log q?j (z j ) = Eq,i6= j log p( x, z) + const.

For exponential family p, things are particularly simple as we only


need the expectation of q under the sufficient statistics.

Variational Inference is a flexible and powerful approximation


method. Its downside is that constructing the bound and update
equations can be tedious, as we’ve seen here. For a quick test,
variational inference is often not a good idea. But for a deployed
product, it can be one of the most powerful tools in the toolbox.

Figure 99: The graphical model for


Latent Dirichlet Allocation
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K ]
d = [1, . . . , D ]
variational inference v 169

Variational Inference for topic modeling v

Let us return to our topic modeling example. Recall that the pos-
terior p(Π, Θ, C | W ) is intractable. Luckily, Variational Inference
provides a method to construct efficient approximations for in-
tractable distributions. So, we desire to find an approximation q
that factorizes:
q(Π, Θ, C ) = q(C ) · q(Π, Θ)

The best approximation will minimize the Kullback-Leibler di-


vergence DKL (qk p(Π, Θ, C | W )). We know that this is the same as
maximizing the ELBO
Z
!
p(C, Π, Θ, W )
L(q) = q(C, Θ, Π) log dC dΘ dΠ
q(C, Θ, Π)

To maximize the ELBO of a factorized approximation, recall that


the mean field is computed by:

log q∗ (zi ) = Ez j ,j6=i (log p( x, z)) + const.

We can compute the optimal factors of q similarly as before.


Starting with q? (C )
 
K  
log q∗ (C ) = Eq(Π,Θ)  ∑ cdik log(πdk θkwdi ) + const =∑ ∑ dik q(Π,Θ)
c E ( log π θ
dk dwdi ) +const
d,i,k d,i k =1 | {z }
=:log γdik

Thus, we obtain:

q(C ) = ∏ q(cdi )
d,i
c
with q(cdi ) = ∏ γ̃dikdik where γ̃dik = γdik / ∑ γdik
k k

Note that the last formulation implies: Eq (cdik ) = γ̃dik .

On the other hand, for q? (Π, Θ) we have:


 

log q∗ (Π, Θ) = E∏d,i q(cdi: )) ∑(αdk − 1 + ndk· ) log πdk + ∑( β kv − 1 + n·kv ) log θkv  + const
d,k k,v
D K K V
= ∑ ∑ (αdk − 1 + Eq(C) (ndk· )) log πdk + ∑ ∑ ( βkv − 1 + Eq(C) (n·kv )) log θkv + const
d =1 k =1 k =1 v =1
 
D  K D Id
q∗ (Π, Θ) = ∏D πd ; α̃d: := [αd: + γ̃d·: ] · ∏ D θk ; β̃ kv := [ β kv + ∑ ∑ γ̃di: I(wdi = v)]v=1,...,V  .
d =1 k =1 d i =1

Note that once again we have obtained an induced factorization for


Π and Θ, which we did not encode explicitly.
170 probabilistic machine learning

Finally, to close the loop, we have to take care of computing γdik


by utilizing the Digamma function. The final result is as follows:
   
d I
 
q(πd ) = D πd ; α̃dk := αdk + ∑ γ̃dik   ∀d = 1, . . . , D
i =1
k =1,...,K
   
D Id
 
q(θk ) = D θk ; β̃ kv :=  β kv + ∑ ∑ γ̃dik I(wdi = v)  ∀k = 1, . . . , K
d i =1
v=1,...,V
cdik
q(cdi ) = ∏ γ̃dik , ∀d i = 1, . . . , Id
k

where γ̃dik = γdik / ∑k γdik and (note that ∑k α̃dk = const)


 
γdik = exp Eq(πdk ) (log πdk ) + Eq(θdi ) (log θkwdi )
 !
= exp z(α̃ jk ) + z( β̃ kwdi ) − z ∑ β̃kv 
v

Last but not least, we could explicitly compute the ELBO. No-
tice from above, that in practice calculating the ELBO isn’t strictly
necessary. However, it could be a useful tool for monitoring the
progress and debugging the algorithm. To compute the ELBO we
need:

L(q, W ) = Eq (log p(W, C, Θ, Π)) + H(q)


Z Z
= q(C, Θ, Π) log p(W, C, Θ, Π) dC dΘ dΠ − q(C, Θ, Π) log q(C, Θ, Π) dC dΘ dΠ
Z
= q(C, Θ, Π) log p(W, C, Θ, Π) dC dΘ dΠ + ∑ H(D(θk β̃ k )) + ∑ H(D(πd α̃d )) + ∑ H(γ̃di )
k d di

The entropies can be computed from the tabulated values. For the
expectation, we use Eq(C) (ndkv ) = ∑i γdik I(wdi = v) and use
ED(πd ;α̃) (log πd ) = z(α˜d ) − z(α̃ˆ ) from above.

Variational Inference is a powerful mathematical tool to construct


efficient approximations to intractable probability distributions (not
just point estimates, but entire distributions). Often, just imposing
factorization is enough to make things tractable. The downside of
Variational Inference is that constructing the bound can take signif-
icant ELBOw grease. However, the resulting algorithms are often
highly more efficient compared to tools that require less derivation
work, like Monte Carlo.
variational inference v 171

Lastly, following is pseudocode for LDA using Variational Infer-


ence:

1 procedure LDA(W, α, β)
2 γ̃dik ^ Dirichlet_rand (α) initialize
3 L ^ −∞
4 while L not converged do
5 for d = 1, . . . , D; k = 1, . . . , K do
6 α̃dk ^ αdk + ∑i γ̃dik update document-topics distributions
7 end for
8 for k = 1, . . . , K; v = 1, . . . , V do
9 β̃ kv ^ β kv + ∑d,i γ̃dik I(wdi = v) update topic-word distributions
10 end for
11 for d = 1, . . . , D; k = 1, . . . , K; i = 1, . . . , Id do
12 γ̃dik ^ exp(z(α̃dk ) + z( β̃ kwdi ) − z(∑v β̃ kv )) update word-topic assignments
13 γ̃dik ^ γ̃dik /γ̃di·
14 end for
15 L ^ Bound (γ̃, w, α̃, β̃) update bound
16 end while
17 end procedure
Customizing models and algorithms v

Building a tailormade solution requires creativity and


mathematical stamina. In the previous chapter we saw that
Variational Inference is a powerful mathematical tool to construct
efficient approximations to intractable probability distributions. Often,
just imposing factorization is enough to make things tractable. In
the example of topic modeling, the only factorization we imposed
was between C, and Π, Θ, i.e. q(C, Π, Θ) = q(C ) · q(Π, Θ). However,
when deriving the variational bound, we obtained even further
(induced) factorization between the terms πd , θk , cdi :
   
Id
 
q(πd ) = D πd ; α̃dk := αdk + ∑ γ̃dik   ∀d = 1, . . . , D
i =1
k =1,...,K
   
D Id
 
q(θk ) = D θk ; β̃ kv :=  β kv + ∑ ∑ γ̃dik I(wdi = v)  ∀k = 1, . . . , K
d i =1
v=1,...,V
cdik
q(cdi ) = ∏ γ̃dik , ∀d i = 1, . . . , Id
k

Was the outcome of obtaining induced factorization and closed form


approximation accidental? To answer this question, consider an
exponential family joint distribution:

N 
p( x, z | η ) = ∏ exp η | φ( xn , zn ) − log Z (η )
n =1

with conjugate prior


p(η | ν, v) = exp η | v − ν log Z (η ) − log F (ν, v)

Furthermore, assume q(z, η ) = q(z) · q(η ). Then q is in the same


exponential family:

N
log q∗ (z) = Eq(η ) (log p( x, z, η )) + const = Eq(η ) (log p( x, z | η )) + const = ∑ Eq ( η ) ( η )| φ ( x n , z n )
n =1
|


q (z) = ∏ exp E(η ) φ( xn , zn ) − log Z (E(η ))
n =1
174 probabilistic machine learning

Once again, note that we obtain induced factorization in the same


exponential family. We obtain similar results for q(η ) as well:

log q∗ (η ) = Eq(z) (log p( x, z, η )) + const = log p(η | ν, v) + Eq(z) (log p( x, z | η )) + const


N
= η | v − ν log Z (η ) + ∑ − log Z(η ) + η | Eq(z) (φ(xn , zn )) + const
n =1
 ! 
N
q∗ (η ) = exp η | v+ ∑ Eq(z) (φ(xn , zn )) − (ν + N ) log Z (η ) − const
n =1

We come to the conclusion that if one considers variational approx-


imations, using conjugate exponential family priors can make life
much easier.

Collapsed Variational Bound v

In the previous chapters we saw that Monte Carlo methods tend to


converge relatively slowly, since in theory, they are only correct in
the infinite limit. On the other hand, variational approximation is
an optimization method that constructs a probabilistic approxima-
tion in finite time. However, recall that Collapsed Gibbs Sampling in-
troduced significant speedup for the Monte Carlo approach: instead
of iterating back and forth between sampling from the conditionals
for C and Π, Θ, it marginalizes over the latent quantities Θ and Π
in order to obtain a marginal distribution just for C:
\di \di \di
(αdk + ndk· )( β kwdi + n·kw )(∑v β kv + n·kv )−1
p(cdik = 1 | C \di , W ) = \di
di
\di \di
∑k0 (αdk0 + ndk0 · ) · ∑w0 ( β kw0 + n·kw0 ) · ∑v0 ( β kv0 + n·kv0 )−1

In this section we explore a similar collapsed inference technique


for our variational approximation procedure. Deriving our varia-
tional bound, we previously imposed the following factorization:

q(Π, Θ, C ) = q(Π, Θ) · ∏ q(cdi )


di

Can we impose a weaker factorization, such as:

q(Π, Θ, C ) = q(Θ, Π | C ) · ∏ q(cdi )


di

and get away with less? First, note that p(C, Θ, Π | W ) = p(Θ, Π |
C, W ) p(C | W ). Now, we minimize

Z
!
q(Π, Θ|C )q(C )
DKL (q(Π, Θ, C )k p(Π, Θ, C | W )) = q(Π, Θ | C )q(C ) log dC dΠ dΘ
p(Π, Θ | C, W ) p(C | W )
 ! !
Z
q ( Π, Θ | C ) q ( C )
= q(Π, Θ | C )q(C ) log + log  dC dΠ dΘ
p(Π, Θ | C, W ) p(C | W )

= DKL (q(Π, Θ | C )k p(Π, Θ | C, W )) + DKL (q(C )k p(C | W ))

Notice that we obtain two KL Divergence terms.


customizing models and algorithms v 175

Notice that we can immediately optimize the first term by explic-


itly setting q(Θ, Π) = p(Θ, Π | C, W ) (recall: we have previously
derived a closed form expression for p(Θ, Π | C, W )), thus making
the bound tight in Π, Θ. Continuing the analogy to the Collapsed
Gibbs Sampling method, we therefore collapse the parameters Π and
Θ. From there, all we need to do is optimize a variational bound
based on the KL-divergence between q(C ) and p(C | W ).

The remaining collapsed variational bound (ELBO) becomes


Z
L(q) = q(C ) log p(C, W ) dC + H(q(C ))

Since we make strictly less assumptions about q than before, we


will get a strictly better approximation to the true posterior. The
above bound is maximized for cdi if

log q(cdi ) = Eq(C\di ) (log p(C, W )) + const

where:
! !
Γ(∑k αdk ) Γ(αdk +ndk· ) Γ(∑v β kv ) Γ( β kv +n·kv )
p(C, W ) = ∏ ∏
Γ(∑k αdk + ndk· ) k Γ(αdk ) ∏ ∏
Γ(∑v β kv + n·kv ) v Γ( β kv )
d k

Before we continue, let us make a couple of observations. Note that


cdi ∈ {0; 1}K and ∑k cdik = 1. So, it holds that q(cdi ) = ∏k γdik with
n −1
∑k γdik = 1. Furthermore, it holds that Γ(α + n) = ∏`= 0 ( α + `), thus
n −1
log Γ(α + n) = ∑`=0 log(α + `). Now: All terms in p(C, W ) that don’t involve
cdik , as well as all sums over k, can be
moved into the constant. Furthermore,
log q(cdi ) = Eq(C\di ) (log p(C, W )) + const we can also add terms to const., such
n\di −1
as ∑`= 0 log(α + `), thus effectively
log γdik = log q(cdik = 1) cancelling terms in log Γ.
 !
= Eq(C\di ) log Γ(αdk + ndk· ) + log Γ( β kwdi + n·kwdi ) − log Γ ∑ β kv + n·kv  + const
v
 !
\di \di \di
= Eq(C\di ) log(αdk + ndk· ) + log( β kwdi + n·kw ) − log
di
∑ β kv + n·kv  + const
v

Therefore:
  
!
 \di \di \di 
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log( β kwdi + n·kw ) − log ∑ β kv + n·kv 
di
v

Under our assumption for the factorization q(C ) = ∏di cdi , the
counts ndk· are sums of independent Bernoulli variables (i.e. they
have a multinomial distribution). Computing their expected log-
arithm is tricky and of complexity O(n2d·· ) i.e. quadratic in the
counts that we are dealing with. That is likely why the original
paper didn’t do this. In fact, it took three years for a solution to be
provided by Yee Whye Teh and Max Welling.
176 probabilistic machine learning

Recall that the probability measure of R = ∑iN xi with discrete xi


of probability f is

N!
P( R = r | f , N ) = · f r · (1 − f ) N −r
( N − r )! · r!
!
N
= · f r · (1 − f ) N −r
r
≈ N (r; Nr, Nr (1 − r ))

Due to the Central Limit Theorem37 , a Gaussian approximation 37


wikipedia.org/wiki/Central_limit_theorem
should be good:
\di \di \di \di \di \di
p(ndk· ) ≈ N (ndk· ; Eq [ndk· ], varq [ndk· ]) with Eq [ndk· ] = ∑ γdkj , varq [ndk· ] = ∑ γdkj (1 − γdkj )
j 6 =i j 6 =i

From there, we can construct a Taylor approximation:

1 1 1
log(α + n) ≈ log(α + E(n)) + (n − E(n)) · − (n − E(n))2 ·
α + E( n ) 2 (α + E(n))2

and (approximately) obtain the quantity of interest:


\di
\di \di varq [ndk· ]
Eq [log(αdk + ndk· )] ≈ log(αdk + Eq [ndk· ]) − \di
2(αdk + Eq [ndk· ])2

Putting everything together:


! −1
\di \di \di
γdik ∝ (αdk + E[ndk· ])( β kwdi + E[n·kw ])
di
∑ β kv + E[n·kv ]
v
 
\di \di \di
 varq [ndk· ] varq [n·kw ] varq [n·k· ] 
di
· exp − \di
− \di
+ \di 
2(αdk + Eq [ndk· ])2 2( β kwdi + Eq [n·kw ])2 2(∑v β kv + Eq [n·kv ])2
di

Essentially, this is the Collapsed Variational Inference algorithm.


At each iteration of the loop, we don’t do anything about Π and Θ
– we just update the topic assignments γdik . Note that γdik doesn’t
depend on i ∈ 1, . . . , Id , as it’s the same for all wdi in d with wdi =
v! This means that we can get away with complexity O( DKV ),
instead of O( DKId ), which is highly suitable for long documents.
Following is a detailed overview of the algorithm complexity:

• memory requirement: O( DKV ), since we have to store γdik for each


value of i ∈ 1, . . . , V and

– E[ndk· ], var[ndk· ] ∈ RD×K


– E[n·kv ], var[n·kv ] ∈ RK ×V
– E[n·k· ], var[n·k· ] ∈ RK

• computational complexity: O( DKV ) We can loop over V rather


than Id (good for long documents!) Often, a document will be
sparse in V, so iteration cost can be much lower.
customizing models and algorithms v 177

1.0 Figure 100: Topic distribution of the


documents over the years
0.8

0.6
πd

0.4

0.2

0.0
1,800

1,850

1,900

1,950

2,000
year

Back to our running example v


In the past few weeks we have been continuously upgrading our
toolbox in order to provide a valid solution for the topic modeling
problem of the State of the Union addresses. One possible outcome
using the tools that we have developed in the previous chapters
is plotted above. While it is a relatively good first result, it doesn’t
quite fulfill our expectations – one would expect a rather smooth
latent structure throughout history. When issues like this occur, one
revisits the model, and ponders where further improvements can
be made. We come to the conclusion that we would like to inject a
priori knowledge about the smoothness of our model. In particular,
the changes will revolve around the document-topic distribution Π
and its prior α, which so far we’ve set to a constant.

One idea is to tune the hyperparameter α by maximizing the log-


posterior of the parameters with EM. First, let us again write down
the log-likelihood:

log p(W | α, β) = L(q, α, β) + DKL (qk p(C | W, α, β))


Z
!
p(W, Π, Θ, C | α, β)
where L(q, α, β) = q(C, Θ, Π) log
q(C, Θ, Π)

We have also shown that the log-posterior of the parameters is


bounded from below as follows:

log p(α, β | W ) ≥ L(q, α, β) + log p(α, β)

Given that we have previously derived an explicit expression for


the ELBO L, all we have to reason about is how to construct the
(log) prior p(α, β). Recall that each document is identified by the
president and the year the document was written (for example, Lin-
coln_1862.txt). This is a very powerful structure that we could make
use of, by incorporating it as a metadata in our model, thus allow-
ing each document to be placed in a latent space described by these
key characteristics. We do this by extending our probabilistic model
that we have seen many times over (see Fig. 101). Before we con-
tinue with our model development, it is crucial to note that while
178 probabilistic machine learning

Figure 101: A model incorporating


document metadata
latent function transf. doc topic dist. doc topic dist. word topic word topic word dist.
βk
f αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K ]

h φd
kernel
metadata
d = [1, . . . , D ]

toolboxes are extremely valuable for quick early development, their


interface often enforces and restricts model design decisions. In or-
der to really solve a probabilistic modelling task, one should build
customized craftware. For example, it is extremely hard to make the
above-mentioned model extension when one uses the pre-packaged
Scikit implementation of LDA38 . 38
github.com/scikit-learn/lda

Now, let us return to our problem at hand and see how exactly we
can incorporate the meta-data. One can pose the problem of con-
structing a smooth latent structure as a regression problem, which
motivates the use of Gaussian Processes. We will use GP regression
for the latent function f , which together with the metadata Φ in-
forms the prior α for the document topic distribution. The updated
model to generate the words W of documents d = 1, . . . , D with
features φd ∈ F is as follows:

• draw function f : F → RK from p( f | h) = GP ( f ; 0, h)

• draw document topic distribution πd from D(αd = exp( f (φd )))

• draw topic-word distributions p(Θ | β) = ∏kK=1 D(θk , β k )


I c
• draw each word’s topic p(Cd:: | Π) = ∏dD=1 ∏i=
d dik
1 ∏k πdk
c
• draw the word wdi with probability θkw
dik
.
di

Note that in order to enforce non-negativity for the parameters α,


the GP regression models log α, which is then transformed back
through the exp link function. The natural prior that arises from the
Gaussian Process is:
1 1
log p( f = log α) = − k f d k2k = − f d| k− 1
DD f d
2 2
One possible kernel function that encodes the desired properties is
the product of the following two functions:
!−α 
( x a − xb ) 2 1.00 if president( x ) = president ( x )
a b
k ( x a , xb ) = θ 2 1 + ·
2α`2 γ otherwise

The first part is the well known rational-quadratic kernel which en-
codes the smoothness properties. The second part is an indication
of change in presidency – notice that we allow for slight shift in
customizing models and algorithms v 179

topics given that there was a change in presidency. A good set of


parameters for the kernel are:

θ=5 ` = 10years α = 0.5 γ = 0.9

Figure 102: Prior document-topic


distribution arising from the Gaussian
1 Process

0.8

0.6
p(topic)

0.4

0.2

0
1,800 1,820 1,840 1,860 1,880 1,900 1,920 1,940 1,960 1,980 2,000
Year

Figure 103: Resulting Kernel Topic


Model
1
war, spain
war, national, economic labor energy, cut, oil

0.8 war, constitution, union men,

law, war,

0.6
hπk | φi

America,
good,
American, people

0.4 work

world, peace, free

0.2 public, commerce

made, business

0
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
year
Making decisions v

Making a decision boils down to conditioning on a vari-


able you control. In the previous chapters we explored various
modeling and computational techniques to learn from data. In this
chapter we discuss what one can do once the inference/learning
process is finished.

Probabilistic models can provide predictions p( x | a) for a vari-


able x conditioned on an action a. An important question is: given
the choice, which value of a do you prefer? For this purpose, one
could assign a loss or utility `( x ), and then choose a such that it
minimizes the expected loss:
Z
a∗ = arg min `( x ) p( x | a) dx
a

In this lecture, we only consider a setting where an action at one


timestep does not influence the state of the system in the next step,
or in other words, we consider independent draws xi with xi ∼ p( x |
ai ). Then, one can choose all ai = a∗ to minimize the accumulated
loss: " #
L(n) = E p ∑ xi
i

But for this setting, we have assumed that we know p. Let’s illus-
trate an example for the (more common) case when we don’t know
p.

1 Figure 104: A simulated example


of Bernoulli experiments with three
0.8 different treatments

0.6
payout

0.4

0.2

0
100 101 102 103
N

Suppose we are performing Bernoulli trials for three different


“treatments”, whose expected payout is represented as horizontal
182 probabilistic machine learning

lines in the plot above. Furthermore, suppose we record the run-


ning empirical average as we increase the number of trials N. When
comparing the empirical averages for a lower number of trials (say
10), we notice that the rankings of the efficacy of each treatment is
not on par with the ranking of the expectations. Only after a sig-
nificant number of trials (say > 1000) do these estimators properly
start separating. Perhaps we shouldn’t rule out an option yet if
the posteriors over their expected return overlaps with that of our
current guess for the best option.

To formalize this, assume we have K choices. Taking choice k ∈


[1, . . . , K ] at time i yields binary (Bernoulli) reward/loss xi i.i.d.
with probability πk ∈ [0, 1]. Since we don’t know the probabilities
πk , we can infer them in a Bayesian manner, i.e. given conjugate
priors
p(πk ) = B(π, a, b) = B( a, b)−1 π a−1 (1 − π )b−1
the posteriors from nk tries of choice k with mk successes is

p(πk | nk , mk ) = B(πk ; a + mk , b + (nk − mk ))

For a, b → 0, the posterior has the following mean and variance

mk mk (nk − mk )
π̄k := E p(πk |nk ,mk ) [π ] = σk2 := var p(πk |nk ,mk ) [π ] = = O(n− 1
k )
nk n2k (nk + 1)

In fact, the smooth curved lines in the plot above are exactly the
standard deviations at each time point i, which we know is the
expected distance of the estimated quantity to the true value.

The probabilistic interpretation gives us a hint how one could


choose the treatment. We would like a policy that resolves the two
problems:

• If the true probabilities of positive outcome for two treatments


are very different from each other, then we should make a com-
mitment to focus on the better of the two alternatives.

• On the other hand, if the probabilities are extremely similar, then


we should not commit to either of them, and instead keep exper-
imenting with both alternatives. Appropriately, this exploration
phase should decrease as we increase the number of trials.

One
q idea is, at time i, to choose the option k that maximizes π̄k +
c σk2 . Now, one could pose the question: which is the best value
for c? A large c ensures uncertain options are preferred, thus lead-
ing to exploration. On the other hand, a small c ignores uncertainty,
thus leading to exploitation.
One possibility is to let c grow at rate less than O(n1/2
k ). Then,
the variance of the chosen options will drop faster than c grows, so
their exploration will stop unless their mean is good. However, as c
grows, unexplored choices will eventually become dominant, thus
always explored eventually.
making decisions v 183

Multi-Armed Bandit Algorithms

Now, we move to a more generalized case where we could reason


beyond Bernoulli variables. Consider the following theorem, which
provides us with general bounds on the deviation of the empirical
estimate for the mean from the actual mean of the distribution.

Theorem 70 (Chernoff-Hoeffding). Let X1 , . . . , Xn be random vari-


ables with common range [0, 1] and such that E[ Xt | X1 , . . . , Xt−1 ] =
µ. Let Sn = X1 + · · · + Xn . Then for all a ≥ 0,
2 /n 2 /n
p(Sn − nµ ≤ − a) ≤ e−2a and p(Sn − nµ ≥ a) ≤ e−2a

The fact that these general bounds exist is suggestive of a general


class of algorithms which are relatively independent of the specific
prior over the output of the individual choices. These algorithms
are widely known under the name Multi-Armed Bandit algorithms.

A K-armed bandit is a collection Xkn of random variables, 1 ≤


k ≤ K, n ≥ 1 where k is the arm of the bandit. Successive plays of k
yield rewards Xk1 , Xk2 , . . . which are independent and identically
distributed according to an unknown p with E p ( Xki ) = µi .

A policy A chooses the next machine to play at time n, based on


past plays and rewards.

Let Tk (n) be number of times machine k was played by the pol-


icy during the first n plays. Then, the regret of the policy A is

R A (n) = µ∗ · n − ∑ µ j · E p [ Tj (n)] with µ∗ := max µk


j 1≤ k ≤ K

Let x̄ j be the empirical average of rewards from j and n j be the


number of plays at j in n plays. Then the pseudocode implementa-
tion of the Upper Confidence Bound procedure is as follows:
1 procedure UCB(K) Upper Confidence Bound
2 play each machine once
3 while true do r !
2 log n
4 play j = arg max x̄ j + nj
5 end while
6 end procedure
Interestingly, it is possible to upper bound the expected regret of
UCB:

Theorem 71 (Auer, Cesa-Bianchi, Fischer). Consider K machines


(K > 1) having arbitrary reward distributions P1 , . . . , PK with
support in [0, 1] and expected values µi = EP ( Xi ). Let ∆i := µ∗ − µi .
Then, the expected regret of UCB after any number n of plays is at
most The sums are over K, not n. So the
  regret is O(K log n). UCB plays a sub-
  ! 
optimal arm at most logarithmically
log n π 2
EP [ R A (n)] ≤ 8 ∑ + 1+ ∑ ∆ j  often.
i:µ ≤µ∗
∆i 3 j
i
184 probabilistic machine learning

regret bound
2,000 103 expected regret
sampled regret
p = 50%

regret
∑t nit

p = 55% 101
1,000 p = 45%

10−1

0
500 1,000 1,500 2,000 2,500 3,000 100 101 102 103 104
N N

In the figure above we plot the behavior of UCB on our simu-


lated example from before. On the left you can see the payoffs of
each treatment as the number of trials grows. On the top of the left
plot you can see which treatment was chosen by the policy at each
particular time. Notice that in the initial phase the algorithm ex-
plores among all three treatments, and only slowly stabilizes after
> 2000 trials. On the right we plot the regret of the treatment with
p = 50%.

In conclusion, Multi-Armed Bandit algorithms apply to inde-


pendent, discrete choice problems with stochastic pay-off. Algo-
rithms based on upper confidence bounds incur regret bounded by
O(log n). Unfortunately, no problem is ever discrete, finite and in-
dependent. That being said, in a continuous problem, no “arm” can
and should ever be played twice. Furthermore, one should make
use of the fact that in a typical prototyping setting early exploration
is free.

2 Figure 105: Example application of


parameter optimization, with tunable
parameter x, and outcome/loss f

0
f

−2

−5 −4 −3 −2 −1 0 1 2 3 4 5
x

Continuous-Armed Bandits

In this section we will make use of a running example for parame-


ter optimization (see Fig. 105). Suppose the observations of the loss
are distributed according to:

p(y | x ) = N (y; f x , σ2 )
making decisions v 185

Given the observations of loss y for different parameters x, we


would like to determine which parameter x? yields the minimum
loss:
x∗ = arg min f ( x ) = ?
x ∈D

We define the regret as follows:

T
R( T ) := ∑ f ( xt ) − f ( x∗ )
t =1

Without an informative prior, we would have to test every single


hypothesis in order to find the minimum – which is infeasible in
the continuous domain! For this reason, we make use of a Gaussian
Process prior for where the minimum might be:

p( f ) = GP ( f ; µ, k )

Depending on the community, this problem is known under the


names of Continuous-Armed Bandit and Bayesian Optimization.

2 Figure 106: The GP posterior arising


from the prior and the three obser-
vations. In red, we see the empirical
distribution of the minimum for sam-
0 ples of the GP
f

−2

−5 −4 −3 −2 −1 0 1 2 3 4 5
x
2
For this reason, we build GP regression on the observations (see
Fig. 106). One pedestrian way to find where it is most likely that 0
the minimum lies is to iteratively draw samples from the posterior
GP and record where the minimum of each sample is. Based on this −2
f

information, one could construct an empirical distribution, which


can be seen colored in red in the plot above. −4

−6
GP Upper Confidence Bound −4 −2 0 2 4
x
A more structured and theoretically motivated algorithm is GP
Upper Confidence Bound (GP-UCB)39 . Under the posterior p( f | Figure 107: GP UCB. Top: GP poste-
rior. Bottom: Utility u( x )
y) = GP ( f ; µt−1 , σt2−1 ), we define the pointwise utility as: 39
Srinivas, Krause, Kakade, Seeger,
p ICML 2009
u i ( x ) = µ i −1 ( x ) − β t σt−1 ( x )

Then, in each iteration we choose xt as xt = arg minx∈D u( x ) (see


Fig. 107). Interestingly, the algorithm provides a theoretical guaran-
tee, which is summarized below.
186 probabilistic machine learning

Theorem 72 (Srinivas et al., 2009). Assume that f ∈ Hk with k f k2k ≤


B, and the noise is zero-mean and σ-bounded almost surely. Let
δ ∈ (0, 1) and β t = 2B + 300γt log3 (t/δ). Running GP-UCB with β t 2

and p( f ) = GP ( f ; 0, k),
  0
q
p R T ≤ 8Tβ T γT / log(1 + σ2 ) ∀ T ≥ 1 ≥ 1 − δ
−2

f
thus limT →∞ R T /T = 0 (“no regret”).
−4

−6
Entropy Search xn
x
The limit of the GP UCB algorithms is that they solely focus on
minimizing regret. It might not be true that you always want to Figure 108: Entropy Search. Top: GP
posterior hypotheses. Bottom: Utility
collect the minimum function values. Ideally we would like, in a u( x )
guided fashion, to efficiently learn where the minimum is. This is
the driving idea behind Entropy Search40 . In particular, instead 40
Villemonteix et al., 2009; Hennig &
of evaluating where you think the minimum lies, evaluate where Schuler, 2012

you expect to learn most about the minimum. For this, we need to
make use of the entropy:
Z
p( x )
H( p ) : = − p( x ) log dx
b( x )

of various hypotheses with base measure b. Then, we define the


utility as follows:

u( x ) = Ht ( pmin ) − Eyt+1 [Ht+1 ( pmin )]

Following are several settings in which information-based search is


preferable:

• “prototyping-phase” followed by “product release”

• structured uncertainty with variable signal-to-noise ratio

• “multi-fidelity”: Several experimental channels of different cost


and quality, e.g.

– simulations vs. physical experiments


– training a learning model for a variable time
– using variable-size datasets

Regret-based optimization is easy to implement and works well


on standard problems. But it is a strong simplification of reality, in
which many pratical complications can not be phrased.
making decisions v 187

Condensed content

• the bandit setting formalizes iid. sequential decision making


under uncertainty

• bandit algorithms can achieve “no regret” performance, even


without explicit probabilistic priors

• Bayesian optimization extends to continuous domain

• it lies right at the intersection of computational and physical


learning

• requires significant computational resources to run a numerical


optimizer inside the loop

• allows rich formulation of global, stochastic, continuous, struc-


tured, multi-channel design problems

• is currently the state of the art in the solution of challenging


optimization problems
Bibliography

A. Azzalini and A. W. Bowman. A look at some data on the Old


Faithful geyser. Applied Statistics 39, pages 357–365, 1990. URL
https://www.stat.cmu.edu/~larry/all-of-statistics/=data/
faithful.dat.

Christopher Bishop. Pattern Recognition and Machine


Learning. Springer, 2006. URL https://www.
microsoft.com/en-us/research/uploads/prod/2006/01/
Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf.

P. Diaconis and D. Ylvisaker. Conjugate priors for exponential families.


Annals of Statistics, 1979.

Jaynes. Probability theory: The logic of science. Cambridge University


Press, 2003. URL bayes.wustl.edu/etj/prob/book.pdf.

Kanagawa, Hennig, Sejdinovic, and Sriperumbudur. Gaussian


Processes and Kernel Methods: A Review on Connections and Equiva-
lences. CoRR, 2018. URL arxiv.org/abs/1807.02582.

Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. 1933.

F.R. Kschischang, B.J. Frey, and H.-A. Loeliger. Factor graphs and the
sum-product algorithm. IEEE Transactions on Information Theory,
2001.

S.L. Lauritzen and D.J. Spiegelhalter. Local computations with proba-


bilities on graphical structures and their application to expert systems.
Journal of the Royal Statistical Society, 1988.

MacKay. The Evidence Framework Applied to Classification Networks.


Neural Computation, 1992.

J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kauf-


mann, 1988.

E. Pitman. Sufficient statistics and intrinsic accuracy. Mathematics


Proceedings of the Cambridge Philosophical Society, 1936.

Rasmussen and Williams. Gaussian processes for machine learning.


MIT Press, 2006.

Schölkopf and Smola. Learning with Kernels: support vector machines,


regularization, optimization, and beyond. MIT Press, 2002.

You might also like