Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
A Probabilistic Approach
David Barber
http://www.idiap.ch/∼barber
c David Barber 2001, 2002,2003,2004,2006
Contents
1 Introduction 2
1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Supervised Learning Approaches . . . . . . . . . . . . . . . . . . . 4
2 Generalisation 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Training Error . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Test Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Dodgy Joe and Lucky Jim . . . . . . . . . . . . . . . . . . . 11
2.1.6 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
c David Barber 2001,2002,2003,2004,2006
Machine Learning : A probabilistic approach :
2
8 Autoencoders 54
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.1.1 Linear Dimension Reduction (PCA) . . . . . . . . . . . . . 54
8.1.2 Manifolds : The need for non-linearity . . . . . . . . . . . . 54
8.2 Non-linear Dimension Reduction . . . . . . . . . . . . . . . . . . . 55
8.2.1 Training Autoencoders . . . . . . . . . . . . . . . . . . . . . 55
8.3 Uses of Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.3.1 A Visualisation example . . . . . . . . . . . . . . . . . . . . 57
9 Data Visualisation 58
9.1 Classical Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.1.1 Finding the optimal points . . . . . . . . . . . . . . . . . . 59
9.1.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.2 Sammon Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.3 A word of warning . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.4.3 d-Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.5 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.5.1 Markov Random Fields . . . . . . . . . . . . . . . . . . . . 75
10.5.2 Expressiveness of Graphical Models . . . . . . . . . . . . . 76
10.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
24 Sampling 265
24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
24.2 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . 269
There are many motivations as to why one might want a machine to “learn” from
data. By “learn”, I have in mind applications that would be hard to program in a
traditional manner, such as the task of face recognition. Formally specifying why
you recognise a collection of images as examples of John’s face may be extremely
difficult. An alternative is to give examples of John’s face and let a machine “learn”
– based on the statistics of the data – what it is that differentiates John’s face from
other faces in the database. That is not to say that all information can be learned
solely on the basis of large databases – prior information about the domain is often
crucial to the successful application of machine learning.
The connection between probability and machine learning stems from the idea
that probabilistic models enable us to form a compact description of complex
phenomena underlying the generation of the data. Graphical models are simply
ways of depicting the independence assumptions behind a probabilistic model.
They are useful in modelling since they provide an elegant framework to therefore
graphically express basic independence assumptions about the processes generating
the data. This is useful since the calculus of probability will transfer to graph
theoretic operations and algorithms, many of which have deep roots in computer
science and related areas[1].
This book is intended as a (non-rigorous) introduction to machine learning, prob-
abilistic graphical models and their applications. Formal proofs of theorems are
generally omitted. These notes are formed from the basis of lectures given to both
undergraduate and graduate students at Aston University, Edinburgh University,
and EPF Lausanne.
A baby processes a mass of initially confusing sensory data. After a while the
baby begins to understand her environment in the sense that novel sensory data
from the same environment is familiar or expected. When a strange face presents
itself, the baby recognises that this is not familiar and may be upset. The baby
has learned a representation of the familiar and can distinguish the expected from
the unexpected, without an explicit supervisor to say what is right or wrong. Un-
Clustering supervised learning just addresses how to model an environment.Clustering is an
example of unsupervised learning, whereby the aim is to partition a set of data
into clusters. For example, given a set of questionnaire responses, find clusters
whereby in each cluster the responses are ‘similar’. This area is also sometimes
Descriptive modelling called descriptive modelling, where we just wish to fit a model which describes
succinctly and accurately the data in the database.
10
c David Barber 2001,2002,2003,2004,2006
Machine Learning : A probabilistic approach :
11
60
50
40
30
20
10
−10
−10 0 10 20 30 40 50
This story is an example of supervised learning. Here the father is the supervisor,
and his son is the ‘learner’, or ‘machine learner’ or ‘predictor’. The nice point
about this story is that you can’t expect miracles – unless you explicitly give ex-
tra information, learning from examples may not always give you what you might
hope for. On the other hand, if they had been there the whole week, probably the
son would have learned a reasonably good model of a sports car, and helpful hints
by the father would be less important. It’s also indicative of the kinds of problems
typically encountered in machine learning in that it is not really clear anyway what
a sports car is – if we knew that, then we wouldn’t need to go through the process
of learning!
Predictive modelling We typically have a training set of labelled data, for example, here are some data
nationality British Dutch Taiwanese British
height(cm) 175 195 155 165
sex m m f f
c David Barber 2001,2002,2003,2004,2006
Machine Learning : A probabilistic approach :
12
Classification Given a set of inputs, predict the class (one of a finite number of discrete labels).
Normally, the class is ordinal (there is no intrinsic information in the class la-
bel). For example, given an image of a handwritten digit, predict whether it is
0,1,2,3,4,5,6,7,8 or 9. This would be a 10-class classification problem. Many prob-
lems involve binary classes (you can always convert a multi-class problem into a
set of binary class problems – though this is not always natural or desirable). For
binary classes, there is usually no information as to whether we say the data are
labelled as class 0 or 1, or alternatively as class 1 or 2. For example, the sports-car
classification problem would have been the same if the father said ‘1’ or ‘0’ when
the car passing by was a sports car or not. A great deal of problems in the ma-
chine learning arena are classification problems. Uncertainty will ultimately play
a key role in any real world application. Can we really say that Mr Smith will
definitely default on his loan? This may seem a very strong statement if there is
little obvious difference between the attributes of Mr Smith and Mr Brown.
Regression Given a set of inputs, predict the output (a continuous value). For example, given
historical stock market data, predict the course of the FTSE for tomorrow.
Reinforcement Learning Reinforcement learning is a kind of supervised learning in which the supervisor
provides rewards for actions which improve a situation and penalties for deleterious
actions.
f
m f
m f
f
m m
m m f f
f
m f f
f x*
Figure 1.2: Here, each point in this space represents a high dimensional vector
x, which has an associated class label, either Male or Female. The point x∗ is
a new point for which we would like to predict whether this should be male or
female. In the generative approach, a Male model would be produced which would
ideally generate data which would be similar to the ‘m’ points. Similarly, another
model, the Female model should ideally generate points that are similar to the
‘f’ points above. We then use Bayes’ rule to calculate the probability p(male|x∗ )
using the two fitted models, as given in the text. In the discriminative case, we
directly make a model of p(male|x∗ ), which cares less about how the points ‘m’ or
‘f’ are distributed, but more about whether or not there is a simple description of
a boundary which can separate the two classes, as given by the line.
Generative Approach
p(v ∗ |c)p(c)
p(c|v ∗ ) = . (1.2.1)
p(v ∗ )
That model c with the highest posterior probability p(c|v ∗ ) is designated the pre-
dicted class.
Advantages : In general, the potential attraction of a generative approach is that prior
information about the structure of the data is often most naturally specified
through the generative model p(v|c).
Disadvantages : A potential disadvantage of the generative approach is that it does not di-
rectly target the central issue which is to make a good classifier. That is, the
goal of generative training is to model the observation data v as accurately
as possible, and not to model the class distribution. If the data v is complex,
or high-dimensional, it may be that finding a suitable generative data model
is a difficult task. Furthermore, since each generative model is separately
trained for each class, there is no competition amongst the models to explain
the data. In particular, if each class model is quite poor, there may be little
confidence in the reliability of the prediction. In other words, training does
c David Barber 2001,2002,2003,2004,2006
Machine Learning : A probabilistic approach :
14
not focus explicitly on the differences between mental tasks, but rather on
accurately modelling the data distributions from each associated class.
Discriminative Approach
Arguably all machine learning approaches are based on some notion of smooth-
ness or regularity underlying the mechanism that generated the observed data.
Roughly speaking : if two datapoints are close neighbours, they are likely to be-
have similarly.
The general procedure will be to postulate some model and then adjust it’s para-
meters to best fit the data. For example in a regression problem, we may think
that the data {(xµ , y µ ), µ = 1, . . . , P }, where x is an input and y an output, is well
modelled by the function y = wx, and our task is to find an appropriate setting
of the parameter w. An obvious way to do this is to see how well the current
model predicts the training data that we have, and then to adjust the parameter
w to minimise the errors that our model makes on predicting the data. This gen-
eral procedure will usually involve therefore optimisation methods, usually in high
dimensional spaces (although the above is a one-dimensional example).
Noise, overfitting and In the case that there is noise on the data (sometimes, the father might be in-
Generalisation consistent in his labelling of sports cars, or there might be essentially random
perturbations on the FTSE index), we don’t want to model this noise. That is,
we have to be careful to make sure that our models only capture the underlying
process that we are truly interested in, and not necessarily the exact details of the
training data. If we have an extremely flexible model, it may overfit noisy training
c David Barber 2001,2002,2003,2004,2006
Machine Learning : A probabilistic approach :
15
data be a very poor predictor of future novel inputs (that is, it will generalise
poorly). This is very important topic and central to machine learning. We shall
return to this in a later chapter.
I. Machine Learning : More Traditional Approaches
16
2 Generalisation
2.1 Introduction
One major goal in supervised learning is, on the basis of labelled training data, to
encapsulate the underlying mechanism which generated the data, thereby learning
a model with predictive power. That is, given a novel unlabelled instance, to make
an accurate prediction.
17
18
1 1
0.5 0.5
0 0
t
−0.5 −0.5
−1 −1
−1.5 −1.5
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
Figure 2.1: Left: Training Data for a regression problem. We wish to fit a function
f (x|θ) to this data. Right: A straight line fit might look reasonable.
10th order polynomial fit True underlying generating function
1.5 1.5
1 1
0.5
0.5
0
0
t
t
−0.5
−0.5
−1
−1.5 −1
−2 −1.5
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
Figure 2.2: Left: What about a 10t h order polynomial. This has zero training
error. Right: The “correct” clean underlying function which generated the data.
that, given a novel point x our prediction f (x|θ) will be accurate. What do we
mean by accurate? If we had some extra data, Dtest , different from the training
data and generated in the same manner, then we would like that the error made
by our predictions is roughly the same as the error that would be made even if we
knew exactly what the clean underlying data generating process were. Of course,
this is in some sense, an impossible task. However, we can devise procedures that
can help give us some confidence in our predictions.
The typical way that we train/learn the adjustable parameters θ of our model is
to optimise some objective function. For example, if our current model outputs
f (xµ |θ) on an input xµ and the training data output for that xµ is tµ , we would like
to adjust the parameters θ such that f (xµ |θ) and tµ are close. We can measure how
close these values are by using a function d(x, y) which measures the discrepancy
between two outputs x and y. To find the best parameter settings for the whole
training set, we use
X
Etrain (θ) = d(f (xµ |θ), tµ ) (2.1.1)
(xµ ,tµ )∈Dtrain
If we adjust the parameters θ to minimise the training error, what does this tell
us about the prediction performance on a novel point x? In principle, nothing!
However, in practice, since the mechanisms which generate data are in some sense
smooth, we hope that our predictions will be accurate. We saw that in the case of
using a perceptron, we can always find a hyperplane that separates data, provided
19
that the dimension of the data space is larger that the number of training examples.
In this case, the training error is zero. We saw, however, that the error on the
600 test examples was non-zero. Indeed, if the training data is believed to be a
corrupted version of some clean underlying process, we may not wish to have a
zero training error solution since we would be “fitting the noise”. What kind of
error would we expect that our trained model would have on a novel set of test
data?
Imagine that we have gone through a procedure to minimise training error. How
can we assess if this will have a good predictive performance – i.e., will generalise
well? If we have an independent set of data Dtest , the test error
X
Etest (θ) = d(f (xµ |θ), tµ ) (2.1.2)
(xµ ,tµ )∈Dtest
Consider two competing prediction model classes, f1 (x|θ 1 ) and f2 (x|θ 2 ). We train
each of these by minimising the training error to end up with training error “opti-
mal” parameter settings θ∗1 and θ∗2 . Which is the better model? Is it the one with
the lower training error? No. We can say that model with setting θ∗1 is better
than a model with setting θ∗2 by comparing the test errors, Etest (θ 1 ) < Etest (θ2 ).
Using test data in this way enables us to validate which is the better model.
The standard procedure is to split any training data into three sets. The first is
the training data, Dtrain , used to train any model. The second Dvalidate is used to
assess which model has a better test performance. Once we have chosen our optimal
model on the basis of using validation data, we can get an unbiased estimate of the
expected performance of this model by using a third set of independent data Dtest .
This data should not have been used in any way during the training procedure if
we wish to obtain an unbiased estimate of the expected test performance of the
model.
Perhaps the following parody will make the above arguments clearer:
Let me introduce two characters, “Lucky Jim” and “Dodgy Joe”. Lucky Jim in-
vents some new procedure, and initially, finds that it works quite well. With further
experimentation, he finds that it doesn’t always work, and that perhaps it requires
some rather fine tuning to each problem. Undeterred, this charismatic scientist
attracts both funds and attention enough to stimulate a world wide examination of
his method. Working independently of each other, surely enough research groups
from around the world begin to report that they manage to achieve zero test error
on each problem encountered. Eventually, some research group reports that they
have found a procedure, based on Lucky Jim’s method that is able to give zero test
error on every problem that has ever been known to exist. After so many years of
hard work, Lucky Jim happily announces his universal predictor (perhaps a billion
hidden unit neural network with fixed parameters), with the (indeed true) claim
20
that it gives zero test error on every known problem that ever existed. He markets
this product and hopes to claim a fortune.
Contrast the dogged determination of Lucky Jim now with the downright un-
scrupulous behaviour of Dodgy Joe. Quite frankly, he doesn’t have the patience
of Lucky Jim, and he simply assembles all the known problems that ever existed,
and their corresponding test sets. He then constructs his method such that, when
asked to perform the prediction on problem A with corresponding test set B, he
simply makes the output of his method the output for the test set B (which he of
course knows). That is, his algorithm is nothing more than a lookup table - if the
user says, “this is the test set B” then Dodgy Joe’s algorithm simply reads off the
predictions for test set B which, by definition, will give zero error. He then also
markets his universal predictor package as giving zero test performance on every
known problem (which is indeed true) and also hopes to make a fortune.
If we look at this objectively, both Lucky Jim and Dodgy Joe’s programs are doing
the same thing, even though they arrived at the actual code for each method in
a different way. They are both nothing more than lookup tables. The point is
that we have no confidence whatsoever that either Lucky Jim’s or Dodgy Joe’s
package will help us in our predictions for a novel problem. We can only have
confidence that a method is suitable for our novel problem if we believe that a
particular method was successful on a similar problem to ours in the past, or the
assumptions that resulted in successful prediction on a previous problem might
well be expected to hold for a novel problem – smoothness of the problems for
example.
The above also highlights the issue that it is not enough to assess a method only
on the reported results of a subset of independent research groups. It may be that,
with the same method (eg neural nets with a fixed architecture but undetermined
parameters) one of a hundred groups which decide to tackle a particular problem
is able to find that particular set of parameter values (essentially by chance) that
gives good test performance, whilst the other 99 groups naturally do not report
their poor results. In principal, real comparison of a method on a problem requires
the collation of all results from all sources (attempts).
WowCo.com WowCo.com is a new startup prediction company. After years of failures, they
eventually find a neural network with a trillion hidden units that achieves zero
test error on every learning problem posted on the internet up till January 2002.
Each learning problem included a training and test set. Proud of their achievement,
they market their product aggressively with the claim that it ‘predicts perfectly
on all known problems’. Would you buy this product?
Model Comparison : An Let us reconsider our favourite digit classification problem. There are 1200 ex-
example amples of the digit 1 and 7. Let us split this to form a new training set of 400
examples, and a validation set of 200 examples. We will retain a further 600 ex-
amples to measure the test error. I used PCA to reduce the dimensionality of
the inputs, and then nearest neighbours to perform the classification on the 200
validation examples. Based on the validation results, I selected 19 as the number
of PCA components retained, see fig(2.3). The independent test error on 600 in-
dependent examples using 19 dimensions is 14. Once we have used the validation
data to select the best model, can we use both training and validation data to
retrain the optimal model? In this case, we would have decided that 19 is the op-
timal dimension to use, based on 200 training and 100 validation points. Can we
now, having decided to use 19 components, retrain the model on the 300 training
21
12
10
number of errors
8
0
0 20 40 60 80 100
number of eigenvalues
Figure 2.3: 400 training examples are used, and the validation error plotted on
200 further examples. Based on the validation error, we see that a dimension of
19 is reasonable.
and validation points? This is a fairly subtle point. In principle, the answer is
no, since the new procedure, strictly speaking, corresponds to a different model.
However, in practice, there is probably little to be lost by doing so, and possibly
something to be gained since we are making use of more training data in setting
many parameters. These issues highlight some of the philosophical complications
that can arise based on a frequentist interpretation of probability. No such diffi-
culties arise in the Bayesian framework, where all the training data can be used in
a clear and consistent manner for model selection.
2.1.6 Regularisation
If the data generation process includes noise (additive), then the true, clean data
generating process will typically be smoother than the observed data would directly
suggest. To try to discover this smoother clean underlying process, we need to
ensure that our model for the clean underlying process does not fit the noise in
the observed data. That is, it is undesirable to have a zero training error, and we
need to encourage our model to be smooth. One way to achieve this is through
regularisation in which an extra “penalty” term is added on the the standard
training error, to form the regularised training error:
The larger λ is, the smoother will be solution which minimises the regularised
training error. If we regularise too much, the solution will be inappropriate and
too smooth. If we don’t regularise at all, the solution will be over complex, and
the solution will be fitting the noise in the data. (Regularisation only really makes
sense in the case of models which are complex enough that overfitting is a potential
problem. There is little point in taming a pussy-cat; taming a lion however, might
be worthwhile!). How do we find the “optimal” value for λ? Training is then done
in two stages:
• For a fixed value of λ, find θ ∗ that optimises Eregtrain . Repeat this for each
value
∗ of λ that you wish to consider. This gives rise to a set of models,
θ λi , i = 1, . . . , V .
• For each of these models, on a separate validation set of data (different from
the training data used in the first step), calculate the validation error:
X
Eval (θ∗ ) = d(f (xµ |θ ∗ ), tµ ) (2.1.4)
(xµ ,tµ )∈Dval
22
The “optimal” model is that which has the lowest validation error.
Regularisation : An example In fig(2.4), we fit the function t = a sin(wx) to data, learning the parameters
a and w. The unregularised solution badly overfits the data, and has a high
validation error. To encourage a smoother solution, I used a regularisation term
Ereg = w2 . I then computed the validation error based on several different values
of the regularisation parameter λ, finding that λ = 0.5 gave a low validation error.
6
1.5
4 1
2 0.5
0 0
−2 −0.5
−4 −1
−6 −1.5
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 2.4: Left: The unregularised fit (λ = 0) to training given by ×. Whilst the
training data is well fitted, the error on the validation examples, + is high. Right:
the regularised fit (λ = 0.5). Whilst the training error is high, the validation error
(which is all important) is low. The true function which generated this noisy data
is the dashed line, and the function learned from the data is the solid line.
2.2 Problems
Exercise 1 WowCo.com is a new startup prediction company. After years of
failures, they eventually find a neural network with a trillion hidden units that
achieves zero test error on every learning problem posted on the internet up till
January 2002. Each learning problem included a training and test set. Proud of
their achievement, they market their product aggressively with the claim that it
‘predicts perfectly on all known problems’. Would you buy this product? Justify
your answer.
2.3 Solutions
3 Nearest Neighbour Classification
In other words, ‘just say Things x which are similar (in x-space) should have the same class label
whatever your neighbour
says!’
(This is a kind of smoothness assumption. Note that in this chapter, we won’t
explicitly construct a ‘model’ of the data in the sense that we could generate fake
representative data with the model. It is possible, however, to come up with a
model based on the above neighbourhood type idea which does just this. We will
see how to do this when we learn about density estimation in a later chapter.)
What does ‘similar’ mean? The key word in the above strategy is ‘similar’. Given two vectors x and y rep-
resenting two different datapoints, how can we measure similarity? Clearly, this
would seem to be rather subjective – two datapoints that one person thinks are
‘similar’ may be to someone else dissimilar.
The dissimilarity function Usually we define a function d(x, y), symmetric in its arguments (d(x, y) = d (y, x))
d(x, y) that measures the dissimilarity between the datapoints x and y.
It is common practice to adopt a simple measure of dissimilarity based on the
squared euclidean distance d(x, y) = (x − y)T (x − y) (often more conveniently
written (x − y)2 ) between the vector representations of the datapoints. There
can be problems with this but, in general, it’s not an unreasonable assumption.
However, one should bear in mind that more general dissimilarity measures can,
and often are used in practice.
Some say that nearest neighbours methods might be construed as machine learn-
ing’s “dirty secret” – one can often get very good results with such a simple method,
and the more sophisticated methods don’t really provide much more. Well, it’s no
secret that machine learning depends, as we discussed, a sense of smoothness in
the data – this is no dirty secret – it’s the fundamental assumption upon which
most machine learning algorithms are based. Having said that, nearest neighbour
methods are a good starting point in many applications, since they are intuitive
and easy to program. I would recommend this approach as a first starting point
to trying to understand the problem.
23
24
1
? 3 3
1
2 3
1 1 3
2
2 3
1 2
1 2
2 2 2
1. Calculate the dissimilarity of the test point x to each of the stored points,
dµ = d (x, xµ ).
∗
2. Find the training point xµ which is ‘closest’ to x by finding that µ∗ such
∗
that dµ < dµ for all µ = 1, . . . , P .
∗
3. Assign the class label c(x) = cµ .
In the case that there are two or more ‘equidistant’ (or equi-dissimilar) points with
different class labels, the most numerous class is chosen. If there is no one single
most numerous class, we can use the K-nearest-neighbours case described in the
next section.
The decision boundary In general, the decision boundary is the boundary in input space such that our
decision as to the class of the input changes as we cross this boundary. In the
nearest neighbour algorithm above based on the squared euclidean distance, the
decision boundary is determined by the lines which are the perpendicular bisectors
Voronoi Tessellation of the closet training points with different training labels, see fig(3.2). This is called
a Voronoi tessellation.
1 1
2
Figure 3.2: The decision boundary for the nearest neighbour classification rule is
piecewise linear with each segment corresponding to the perpendicular bisector
between two datapoints belonging to different classes.
25
2 2 2
2 2 2 2 2 2 2
1 1
? 1 1 1
1 1 1 1
Figure 3.3: Consider data which lie close to (hyper)planes. The Euclidean distance
would classify ? as belonging to class 2 – an undesirable effect.
The nearest neighbours algorithm is extremely simple yet rather powerful, and
used in many applications. There are, however, some potential drawbacks:
How should we measure the distance between points? Typically one uses the
euclidean square distance, as given in the algorithm above. This may not always
be appropriate. Consider a situation such as in fig(3.3), in which the euclidean
Invariance to linear distance leads to an undesirable result. If we use the Eucliean distance, (x −
transformation y)T (x − y) then the distance between the orthogonally transformed vectors M x
and M y (where M T M is the identity matrix) remains the same. (This is not
true for the Mahalanobis distance). Since classification will be invariant to such
transformations, this shows that we do not make a sensible model of how the data
is generated – this is solved by density estimation methods – see later chapter.
Mahalanobis Distance The Mahalanobis distance (x − y)T Ai (x − y) where usually Ai is the inverse co-
variance matrix of the data from class i can overcome some of these problems. I
think it’s better to use density estimation methods.
In the simple version of the algorithm as explained above, we need to store the
whole dataset in order to make a classification. However, it is clear that, in general,
only a subset of the training data will actually determine the decision boundary.
Data Editing This can be addressed by a method called data editing in which datapoints which
do not affect (or only very slightly) the decision boundary are removed from the
training dataset.
Dimension Reduction Each distance calculation could be quite expensive if the datapoints are high dimen-
sional. Principal Components Analysis (see chapter on linear dimension reduction)
is one way to address this, by first replacing each high dimensional datapoing xµ
with it’s low dimensional PCA components vector pµ . The euclidean distance of
2 2
the of two datapoints xa − xb is then approximately given by pa − pb – thus
we need only to calculate distance among the PCA representations of data. This
can often also improve the classification accuracy.
Sensitivity to outliers An outlier is a ‘rogue’ datapoint which has a strange label – this maybe the result
of errors in the database. If every other point that is close to this rogue point
has a consistently different label, we wouldn’t want a new test point to take the
label of the rogue point. K nearest neighbours is a way to more robustly classify
datapoints by looking at more than just the nearest neighbour.
2
1 ?
centred on the point x with radius r. We increase the radius r until the hypersphere
contains exactly K points. The class label c(x) is then given by the most numerous
class within the hypersphere. This method is useful since classifications will be
robust against “outliers” – datapoints which are somewhat anomalous compared
with other datapoints from the same class. The influence of such outliers would
be outvoted.
How do we choose K? Clearly if K becomes very large, then the classifications will become all the same
– simply classify each x as the most numerous class. We can argue therefore that
there is some sense in making K > 1, but certainly little sense in making K = P
(P is the number of training points). This suggests that there is some “optimal”
Generalisation intermediate setting of K. By optimal we mean that setting of K which gives the
best generalisation performance. One way to do this is to leave aside some data
that can be used to test the performance of a setting of K, such that the predicted
class labels and the correct class labels can be compared. How we define this is
the topic of a later chapter.
time, 18 errors are found using nearest neighbour classification – a 3% error rate for
this two class problem. The 18 test points on which the nearest neighbour method
makes errors are plotted in fig(3.7). Certainly this is a more difficult task than
distinguishing between zeros and ones. If we use K = 3 nearest neighbours, the
State of the art classification error reduces to 14 – a slight improvement. Real world handwritten
digit classification is big business. The best methods classify real world digits (over
all 10 classes) to an error of less than 1% – better than human average performance.
Figure 3.5: (left) Some of the 300 training examples of the digit zero and (right)
some of the 300 training examples of the digit one.
Figure 3.6: Some of the 300 training examples of the digit seven.
Figure 3.7: The Nearest Neighbour method makes 18 errors out of the 600 test
examples. The 18 test examples that are incorrectly classified are plotted (above),
along with their nearest neightbour in the training set (below).
p(x∗ |c = 0)p(c = 0)
p(c = 0|x∗ ) =
p(x∗ |c = 0)p(c = 0) + p(x∗ |c = 1)p(c = 1)
which follows from using Bayes’ rule. One can show (exercise) that the maximum
likelihood setting of p(c = 0) is P0 /(P0 + P1 ), and p(c = 1) = P1 /(P0 + P1 ). A
similar expression holds for p(c = 1|x∗ ). Hence
Often in machine learning, the data is very high dimensional. In the case of
the hand-written digits from chapter(3), the data is 784 dimensional. Images are
a good example of high dimensional data, and a good place where some of the
basic motivations and assumptions about machine learning come to light. For
simplicity, consider the case of the handwritten digits in which each pixel is binary
– either 1 or 0. In this case, the total possible number of images that could ever
exist is 2784 ≈ 10236 – this is an extremely large number (very much larger than
the number of atoms in the universe). However, it is clear that only perhaps
at most a hundred or so examples of a digit 7 would be sufficient (to a human)
to understand how to recognise a 7. Indeed, the world of digits must therefore
lie in a highly constrained subspace of the 784 dimensions. It is certainly not
true that each dimension is independent of the other in the case of digits. In
other words, certain directions in the space will be more important than others
A hook for machine learning for describing digits. This is exactly the hope, in general, for machine learning –
that only a relatively small number of directions are relevant for describing the
true process underlying the data generating mechanism. That is, any model of the
Features data will have a relatively low number of effective degrees of freedom. These lower
dimensional independent representations are often called ‘feature’ representations,
since it is these quintessential features which succinctly describe the data.
Linear Dimension Reduction In general, it seems clear that the way dimensions depend on each other is, for a
general machine learning problem (and certainly the digits data) very complex –
certain dimensions being ‘on’ means that others are likely to be ‘off’. This suggests
that non-linear effects will, in general, be important for the efficient description of
data. However, finding non-linear representations of data is numerically difficult.
Here, we concentrate on linear dimension reduction in which a high dimensional
datapoint x is represented by y = Fx where the non-square matrix F has di-
mensions dim(y) × dim(x), dim(y) < dim(x). The matrix F represents a linear
projection from the higher dimensional x space to the lower dimensional y space.
The form of this matrix determines what kind of linear projection is performed
and, classically, there are several popular choices. The two most popular corre-
spond to Principal Components Analysis (PCA) and Linear Discriminants. The
first is an unsupervised and the latter a supervised projection. We concentrate
in this chapter on the more generic PCA, leaving linear discriminants to a later
chapter. Note that, again, these methods do not describe any model from which
we could generate data and, are also non-probabilistic. However, probabilistic data
generating versions do exist which are model based but are beyond the scope of
this course.
29
30
x
x
x
x
Figure 4.1: In linear dimension reduction we hope that data that lies in a high
dimensional space lies close to a hyperplane that can be spanned by a smaller
number of vectors.
whole space, rather is is a ‘basis’ which approximately spans the space where
the data is concentrated. Effectively, we are trying to choose a more appropriate
low dimensional co-ordinate system that will approximately represent the data.
Mathematically, we write
M
X
x≈c+ wi bi (4.1.1)
i=1
Figure 4.2: Projection of two dimensional data using one dimensional PCA. Plotted
are the original datapoints (crosses) and their reconstructions using 1 dimensional
PCA (circles). The two lines represent the eigenvectors and their lengths their
corresponding eigenvalues.
5. The totalPsquared error over all the training data made by the approximation
is (P −1) N j=M+1 λj where λj , j = M +1 . . . N are the eigenvalues discarded
in the projection.
One can view the PCA reconstructions (though there is usually little use for these
except to check that they give an adequate representation of the original data) as
orthogonal projections of the data onto the subspace spanned by the M largest
eigenvectors of the covariance matrix, see fig(4.2).
Interpreting the Eigenvectors Do the eigenvectors themselves explicitly have any meaning? No! They only
act together to define the linear subspace onto which we project the data – in
themselves they have no meaning. We can see this since, in principle, any basis
which spans the same subspace as the eigenvectors of the covariance matrix is
equally valid as a representation of the data. For example, any rotation of the
basis vectors within the subspace spanned by the first M eigenvectors would also
have the same reconstruction error. The only case when the subspace is uniquely
defined is when we only use one basis vector – that is, the principal component of
the correlation matrix alone.
The “intrinsic” dimension of How many dimensions should the linear subspace have? As derived (at the end of
data the chapter), the reconstruction error is dominated by the largest eigenvalues of
the covariance matrix. If we plot the eigenvalue spectrum (the set of eigenvalues
ordered by decreasing value), we might hope to see a few large values and many
small values. Indeed, if the data did lie very close to say a M dimensional linear
manifold (hyperplane), we would expect to see M large eigenvalues, and the rest to
be very small. This would give an indication of the number of degrees of freedom
in the data, or the intrinsic dimensionality. The directions corresponding to the
small eigenvalues are then interpreted as “noise”.
Warning! It might well be that a small reconstruction error can be made by using a small
number of dimensions. However, it could be that precisely the information required
to perform a classification task lies in the “noise” dimensions thrown away by the
above procedure (though this will hopefully be rather rare). The purpose of linear
discriminants is to try to deal with this problem.
32
mean
Figure 4.3: (left) Four of the 892 images. (right) The mean of the 892 images
5
x 10
6
eigenvalue
3
0
0 10 20 30 40 50 60 70 80 90 100
eigenvalue number
Non-linear Dimension Whilst it is straightforward to perform the above linear dimension reduction, bear
Reduction in mind that we are presupposing that the data lies close to a hyperplane. Is this
really realistic? More generally, we would expect data to lie on low dimensional
curved manifolds. Also, data is often clustered – examples of handwritten ‘4’s look
similar to each other and form a cluster, separate from the ‘8’s cluster. Neverthe-
less, since linear dimension reduction is so straightforward, this is one of the most
powerful and ubiquitous techniques used in dimensionality reduction.
We have 892 examples of handwritten 5’s. Each is a 21*23 pixel image – that is,
each data point is a 483 dimensional vector. We plot 4 of these images in fig(4.3).
The mean of the data is also plotted and is, in a sense, an archetypal 5. The
covariance matrix has eigenvalue spectrum as plotted in fig(4.4), where we plot
only the 100 largest eigenvalues. The reconstructions using different numbers of
eigenvectors (10, 50 and 100) are plotted in fig(4.5). Note how using only a small
number of eigenvectors, the reconstruction more closely resembles the mean image.
XXT E = EΛ (4.1.5)
T T T
X XX E = X EΛ (4.1.6)
XT XẼ = ẼΛ (4.1.7)
where we defined Ẽ = XT E. The last line above represents the eigenvector equation
for XT X. This is a matrix of dimensions P × P – in the above example, a 500 × 500
matrix as opposed to a 106 × 106 matrix previously. We then can calculate the
eigenvectors Ẽ and eigenvalues Λ of this matrix more easily. Once found, we then
use
E = XẼΛ−1 (4.1.8)
further use of the data – making a machine which can generalise from the lower
dimensional representations for example – has a chance. Hence, perhaps somewhat
perversely, PCA is a reasonable feature extraction method because it is such a poor
compressor!
B̃ = I − U U T
where S is the correlation matrix of the data. The constraint can be written (using
a set of Lagrange multipliers)
−trace SU U T + trace M U T U − I
Since the constraint is symmetric, we can assume that M is also symmetric. Dif-
ferentiating with respect to M , we get
SU = U M
35
4.3 Problems
Exercise 2 Consider AA−1 = I. By applying a differential operator ∂, show that
elog A = A
show that
∂ log(A) = A−1 ∂A
Show that
Exercise 3 Consider a dataset in two dimensions where the data lies on the cir-
cumference of a circle of unit radius. What would be the effect of using PCA on
this dataset, in which we attempt to reduce the dimensionality to 1? Suggest an
alternative one dimensional representation of the data.
a
Exercise 4 P Consider two vectors
PMx and xb and their corresponding PCA approx-
M
imations c+ i=1 ai e and c+ i=1 bi e , where the eigenvectors ei , i = 1, . . . M are
i i
mutually orthogonal and have unit length. The eigenvector ei has corresponding
eigenvalue λi .
Approximate (xa − xb )2 by using the PCA representations of the data, and show
that this is equal to (a − b)2 .
Exercise 6 Let S be the covariance matrix of the data. The Mahalanobis distance
between xa and xb is defined as
T
xa − xb S −1 xa − xb .
Explain how to approximate this distance using the M -dimensional PCA approxi-
mations, as described above.
Exercise 8 In a recent radio lecture, the following phrase was uttered by a famous
Professor of Experimental Psychology:
“In a recent data survey, 90% of people claim to have above average intelligence,
which is clearly nonsense!” [Audience Laughs]. Discuss the truth or falsity of this
statement, and justify your answer.
Exercise 9 (PCA with external inputs) In some applications, one may sus-
pect that certain variables have a strong influence on how the data x is distributed.
For example, we could have a set of variables vkµ ,k = 1, . . . K for each observation
xµ , µ = 1, . . . P , which we think will heavily influence each xµ . It may therefore
be sensible to include these variables, and assume an approximation
X µ X µ
xµ ≈ wj bj + vk ck (4.3.1)
j k
4.4 Solutions
6 Using the approximations, we have
X X X X
(xa − xb )T S −1 (xa − xb ) ≈ ( ai e i − bi ei )T S −1 ( aj e j − bj ej )
i i j i
P
Due to the orthonormality of the eigenvectors, this is i a2i /λi −2ai bi /λi +b2i /λi =
(a − b)T D−1 (a − b) where D is a diagonal matrix containing the eigenvalues.
7 Even though 25000 is a very small number compared to 210000 , the point is that
digits are not simply random point in a 10000 dimensional space. There is a great
deal of regularity and constraint on the form that each digit can take, so that digits
will occupy only a very small fraction of the space of all possible images. Indeed,
humans are capable of learning digits based on only a small number of training
examples, and there is therefore every reason to be optimistic that a machine could
do the same.
If we wish to make a classifier that works well on a wide variety of peoples hand-
writing, we need training data that is representative of a wide variety of styles.
Otherwise, the trained classifier may be appropriate for recognizing the handwrit-
ing of the line manager, but not necessarily anyone else.
The classification of the KNN method is based on finding the K nearest neighbours.
If none of the neighbours is very close, this will result is potentially inaccurate
classification. A simple method is therefore to use an independent testset, and
38
set a threshold value. Measure the distance to the nearest neighbour for each
testpoint to be classified, and discard this point if it is greater than the threshold.
For the remaining undiscarded points, determine the classification. If this is not
99%, increase the threshold and repeat the procedure until a just sufficient value
of the threshold has been found.
8 Clearly false. A canny student will be able to give an example to demonstrate
this, which is surely the result of a highly non-symmetric distribution with many
(slightly) above average values and a few extremely low values. A simple demon-
stration is to assume that the average IQ is 100, and the minimum 0. If there are
only two possible scores, the above average score, and the below average score, then
one can easily show that 90 percent of people can indeed have an above average
IQ if the above average IQ score is less than 111.111.
9 To optimise equation (4.3.2), it is straightforward
P µ to P
show that we should first
µ
transform the data to be zero mean : µ x = 0 and µ vk = 0, k = 1, . . . , K.
We may assume, without loss of generality, that the bj are orthonormal (since we
could rescale the wjµ if not). However, we cannot assume that the ck , k = 1, . . . , K
are orthonormal, since we cannot rescale the v. Similarly, we assume nothing, a
priori, regarding the relationship between the vectors bj and ck . Differentiating
equation (4.3.2) with respect to wµ gives (using the orthonormality constraint on
the bi )
!
X X µ
wµ = (bi )T xµ − vl cl
i l
The residual vector (difference between xµ and the linear reconstruction) is then
!
X X µ X µ
µ µ i T µ
r =x − (b ) x − l
vl c bi − vj cj
i l j
P
By defining B̃ ≡ I − i bi (bi )T ≡ I − U U T , (using the notation of the previous
section), the residual is
X µ
rµ = B̃ xµ − vj cj
j
P
Differentiating E = µ (rµ )T rµ with respect to ci , we get
X µ XX µ µ
vi B̃xµ = vj vi B̃cj
µ j µ
Define
X X
[Ṽ ]ij = viµ vjµ , [X̃]ij = viµ xµj , C = [c1 , . . . , cK ]
µ µ
where
X
dµ = vjµ cj
j
Hence, the optimal solution is given by taking the principal e-vecs of S̃, with C
set as above.
I believe this is a special case on Constrained Principal Components Analysis
(CPCA)[2].
5 Linear Discriminant Analysis
We will be interested here in how we can exploit class label information to improve
the projection. That is to make supervised projections of the data. We begin with
a discussion of a simpler method in which the projection occurs in an unsupervised
way without using class information.
40
41
1500
1000
500
−500
−1000
−1500
−1500 −1000 −500 0 500 1000 1500
Figure 5.1: Projection of data onto two dimensions, formed by the two principal
components of the data. The fives are plotted with an ’x’, and the threes as an
’o’. We can see that the classes are partially separated in this two dimensional
projection, but there is a good deal of class overlap.
Figure 5.2: Two linear projections of two data classes. The dotted lines represent
the distributions for the projection that maximises the difference between the pro-
jected means. The full curves are Fisher’s projection. Fisher’s projection clearly
provides a better one-dimensional measure by which we can separate the classes.
We restrict attention here to two classes of data – the generalisation to more classes
is given in the subsequent section (although only the algorithm is described). Also,
for simplicty, we will project the data down to one dimension. The algorithm in a
later section deals with the higher dimensional multiple class case, and is known
as canonical variates.
Gaussian Assumption
Assume that, for each class, we can model the data with a Gaussian. That is
2.5
1.5
1
1
0.5 0.5
0
0
−0.5
−0.5
−1
−1 3
−1.5 2
1
−1.5
0
−2 −3
−2 −1
−1
0 −2
1
−2.5 2 −3
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Figure 5.3: Projection of the Leptograpsus Crab data onto the canonical variate
directions. (a) Two dimensional projection. Note that the data is almost perfectly
separated by this simple projection method. (b) Increasing the dimension of the
projection to three does not significantly improve the separability of the projected
classes. This can be expected since the Eigenvalues are 7.6, 3.2 and 0.15. That is,
the third direction contributes very little to separating the classes.
where y1µ is the projection of an input xµ that is in class one. Similarly, y2µ is
the projection of a datapoint that is in class 2. We want to find w such that,
in some sense, there is maximal separability of the classes in the one dimensional
projections. The aim of this is that a classifier can then be made based on where,
along this one dimension, a novel input lies. Because the projection is linear, the
projected distributions onto the one dimension are also Gaussian,
p(y1 ) = N m1 ; σ12 , p(y2 ) = N m2 ; σ22 (5.2.3)
where πi represents the fraction of the dataset in class i. What this does is tries to
maximize the separation of the projected means whilst at the same time penalising
projections of large variance, see fig(5.2). Note that this objective function is
invariant to linear rescaling of w, so that there is no need to include a restriction
that wT w = 1. The optimal w is then given by
w ∝ Sw −1 (m2 − m1 ) (5.2.5)
where Sw = π1 S1 + π2 S2 .
100
80
60
40
20
−20
−40
−60
100
−80 50
0
−100
−50
−300 −200 −100 0 100 200 −100
300 400
Figure 5.4: The projection of the two classes using canonical variates. Here we
project onto three dimensions. Note how the data is well separated in the projec-
tions, and indeed, is almost linearly separated using this projection.
We can apply the method of canonical variates as described above to project the
digit data onto a small number of dimensions (in the code below we project onto
three dimensions). We use here 600 examples of a three and 600 examples of a
five. Thus, overall , there are 1200 examples which lie in a 784 (28 × 28 pixels)
dimensional space. Since there are more datapoints than dimensions in the space,
the points cannot, a priori, be trivially separated by a linear decision boundary.
Note how the projection onto three dimensions enables the data to be separated
almost perfectly, see fig(5.4) Canonical variates is a useful method for dimension
reduction for labelled data, preserving much more class relevant information in the
projection than PCA. We can use the lower dimensional representations to help
visualise the data and also for use in further processing such as building a classifier.
6 Linear Parameter Models
6.1 Introduction
Consider the data in fig(6.1), in which we plot the number of chirps per second
for crickets, versus the temperature in degrees Fahrenheit. A biologist believes
that there is a simple relation between the number of chirps and the temperature.
Modelling such a relation is a regression problem. The biologist decides to make
a straight line fit :
c = a + bt (6.1.1)
where she needs to determine the parameters a and b. How can she determine these
parameters based on the training data (cµ , tµ ), µ = 1, . . . , 15 ? For consistency with
our previous notations, let us use y rather than c, and x in place of t, so that our
model is y = a + bx. The sum squared training error is
P
X
E(a, b) = (y µ − a − bxµ )2 (6.1.2)
µ=1
y = wT φ (6.1.8)
where w = (a, b)T and φ = (1, x)T . The training error then is
P
X
E(w) = (y µ − wT φµ )2 (6.1.9)
µ=1
44
45
26
24
22
20
16
14
12
10
8
70 75 80 85 90
temperature (F)
Figure 6.1: Data from crickets – the number of chirps per second, versus the
temperature in Fahrenheit.
Putting in the actual data, we get a = −0.3091, b = 0.2119. The fit is plotted in
fig(6.2). Although the solution is written in terms of the inverse matrix, we never
actually compute the inverse numerically; we use instead Gaussian elimination –
see the MATLAB code.
30 26
24
25 22
20
15 16
14
10 12
10
5
8
65 70 75 80 85 90 95 100 70 75 80 85 90
temperature (F) temperature (F)
Figure 6.2: Left: Straight line regression fit to the cricket data. Right: PCA fit to
the data. In regression we minimize the residuals – the fit represents the shortest
vertical distances. In PCA the fit minimizes the orthogonal projections to the line.
30
25
15
10
65 70 75 80 85 90 95 100
temperature (F)
As we saw above, straight line regression fits to data are examples of this. If
we choose the coefficients of the vector φ to be non-linear functions of x, then
the mapping x → y will be non-linear. The phrase “linear” model here refers
to the fact that the model depends on its parameters in linear way. This is an
extremely important point. Unfortunately, the terminology is a little confused in
places. These models are often referred to as “generalised linear models”. However,
sometimes people use this same phrase to refer to something completely different
– beware!
In the derivation above, there was nothing specific about the form of φ. Hence,
the solution in equation (6.2.5) holds in general. That is, you simply put in a
different φ vector if you wish to find a new solution. For example, consider the
case of fitting a cubic function y = w1 + w2 x + w3 x2 + w4 x3 to the data. In this
case, we would choose
T
φ = 1, x, x2 , x3 (6.2.2)
The solution has the same form, except w is now a 4 dimensional vector
The above MATLAB code implements LPM in general. All that needs to be
changed in the above code for a different model is the function phi_fn. Note that,
rather than using the inv function in MATLAB to solve the linear equations, it
is much better to use the slash function \ – this implements Gaussian elimination
47
to solve linear systems. This is both much faster and more accurate. As a rule we
never invert matrices unless you need to – and you never need to if you only want
to solve the linear system.
Choosing between Different How would we decide if a straight line fit is preferable to a cubic polynomial fit?
Models We saw in the previous chapter that a general way to address this problem is to
use some validation data to test how accurate each model predicts the validation
data. The more accurate model on the validation data would then be preferred.
It should be fairly clear from the above that all polynomial regression fits are
simply special cases of LPMs. Also, the more terms there are in polynomial, the
more curved can be the fit to the data. One way to penalise too complex models
is to use a penalty term
P
X
Eregtrain (w, λ) = (y µ − wT φµ )2 + λwT w (6.2.4)
µ=1
If we differentiate the regularised training error to find the optimal w for a given
λ, we find: Hence, the solution is
!−1
X X
µ µ T
w= φ (φ ) + λI y µ φµ (6.2.5)
µ µ
The mathematics follows similarly to before, and this is left as an exercise for the
interested reader.
6.2.4 Classification
One way to adapt the LPM model to classification is to use p(c = 1|x) = σ(wT φ(x)).
The logistic regression model simply used a special case in which the vector
φ(x) = x. However, there is nothing to stop us using this more general method.
The nice thing is that the decision boundary is then a non-linear function of x.
Clearly, instead of using the euclidean square distance as the error measure, we
now use the log-likelihood, exactly as in the chapter on logistic regression. Again,
48
1 1.5
0.9
1
0.8
0.7
0.5
0.6
t
0.5
0.4
−0.5
0.3
0.2 −1
0.1
−1.5
0 0.2 0.4 0.6 0.8 1
0
−1 −0.5 0 0.5 1 1.5 2 x
however, the training to find w will not be so straightforward, since the objective
function is not quadratic. However, the surface remains well behaved so that find-
ing a solution is not numerically difficult. We leave it as an exercise for the reader
to work out the details.
8
1.5
7
1
6
validation error
5 0.5
4
0
3
−0.5
2
1 −1
0
0 0.2 0.4 0.6 0.8 1 −1.5
alpha 0 0.2 0.4 0.6 0.8 1
Figure 6.5: Left: The validation error as a function of the basis function width.
Right: The predictions. The solid line is the correct underlying function sin(10x);
the dashed line is the best predictor based on the validation set. The dotted line
is the worst predictor based on the validation set.
done. This explosion in the apparent number of basis functions required is the
famous “curse of dimensionality”.
A possible solution is to make the basis functions very broad to cover more of the
high dimensional space. However, this will mean a lack of flexibility of the fitted
function. Another approach is to place basis functions centred on the training
input points that we have, and add some more basis functions randomly placed
close to the training inputs. The rational behind this is that when we come to do
prediction, we will most likely see novel x that are close to the training points –
we do not need to make “accurate” predictions over all the space.
A further approach is to make the positions of the basis functions adaptive, allow-
ing them to be moved around in the space to minimise the error. This approach
motivates the neural network models.
The criticism of the curse of dimensionality is, in my humble opinion, rather weak,
and good results are often obtained by using basis functions which are dense around
the training inputs.
6.5 Summary
• Linear Parameter models are regression models that are linear in the para-
meters.
• They are very easy to train (no local minima).
• Criticised in high dimensional input spaces due to the curse of dimensionality.
• Judicious placement of the basis functions on close to the training inputs is
a workaround for the curse of dimensionality. Otherwise we need to optimise
the placement of the basis functions – that is, use neural networks.
• Easily adapted to classification (though the training is now more difficult
and needs to be solved using optimisation).
7 Layered Neural Networks
where the weights w encode the mapping that this neuron performs. Graphically,
this is represented in fig(7.1). We can consider the case of several outputs as
follows:
X
yi = g wij xj + µi
j
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
y y1 y2
Figure 7.1: (Left) A simple perceptron. We use square boxes to emphasise the
deterministic nature of the network. (Right) We can use two perceptrons with
weights w1 and w2 to model a mapping (x1 , x2 , x3 , x4 , x5 ) → (y1 , y2 )
50
51
Figure 7.2: Linear separability: The data in (a) can be classified correctly using a
hyperplane classifier such as the simple perceptron, and the data is termed linearly
separable. This is not the case in (b) so that a simple perceptron cannot correctly
learn to classify this data without error.
x1 xn
l1 lr
h1 hk
m1 ms
y1 ym
Figure 7.3: A multilayer perceptron (MLP) with multiple hidden layers, modeling
the input output mapping x → y. This is a more powerful model than the single
hidden layer, simple perceptron. We used here boxes to denote the fact that the
nodes compute a deterministic function of their inputs.
successfully. For example, consider the case in which g (x) = Θ (x) – that is, the
output is a binary valued function (Θ(x) = 1 if x ≥ 0, Θ(x) = 0 if x < 0) . In this
case, we can use the perceptron for binary classification. With a single output we
can then classify an input x as belonging to one of two possible classes. Looking
at the perceptron,
P equation (7.2.1), we see that we will classifyP the input as being
in class 1 if w x
j j j + µ ≥ 0, and as in the other class if j wj xj + µ < 0.
Mathematically speaking, the decision boundary then forms a hyperplane in the x
space, and which class we associate with a datapoint x depends on which side of
the hyperplane this datapoint lies, see fig(7.2).
0.8
0.6
0.4
0.2
−0.2
−0.4 sigma(x)
tanh(x)
−0.6 exp(−0.5 x2)
−0.8
−1
−10 −5 0 5 10
x
x1 x2
h1 h2
Transfer Functions mapping. Each hidden node computes a non-linear function of a weighted linear
sum of its inputs. The specific non-linearity used is called the transfer function. In
principle, this can be any function, and different for each node. However, it is most
common to use an S-shaped (sigmoidal) function of the form σ(x) = 1/(1 + e−x ).
This particular choice is mathematically convenient since it has the nice derivative
property dσ(x)/dx = σ(x)(1 − σ(x)). Another popular choice is the sigmoidal
1 2
function tanh(x). Less “biological” transfer functions include the Gaussian, e− 2 x ,
see fig(7.4). For example, in fig(7.3), we plot a simple single hidden layer function,
h1 = σ w1T x + b1 , h2 = σ w2T x + b2 , y = r(v T h + b3 ) (7.3.1)
where the adaptable parameters are θ = {w1 , w2 , v, b1 , b2 , b3 }. Note that the out-
put function r(·) in the final layer is usually taken as the idendity function r(x) = x
in the case of regression – for classification models, we use a sigmoidal function.
The biases, b1 , b2 are important in shifting the position of the “bend” in the sigmoid
function, and b3 shifts the bias in the output.
Generally, the more layers that there are in this process, the more complex becomes
the class of functions that such MLPs can model. One such example is given in
fig(7.3), in which the inputs are mapped by a non-linear function into the first layer
outputs. In turn, these are then fed into subsequent layers, effectively forming new
inputs for the layers below. However, it can be shown that, provided that there
are sufficiently many units, a single hidden layer MLP can model an arbitrarily
complex input-output regression function. This may not necessarily give rise to
the most efficient way to represent a function, but motivates why we concentrate
mainly on single hidden layer networks here.
53
There are a great number of software packages that automatically set up and train
the networks on provided data. However, following our general belief that our
predictions are only as good as our assumptions, if we really want to have some
faith in our model, we need to have some insight into what kinds of functions
neural networks are.
The central idea of neural networks is that each neuron computes some function
of a linear combination of its inputs:
h(x) = g(wT x + b) (7.3.2)
where g is the transfer function, usually taken to be some non-linear function.
Alternatively, we can write
h(x) = g(a(x)) (7.3.3)
where we define the activation a(x) = wT x + b. The parameters for the neuron are
the weight vector w and bias b. Each neuron in the network has its own weight
and bias, and in principle, its own transfer function. Consider a vector w⊥ defined
to be orthogonal to w, that is, wT w⊥ = 0. Then
T
a(x + w⊥ ) = x + w⊥ w + b (7.3.4)
T
= xT w + b + w⊥ w (7.3.5)
| {z }
0
= a(x) (7.3.6)
Since the output of the neuron is only a function of the activation a, this means
that any neuron has the same output along directions x which are orthogonal to
w. Such an effect is given in fig(7.6), where we see that the output of the neuron
does not change along directions perpendicular to w. This kind of effect is general,
and for any transfer function, we will always see a ridge type effect. This is why
a single neuron cannot achieve much on its own – essentially, there is only one
direction in which the function changes (I mean that unless you go in a direction
which has a contribution in the w direction, the function remains the same). If
the input is very high dimensional, we only see variation in one direction.
Combining Neurons In fig(7.7) we plot the output of a network of two neurons in a single hidden layer.
The ridges intersect to produce more complex functions than single neurons alone
can produce. Since we have now two neurons, the function will not change if we go
in a direction which is simultaneously orthogonal to both w1 and w2 . In this case,
x is only two dimensional, so there is no direction we can go along that will be
orthogonal to both neuron weights. However, if x where higher dimensional, this
would be possible. Hence, we now have variation along essentially two directions.
1
1
0.8
0.5
0.6
0 0.4
1
0.2
0.5 0
1
0 0.5
0
1
−0.5 1
0.8 0.5
0.6
0.4 −0.5
0.2 0
x(2) 0
−0.2
−0.4 −0.5
−1 −0.8
−0.6 x(2) −1
−1 −1
x(1) x(1)
Figure 7.6: The output for a single neuron, w = (−2.5, 5)T , b = 0. Left: The
network output using the transfer function exp(−0.5x2 ). Right: using the transfer
function σ(x). Note how the network output is the same along the direction
perpendicular (orthogonal) to w, namely w⊥ = λ(2, 1)T .
1.5 2
1 1.5
0.5
1
0
0.5
1
0
0.5 1
0.5
0
0
−0.5
1
−0.5 0.5
x(2) 0.8 1 0
−1 0.2 0.4 0.6
−0.4 −0.2 0 x(2) −0.5
−1 −0.8 −0.6 −1
−1
x(1) x(1)
Figure 7.7: The combined output for two neurons, w1 = (−5, 10)T , b2 = 0, w2 =
(7, 5)T , b2 = 0.5. The final output is linear, with weights v = (1, 1)T and zero bias.
Left: The network output using the transfer functions exp(−0.5x2 ). Right: using
the transfer function σ(x) – this is exactly the function in equation (7.3.1) with r
the identity function.
55
where f (xµ , θ) is the output of the network for input xµ , given that the parameters
describing the network are θ. We can train this network by any standard (non-
linear) optimisation algorithm, such as conjugate gradient descent.
Classification A suitable choice of energy or error function to minimise for classification is the
negative log likelihood (if y µ ∈ {0, 1})
X
Etrain (θ) = − (y µ log f µ + (1 − y µ ) log(1 − f µ )) (7.4.2)
µ
where f µ = f (xµ , θ). In this case, we would need that the final output r(x) is
bounded between 0 and 1 in order that it represents a probability. The case of
more than two classes is handled in a similar way using the so-called soft-max
function (see Bishops book for references).
Regularisation In principle, the problem of training neural networks is equivalent to the general
statistical problem of fitting models to data. One of the main problems when
fitting complex non-linear models to data is how to prevent “overfitting”, or, more
generally, how to select the model that not only fits the data, but also generalises
well to new data. We have already discussed this issue in some generality, and
found that one approach is to use a penalty term which encourages smoother
functions. In the case of MLPs, smoother functions can be encouraged if we
penalise large weight values. The reason for this is that the larger the weights wi
are, the more rapidly the function can change as x changes (since we could flip
from close to one near saturated region of the sigmoid to the other with only a
small change in x).
A term which penalises large weights,
∂E XP
∂f (xµ , θ) X dim(w
Xk) ∂wj,k
=2 (f (xµ , θ) − y µ ) +2 wj,k (7.4.6)
∂θi µ=1
∂θi j=1
∂θi
k
The final term is zero unless we are differentiating with respect to a parameter
that is included in the regularisation term. If θi is included in the regularisation
term, then the final term simply is 2θi . All that is required then is to calculate the
derivatives of f with respect to the parameters. This is a straightforward exercise
in calculus, and we leave it to the reader to show that, for example,
∂f (xµ , θ)
= g(w1T xµ + b1 ) (7.4.7)
∂v1
and
∂f (xµ , θ)
= v2 g ′ (w2T xµ + b2 )xµ1 (7.4.8)
∂w1,2
where g ′ (x) is the derivative of g(x). Example code for regression using a single
hidden layer is given below. It is straightforward to adapt this for classification.
This code is not fully vectorised for clarity, and also uses the scg.m function, part
of the NETLAB (see http://www.ncrg.aston.ac.uk) package which implements
many of the methods in these chapters.
In computing the gradient of the error function, naively it appears that we need of
the order of P W 2 operations (if W is the number of parameters in the model and
P is the number of training patterns), since computing the output of the network
involves roughly W summations for each of the P patterns, and the gradient is
a W -dimensional vector. The essence of the backpropagation procedure is that
the gradient can instead by computed in order P W operations. If the training
set is very large, standard computation of the gradient over all training patterns
is both time-consuming and sensitive to round-off errors. In that case, “on-line
learning”, with weight updates based on the gradient for individual patterns, offers
an alternative. Back propagation is most useful in cases where there are more than
one hidden layer in the network. In this case, the gradient can be computed more
efficiently, and time saved therefore to find the optimal parameters.
A problem with neural networks is that they are difficult to train. This is because
the surface of the error function E(θ) is very complicated and typically riddled
with local minima. No algorithm can guarantee to find the global optimum of
1 There is no need to penalise the biases, since they only really affect a translation of the
functions, and don’t affect how bumpy the functions are.
57
the error surface. Indeed, depending on the initial conditions that we use, the
parameters found by the optimisation routine will in general be different. How
are we to interpret these different solutions? Perhaps the simplest thing to do
is to see which of the solutions has the best error on an independent validation
set. Many algorithms have been proposed on how to combine the results of the
separate networks into a single answer and for computing error bars that indicate
the reliability of this answer. Imagine that we have used optimisation several
times, and found the different solutions θ i , i = 1, . . . , M . One simple approach
(for regression) is to combine the outputs of each of the trained models,
M
1 X
f¯(x) = f (x, θ i ) (7.4.9)
M i=1
This is also useful since we can make an estimate of the variance in the predictions
at a given point,
M
1 X 2
var(f (x) = f (x, θi ) − f¯ (7.4.10)
M i=1
p
This can then be used to form error bars f¯(x) ± var(f (x)).
As previously discussed, because the output of the node only depends on a linear
combination of the inputs to the network node/neuron, essentially there is only
variability in one direction in the input space (where by input I mean the inputs
to the node). We can make a bump, but only a one dimensional bump, albeit in
a high dimensional space. To get variability in more than one direction, we need
to combine neurons together. Since it is quite reasonable to assume that we want
variability in many dimensions in the input space, particularly in regions close to
the training data, we typically want to make bumps near the data.
In the case of linear parametric models, we saw how we can approximate a func-
tion using a linear combination of fixed basis functions. Localised Radial Basis
Functions(exp(−(x − m)2 )) are a reasonable choice for the “bump” function type
approach. The output of this function depends on the distance between x and the
centre of the RBF m. Hence, in general, the value of the basis function will change
as x moves in any direction, apart from those that leave x the same distance from
m, see fig(7.8). Previously, we suggested that a good strategy for placing centres of
basis functions is to put one on each training point input vector. However, if there
are a great number of training patterns, this may not be feasible. Also, we may
wish to use the model for compression, and placing a basis function on each train-
ing point may not give a particularly high compression. Instead we could adapt
58
0.5 1.5
1
1
0
1
0.5
0.8 0.5
0.6 1
0.4 0
0.2 0.5 0
0 1
−0.2 0 0.5 −0.5
−0.4 0
−0.6 −0.5 −0.5 x(1)
−0.8 −1
x(2) −1 −1 x(1) x(2)
−1
2
Figure 7.8: Left: The output of an RBF function exp(− 21 x − m1 /α2 ). Here
m1 = (0, 0.3)T and α = 0.25. Right: The combined output for two RBFs, m2 =
(0.5, −0.5)T .
1.5
0.5
−0.5
−1
−3 −2 −1 0 1 2 3
Figure 7.9: A RBF function using five basis functions. Note how the positions of
the basis function centres, given by the circles, are not uniform.
the positions of the centres of the basis functions, treating these also as adaptable
parameters. In general, an adaptive basis function network is of the form
X
y(x, θ) = wi φi (x, bi ) (7.5.2)
i
where now each basis function φi (x, bi ) has potentially its own parameters that
can be adjusted. θ represents the set of all adjustable parameters. If the basis
functions are non-linear, then the overall model is a non-linear function of the
parameters.
However, one should always bear in mind that, in general, the training of complex
non-linear models with many parameters is extremely difficult.
Classification A suitable choice of energy or error function to minimise for classification is the
negative log likelihood (if y µ ∈ {0, 1})
X
Etrain (θ) = − (y µ log f µ + (1 − y µ ) log(1 − f µ )) (7.6.2)
µ
If we use basis functions that decay rapidly from a ‘centre’, as in the case exp(−(x−
m)2 ), the basis function value will always decay to zero once we are far away from
the training data. In the case of binary classification and a logistic sigmoid for
the class output, this may be reasonable since we would then predict any new
datapoint far away from the training data with a complete lack of confidence, and
any assignment would be essentially random. However, in regression, using say
a linear combination of basis function outputs would always give zero far from
the training data. This may give the erroneous impression that we are therefore
extremely confident that we should predict an output of zero far away from the
training data whilst, in realilty, this is simply an artefact of our model. For this
reason, it is sometimes preferable to use basis functions that are non-local – that
is, they have appreciable value over all space, for example, (x − m)2 log((x − m)2 ).
Whilst any single output will tend to infinity away from the training data, this
serves to remind the user that, far from the training data, we should be wary of
our predictions.
7.7 Committees
Drawbacks of the non-linear approaches we have looked at – neural networks and
their cousins adaptive basis functions – are
1. Highly complex energy/error surfaces give rise to multiple solutions since
global optima are impossible to find.
2. We have no sense of the confidence in the predictions we make (particularly
in regression).
Whilst there are alternative (and in my view more attractive) approaches around
these problems, we can exploit the variability in the solutions found to produce
a measure of confidence in our predictions. The idea is to form a committee
of networks from the solutions found. For example, for regression, we could train
(say) M networks on the data and get M different parameter solutions θ1 , . . . , θM .
The average network function would then be
M
¯ 1 X
f (x) = f (x, θ i ). (7.7.1)
M i=1
A useful plot of confidence in our predictions is then to use one standard deviation
error bars :
p
f¯(x) ± var(f )(x) (7.7.3)
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 7.10: Left: A single solution using Adaptive Basis Functions to fitting the
training data (crosses). The centres of the five basis functions are given by the
circles. Right: A committee prediction from six individual solutions of the form
given on the left. The central line is the average prediction – note how this still
decays to zero away from the training data. The lines around the central line a
one standard deviation confidence intervals.
Two features of artificial neural networks stand out as being of particular impor-
tance – their non-linearity, and stochasticity (although this latter aspect is not
always exploited in many applications). These properties can be used to define
local computation units which, when coupled together suitably, can combine to
produce extremely rich patterns of behaviour, whether these be dynamic, or static
input-output relationships. One of the most import consequences of neural net-
work research has been to bring the techniques and knowledge of artificial intelli-
gence and statistics much closer together. Typically it was the case that problems
in artificial intelligence were tackled from a formal specification aspect. On the
other hand, statistics makes very loose formal specifications, and lets the data try
to complete the model. Neural networks can be seen as a statistical approach
to addressing problems in artificial intelligence, obfuscating the need for formal
specifications of how the program works – just learn how to do it from looking
at examples. For example, rather then formally specifying what constitutes the
62
figure “2”, a neural network can learn the (statistical) structure of “2”s by being
asked to learn (find appropriate weights for) how to differentiate between “2”s
and non-”2”s. This idea is especially powerful in the many human computer in-
teraction applications where formally specifying, for example, what constitutes an
individuals facial characteristics that differentiate them from others, is extremely
difficult.
8 Autoencoders
8.1 Introduction
The general idea of autoencoders is that they are simply approximations of the
identity mapping. We do not really need to invoke the concepts of neural networks
to talk about these. However, many applications use neural networks to implement
autoencoders.
Dimension Reduction The major use of autoencoders is in dimension reduction, that is to replace a high
N -dimensional vector x with a lower M -dimensional vector y. Clearly, this only
makes sense when we have a set of data, xµ , µ = 1, . . . , P .
y = ET (x − m) (8.1.1)
x̃ = m + Ey (8.1.2)
Consider the situation in fig(8.2), where a “piece of paper” has been wrinkled to
form a three dimensional object. However, to describe exactly the position of any
point on the surface of the paper, we only need two co-ordinates, namely how
far to go along y1 and how far to go along y2 . Of course, the actual position
of the surface point is a three dimensional vector, but it is only a function of y1
and y2 . Clearly, in this case, x is a non-linear function of y. A manifold is a
63
64
In this case the optimal hidden layer activations would be y1 = x1 , y2 = sin−1 (x2 )−
x1 . Clearly, there are other possibilities available. Given y1 and y2 , to make our
reconstruction, we use
T
x̃ = (y1 , sin(y2 ), cos(y1 )) (8.2.2)
If we use a neural network (by which we mean that the outputs of the hidden
units are non-linear functions of a weighted linear combination of the units inputs),
both the hidden unit transfer functions and output transfer functions need to (in
general) be non-linear. Note that the above would need more than one hidden layer
to be represented by an autoencoder. Graphically, we can represent a multiple
hidden layer neural network autoencoder as in fig(8.4). In principle, no restriction
on the form of the mappings from layer to layer need be made. However, it is
common to use non-linear perceptron like mappings from layer to layer, so that
the output of each node is a non-linear function of its linearly weighted inputs.
The standard approach to training autoencoders is to use the sum squared recon-
struction error. If θ are the parameters of the autoecoder, then the autoencoder
x1 xn
y1 ym
x̃1 x̃n
x3
y1 y2
x2
x1
expresses a mapping f (x, θ). Since we want the output to resemble the input as
closely as possible, we form the error:
P
X 2
E(θ) = (xµ − f (x, θ)) (8.2.3)
µ=1
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
1 5
0.5
0 0
−0.5
−1 −5
x1 xn
l1 lr
h1 hk
m1 ms
x1 xn
Figure 8.4: Autoencoder with multiple hidden layers. This is a more powerful
autoencoder than the single hidden layer case, only provided that the hidden to
output layers encode a non-linear function.
30 30 30 30
1 4 11
22 29
1 29
14 28 28
27 27
0.8
26 26
25 25 25 25
5 24 24
0.6
17 23 23
21
25 22 22
0.4 21 21
3 30 20 20
20 20
719
10
6 19
19
0.2 18 18
17 17
16 16
0 12 15 15 15 15
14 14
9 13 13
−0.2 12 12
27 11 11
29 10 10 10 10
−0.4 9 9
23
20 8 8
7 7
−0.6 2 6 6
5 5 5 5
24 4 4
−0.8 8 3 3
16 2 2
28
13 1 1
−1 18 26 15 0 0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 8.5: A two dimensional data set (left) represented by a one dimensional
PCA (middle) and one dimensional autoencoder (right). Note that in the middle
and right plots, the y axis is irrelevant and simply used to aid visual separation of
the data.
67
68
Here we briefly describe the mathematics behind the solution of the Classical
Scaling. If we consider a single element of the distance matrix, we have
E = XX T (9.1.4)
If it were not for the terms Eaa and Ebb , life would be easy since, in that case,
we would have a known matrix, T expressed as the outer product of an unknown
matrix, X which would be easy to solve. What we need to do therefore is to express
the unknown matrix elements Eaa and Ebb in terms of the known matrix T . In
order
P ato do this, we make the following extra assumption – the data has zero mean,
a xi = 0. Clearly, this does not
Paffect thePsolution since it is only defined up to
an arbitrary shift. In that case, a Eab = ai xai xbi = 0. Hence,
X X X
Tab = Eaa − 2 Eab + P Ebb (9.1.6)
a a a
X
= Eaa + P Ebb (9.1.7)
a
P
This means that we could express Ebb in terms of T , if only we knew what a Eaa
is. But this can also be obtained by now summing over b:
X X X
Tab = P Eaa + P Ebb (9.1.8)
ab a b
X
= 2P Eaa (9.1.9)
a
This means
X X
P Ebb = Tab − Eaa (9.1.10)
a a
X 1 X
= Tab − Tab (9.1.11)
a
2P
ab
so that
1 X 1 X 1 X
Tab = Tab − 2 Tab + Tab − 2Eab (9.1.12)
P a P P
ab b
Mexico City
Buenos Aires
Honolulu
Los Angeles
Caracas San Francisco
Rio de Janeiro
Chicago
Washington DC
New York
Montreal
Sydney
Lisbon
London Tokyo
Paris
Berlin Stockholm
Rome Warsaw
Moscow
Cape Town Istanbul Shanghai
Manila
Cairo Hong Kong
Calcutta
The right hand side are elements of a now known matrix, T ′ , for which we can
find an eigen-decomposition
T ′ = V ΛV T (9.1.14)
2 11
2 3 3 3 1 1
3 3 111 1
5 1
3 35 3 2 1 9
8
2
3 85 8 86 5
0 5 3
6
26 6 2 9 99
2 8
0 56 0 6
6 92 97
2
0 0 6 65 5 74 77
7
0 4 8 7 4 77
0 0 65 8 8 85 7
0 74
9 4
2 9
0 4
4 94
9
4
4
Then, given a set of target dissimilarities dij we then need to arrange the vectors y i
to minimize the (weighted) difference between the given dissimilarities and those
measured above. The parameters of the optimization are therefore the vectors y i
themselves.
Strictly speaking, the Sammon “Mapping” is not a mapping, since it does not yield
a function that describes how general points in one space are mapped to another
(it only describes how a limited set of points is related).
Making a Mapping Given points xi in a n-dimensional space (possibly very high dimensional) to rep-
resent them by points y i in a m-dimensional space (possibly very low dimensional,
say 2) in such a way that the separation between the points in the two spaces is
roughly similar. One way to obtain this mapping is to parameterize the positions
of the objects in the lower dimensional space
y = f (x; W) (9.2.3)
The distance then between two mapped points is a function of the parameters of
the mapping W. The optimal parameters can then be found by optimization. The
method Neuroscale is one such procedure.
72
10 Introducing Graphical Models
10.1.1 Tracey
Tracey lives in sunny Birmingham. One morning she leaves her house and realizes
that her grass is wet. Is it due to rain or has she forgotten to turn off the sprinkler
?
Next she notices that the grass of her neighbour, Jack, is also wet. She concludes
therefore that it has probably been raining, and that “explains away” to some extent
the possibility that her sprinkler was left on.
Making a model
We can model the above situation using probability by following a general mod-
elling approach.
First we define what variables we wish to include in our model. In the above
situation, the natural variables are
R ∈ {0, 1} (R = 1 means that it has been raining, and 0 otherwise).
S ∈ {0, 1} (S = 1 means that it she has forgotten to turn off the sprinkler, and 0
otherwise).
73
74
To see how many states need to be specified in general, consider the following
decomposition. Without loss of generality (WLOG) and repeatedly using Bayes’
rule, we may write1 :
p(T, J, R, S) = p(T |J, R, S)p(J, R, S) = p(T |J, R, S)p(J|R, S)p(R, S)
= p(T |J, R, S)p(J|R, S)p(R|S)p(S)
That is, we may write the joint distribution as a product of conditional distri-
butions. The first term p(T |J, R, S) requires us to specify 23 = 8 values, say
for p(T = 1|J, R, S) given the 8 possible states jointly of J, R, S. The other value
p(T = 0|J, R, S) is given by normalisation : p(T = 0|J, R, S) = 1−p(T = 1|J, R, S).
Similarly, we need 4 + 2 + 1 values for the other factors, making a total of 15 values
in all. In general, for a set of n binary variables, we need to specify 2n − 1 values
in the range [0, 1]. The important point here is that the number of values that
need to be specified in general scales exponentially with the number of variables
in the model – this is extremely bad news, and motivates simplifications.
Conditional Independence
The modeler often knows that certain simplifications often occur. Indeed, it is
arguably the central role of modelling to make the simplest model that fits with
the modelers beliefs about an environment. For example, in the scenario above,
Tracey’s grass is wet dependent only directly on whether or not is has been raining
and or whether or not her sprinkler was on. That is, we make the conditional
independence assumption for this model that
p(T |J, R, S) = p(T |R, S)
Similarly, since whether or not Jack’s grass is wet is influenced only directly by
whether or not is has been raining, we write
p(J|R, S) = p(J|R)
and since the rain is not directly influenced by the sprinkler!
p(R|S) = p(R)
which means that our model now becomes :
p(T, J, R, S) = p(T |R, S)p(J|R)p(R)p(S)
We can represent these conditional independencies graphically, as in fig(10.1).
This reduces the number of values that we need to specify to 4 + 2 + 1 + 1 = 8, a big
saving over the previous 15 values in the case where no conditional independencies
had been assumed.
The heart of modelling is in judging which variables are dependent on each other.
Specifying the values To complete the model, we need to numerically specify the values of the conditional
1 Note that a probability distribution simply assigns a value between 0 and 1 for each of the states
jointly of the variables. For this reason, p(T, J, R, S) is considered equivalent to p(J, S, R, T )
(or any such reordering of the variables), since in each case the joint setting of the variables
is simply an index to the same probability. This situation is more clear in the set theoretic
notation p(J ∩ S ∩ T ∩ R). We abbreviate this set theoretic notation by using the commas
– however, one should be careful not to confuse the use of this indexing type notation with
functions f (x, y) which are in general dependent on the variable order. Whilst the variables
to the left of the conditioning bar may be written in any order, and equally those to the right
of the conditioning bar may be written in any order, moving variables across the bar is not
allowed, so that p(x1 |x2 ) 6= p(x2 |x1 ).
75
R S
J T
Figure 10.1: Belief network structure for the “wet grass” example. Each node in
the graph represents a variable in the joint distribution, and the variables which
feed in (the parents) to another variable represent which variables are to the right
of the conditioning bar.
probabilty tables (CPTs). Let the prior probabilities for R and S be p(R) =
(0.2, 0.8) (that is, p(rain = yes) = 0.2 and p(rain = no) = 0.8) and p(S) =
(0.1, 0.9). Note, for clarity I use here for example p(R = y) instead of p(R = 1) –
of course, the labels we use for the states are irrelevant. Let’s set the remaining
probabilities to
p(J = y|R = y) = 1, p(J = y|R = n) = 0.2 (sometimes Jack leaves his own
sprinkler on too).
Inference
Now that we’ve made a model of an environment, we can perform inference. Let’s
calculate the probability that the sprinkler was on overnight, given that Tracey’s
grass is wet: p(S = y|T = y).
To do this, we use Bayes rule:
p(S = y, T = y)
p(S = y|T = y) =
p(T = y)
P
J,R p(T = y, J, R, S = y)
= P
J,R,S p(T = y, J, R, S)
P
J,R p(J|R)p(T = y|R, S = y)p(R)p(S = y)
= P
J,R,S p(J|R)p(T = y|R, S)p(R)p(S)
P
RP p(T = y|R, S = y)p(R)p(S = y)
=
R,S p(T = y|R, S)p(R)p(S)
0.9 ∗ 0.8 ∗ 0.1 + 1 ∗ 0.2 ∗ 0.1
=
0.9 ∗ 0.8 ∗ 0.1 + 1 ∗ 0.2 ∗ 0.1 + 0 ∗ 0.8 ∗ 0.9 + 1 ∗ 0.2 ∗ 0.9
0.092
= = 0.3382
0.272
so that the belief that the sprinkler is on increases above the prior probability 0.1,
due to the fact that the grass is wet.
76
Let us now calculate the probability that Tracey’s sprinkler was on overnight, given
that her grass is wet and that Jack’s grass is also wet, p(S = y|T = y, J = y). We
use Bayes rule again:
p(S = y, T = y, J = y)
p(S = y|T = y, J = y) =
p(T = y, J = y)
P
R p(T = y, J = y, R, S = y)
= P
R,S p(T = y, J = y, R, S)
P
p(J = y|R)p(T = y|R, S = y)p(R)p(S = y)
= RP
R,S p(J = y|R)p(T = y|R, S)p(R)p(S)
0.0344
= = 0.1604
0.2144
What this shows is that the probability that the sprinkler is on, given the extra
evidence that Jack’s grass it wet, is lower than the probability that the grass is
wet given only that Tracey’s grass is wet. That is, that the grass is wet due to
the sprinkler is (partly) explained away by the fact that Jack’s grass is also wet
– this increases the chance that the rain has played a factor in making Tracey’s
grass wet.
p(B) B E p(E)
p(A|B,E) A R p(R|E)
Figure 10.2: Belief Network for the Burglar model. Here, for pedagological pur-
poses only, we have explicitly written down which terms in the distribution each
node in the graph represents.
However, the alarm is surely not directly influenced by any report on the Radio
– that is, p(A|B, E, R) = p(A|B, E). Similarly, we can make other conditional
independence assumptions such that
The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001. The tables
and graphical structure fully specify the distribution.
Explaining Away Now consider what happens as we observe evidence.
Initial Evidence: The Alarm is sounding:
P
p(B=1,E,A=1,R)
p(B = 1|A = 1) = PE,R p(B,E,A=1,R)
B,E,R
P
PE,R p(A=1|B=1,E)p(B=1)p(E)p(R|E)
=
B,E,R p(A=1|B,E)p(B)p(E)p(R|E) ≈ 0.99
x1 x2 x3 x4
x3 x4 x1 x2
Figure 10.3: Two Belief networks for a 4 variable distribution. In this case, both
graphs are representations of the same distribution p(x1 , x2 , x3 , x4 ). The extension
of this ‘cascade’ to many variables is obvious, and always results in an acyclic graph.
Of course, if one wishes to make independence assumptions, then the initial choice
becomes significant. However, one should bear in mind that, in general, two dif-
ferent graphs may represent the same distribution.
Indeed, the observation that any distribution may be written in the cascade form
fig(10.3) gives an algorithm for constructing a belief network on variables x1 , . . . , xn
: write down the n−variable cascade graph; assign any ordering of the variables
to the nodes; you may then delete any of the directed connections.
Variable Order To ensure maximum sparsity, add “root causes” first, then the variables they
influence, and so on, until the leaves are reached. Leaves have no direct causal2
influence over the other variables.
Conditional Probability Once the graphical structure is defined, the actual values of the tables p(xi |pa (xi ))
Tables (CPTs) need to be defined. That is, for every possible state of the parental variables pa (xi ),
a value for each of the states (except one, since this is determined by normalisation)
needs to be specified. For a large number of parents, writing out a table of values
is intractable, and the tables are usually parameterised in some simple way. More
on this later.
2 ‘Causal’ is a tricky word since here there is no temporal ‘before’ and ’after’, merely correlations
or dependencies. For a distribution p(a, b), we could write this as either p(a|b)p(b) or p(b|a)p(a).
In the first, we might think that b ‘causes’ a, and in the second case, a ‘causes’ b. Clearly,
this is not very meaningful since they both represent exactly the same distribution, and any
apparent causation is merely spurious. Nevertheless, in constructing belief networks, it can be
helpful to think about dependencies in terms of causation since our intuitive understanding is
that often one variable ‘influences’ another. This is discussed much more deeply in [10], where
a true calculus of causality is developed.
79
Consider the three variable distribution p(x1 , x2 , x3 ). We may write this is any of
the 6 ways p(xi1 |xi2 , xi3 )p(xi2 |xi3 )p(xi3 ) where (i1 , i2 , i3 ) is any of the 6 permu-
tations of (1, 2, 3). Hence, whilst all graphically different, they all represent the
same distribution which does not make any conditional independence statements.
To make an independence statement, we need to drop one of the links. This gives
rise in general to 4 graphs in fig(10.4). Are any of these graphs equivalent, in the
x1 x2 x1 x2 x1 x2 x1 x2
x3 x3 x3 x3
(a) (b) (c) (d)
sense that they represent the same distribution? A simple application of Bayes’
rule gives :
p(x2 |x3 )p(x3 |x1 )p(x1 ) = p(x2 , x3 )p(x3 , x1 )/p(x3 ) = p(x1 |x3 )p(x2 , x3 )
| {z }
graph(c)
= p(x1 |x3 )p(x3 |x2 )p(x2 ) = p(x1 |x3 )p(x2 |x3 )p(x3 )
| {z } | {z }
graph(d) graph(b)
and hence graphs (b),(c) and (d) represent the same distribution. However, graph
(a) represents something fundamentally different: there is no way to transform the
distribution p(x3 |x1 , x2 )p(x1 )p(x2 ) into any of the others.
Graphs (b),(c) and (d) all represent the same conditional independence assumption
that, given the state of variable x3 , variables x1 and x2 are independent. We write
this as I(x1 , x2 |x3 ).
Graph (a) represents something different, namely marginal independence : p(x1 , x2 ) =
p(x1 )p(x2 ). Here we have marginalised over the variable x3 .
10.4.2 Intuition
In a general Belief Network, with many nodes, how could we check if two variables x
and y are independent, once conditioned on another variable z? In fig(10.5)(a,b),
it is clear that x and y are independent when conditioned on z. It is clear in
collider fig(10.5)(c) that they are dependent. In this situation, variable z is called a collider
– the arrows of its neighbours are pointing towards it. What about fig(10.5)(d)?
In (d), when we condition on z, then, in general, x and y will be dependent, since
X
p(z|w)p(w|x, y)p(x)p(y) 6= p(x|z)p(y|z)
w
x y
x y x y x y w
z z z z
(a) (b) (c) (d)
Figure 10.5: In graphs (a) and (b), variable z is not a collider. (c) Variable z is
a collider. Graphs (a) and (b) represent conditional independence I(x, y|z). In
graphs (c) and (d), x and y are conditionally dependent given variable z.
b c
Figure 10.6: The variable d is a collider along the path a − b − d − c, but not along
the path a − b − d − e.
10.4.3 d-Separation
A C E A C E
B D B D
Figure 10.7: Examples for d-separation – Is I(a, e|b)? Left: If we sum out variable
d, then we see that a and e are independent given b, since the variable e will ap-
pear as an isolated factor independent of all other variables, hence indeed I(a, e|b).
Whilst b is a collider which is in the conditioning set, we need all colliders on
the path to be in the conditioning set (or their descendents) for d-connectedness.
Right: Here, if we sum out variable d, then variables c and e becomes intrinsi-
cally linked, and the distribution p(a, b, c, e) will not factorise into a function of a
multiplied by a function of e – hence they are dependent.
B G F
T S
Are the variables T and F unconditionally independent, i.e. I(T, F |∅)? Remember
that the key point are the colliders along the path between the two variables. Here
there are two colliders, namely G and S – however, these are not in the condi-
tioning set (which is empty), and hence they are d-separated, and unconditionally
independent.
What about I(T, F |G)? Well, now there is a collider on the path between T and
F which is in the conditioning set. Hence T and F are d-connected conditioned
on G, and therefore T and F are not independent conditioned on G.
Note that this may seem a bit strange – initially, when there was no conditioning,
T and F were independent. However, conditioning on G makes them dependent.
An even simpler example would be the graph A → B ← C. Here A and B are
unconditionally independent. However, conditioning of B makes them dependent.
Intuitively, whilst we believe the root causes are independent, given the value of
the observation, this tells us something about the state of both the causes, coupling
them and making them dependent.
What about I(B, F |S)? Since there is a collider on the path between T and F
which is in the conditioning set, namely S, B and F are conditionally dependent
given S.
82
Deterministic Dependencies
More formally, Belief networks are directed acyclic graphs (DAGs), in which the
nodes in the graph represent random variables in a probability distribution. To
each variable A with parents B1 . . . Bn , there is an associated probability table
p (A|B1 . . . Bn )
If A has no parents then the table reduces to unconditional probabilities p(A).
Chain Rule Let BN be a Bayesian network over
U = {A1 , . . . An }
Then the joint probability distribution p(U ) is the product of all conditional prob-
abilities specified by the BN:
Y
p(U ) = P (Ai |pa (Ai ))
i
83
a b c
An undirected graph
Consider a model in which our desire is that states of the binary valued vari-
ables x1 , . . . , x9 , arranged on a lattice (as below) should prefer their neighbouring
variables to be in the same state
x1 x2 x3
x4 x5 x6
x7 x8 x9
1 Y
p(x1 , . . . x9 ) = φij (xi , xj )
Z <ij>
84
where < ij > denotes the set of indices where i and j are neighbours in the
undirected graph. Then a set of potentials that would encourage neighbouring
variables to have the same state would be
1 2
φij (xi , xj ) = e− T (xi −xj )
Imagine that we define a set of local distributions p(xi |pa(xi )) > 0. When indeed
will this define a consistent joint distribution p(x1 , . . . , xn )? The Hammersley
Clifford Theorem states that the MRF defines a consistent joint distribution if
and only if p(x1 , . . . , xn ) is a so-called Gibbs distribution
!
1 X
p(x1 , . . . , xn ) = exp − Vc (xc )
Z c
where the sum is over all cliques (maximal complete subgraphs), c and Vc (xc ) is a
real function defined over the variables in the clique c. The graph over which the
cliques are defined is an undirected graph with a link between all parents pa(xi )
and a link between xi and each parent pa(xi ), repeated over all the variables xi .
85
A A
B C B C
D D
Figure 10.9: Left: An undirected model. Middle: Every DAG with the same
structure as the undirected model must have a situation where two arrows will
point to a node, such as node d. Summing over the states of variable in this DAG
will leave a DAG on the variables A, B, C with no link between B and C – which
cannot represent (in general) the undirected model since when one marginals over
D in that, this adds a link between B and C.
Besag has originally a nice proof of this, which requires the positivity constraint[12],
and a counter example shows that this is necessary[13].
Note : It’s easy to go from the Gibbs distribution to find the local conditional
distributions. The other way round is not necessarily so easy, since we would have
to know the so-called partition function Z. This is reminiscent of Gibbs sampling
(see the appendix on sampling) : effectively, one can easily define a sampler (based
on a Gibbs distribution), but it does not mean that we know the joint distribution,
ie the partition function (normalisation constant) Z.
It’s clear that every Belief Network can be represented as an undirected graphical
model, by simple identification of the factors in the distributions.
Can every undirected model be represented by a Belief Network with the same link
structure? Consider the example in fig(10.9) (from Zoubin Ghahramani)
As a final note, of course, every probability distribution can be represented by
some Belief Network, though it may not necessarily have any obvious structure
and be simply a “fully connected” cascade style graph.
Discussion
Graphical models have become a popular framework for probabilistic models in
artificial intelligence and statistics. One of the reasons for this is that the graphical
depiction of the model contains no information about the content of the conditional
probability tables. This is advantageous in that algorithms can be formulated for
a graphical structure, independent of the details of the parameterisation of the
local tables in the model. However, despite the elegance of such an approach, the
issue of tractability can be heavily dependent on the form of the local probability
tables. For example, for Gaussian tables all marginals are tractable although, in
general, marginalising high dimensional distributions is highly non-trivial.
10.6 Problems
(Thanks to Chris Williams for some of these questions)
Exercise 12 (Elvis’ twin) Approximately 1/125 of all births are fraternal twins,
and 1/300 births are identical twins. Elvis Presley had a twin brother (who died
at birth). What is the probability that Elvis was an identical twin? You may
approximate the probability of a boy or girl birth as 1/2. (Biological information:
identical twins must be either both boys or both girls, as they are derived from one
egg.)
Exercise 14 (The Three Prisoners problem) (From Pearl, 1988) Three pris-
oners A, B and C are being tried for murder, and their verdicts will be read and
their sentences executed tomorrow. They know only that one of them will be de-
clared guilty and will be hanged while the other two will go free; the identity of the
condemned prisoner is revealed to a reliable prison guard, but not to the prisoners.
In the middle of the night Prisoner A makes the following request. “Please give
this letter to one of my friends – to one who is to be released. You and I know
that at least one of them will be released.”. The guard carries out this request.
Later prisoner A calls the guard and asks him to whom he gave the letter. The
guard tells him that he gave the letter to prisoner B. What is the probability that
prisoner A will be released?
Exercise 15 (The Monte Hall problem) I have three boxes. In one I put a
prize, and two are empty. I then mix up the boxes. You want to pick the box with
the prize in it. You choose one box. I then open another one of the boxes and show
that it is empty. I then give you the chance to change your choice of boxes—should
you do so? How is this puzzle related to the Three Prisoners problem?
1. What is the probability that a regular hamburger eater will have Kreuzfeld-
Jacob disease?
2. If the case had been that the number of people eating hamburgers was rather
small, say p(HamburgerEater) = 0.001, what is the probability that a reg-
ular hamburger eater will have Kreuzfeld-Jacob disease? Comment on the
difference with the result in the previous part of the question.
Exercise 19 Inspector Clouseau arrives at the scene of a crime. The victim lies
dead in the room, and the inspector quickly finds the murder weapon, a knife. The
Butler (B) and Maid (M) are his main suspects. The inspector has a prior be-
lief of 0.8 that the Butler is the murderer, and a prior belief of 0.2 that the Maid
is the murderer. These probabilities are independent in the sense that p(B, M ) =
p(B)p(M ). (It is possible that both the Butler and the Maid could be the murderer).
Exercise 20 The belief network shown below is the famous “Asia” example of
Lauritzen and Speigelhalter (1988). It concerns the diagnosis of lung disease (tu-
berculosis, lung cancer, or both, or neither). In this model a visit to Asia is assumed
88
visit to
Asia? smoking?
lung
tuberculosis? bronchitis?
cancer?
tuberculosis
or lung
cancer?
positive shortness
X-ray? of breath?
p1 (a, b, c) = p(a|b)p(b|c)p(c)
where all variables are binary. How many parameters are needed to specify distri-
butions of this form?
Now consider an undirected distribution on the same set of variables,
11.1 Inference
Calculating conditional marginals, as in the Wet Grass example seen previously,
is a form of inference. Although in simple graphs, such as the Wet Grass DAG, it
is straightforward to carry out the calculations to calculate marginals by hand, in
general, this problem can be computationally non-trivial. Fortunately, for singly-
connected graphs (poly-trees) there exist efficient algorithms for inference, and it
is instructive to understand how these algorithms work.
In this chapter we will consider two main algorithms, based on simple ideas. The
first, variable elimination works on general mutliply-connected distributions (al-
beit not-necessarily efficiently) and is particularly appropriate for answering single
queries.
The second algorithm we consider is Pearl’s Belief Propagation[5], which works
only for singly-connected graphs, yet has the advantage that it can answer multiple
queries efficiently.
These two classes of algorithms are useful as a precursor to developing algorithms
that run, essentially as efficiently as can be reasonably made, on any graphical
model (see the Junction Tree Algorithm chapter) and which are efficient in dealing
with answering multiple queries.
p(a, b, c, d) = p(a|b)p(b|c)p(c|d)p(d)
A B C D
and imagine that our inference task is to calculate the marginal distribution p(a).
Also, for simplicity, let’s assume that each of the variables can take one of two
states ∈ {0, 1}. Then
X
p(a = 0) = p(a = 0, b, c, d) (11.2.1)
b∈{0,1},c∈{0,1},d∈{0,1}
X
= p(a = 0|b)p(b|c)p(c|d)p(d) (11.2.2)
b∈{0,1},c∈{0,1},d∈{0,1}
It’s clear that we could carry out this computation by simply enumerating each of
the probabilities for the 2 ∗ 2 ∗ 2 = 8 states of the variables b,c and d. However,
in a more general chain of length T , this would imply that we would need a large
89
90
where fd (c) is a (two state) function. Similarly, we can distribute the summation
over c as far to the right as possible:
X X
p(a = 0) = p(a = 0|b) p(b|c)fd (c)
b∈{0,1} c∈{0,1}
| {z }
fc (b)
Then, finally,
X
p(a = 0) = p(a = 0|b)fc (b)
b∈{0,1}
[MAB ]i,j = p(a = i|b = j), [MBC ]i,j = p(b = i|c = j),
[MCD ]i,j = p(c = i|d = j), [MD ]i = p(d = i), [MA ]i = p(a = i)
If we had somehow been rather myopic and not realised that the distribution was
a chain and, instead, had placed the summations non-optimally, we may still have
ended up with an exponentially large amount of computation in a long chain – as
in the case where we do not push variables at all, which as we saw above, may
result in extreme inefficiency.
91
A B
C D E
F G
Figure 11.1: A simple Polytree. Inference is the problem of calculating the conse-
quences of (possible) evidence injection on the individual nodes.
E p(e)p(g|d, e)
C p(c|a)
B p(b)p(d|a, b) p(b)p(d|a, b)
G γE (d, g) γE (d, g)
D p(f |d) p(f |d) p(f |d) p(f |d)γG (d) p(f |d)γG (d) γA (d)
F γD (f )
Figure 11.2: The bucket elimination algorithm applied to the graph fig(11.1). At each stage, at least one
node is eliminated from the graph.
For simplicity, we will consider calculating only marginals, since the generalisation
to conditional marginals is straightforward. Consider the problem of calculating
the marginal p(f ).
X X
p(f ) = p(a, b, c, d, e, f, g) = p(f |d)p(g|d, e)p(c|a)p(d|a, b)p(a)p(b)p(e)
a,b,c,d,e,g a,b,c,d,e,g
We can distribute the summation over the various terms as follows: e,b and c are
92
Again, this defines new functions γA (d), γG (d), so that the final answer can be
found from
X
p(f ) = p(f |d)γA (d) γG (d)
d
A B A B
C D C D
E F E F
(a) Undirected (b) Directed
where we have defined messaged λn1 →n2 (n2 ) sending information from node n1 to
node n2 as a function of the state of node n2 . It is intuitively clear that we can in
95
where N (a) is the set of neighbouring nodes to a. Iterating these equations results
D C
λd,a
λc,a
A λa,b
E λe,a
B
Figure
P 11.4: Undirected BP : calculating a message λa→b (b) =
a Ψ (a, b)λc→a (a) λd→a (a) λe→a (a).
in a convergent
Q scheme for non-loopy graphs. The marginal is then found from
p(d) ∝ c∈N (d) λc→d (d), the prefactor being determined from normalisation. In
contrast to directed belief propagation (described in the following section) whose
complexity scales exponentially with the number of parents of a node, the complex-
ity of calculating a message in undirected belief propagation scales only linearly
with the number of neighbours of the node.
Intuition into the derivation of the directed belief propagation algorithm can also
be gained by considering marginalisation on a simple graph, such as depicted in
fig(11.5).
A B C
E F G
Figure 11.5: Intuition for DBP can be gleaned from consideration of simple graphs.
The marginal p(d) can be calculated from information passing from its parents
a, b, c and children e, f, g.
Consider calculating the marginal p (d). This involves summing the joint distrib-
ution over the remaining variables a, b, c, e, f, g. Due to the partial factorisation
96
that the directed structure encodes, we can distribute this summation as follows
X X X X
p (d) = p (d|a, b, c) p(a) p(b) p(c) p (e|d) p (f |d) p (g|d) .
|{z} |{z} |{z} e g
abc f
ρa→d (a) ρb→d (b) ρc→d (c) | {z }| {z }| {z }
λe→d (d) λf →d (d) λg→d (d)
(11.4.2)
We have defined here two types of messages for node d, λ messages that contain
information passing up from children, and ρ messages that contain information
passing down from parents.
If the children of node d are not fixed in any particular state (there is no “evi-
dence”), then the λ messages are trivially 1. If however, there is evidence so that,
for example, node e is fixed in state 1, then λe→d (d) = p (e = 1|d).
It is clear that the marginal for any node can be calculated from the local messages
incoming to that node. The issue that we now address is how to find a recursion
for calculating such messages.
Consider the case where the graph of fig(11.5) has some extra connections, as in
fig(11.5).
I A B C
J D O
E F G N
Figure 11.6: To calculate p(d), we only need to adjust the messages passing from
node a to node d and node g to node d. Information from all the other nodes is
as in fig(11.5)
The only messages that need to be adjusted to find the marginal p(d) are those
from a to d, namely ρa→d (a) and from g to d, namely λg→d (d). The marginal for
d will have exactly the same form as equation (11.4.2), except with the following
adjustments to the messages
X X
ρa→d (a) = p (a|h, i) p(h) p(i) p(j|a)
|{z} |{z}
h,i j
ρh→a (h) ρi→a (i) | {z }
λj→a (a)
X X X
λg→d (d) = p (g|o, d) p(o) p(m|g) p(n|g) .
g,o
|{z} m n
ρo→g (o) | {z }| {z }
λm→g (g) λn→g (g)
The structure of the above equations is that to pass a message from a node n1 to
a child node n2 , we need to take into account information from all the parents of
n1 and all the children of n1 , except n2 . Similarly, to pass a message from node n2
97
to a parent node n1 , we need to gather information from all the children of node
n2 and all the parents of n2 , except n1 .
Essentially, to formulate the messages for a node, it is as if the parents were dis-
connected from the rest of the graph, with the effect of this disconnection being a
modified prior for each parental node. Similarly, it is as if the children are discon-
nected from the rest of the graph, and the effect of the child nodes is represented
by a modified function on the link (e.g., instead of p(e|d) we have λe,d (d)).
From these intuitions, we can readily generalise the situation to a formal algorithm.
A general node d has messages coming in from parents and from children, and we
can collect all the messages from parents that will then be sent through d to any
subsequent children as[15]
X Y
ρd (d) = p (d|pa (d)) ρi,d (i) .
pa(d) i∈pa(d)
Similarly, we can collect all the information coming from the children of node d
that can subsequently be passed to any parents of d
Y
λd (d) = λi,d (d) .
i∈ch(d)
Y
ρb,d (b) = ρb (b) λi,b (b)
i∈ch(b)\d
A B
C∗ D E∗
F G
Figure 11.7: A simple singly connected distribution. Here variables C and E are
clamped into evidential states, and we wish to infer the marginals p(x|c, e) for the
remaining unclamped variables x ∈ {a, b, d, f, g}
Repeat the above (a,b,c,d) until all the λ and ρ messages between any two adjacent
nodes have been calculated. For all non-evidential nodes i compute ρi (i) λi (i).
The marginal p(i|evidence) is then found by normalising this value.
For binary valued nodes, both the λ and ρ messages for each node are binary
vectors, expressing the messages as a function of the two states that the node can
exist in. It is only the relative value of the messages in their two states which is
important. For this reason, we are free to normalise both λ and ρ messages, which
is useful in avoiding overflow and underflow problems.
The complexity of belief propagation is time exponential in the maximum family
size and linear in space.
One of the main benefits of BP is that we can define a completely local algorithm.
In the BP algorithm above, nodes had to wait for certain messages before they
could pass other messages. However, if we start with some randomly chosen mes-
sages used for the nodes which are not in the initialisation procedure, then we
claim that we can calculate the messages in any order. Provided that a sweep is
made through all the nodes in the graph, then the messages will, after at most n
iterations, converge to the correct messages given by BP. We will not prove this
here, although the intuition why this is true is clear: Imagine a simple graph which
is a chain. If we start somewhere in the middle of the chain, and calculate the
messages, then these will be incorrect. However, provided that we sweep through
all the nodes, eventually, we will hit the end points of the chain. Since the ini-
tialisation of the end nodes is correct, the messages from these end nodes will be
calculated correctly. Similarly, when we repeat a sweep through all the nodes in
the graph, we will pass through nodes which are adjacent to the end nodes. Since
the messages from the end nodes are correct, the messages from the nodes adjacent
to the end nodes will also be correct. We see therefore that the correct messages
are filled in from the ends, one by one, and that eventually, after at most n sweeps
though all the nodes, all the messages will be calculated correctly. This intuition
holds also for the more general case of singly-connected graphs.
Let’s perform inference for the distribution fig(11.7), in which c and e are eviden-
tial. To denote that c and e are clamped into some particular state, we use the
notation p∗ (c|a) and p∗ (e) which sets these tables so that the states of c and e
which do not correspond to the clamped states, have zero probability.
99
Multiple Queries
Now that the messages have been calculated, we can easily find all the marginals
using final values for the messages.
For example,
X
p(d|c, e) ∝ λf →d (d) λg→d (d) p(d|a, b)ρa→d (a) ρb→d (b)
a,b
The party animal corresponds to the network in fig(11.8). Then, given that we
observe that the Boss is Angry, and that the worker has a Headache, we wish to
find the probability hat the worker has been to a party.
To complete the specifications, the probabilities are given as follows:
p(u = T |p = T, d = T ) = 0.999
p(u = T |p = F, d = T ) = 0.9
p(u = T |p = T, d = F ) = 0.9
p(u = T |p = F, d = F ) = 0.01
p(d = T ) = 0.05, p(h = T |p = T ) = 0.9, p(h = T |p = F ) = 0.1
p(a = T |u = T ) = 0.99, p(a = T |u = F ) = 0.2
Let’s perform inference for the distribution fig(11.7), in which c and e are eviden-
tial. To denote that c and e are clamped into some particular state, we use the
100
P D
H∗ U
A∗
Figure 11.8: All variables are binary. When set to 1 the statements are true:
P = Been to Party, H = Got a Headache, D = Demotivated at work, U =
Underperform at work, A =Boss Angry. The stars denote that the variables are
observed in the true state.
notation p∗ (c|a) and p∗ (e) which sets these tables so that the states of c and e
which do not correspond to the clamped states, have zero probability.
0.9 true state, T
λh→p (p) = p(h∗ |p) =
0.1 false state, F
0.99
λa→u (u) = p(a∗ |u) =
0.2
0.05
ρd→u (d) = p(d) =
0.95
X X 0.9150
λu→p (p) = λa→u (u) p(u|p, d)ρd→u (d) =
0.2431
u d
0.1 0.9 0.9150
p(p|h, a) ∝ p(p)λh→p (p) λu→p (p) = · ·
0.9 0.1 0.2431
0.7901
p(p|h, a) =
0.2099
We can exploit the independency structure of the graph just as we did in belief
propagation. That is, we can distribute the maximization operator over the net-
work, so that only local computations are required. In fact, the only difference
between the belief revision and the belief propagation algorithms is that wherever
there was a summation in belief propagation, it is replaced with a maximization
operation in belief revision.
To see more clearly why this is the case, consider a simple function which can be
represented as an undirected chain,
and that we wish to find the joint state x∗ which maximises f . Firstly, let’s calcu-
late the value of the state that maximises f (the corresponding state is straight-
forward to find by backtracking) :
It is clear that the chain structure of the function, coupled with the maximisation
operation, which may be distributed over the function, that the maximal value
(and its state) can be computed in time which scales linearly with the number of
factors in the function.
Note that there is no requirement here that the function f corresponds to a prob-
ability distribution. This hints that there is a more general class of functions and
operations on them that permit computational simplifications, as described in the
next section.
Commutative Semiring
(a · b) + (a · c) = a · (b + c)
A B
C D E
F G
B C
ρb,d (b):
X
ρb,d (b) = p(b|a)ρa,b (a)
a
X
λc,a (a) = p(c|a)λd,c (c)
c
Our initial assignment for ρa,b (a) gets updated after we have cycled once around
the loop. If there are other loops in the graph, the messages from this loop will
get passed to other loops, which will feed back messages to this loop in return.
There is no guarantee that the messages will ultimately converge to something
consistent.
11.7.1 Conditioning
One way to solve the difficulties of multiply connected (loopy) graphs is to identify
nodes that, if were not present, would mean that the reduced graph was singly-
connected[5].
Consider the example if fig(11.11). Imagine that we wish to calculate a marginal,
say p(D). Then
X X
p(d) = p(c|a)p(a) p(d|a, b)p(b) p(f |c, d) p(g|d, e)
c
| {z } | {z }
a,b,e,f,g
p∗ (a) p∗ (f |d)
where the p∗ definitions are not necessarily distributions. It is clear that, for
each state of c, the form of the products of factors remaining as a function of
a, b, e, f, g is singly-connected, and that therefore, standard propagation (sum-
product) methods can be used to perform inference. We will need to do this for as
many states as there are in variable c, each state defining a new singly-connected
graph (with the same structure) but modified potentials.
More generally, we can define a set of variables C, called the loop-cut set. So,
for the price of a factor exponential in the loop-cut size, we can calculate the
104
A B A B
C D E C D E
F G F G
(a) (b)
Figure 11.11: (a) A multiply connected graph reduced to a singly connected graph
(b) by conditioning on the variable C.
Figure 11.12: (a) A belief network. (b) A cluster graph representation of the
network. The cluster potentials are defined on the round/oval nodes, and the
separator potentials are defined on the square nodes, which share common variables
with their neighbours.
marginals for any (loopy) DAG. The computationally difficult part of conditioning
is in determining a small cut set, and there is no guarantee that this will anyway
be small for a given graph.
Whilst this method is able to handle loops in a general manner, it is not particularly
elegant, and no more efficient than the Junction Tree approach, as described in
the next chapter.
For every cluster representation, we claim that there exists another cluster rep-
resentation for which the clusters contain the marginals of the distribution. For
example, from the definition of conditional probability, we can rewrite equation
(11.8.1) as
p (a, b) p (b, c) p (c, d) Ψ∗ (a, b) Ψ∗ (b, c) Ψ∗ (c, d)
p(U ) = =
p(b)p(c) Ψ∗ (b) Ψ∗ (c)
Where the cluster and separator potentials are set to Ψ∗ (a, b) = p (a, b), Ψ∗ (b, c) =
p (b, c), Ψ (c, d) = p (c, d), and Ψ∗ (b) = p(b), Ψ∗ (c) = p(c). It turns out that every
singly-connected graph can always be represented as a product of the the clique
marginals divided by the product of the separator marginals – this is a useful and
widely applied result in the development of exact and approximate inference algo-
rithms.
Directed graphs which are trees will always have a cluster representation of this
form, since each factor in the distribution can be written using Bayes’ rule, which
will add a clique marginal and corresponding separator.
In the previous section, we saw an example of a directed graphical model that has
a cluster representation. What about undirected graphical models – do they also
have a cluster graph representation? In an undirected graph, each link contains a
A B C D
(a)
In order to find the settings of q that make a perfect match between the distribu-
tions q and p, we can minimise the Kullback-Leibler divergence between the two
1 In the directed case, we need to carry out a moralisation step– see later section
107
However,
P we have bear
P in mind that P there are consistency
P constraints, namely
a q(a, b) = q(b) = c q(b, c) and b q(b, c) = q(c) = d q(c, d). In addition,
there are constraints that each probability must sum to 1. (I’ll ignore these since
they do no interact with other terms, and just simply will constitute a rescaling of
the tables).
We can enforce these constraints by adding Lagrange multipliers (the reason for
the labeling of the Lagrange multipliers will become clearer later):
!
X X
γ21 (b) q(a, b) − q(b)
b a
!
X X
γ12 (b) q(b, c) − q(b)
b c
!
X X
γ32 (c) q(b, c) − q(c)
c b
!
X X
γ23 (c) q(c, d) − q(c)
c d
Let’s now differentiate wrt q(b, c). This gives, at the extreme point
Or
and
Or
Similarly,
or
X
φ1 (a, b) ∝ λ12 (b)
a
P
Similarly, c q(b, c) = q(b) gives
X
φ2 (b, c)λ12 (b)λ32 (c) ∝ λ21 (b)λ12 (b)
c
or
X
φ2 (b, c)λ32 (c) ∝ λ21 (b)
c
P
The constraint, b q(b, c) = q(c) gives
X
φ2 (b, c)λ12 (b)λ32 (c) ∝ λ32 (c)λ23 (c)
b
or
X
φ2 (b, c)λ12 (b) ∝ λ23 (c)
b
P
And finally, d q(c, d) = q(c) gives
X
φ3 (c, d)λ23 (c) ∝ λ32 (c)λ23 (c)
d
or
X
φ3 (c, d) ∝ λ32 (c)
d
Note that the normalisation constants in the λ messages can be dropped since we
do not need to know these – they are only used at the end to calculate the nor-
malisation of the marginals. Hence, the proportions may be written as equalities.
This gives
X
λ12 (b) = φ1 (a, b) (11.9.2)
a
X
λ23 (c) = φ2 (b, c)λ12 (b) (11.9.3)
b
X
λ32 (c) = φ3 (c, d) (11.9.4)
d
X
λ21 (b) = φ2 (b, c)λ32 (c) (11.9.5)
c
109
These equations have a definite starting point (at each end), and can then be
solved deterministically. This means that, for trees (and singly-connected struc-
tures) there is only one fixed point, and this is indeed the minimum zero KL
divergence solution. These equations define the Belief Propagation Algorithm and
demonstrates why (which was based on the fact that a tree has this representa-
tion) the solution found by iterating these equations is unique, and is the global
minimum. Note that when the equations are used in the order defined above, then
we always have enough information in the preceding equations in order to define
the next equation. This is always the case for trees, and any ordering which starts
from the leaves inwards will be fine. Note that the final marginals are determined
directly from the λ messages, as given in the previous equations – just a simple
normalisation is required.
There is an interesting property of these Belief Propagation equations, namely
that they can be parallelised, since the forward and backward equations for the
messages do not interfere. Once these two sets of messages have been calculated,
they can be locally combined to produce marginals. This is in contrast to the
Hugin scheme below which cannot be parallelised in the same way – however, in
the case of trees with many branches, the Hugin approach may be slightly more
efficient since it does not need to calculate products of messages, which is the case
for the Belief Propagation style approach.
An Alternative : Hugin
We can rewrite the above equations in a manner that will be more efficient in some
cases, particularly when there are many branches in the cluster graph.
Recall that for the running example we have been considering, the cluster graph
is of the form :
φ(a, b)φ(b, c)φ(c, d)
q(a, b, c, d) =
φ(b)φ(c)
where we initially set φ(b) and φ(c) to be the identity functions, and potentials in
the numerator to those that make a match with the distribution p.
The forward equations for λ can be run independently of the backward equations
since the two sets of equations do not interfere. We consider here an alterna-
tive, in which we run first the forward equations, and subsequently the backward
equations.
Let’s start with λ12 (b). We shall call this a new potential:
X
φ∗ (b) = φ(a, b)
a
If we wish this new potential to replace the old φ(b), but q to remain unchanged
(this is desired if the initialisation corresponds to the correct distribution p), then
we need to make the change:
∗
φ(a, b)φ(b, c) φφ(b)
(b)
φ(c, d)
q=
φ∗ (b)φ(c)
Let’s therefore define
φ∗ (b)
φ∗ (b, c) = φ(b, c)
φ(b)
110
so that q is simply
φ(a, b)φ∗ (b, c)φ(c, d)
q=
φ∗ (b)φ(c)
Now consider λ23 (c). We shall call this a new potential φ∗ (c). A simple substitution
reveals
X
φ∗ (c) = φ∗ (b, c)
b
and potential
φ∗ (s)
φ∗ (w) = φ(w)
φ(s)
Once one has then done a full forward sweep and backward sweep of these equa-
tions, the final potentials will be proportional to the marginal. Similarly, one can
show that the final separators will be also proportional to the marginal of the
separator variable). This is a reformulation of the Belief Propagation equations
based on defining separators and new cluster potentials. The benefit of doing this
in lieu of the BP approach is not immediately obvious. Hopefully it will become
more apparent in the general case below.
111
Let vi be a set of variables on a cluster φ(vi ). Our task is, from the general
distribution
Y
p(v) = φ(vi )
i
To find a representation
Q
q(vi )
q(v) = Q i
<ij> q(vij )
here vij = vi ∩ vj , < ij > means the set of pairs i, j for P which cluster i is a
neighbour of cluster j. v is the union of all the vi . Below vi f (vi ) means the
sum of the clique function f (vi ) over all the joint states of the set of variables vi .
Then KL(q||p) may be written as
XX XX XX
q(vi ) log q(vi ) − q(vij ) log q(vij ) − q(vi ) log ψ(vi )
i vi <ij> vij i vi
XX X
− γji (vij ) q(vi ) − q(vij ) (11.9.6)
<ij> vij vi \vij
(as before, the normalisation constraints have been omitted since they play only a
trivial non-interacting role).
Differentiating wrt to q(vi ) and q(vij ) we arrive at
Y
q(vi ) ∝ ψ(vi ) λji (vij )
j∈n(i)
we have
X Y
ψ(vi ) λki (vik ) = λij (vij )λji (vij )
vi \vij k∈n(i)
which are the usual Belief Propagation equations. This is also called the Shenoy-
Shafer updating scheme. However, if we didn’t cancel the λ terms, we could rewrite
X
q old (vi ) = q new (vij ) (11.9.7)
vi \vij
112
Hence, we can form updates for the separator potentials in this fashion. What
about the potentials in the numerator? In order to do this, we need to invoke an-
other piece of information. One approach is that if we made an initial assignment
that the q was equal to p (say by making a simple assignment of the numerators,
with the separators set to unity), then we need q to be invariant under the trans-
formation equation (11.9.7). The separator q(vij ) always occurs with a numerator
term q(vj ), and hence the requirement that the distribution q remains equal to p
reduces to the requirement
This form of updates is called Hugin propagation, and together the equations equa-
tion (11.9.8) and equation (11.9.7) form a procedure called absorption.
11.10 Problems
Exercise 22 Consider the distribution p(a, b, c) = p(c|a, b)p(a)p(b). If we clamp
b into an evidential state, what effect will this have on a? Explain your answer
intuitively.
Exercise 23 Consider the belief network given below, which concerns the proba-
bility of a car starting.
Battery Fuel
Gauge
P(g=empty|b=good, f=not empty) = 0.04
P(g=empty| b=good, f=empty) = 0.97
P(g=empty| b=bad, f=not empty) = 0.10
P(g=empty|b=bad, f=empty) = 0.99
Turn Over
P(t=no|b=good) = 0.03
P(t=no|b=bad) = 0.98 Start
Calculate P (f = empty|s = no), the probability of the fuel tank being empty con-
ditioned on the observation that the car does not start. Do this calculation “by
hand”, i.e. do not use or create a computer program to do this.
113
A S
T L B
D
X
11.11 Solutions
24 (Thanks to Peter Mattsson for typesetting this) The marginal p(d) can be cal-
culated as
X
p(d) = p(a, s, t, l, b, e, x, d)
a,s,t,l,b,e,x
X
= p(a)p(s)p(t|a)p(l|s)p(b|s)p(e|t, l)p(x|e)p(d|b, e)
a,s,t,l,b,e,x
!
X X
= p(a)p(s)p(t|a)p(l|s)p(b|s)p(e|t, l)p(d|b, e) p(x|e)
a,s,t,l,b,e x
!
X X
= p(s)p(l|s)p(b|s)p(e|t, l)p(d|b, e) p(t|a)p(a) ,
s,t,l,b,e a
P
where we have noted that x p(x|e) = 1. The final term on the RHS is just the
marginal p(t), which is
p(t = yes) = p(t = yes|a = yes) × p(a = yes) + p(t = yes|a = no) × p(a = no)
= 0.05 × 0.01 + 0.01 × 0.99 = 0.01 × 1.04
= 0.0104.
Armed with this, we can further simplify the expression for p(d) to
!
X X
p(d) = p(s)p(l|s)p(b|s)p(d|b, e) p(e|t, l)p(t) .
s,l,b,e t
p(d = yes|b = yes, s = yes) =p(d = yes|b = yes, e = yes) × p(e = yes|s = yes)
+ p(d = yes|b = yes, e = no) × p(e = no|s = yes)
=0.9 × 0.10936 + 0.2 × 0.89064 = 0.276552,
p(d = yes|b = yes, s = no) =p(d = yes|b = yes, e = yes) × p(e = yes|s = no)
+ p(d = yes|b = yes, e = no) × p(e = no|s = no)
=0.9 × 0.020296 + 0.2 × 0.979704 = 0.2142072,
p(d = yes|b = no, s = yes) =p(d = yes|b = no, e = yes) × p(e = yes|s = yes)
+ p(d = yes|b = no, e = no) × p(e = no|s = yes)
=0.3 × 0.10936 + 0.1 × 0.89064 = 0.121872,
p(d = yes|b = no, s = no) =p(d = yes|b = no, e = yes) × p(e = yes|s = no)
+ p(d = yes|b = no, e = no) × p(e = no|s = no)
=0.3 × 0.020296 + 0.1 × 0.979704 = 0.1040592.
We now have
X X
p(d) = p(s) p(d|b, s)p(b|s),
s b
p(d = yes|s = yes) =p(d = yes|b = yes, s = yes) × p(b = yes|s = yes)
+ p(d = yes|b = no, s = yes) × p(b = no|s = yes)
=0.276552 × 0.6 + 0.121872 × 0.4 = 0.21468,
p(d = yes|s = no) =p(d = yes|b = yes, s = no) × p(b = yes|s = no)
+ p(d = yes|b = no, s = no) × p(b = no|s = no)
=0.2142072 × 0.3 + 0.1040592 × 0.7 = 0.1371036.
P
Now, at last, we can calculate p(d) = s p(d|s)p(s), which is
p(d = yes) =p(d = yes|s = yes) × p(s = yes) + p(d = yes|s = no) × p(s = no)
=0.21468 × 0.5 + 0.1371036 × 0.5 = 0.1758918.
Thus we have p(d = yes) = 0.1758918, p(d = yes|s = yes) = 0.21468 and
p(d = yes|s = no) = 0.1371036.
12 The Junction Tree Algorithm
Parts of this chapter are based on Expert Systems and Probabilistic Network Model
by E. Castillo, J. Gutierrez, and A. Hadi (Springer, 1997), and also An introduction
to bayesian networks by F.V.Jensen (Springer 1996). Both are excellent introduc-
tions to the field.
One may have had the nagging suspicion during the discussion of the algorithms
relating to DAGs, that what we are really exploiting is the underlying graphical
structure, whether this be directed or not. Does it really matter if the graph
is directed? For example, in bucket elimination, we are essentially removing the
directedness of the graph and defining (undirected) functions in their place. What
this suggests is that the complexity of the calculations on directed graphs can be
transformed into an undirected graph, possibly of greater connectivity than the
directed graph from which it was derived.
Indeed, there is an algorithm that does this. A graph, directed or undirected, is
transformed into an undirected graph on which the relevant computations can be
performed. This is called the junction tree algorithm.
I’ll derive the algorithm in the context of graphical models (probability distribu-
tions). The reader should bear in mind that a more general approach shows how
the Junction Tree algorithm is also appropriate for forming a structure for which
the Generalised Distributive Law of the previous chapter is guaranteed to work for
a graph of any structure, singly or multiply-connected.
1 Note that, in the beginning, the assignment of the cluster potentials does not satisfy the
consistency requirement. The aim is to find an algorithm that modifies them so that ultimately
consistency is achieved.
116
117
Similarly,
X
p(s) = Ψ (v)
v\s
Absorption Let v and w be neighbours in a cluster tree, let s be their separator, and let
Ψ∗ (v), Ψ (w) and Ψ (s) be their potentials. Absorption replaces the tables Ψ∗ (s)
and Ψ∗ (w) with
X
Ψ∗ (s) = Ψ (v)
v\s
Ψ∗ (s)
Ψ∗ (w) = Ψ (w)
Ψ (s)
The idea behind this definition is that, under the update of the table for v, the
table for the separator s and neighbour w are updated such that the link remains
consistent. To see this consider
Note that if Ψ (s) can be zero, then we need also Ψ (w) to be zero when Ψ (s) is
zero for this procedure to be well defined. In this case, the potential takes the value
unity at that point. (This requirement is on Ψ (w) and not Ψ∗ (s) since we are
considering whether or not it is possible to transmit the information through the
current state of the link). We say that a link is supportive if it allows absorption
in both directions (that is Ψ (v) and Ψ (w) will both be zero when Ψ (s) is zero).
Note that supportiveness is preserved under absorption.
Invariance of Cluster Tree Let T be a supportive cluster tree. Then the product of all cluster potentials
under Absorption divided by the product of all separator potentials is invariant under absorption.
Proof: When w absorbs v though the separator s only the potentials of w and s are
changed. It is enough therefore to show that the fraction of the w and s tables is
unchanged. We have
∗
A 3 4 B
C 1 6 7
D
2 9 8 5
10 E F
Figure 12.2: An example of the updating procedure for Hugin message passing.
The Hugin scheme is slightly advantageous over the standard Belief Propagation
scheme since one doesn’t need to take the products of all the messages at junctions.
However, it is not as readily parallisable as the BP scheme, since the messages are
not independent.
What we would like to do for a general distribution is to define a potential rep-
resentation of the graph such that, coupled with a suitable algorithm to modify
these potentials, the effect will be, as above, that the marginals of individual or
119
groups (in fact cliques) can be read off directly from the modified potentials. This
is the idea of the junction tree algorithm.
Comment : When the original distribution p is singly-connected, one can show
that there is a cluster potential representation that is also singly-connected (actu-
ally, it will also be guaranteed to satisfy the running-intersection property so that
locally consistent links will propagate to ensure global consistency). In that case,
Hugin Belief Propagation may be performed. However, in the more general case of
multiply-connected graphs, it may not be clear that an appropriate cluster repre-
sentation exists (apart from trivial ones which essentially put all the variables into
one cluster). How to construct efficient and suitable cluster-trees is at the heart
of the JTA.
Junction Tree A cluster tree is a junction tree if, for each pair of nodes, v and w, all nodes on
Running Intersection the path between v and w contain the intersection v ∩ w. This is also called the
running intersection property.
From this definition, it is clear that, in a consistent junction tree, the local consis-
Local = Global consistency tency will be passed on to any neighbours. That is, a consistent junction tree is
globally consistent.
Marginals Let T be a consistent junction tree over U , and let Ψ (U ) be the product of all
potentials divided by the product of all separator potentials. Let v be a node with
potential Ψ (v). Then
X
Ψ (v) = Ψ (U ) = p(v)
U\v
To gain some intuition about the meaning of this theorem, consider the junction
tree in fig(12.3). After a full round of message passing on this tree, each link
is consistent, and the product of the potentials divided by the product of the
separator potentials is just the original distribution itself. Imagine that we are
interested in calculating the marginal for the node ABC. That requires summing
over all the other variables, D, E, F, G, H. If we consider summing over H then,
because the link is consistent,
X
Ψ (eh) = Ψ (e)
h
P
so that the ratio h Ψ(eh)
Ψ(e) is unity, so that the effect of summing over node H is
that the link between EH and DCE can be removed, along with the separator.
120
ABC
C C
DCE CF
E E
EG EH
Figure 12.3: A junction tree. This satisfies the running intersection property that
for any two nodes which contain a variable a, the path linking the two nodes also
contains the variable a.
The same happens for the link between node EG and DCE, and also for CF to
ABC. The only nodes remaining are now DCE and ABC and their separator C,
which have so far been unaffected by the summations. We still need to sum out
over D and E. Again, because the link is consistent,
X
Ψ (d, c, e) = Ψ (c)
de
P
so that the ratio de Ψ(d,c,e)
Ψ(c) = 1. The result of the summation of all variables
not in ABC therefore produces unity for the cliques and their separators, and the
summed potential representation reduces simply to the potential Ψ (a, b, c) which
is the marginal p(a, b, c). It is clear that a similar effect will happen for other
nodes. Formally, one can prove this using induction.
We can then obtain the marginals for individual variables by simple brute force
summation over the other variables in that potential. In the case that the number
of variables in each node is small, this will not give rise to any computational
difficulties. However, since the complexity is exponential in the clique size of the
Junction Tree, it is prudent to construct the Junction Tree to have minimal clique
sizes. Although, for a general graph, this is itself computationally difficult, there
exist efficient heuristics for this task.
A B A B
D C D C
E F E F
G H G H
(a) DAG (b) Moral Graph
ABC ABC
C C C C
DCE C CF DCE CF
E E E E
EG E EH EG EH
(c) Junction Graph (d) Junction tree
E F G E F F G
H I J K EH FI F FJ GK
(a) DAG (b) Junction Graph
Figure 12.5: (a) A singly connected graph and (b) its junction graph. By removing
any of the links in (b) with separator F you get a junction tree.
122
A B A B A,B,C
C D C D B,C,D
Figure 12.6: If we were to form a clique graph from the graph on the left, this would
not satisfy the running intersection property, namely that if a node appears in two
cliques, it appears everywhere on the path between the cliques. By introducing the
extra link (middle picture), this forms larger cliques, of size three. The resulting
clique graph (right) does satisfy the running intersection property (separator set
not shown). Hence it is clear that loops of length four or more certainly require
the addition of such chordal links to ensure the running intersection property in
the resulting clique graph. It turns out that adding a chord in for all loops of
length four or more is sufficient to ensure the running intersection property for
any resulting clique graph.
Junction Graph and Tree Then, between any two clusters with a non-empty intersection add a link with the
intersection as the separator. The resulting graph is called a junction graph. All
separators consist of a single variable, and if the junction graph contains loops,
then all separators on the loop contain the same variable. Therefore any of the
links can be removed to break the loop, and by removing links until you have a
tree, you get a junction tree.
Consider the graph in fig(12.5a). Following the above procedure, we get the junc-
tion graph fig(12.5b). By breaking the loop BCF − −F − −F I − −F − −F J −
−F − −BCF anywhere we obtain a junction tree.
The previous section showed how to construct a JT for a singly-connected graph.
If we attempt to do this for a multiply connected (loopy) graph, we find that the
above procedure generally does not work since the resulting graph will not neces-
sarily satisfy the running intersection property. The idea is to grow larger clusters,
such that these the resulting graph does satisfy the running intersection property.
Clearly, a trivial solution would be to include all the variables in the graph in one
cluster, and this will complete our requirements. However, of course, this does not
help in finding an efficient algorithm for computing marginals. What we need is
a sufficient approach that will guarantee that we can always form a junction tree
from the resulting junction graph. This operation is called triangulation, and it
generally increases the minimum clique size, sometimes substantially.
A B A B A B A B
D C C D C D C
(a) (b) (c) (d)
Figure 12.7: (a) An undirected graph with a loop. (b) Eliminating node D adds
a link between A and C in the subgraph. (c) The induced representation for the
graph in (a). (d) An alternative equivalent induced representation.
p(a, b, c)
p(a, b, c, d) = P φ(c, d)φ(d, a)
d φ(c, d)φ(d, a)
Let’s try to replace the numerator terms with probabilties. We can do this by
considering
X
p(a, c, d) = φ(c, d)φ(d, a) φ(a, b)φ(b, c)
b
p(a, b, c)p(a, c, d)
p(a, b, c, d) = P P
d φ(c, d)φ(d, a) b φ(a, b)φ(b, c)
p(a, b, c)p(a, c, d)
p(a, b, c, d) = .
p(a, c)
A B C A B C
F E D F E D
(a) (b)
Figure 12.8: (a) An undirected graph with a loop. (b) An induced representation.
124
1 2 5 7 10
3 4 8 9
6 11
(a) (b)
1 2 5 7 10
3 4 8 9
6 11
(c) (d)
Figure 12.9: (a) An undirected graph which is not triangulated. (b) We start the
algorithm, labeling nodes until we reach node 11. This has neighbours 6 and 8
that are not adjacent. (c) We can correct this then by adding a link between 6 and
8, and restarting the algorithm. (d) The reader may check that this is a correct
triangulation.
we decide to eliminate. However, the reader may convince herself that one such
induced representation is given by fig(12.8b).
Generally, the result from variable elimination and re-representation in terms of
the induced graph is that a link between any two variables on a loop (of length 4
or more) which do not have a chord is added. This is called triangulation. Any
triangulated graph can be written in terms of the product of marginals divided by
the product of separators.
Armed with this new induced representation, we can carry out a message propa-
gation scheme as before.
The following is an algorithm that terminates with success if and only if the graph
is triangulated[18]. It processes each node and the time to process a node is
quadratic in the number of adjacent nodes
(see http://www.cs.wisc.edu/∼dpage/cs731/).
125
The weight of a tree is defined to be the sum of all the separator weights of the tree,
where the separator weight is defined as the number of variables in the separator.
A simple algorithm to find the spanning tree with maximal weight is as follows.
Start by picking the edge with the largest weight, and add this to the edge set.
Then pick the next candidate edge which has the largest weight and add this to
the edge set – if this results in an edge set with cycles, then reject the candidate
edge, and find the next largest edge weight.
Note that there may be many maximal weight spanning trees. This algorithm
provides one.
A A
B D B D
C C
E F G E F G
H I H I
(a) Original Graph (b) Moralised and Tri-
angulated
Figure 12.10: Example of the JTA. In (a) is the original loopy graph. (b) The
moralisation links are between nodes E and F and between nodes F and G. The
other additional links come from triangulation. The clique size of the resulting
clique tree (not shown) is four.
Form the Junction Tree Form a Junction Tree by forming a cluster representation from cliques of the
triangulated graphs, removing any unnecessary links in a loop on the cluster
graph. Algorithmically, this can be achieved by finding a tree with maximal
spanning weight.
Potential Assignment Assign the potentials to the cliques on the Junction Tree and assign the
separator potentials on the JT to unity.
Message Propagation Then carry out the absorption procedure until updates have been passed
along both directions of every link on the JT.
Then the clique marginals can be read off from the JT. An example is given in
fig(12.10).
There are some interesting points about the JTA. It provides an upper bound
on the computation required to calculate marginals in the graph. This means
that there may indeed exist more efficient algorithms in particular cases, although
generally it is believed that there cannot be much more efficient approaches than
the JTA since every other approach must perform a triangulation[19, 20].
However, there are, in general, many different ways to carry out the triangulation
step. Ideally, we would like to find a triangularised graph which has minimal clique
size. However, it can be shown to be a hard-computation problem (N P -hard) to
find the most efficient triangulation. In practice, the triangulation algorithms used
are somewhat heuristic, and chosen to provide reasonable, but clearly not optimal,
performance.
2.5
0.012
2 0.01
0.008
1.5
0.006
1
0.004
0.5
0.002
0 0
−5 0 5 −5 0 5
Figure 12.11: p∗ (x) ∝ p(x)10 . In both figures the vertical dashed line indicates
(on the x-axis the mean value for x. Note how p∗ becomes much more peaked
around its most probable valuem, and how the mean value in p∗ shifts to be close
to the most likely value. In the limit p∗ (x) ∝ (p(x))β ,β → ∞, then the mean of
the distribution p∗ tends to the most-likely value.
distribution? There is a simple trick which will enable us to convert the JTA to
enable us to answer this2 .
In general, a probability distribution may be written as
1 Y
p= φ(xc )
Z c
where φ(xc ) is the potential for cluster c. Consider a modified distribution in which
we wish to re-weight the states, making the higher probability states exponentially
more likely than lower probability states. This can be achieved by defining
1 Y β
p∗ = φ (xc )
Zβ x
c
where β is a very large positive quantity. This makes the distribution p∗ very
peaked around the most-likely value of p, see fig(12.11).
In the JTA, we need to carry out summations over states. However, in the limit
β → ∞ it is clear that only the most-likely state will contribute, and hence that the
summation operation can be replaced by a maximisation operation in the definition
of absorption. The algorithm thus proceeds as normal, replacing the summations
with maximisations, until the final stage, whereby from the table one reads off
argmax φ(xc ) for the variables in the modified final potential on cluster c to find
xc
the most likely state.
p(a, b, c) = p(a|b)p(b|c)p(c)
2 As with the comments at the beginning of the chapter, the reader should bear in mind that
the Generalised Distributive Law can be extended to the loopy case by using the updating
equations on the Junction Tree. In this sense, any operations within the semiring algebra are
admissible.
128
a b c a,b b b,c
(a) (b)
Figure 12.12: (a) A belief network. (b) JTA for the network.
There are three questions we are interested in (i) What is p(b)? (ii)? What is
p(b|a = 1, c = 1) (iii) What is the likelihood of the evidence p(a = 1, c = 1)?
For this simple graph, the moralisation and triangulation steps are trivial, and the
JTA is given immediately by fig(12.12b). A valid assignment is Ψ (a, b) = p(a|b),
Ψ (b) = 1, Ψ (b, c) = p(b|c)p(c).
First let’s absorb from (a, b) through the separator b to (b, c):
First we clamp the evidential variables in their states. Then we claim that the
effect of running the JTA is to produce on the cliques, the joint marginals p(a =
1, b, c = 1), p(a = 1, b, c = 1) and p(a = 1, b, c = 1) for the final potentials on the
two cliques and their separator. We demonstrate this below:
P P
• In general, the new separator is given by Ψ∗ (b) = a Ψ (a, b) = a p(a|b) =
1. However, since a is clamped in state a = 1, then the summation is not
carried out over a, and we have instead Ψ∗ (b) = p(a = 1|b).
Ψ(b,c)Ψ∗ (b)
• The new potential on the (b, c) clique is given by Ψ∗ (b, c) = Ψ(b) =
p(b|c=1)p(c=1)p(a=1|b)
1 .
P P
• The new separator is normally given by Ψ∗∗ (b) = c Ψ∗ (b, c) = c p(b|c)p(c).
However, since c is clamped in state 1, we have instead Ψ∗∗ (b) = p(b|c =
1)p(c = 1)p(a = 1|b)
Ψ(a,b)Ψ∗∗ (b) p(a=1|b)p(b|c=1)p(c=1)p(a=1|b)
• The new potential on (a, b) is given by Ψ∗ (a, b) = Ψ∗ (b) = p(a=1|b) =
p(a = 1|b)p(b|c = 1)p(c = 1).
Hence, here in this special case, all the cliques contain the joint distribution
p(a = 1, b, c = 1).
129
In general, the effect of clamping a set of variables V in their evidential states, and
running the JTA is that, for a clique i which contains the set of non-evidential vari-
ables H i , the potentials after the end of the JTA contains the marginal p(H i , V ).
Then calculating the conditional marginal p(b|a = 1, c = 1) is a simple matter since
p(b|a = 1, c = 1) ∝ p(a = 1, b, c = 1), where the proportionality is determined by
the normalisation constraint.
By the above procedure, the effect of clamping the variables in their evidential
states and running the JTA produces the joint marginals, such as Ψ∗ (a, b) =
p(a = 1, b, c = 1). Then calculating the likelihood is easy since we just sum out
over
P the non-evidential
P variables of any converged potential : p(a = 1, c = 1) =
∗
b Ψ (a, b) = b p(a = 1, b, c = 1).
Whilst we have demonstrated these results only on such a simple graph, the same
story holds in the general case. Hence calculating conditional marginals and like-
lihoods can be obtained in exactly the same way. The main thing to remember is
that clamping the variables in evidential states means that the joint distribution
on the non-evidential variables in a clique with all the evidential variables clamped
in their evidential states is what is found a the end of the JTA. From this condi-
tionals and the likelihood are straightforward to calculate.
12.8 Problems
Exercise 25 Consider the following undirected graphical model:
1. Draw a clique graph that represents this distribution, and indicate the sepa-
rators on the graph.
2. Write down an alternative formula for the distribution p(x1 , x2 , x3 , x4 ) in
terms of the marginal probabilities p(x1 , x2 ), p(x2 , x3 ), p(x3 , x4 ), p(x2 ), p(x3 )
12.9 Solutions
13 Variational Learning and EM
in which a single ‘best’ estimate for Θ is chosen. If the user does not feel able
to specify any prior preference for Θ (a so-called “flat” prior p(Θ) = const), the
parameters are given by Maximum Likelihood
ΘML = arg max p(V |Θ)
Θ
which simply says that we set the parameter Θ to that value for which the observed
data was most likely to have been generated.
Belief Nets
130
131
A S
Figure 13.1: A simple model for the relationship between lung Cancer, Asbestos
exposure and Smoking.
which is depicted in fig(13.1). Here we have made the assumption that Cancer is
dependent on both exposure to Asbestos and being a Smoker, but that there is no
direct relationship between Smoking and exposure to Asbestos. This is the kind
of assumption that we may be able to elicit from experts such as doctors who have
good intuition/understanding of the relationship between variables.
Furthermore, we assume that we have a list of individuals characteristics in the
population, where each row represents a training example. This is perhaps taken
from hospital records or a general survey of the population.
A S C
1 1 1
1 0 0
0 1 1
0 1 0
1 1 1
0 0 0
1 0 1
Intuitive Table Settings Looking at instances where A = 0, S = 0, we find always C = 0, and hence
p(C = 1|A = 0, S = 0) = 0. Similarly, we can count other cases to form a
CPT table. Counting the instances of A = 1, we find p(A = 1) = 4/7, and
similarly, p(S = 1) = 4/7. These three CPTs then complete the full distribution
specification.
A S p(C = 1|A, S)
0 0 0
0 1 0.5
1 0 0.5
1 1 1
We want to learn the entries of the CPTs. For convenience, let pa (x1 ) = {x2 , x3 },
and say we want to find the CPT entry p(x1 = 1|x2 = 1, x3 = 0).
Counting the Occurrences Naively, the number of times p(x1 = 1|x2 = 1, x3 = 0) occurs in the log likelihood
is equal to c(1, 1, 0), the number of such occurrences in the training set. However,
since (by the normalisation constraint) p(x1 = 0|x2 = 1, x3 = 0) = 1 − p(x1 =
1|x2 = 1, x3 = 0), the total contribution of p(x1 = 1|x2 = 1, x3 = 0) to the log
likelihood is
c(x1 = 1, x2 = 1, x3 = 0)
p(x1 = 1|x2 = 1, x3 = 0) =
c(x1 = 1, x2 = 1, x3 = 0) + c(x1 = 0, x2 = 1, x3 = 0)
From the above example, it is clear that we can set values for the all the table
entries. However, consider a smaller dataset:
A S C
1 1 1
1 0 0
0 1 1
0 1 0
According to the ML principal above, we will not be be able to determine entry
p(c|a = 0, s = 0), since there are no entries in the database which jointly contain
the setting a = 0 and s = 0. In this case, we either need additional information,
or assumptions about how to set the missing table entries. One approach that
may lead to a fuller specification is to require that not only all the jointly observed
training data should be maximally likely, but also that any marginal observations
should also be maximally likely – that is, we restrict attention to a subset of
the variabes, say here C alone, and require that the model is maximally likely
to generate the observed statistics for the C variable alone.P Since calculating the
marginal likelihood p(c) involves summing over all the states s,a p(c|s, a)p(s)p(a),
we obtain an objective function that contains at least the parameters p(c|s = 0, a =
0). How to choose such marginals, and how to weight this requirement with the
133
x1 x2 xn
Imagine that we have a node with n parents, in state x = (x1 , . . . , xn ). For binary
variables, there are therefore 2n entries in the CPT to specify for that node. This
rapidly becomes infeasible, and we need to use a functional specification of the
table. For example, a sigmoidal function
Previously, we assumed that all the variables are ‘visible’, or ‘evidential’. That is,
each training data point specified that values of all the variables in the distribution.
In many cases, we simply will not be able to directly measure certain variables,
although we know that they are important in the modelling process. Such variables
are ‘hidden’. In cases where we have a training set, and sometimes the variables
are specified, and in some cases not, then we have a ‘missing data’ problem. Both
of these cases can be dealt with by using Maximum Likelihood – however, we
calculate the likelihood for each training example on only those visible variables.
For example, consider two training examples, x1 and x2 , in which x1 is fully
observed, but x2 has an unobserved first component, i.e. x2 = (?, x22 , x23 , . . . x2n ).
The log likelihood of the data is then
The structure of the second term is a summation over the missing values for the
variable missing for that example. If there are many missing datapoints, calculat-
ing the summations may be difficult. However, one can see that the log-likelihood
remains a function of the tables, and one can optimise this as usual. However,
this direct approach is rarely taken in practice. An alternative, general and more
elegant approach to this problem, is given by the EM algorithm, as described
below.
In the above, we could use marginalisation to calculate the likelihood of the visible
data,
X
p(v µ |Θ) = p(v µ , hµ |Θ)
hµ
However, there are reasons why this may not be such a good idea. For example,
there are so many hidden units that we cannot carry out the summation (Junction
Tree Cliques are too large). Or the resulting log likelihood is difficult to optimise
using standard approaches – the objective function is extremely complicated.
There exists a useful general procedure for learning with hidden units. Special
cases of this approach include the Expectation Maximisation (EM) algorithm,
Generalised EM (GEM) algorithms and (probably all) the other EM variants. In
the machine learning literature, Neal and Hinton[21] made the connection between
the traditional EM algorithm[], and the more general variational treatment. See
[22] for a more standard exposition.
The variational EM has several potential positive aspects
• Can (often but not always) help deal with intractability
• Provides a rigorous lower bound on the likelihood
• May make larger parameter updates than gradient based approaches.
Before we can talk about the Variational Learning algorithm, which is a special
variational technique based on the Kullback-Leibler divergence between two dis-
tributions, we need a digression into Information Theory.
135
log(x)
(1,0)
p(x)
q(x)
(a) (b)
Figure 13.3: (a) The probability density functions for two different distributions
p(x) and q(x). We would like to numerically characterise the difference between
these distributions. (b) A simple linear bound on the logarithm enables us to
define a useful distance measure between distributions (see text).
where the notation hf (x)ir(x) denotes average of the function f (x) with respect
to the distribution r(x). For a continuous variable, this would be hf (x)ir(x) =
R P
f (x)r(x)dx, and for a discrete variable, we would have hf (x)ir(x) = x f (x)r(x).
The advantage of this notation is that much of the following holds independent of
whether the variables are discrete or continuous.
KL(q, p) ≥ 0 The KL divergence is always ≥ 0. To see this, consider the following simple linear
bound on the function log(x) (see fig(D.1b)):
log(x) ≤ x − 1
Furthermore, one can show that the KL divergence is zero if and only if the two
distributions are exactly the same.
L(θ )
θ1 θ2 θ3
Figure 13.4: The idea in variational learning is to bound a possibly complex like-
lihood L(Θ) by a simpler function, for which the maximum is easy to find, here
say finding Θ1 . Subsequently, a new lower bound on the likelihood is fitted, using
the previous best value for the optimal Θ, and a new optimal bound value Θ2 is
found. This is repeated until convergence, each time pushing up the lower bound,
and therefore hopefully improving our estimate of where the maximum of the like-
lihood is. This iterative procedure can often lead to the rapid finding of (locally)
optimal values for the parameters.
Summing over the training data, we get the bound on the marginal likelihood
P
X P
X
log p(V |Θ) ≥ − hlog q µ (h|v)iqµ (h|v) + hlog p(hµ , v µ |Θ)iqµ (h|v)
µ=1 µ=1
| {z } | {z }
Entropy Energy
This bound is exact (that is, it is equal to the log marginal likelihood) when we
set q µ (h|v) = p(hµ |v µ ). Recalling that our aim is to find an algorithm that will
adjust any parameters of p to maximize the likelihood, a reasonable thing to do
would be a relaxed version, namely to maximize a lower bound on the likelihood.
That is, to iteratively adjust the parameters Θ to push up the lower bound on
the (marginal) likelihood, and in so doing hopefully push up the true (marginal)
likelihood.
Variational Learning
Since the parameter Θ only occurs in the Energy term, this suggests that we can
iteratively firstly set the optimal parameters Θ by optimising the Energy term
1 This is analogous to the Mean Field bound on the partition function in statistical physics, and
motivates the terminology ‘energy’ and ‘entropy’.
137
(for fixed q µ (h|v)). And then, we can optimise (push up the lower bound) by
finding a better set of fixed q µ (h|v), by optimising with respect to the variational
distributions q µ (h|v):
1. Expectation (E) step : Choose a set of distributions
q µ (h|v), µ = 1 . . . P
from a chosen class of distributions, for which each q µ (h|v) minimises the
KL divergence KL(q µ (h|v), p(hµ |v µ )).
2. Maximisation (M) step : Set
P
X
Θ ← arg max hlog p(hµ , v µ |Θ)iqµ (h|v)
Θ
µ=1
Iterate (1,2) until parameter convergence. Steps (1) and (2) are guaranteed to
increase the lower bound on the likelihood.
Whilst, by definition, the EM algorithm cannot decrease the bound on the likeli-
hood, an important question is whether or not the iterations cannot decrease the
likelihood itself.
Another way to rephrase our bound on the likelihood log p(v|θ ′ ) ≥ LB(θ′ |θ) is as
That is, the KL divergence is simply the difference between the lower bound and
the true likelihood. Similarly, we may write
Hence
The first assertion is true since, by definition, we search for a θ′ which has a higher
value for the bound than our starting value θ. The second assertion is trivially true
by the property of the KL divergence. Hence we reach the important conclusion
that the EM (or GEM/variational implementation), not only essentially increases
the lower bound on the likelihood, but also increases the likelihood itself (or, at
least, the EM cannot decrease these quantities).
EM Algorithm Clearly, if we do not restrict the class of distributions that the q can take, the
optimal choice is
q µ (h|v) = p(hµ |v µ )
θ
B(θ ,q)
Figure 13.5: The EM approach is an axis aligned way to find a maximum of the
lower bound B(θ, q). This proceeds by, for fixed q, finding the best parameters θ
(the M -step), and then for fixed θ, finding the best distributions q (the E-step).
Of course, any other optimisation procedure is valid, and indeed may result in
faster convergence than this simple axis aligned approach. However, an advantage
of the EM style is that is leads to a simple-to-implement-and-interpret algorithm.
general, it may be that we can only carry out the average over q for a very
Q restricted
class of distributions. For example, factorised distributions q µ (h|v) = j q µ (hj |v).
Hence, in practice, we Q often choose a simpler class of distributions, Q, e.g Q =
factorised q µ (h|v) = i q µ (hi |v), which may make the averaging required for the
energy simpler.
Determining the best Imagine we parameterise our distribution class Q using a parameter θQ . We can
distribution in the class find the best distribution in class Q by minimising the KL divergence between
q µ (h|v, θQ ) and p(hµ |v µ , Θ) numerically using a non-linear optimisation routine.
Alternatively, one can assume a certain structured form for the q distribution, and
learn the optimal factors of the distribution by free form functional calculus.
Using a class of simpler q distributions like this corresponds to a Generalised EM
algorithm (GEM).
The previous variational learning theory is very general. To make things more
concrete, we apply the previous theory to learning the CPTs in a BN in which
certain variables are hidden. We first apply it to a very simple network.
Imagine, as in table fig(13.1), we have a set of data, but that we do not know the
states of variable a. That is,
S C
1 1
0 0
1 1
1 0
1 1
0 0
0 1
139
Firstly, let’s assume that we have chosen some values for the distributions q µ (a|c, s),
e.g. q 1 (a = 1|c = 1, s = 1) = 0.6, q 2 (a = 1|c = 0, s = 0) = 0.3, q 3 (a = 1|c3 = 1, s =
1) = 0.7, q 4 (a = 1|c = 0, s = 1) = 0.1 . . .. Now we write down the Energy term:
7
X
E= hlog p(cµ |aµ , sµ ) + log p(aµ ) + log p(sµ )iqµ (a|c,s)
µ=1
7 n
X o
E= hlog p(cµ |aµ , sµ )iqµ (a|c,s) + hlog p(aµ )iqµ (a|c,s) + log p(sµ )
µ=1
Remember that our goal is to learn the CPTs p(c|a, s) and p(a) and p(s).
Pleasingly, the final term is simply the log likelihood of the variable s, and p(s)
appears explicitly only in this term. Hence, the usual maximum likelihood rule
applies, and p(s = 1) is simply given by the relative number of times that s = 1
occurs in the database (hence p(s = 1) = 4/7, p(s = 0) = 3/7).
The parameter p(a = 1) occurs in the terms
X
{q µ (a = 0|c, s) log p(a = 0) + q µ (a = 1|c, s) log p(a = 1)}
µ
That is, whereas in the standard ML estimate, we would have the real counts of
the data in the above formula, here they have been replaced with our guessed
values q µ (a = 0|c, s) and q µ (a = 1|c, s).
A similar story holds for the more complex case of say p(c = 1|a = 0, s = 1). The
contribution of this term to the Energy is
X
q µ (a = 0|c = 1, s = 1) log p(c = 1|a = 0, s = 1)
µ:cµ =1,sµ =1
X
+ q µ (a = 0|c = 0, s = 1)) log(1 − p(c = 1|a = 0, s = 1))
µ:cµ =0,sµ =1
which is
X
log p(c = 1|a = 0, s = 1) q µ (a = 0|c = 1, s = 1)
µ:cµ =1,sµ =1
X
+ log(1 − p(c = 1|a = 0, s = 1)) q µ (a = 0|c = 1, s = 1)
µ:cµ =0,sµ =1
p(c = 1|a = 0, s = 1) =
P
q µ (a = 0|c = 1, s = 1)
µ:cµ =1,sµ =1
P µ
P µ
µ:cµ =1,sµ =1 q (a = 0|c = 1, s = 1) + µ:cµ =0,sµ =1 q (a = 0|c = 0, s = 1)
140
Again, this has an intuitive relationship to ML for the complete data case, in which
the missing data has been filled in by the assumed distributions q.
What about the parameters q µ (a|c, s)? If we use the standard EM algorithm, we
should set these to
q µ (a|c, s) = p(aµ |cµ , sµ ) ∝ p(aµ , cµ , sµ )
q µ (a|c, s) ∝ p(cµ |aµ , sµ )p(aµ )p(sµ )
where the current set of values for the p’s have been assumed from the previous
calculation. These two stages are then iterated : in the next step, we use these
new values of q µ (a|c, s) to calculate the next p′ s etc. These equations will converge
to a local optimum of the bound.
X XX
hlog p(hµ , v µ )iqµ (h|v) = hlog p(xµi |pa (xµi ))iqµ (h|v)
µ µ i
where each xi is either clamped into a visible state, or is a hidden unit. Note that
p(xµi |pa (xµi )) is only a function of the variables xi ∪ pa (xi ), the family of node xi
and that, in general, some of these may be hidden. The hidden nodes of this family
are giµ ≡ xi ∪ pa (xi ) \v µ . Since the term p(xµi |pa (xµi )) therefore only depends on
giµ , we only require to average with respect to q µ (giµ |v). If we use the optimal
choice (EM setting), q µ (giµ |v) = p(giµ |v µ ), it is clear that this marginal is easy to
calculate for any (poly)tree, since the marginal can be calculated by the JTA, and
that therefore this term can be computed efficiently.
To be more specific, consider a simple five variable distribution with discrete
variabl-es,
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 |x2 )p(x2 |x3 )p(x3 |x4 )p(x4 |x5 )p(x5 ),
in which the variables x2 and x4 are consistently hidden in the training data, and
training data for x1 , x3 , x5 are always present. In this case, the contributions to
the energy have the form
X
hlog p(xµ1 |x2 )p(x2 |xµ3 )p(xµ3 |x4 )p(x4 |xµ5 )p(xµ5 iqµ (x2 ,x4 |x1 ,x3 ,x5 )
µ
A useful property can now be exploited, namely that each term depends on only
those hidden variables in the family that that term represents. Thus we may write
X
hlog p(xµ1 |x2 )iqµ (x2 |x1 ,x3 ,x5 )
µ
X
+ hlog p(x2 |xµ3 )iqµ (x2 |x1 ,x3 ,x5 )
µ
X
+ hlog p(xµ3 |x4 )iqµ (x4 |x1 ,x3 ,x5 )
µ
X
+ hlog p(x4 |xµ5 )iqµ (x4 |x1 ,x3 ,x5 )
µ
X
+ log p(xµ5 ) (13.2.2)
µ
It is clear that the final term causes us no difficulties, and this table can be set
using the standard ML framework. Let us consider therefore a more difficult table,
namely p(x1 |x2 ). When will the table entry p(x1 = i|x2 = j) occur in the energy?
This happens whenever xµ1 is in state i. Since there is a summation over all the
states of variables x2 (due to the average), there is also a single time when variable
x2 is in state j. Hence the contribution to the energy from terms of the form
p(x1 = i|x2 = j) is
X
I[xµ1 = i]q µ (x2 = j|x1 , x3 , x5 ) log p(x1 = i|x2 = j)
µ
where the indicator function I[xµ1 = i] equals 1 if xµ1 is in state i and is zero
otherwise. To ensure normalisation of the table, we add a Lagrange term:
( )
X µ
X
µ
I[x1 = i]q (x2 = j|x1 , x3 , x5 ) log p(x1 = i|x2 = j)+λ 1 − p(x1 = k|x2 = j)
µ k
Hence
P
µ I[xµ1 = i]q µ (x2 = j|x1 , x3 , x5 )
p(x1 = i|x2 = j) = P
µ,k I[xµ1 = k]q µ (x2 = j|x1 , x3 , x5 )
Using the EM algorithm, we would use q µ (x2 = j|x1 , x3 , x5 ) = p(x2 = j|xµ1 , xµ3 , xµ5 ).
Note that this optimal distribution is easy to find for any polytree since this just
corresponds to the marginal on the family, given some nodes in the graph are
clamped in their evidential state. Hence, for EM, an update for the table would
be
P µ
new µ I[x1 = i]p
old
(x2 = j|xµ1 , xµ3 , xµ5 )
p (x1 = i|x2 = j) = P µ old (x = j|xµ , xµ , xµ )
(13.2.3)
µ,k I[x1 = k]p 2 1 3 5
142
Similar expressions can be derived for the other tables. The important thing to
note is that we only ever need local marginals for the variables in a family. These
are always easy to obtain in polytrees (assuming that the number of states in a
family is not too large), since this corresponds to inference in a tree conditioned
on some evidence. Hence all updates in the EM algorithm are computable.
What about the table p(x2 = i|x3 = j)?
To ensure normalisation of the table, we add a Lagrange term:
( )
X µ
X
µ
I[x3 = j]q (x2 = i|x1 , x3 , x5 ) log p(x2 = i|x3 = j)+λ 1 − p(x2 = k|x3 = j)
µ k
All that we do, therefore, in the general EM case, is to replace those determin-
istic functions such as I[xµ2 = i] by their missing variable equivalents pold (x2 =
i|xµ1 , xµ3 , xµ5 ).
Then
1
∂θ L(θ) = ∂θ p(v|θ)
p(v|θ)
Z
1
∂θ L(θ) = ∂θ p(v, h|θ)
p(v|θ) h
143
At this point, it may seem that computing the derivative is difficult. However, we
may observe
R
p(v, h|θ)
∂θ L(θ) = h ∂θ log p(v, h|θ)
p(v|θ)
Z
∂θ L(θ) = p(h|v, θ)∂θ log p(v, h|θ)
h
The rhs is just the average of the derivative of the complete likelihood. This is
closely related to the EM algorithm, though note that the average is performed
with respect the current distribution parameters θ and not θold as in the EM case.
Used in this way, computing the derivatives of latent variable models is relatively
straightforward. These derivatives may then be used as part of a standard opti-
misation routine such as conjugate gradients[23].
What makes this problem awkward is that the parameters also occur in Z, and
hence the objective function does not split into a set of isolated parameter terms.
An upper bound on Z
log x ≤ x − 1 ⇒ − log x ≥ 1 − x
Hence
Z Z Z
− log ′
≥ 1 − ′ ⇒ − log Z ≥ − log Z ′ + 1 − ′
Z Z Z
Let’s call the parameters θ. Then we can write the bound (for a single datapoint,
P = 1) as
Z(θ)
L(θ) ≥ E(θ) − log Z(θold ) + 1 −
Z(θold )
| {z }
LB(θ,θ old )
144
θ′
θ θ
v v
(a) (b)
P µ
where E(θ) represents c log φc (v ) Hence
old old
L(θ) − L(θ ) ≥ LB(θ, θ ) − L(θold )
Using, the property that L(θold ) = LB(θold , θold )), we have
L(θ) − L(θold ) ≥ LB(θ, θold ) − LB(θold , θold )
Hence, provided we can find a θ that increases the lower bound on the likelihood,
we are guaranteed to increase the likelihood itself. This is similar to the guarantees
provided by the EM algorithm. The generalisation to multiple datapoints P > 1
just follows by summing the above over the datapoints.
The IPF procedure then follows iteratively maximising wrt θ. The potential ad-
vantage of this method over gradient based procedures is apparent if the optimum
of LB(θ, θold with respect to θ can be achieved in closed form. Otherwise, there
may be little advantage[26].
13.6 Problems
Exercise 27 (Printer Nightmare) Cheapco is, quite honestly, a pain in the
neck. Not only did they buy a dodgy old laser printer from StopPress and use
145
it mercilessly, but try to get away with using substandard components and mater-
ial. Unfortunately for StopPress, they have a contract to maintain Cheapco’s old
warhorse, and end up frequently sending the mechanic out to repair the printer.
After the 10th visit, they decide to make a model of Cheapco’s printer, so that
they will have a reasonable idea of the fault based only on the information that
Cheapco’s secretary tells them on the phone. In that way, StopPress hopes to be
able to send out to Cheapco only a junior repair mechanic.
Based on the manufacturer’s information, StopPress has a good idea of the depen-
dencies in the printer, and what is likely to directly affect other printer components.
However, the way that Cheapco abuse their printer is a mystery, so that the ex-
act probabilistic relationships between the faults and problems is idiosyncratic to
Cheapco. However, StopPress has the following table of faults for each of the 10
visits. (Each column represents a visit, the transpose of the normal format).
fuse assembly malfunction 0 0 0 1 0 0 0 0 0 0
drum unit 0 0 0 0 1 0 0 1 0 0
toner out 1 1 0 0 0 1 0 1 0 0
poor paper quality 1 0 1 0 1 0 1 0 1 1
worn roller 0 0 0 0 0 0 1 0 0 0
burning smell 0 0 0 1 0 0 0 0 0 0
poor print quality 1 1 1 0 1 1 0 1 0 0
wrinkled pages 0 0 1 0 0 0 0 0 1 0
multiple pages fed 0 0 1 0 0 0 1 0 1 0
paper jam 0 0 1 1 0 0 1 1 1 1
200(x − 1) 1.0 ≤ x ≤ 1.1
p(x|C2 ) =
0 otherwise
The prior probabilities P (C1 ) = 0.6 and P (C2 ) = 0.4 are also known from experi-
ence. Calculate the optimal Bayes’ classifier and P (error).
y0 y1 y2
The notation √ P (x) ∼ N (µ, σ 2 ) is shorthand for the Gaussian distribution P (x) =
2 2
e−(x−µ) /2σ / 2πσ 2 . Assume that P (x) ∼ N (0, σ02 ) and P (yi |x) ∼ N (x, σ 2 ) for
i = 0, . . . , n − 1. Show that P (x|y0 , . . . , yn−1 ) is Gaussian with mean
nσ02
µ= y
nσ02 + σ 2
Exercise 31 (Bayesian analysis) . Consider the beta distribution p(θ) = c(α, β)θα−1 (1−
θ)β−1 , where c(α, β) is a normalizing constant. The mean of this distribution is
E[θ] = α/(α + β). For α, β > 1 the distribution is unimodal (i.e. it has only one
maximum). Find the value θ∗ where this maximum is attained, and compare it to
the mean. For what values of α and β to the mean and θ∗ coincide?
Exercise 33 Suppose that instead of using the Bayes’ decision rule to choose class
k if P (Ck |x) > P (Cj |x) for all j 6= k, we use a randomized decision rule, choosing
class j with probability Q(Cj |x). Calculate the error for this decision rule, and
show that the error is minimized by using Bayes’ decision rule.
R∞
Show that −∞
xp(x)dx = µ.
R∞
Show that −∞ (x − µ)2 p(x)dx = σ 2 .
1
P P i 2
P i=1 (x − µ)
Exercise 37 A training set consists of one dimensional examples from two classes.
The training examples from class 1 are
0.5, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.35, 0.25
Fit a (one dimensional) Gaussian using Maximum Likelihood to each of these two
classes. Also estimate the class probabilities p1 and p2 using Maximum Likelihood.
What is the probability that the test point x = 0.6 belongs to class 1?
13.7 Solutions
III. Probabilistic models in Machine Learning
148
14 Introduction to Bayesian Methods
Introduction
Regarding the general problem of fitting models to data, we are rarely certain
about either our data measurements (they may be inherently ‘noisy’) or model
beliefs. It is natural to use probabilities to account for these uncertainties. How
can we combine our data observations with these modelling uncertainties in a
consistent and meaningful manner? The Bayesian approach provides a consistent
framework for formulating a response to these difficulties, and is noteworthy for
its conceptual elegance[29, 27, 30, 31]. Indeed, throughout these chapters, we have
been using the Bayesian framework, since this is simply inherent in the correct
use of probabilities in graphical models and, in particular, inference in graphical
models. However, here we investigate a little more Bayes’ rule and it’s applications
in some very simple graphical models.
As a reminder to the reader, the fundamental probabilistic relationship required
for inference is the celebrated Bayes’ rule which, for general events A,B,C is
p(B|A, C)p(A|C)
p(A|B, C) = (14.0.1)
p(B|C)
In modelling data, it is convenient to think of different levels of uncertainty in
formulating a model. At the lowest level, we may assume that we have the correct
model, but are uncertain as to the parameter settings θ for this model. This
assumption details how observed data is generated, p (data|θ, model). The task
of inference at this level is to calculate the posterior distribution of the model
parameter. Using Bayes’ rule, this is
p (data|θ, model) p (θ|model)
p(θ|data, model) = (14.0.2)
p (data|model)
Thus, if we wish to infer model parameters from data we need two assumptions:
(1) How the observed data is generated under the assumed model, the likelihood
p (data|θ, model) and (2) Beliefs about which parameter values are appropriate,
before the data has been observed, the prior p(θ|model). (The denominator in
equation (14.0.2) is the normalising constant for the posterior and plays a role in
uncertainty at the higher, model level). That these two assumptions are required
is an inescapable consequence of Bayes’ rule, and forces the Bayesian to lay bare
all necessary assumptions underlying the model.
Let θ be the probability that a coin will land up heads. An experiment yields the
data, D = {h, h, t, h, t, h, . . .}, which contains H heads and T tails in H + T flips
of the coin. What can we infer about θ from this data? Assuming that each coin
is flipped independently, the likelihood of the observed data is
T
p (D|θ, model) = θH (1 − θ) (14.0.3)
149
150
(a)
(b)
(c)
Figure 14.1: Coin Tossing: (a) The prior: this indicates our belief that the coin
is heavily biased. (b) The likelihood after 13 Tails and 12 Heads are recorded,
θML = 0.48. (c) The posterior: the data has moderated the strong prior beliefs
resulting in a posterior less certain that the coin is biased. θMAP = 0.25, θ̄ = 0.39
The Bayesian approach is more flexible than maximum likelihood since it allows
(indeed, instructs) the user to calculate the effect that the data has in modifying
prior assumptions about which parameter values are appropriate. For example,
if we believe that the coin is heavily biased, we may express this using the prior
distribution in fig(14.1a). The likelihood as a function of θ is plotted in fig(14.1b)
for data containing 13 Tails and 12 Heads. The resulting posterior fig(14.1c) is
bi-modal, but less extreme than the prior. It is often convenient to summarise
Rthe posterior by either the maximum a posteriori (MAP) value, or the mean, θ̄ =
θp(θ|D)dθ. Such a summary is not strictly required by the Bayesian framework,
and the best choice of how to summarise the posterior depends on other loss
criteria[27].
The above showed how we can use the Bayesian framework to assess which para-
meters of a model are a posteriori appropriate, given the data at hand. We can
carry out a similar procedure at a higher, model level to asses which models are
more appropriate fits to the data. In general, the model posterior is given by
In the coin example, we can use this to compare the biased coin hypothesis (model
M1 with prior given in fig(14.1a)) with a less unbiased hypothesis formed by using
a Gaussian prior p(θ|M2 ) with mean 0.5 and variance 0.12 (model M2 ). This gives
a Bayes factor p(D|M1 )/p(D|M2 ) ≈ 0.00018. If we have no prior preference for
either model M1 or M2 , the data more strongly favours model M2 , as intuition
would suggest. If we desired, we could continue in this way, forming a hierarchy
of models, each less constrained than the submodels it contains.
For simplicity, consider first two models M 1 and M 2 whose parameter spaces
are of the sameR dimension, and for which the prior is flat. The model likelihood
p(D|M 1) = θ p(D|θ, M 1)p(θ|M 1) is therefore essentially dominated by only the
high likelihood region. If model M 2 has roughly the same likelihood values where
it fits well, but has a higher volume of them, then the likelihood for model M 2 is
higher than for model M 1, see fig(14.2). This also explains why a model which
has a higher dimension parameter space will usually be rejected in favour of a
model which fits equally well with a smaller parameter space, since the prior in
the latter case is a unit mass spread out in a smaller number of dimensions, which
will therefore have a higher weight.
The philosophy in the Bayesian approach is that parameter (or model) uncertainty
is reduced in the presence of observed data. However, except in pathological cases
(such as an infinite amount of training data) there still remains uncertainty in the
parameters. Whilst the Bayesian principal is well-established, not everyone is con-
vinced. A popular, non-bayesian method for model or parameter determiniation
runs somewhat as follows. We split the available training data into two sets, a
training set and a validation set. Model M 1 and model M 2 are both trained on
the training data, giving a single ‘optimal’ parameter for each model. Then each
model with its optimal parameter is then tested on the validation set. That model
which has the better performance on the validation set is then preferred. In this
sense, the uncertainty is not in the parameter space (since only a single optimal
parameter is retained). Rather, the uncertainty is in the predictive performance
of each model.
Need to talk about classical hypothesis testing....... The predictive performance
is (often, but not necessarily) assumed to be Gaussian. Then we would perhaps
152
M1 θ M2 φ
High Likelihood
Region
Figure 14.2: The points represent data for which we wish to find a function that
goes through the datapoints well. We consider two model classes M 1 and M 2 –
each with their own parameter spaces. For example, M 1 might represent poly-
nomicals of order 20, and M 2 polynomials of order 10. In M 1, there is only a
small region of parameter space for which the function fits well, for example, the
solid red curve. If we move slightly away from this region, and use the red-dashed
function, due to the complex nature of the model, by ‘definition’ this means that
it will fit many other kinds of data, and hence be sensitive in this way. On the
otherhand, for model M 2, there is a large R area of the parameter space for which
the functions fit well. Since p(D|M ) = θ p(D|θ, M )p(θ|M ), this means that the
‘evidence’ for how well the model fits is roughly the volume of the space for which
the likelihood is very high. For two models for which the likelihood is equally high,
since the prior p(θ|M ) is a unit mass spread out over the parameter space, then
the model for which the likelihood covers a higher amount of the space will be
preferred.
153
observe a certain performance, and judge whether or not this is more typical of
model 1 or 2....blah...
Need to explain the differences in these approaches, (see also the Laplace to su-
pernova paper).
Error Analysis
Consider a situation where two classifiers A and B have been tested on some data,
so that we have, for each example in the test set, an error pair
where P is the number of test data points, and ea ∈ {1, . . . Q} (and similarly for
eb ). That is, there are Q possible types of error that can occur. This is useful
in text classification, where TruePositive, FalseNegative, TrueNegative and False-
Positive might form four kinds of ‘errors’. For notational simplicity we also call
a TruePositive an ‘error’. It might be more appropriate to use a term such as
‘outcome label’, although this should also not be confused with the class label of
1 The theory is readily extensible to multiple classifiers, and is left as an exercise for the interested
reader.
2 Ideally, a true Bayesian will use a Bayesian Classifier, for which there will always, in principle,
be a direct way to estimate the suitability of the model in explaining the experimental data.
We consider here the less fortunate situation where two non-Bayesian classifiers have been
used, and only their test performances are available for evaluating the classifiers.
154
the classifier – it is their evaluation against the truth that we are interested in.
Let’s call ea = {ea (µ), µ = 1, . . . , P }, the sample set A, and similarly for eb .
‘How much evidence is there supporting that the two classifiers are performing dif-
ferently?’
Mathematically, our major assumption here is that this is the same question as :
‘How much evidence is there in favour of two sample sets being from different
multinomial distributions?’
The main question that we address here is to test whether or not two classifiers
are essentially performing the same. To do this, we have two hypotheses :
1. Hindep : The sample sets are from different distribution.
2. Hsame : The sample sets are from the same distribution.
We need then to formally mathematically state what these two hypotheses mean.
In both cases, however, we will make the independence of trials assumption
Y
p(ea , eb ) = p(ea (µ), eb (µ)).
µ
Since each classifier can make one of Q types of errors, we need to specify what
the probability of making such an error could be. For classifier A, we write
X
α = (α1 , . . . , αQ ) , αq = 1
q
and similarly for β. (These are the values of the probability tables for generating
errors).
where caq is the number of times that classifier A makes error q 3 . A similar expres-
sion holds for classifier B.
Dirichlet Prior
where
QQ
Γ(uq )
Z(u) = q=1
P
Q
Γ q=1 u q
The prior parameter u controls how strongly the mass of the distribution is pushed
to the corners of the simplex. Setting uq = 1 for all q corresponds to a uniform
prior. The uniform prior assumption is reasonable, although there may be situa-
tions where it would preferable to use non-uniform priors4.
Posterior
With a Dirichlet prior and a multinomial likelihood term, the posterior is another
Dirichlet distribution (dropping the a index, since this result is general),
Q
1 Y
p(α|e) = αcq +uq −1 (14.0.8)
Z(u + c) q=1 q
integrate.
156
α β P α β
α
ea eb ea eb ea , eb ea eb
(a) (b) (d)
(c)
Figure 14.3: (a) Hindep : Corresponds to the errors for the two classifiers being
independently generated. (b) Hsame : both errors are generated from the same
distribution. (c) Hdep : the errors are dependent (‘correlated’). (d) Hrelated : In
this case the distributions α and β which generate ea and eb are related in some
way – for example they may be constrained to be similar through the variable r.
This case is not considered in the text.
where p(Hindep ) is our prior belief that Hindep is the correct hypothesis. Note that
the normalising constant p(ea , eb ) does not depend on the hypothesis. Then
Hence
Z(u + ca ) Z(u + cb )
p(Hindep |ea , eb ) = p(Hindep )
Z(u) Z(u)
In Hsame , the hypothesis is that the errors for the two classifiers are generated
from the same multinomial distribution. Hence
Z(u + ca + cb )
= p(Hsame )
Z(u)
157
Bayes Factor
This is the evidence to suggest that the data were generated by two different
multinomial distributions. In other words, this is the evidence in favour of the two
classifiers being different.
Examples
In the experiments that I demonstrate here and elsewhere, I’ll assume that there
are three kinds of ‘errors’, Q = 3.
• We have the two error counts ca = [39, 26, 35] and cb = [63, 12, 25]
Then, the above Bayes factor is 20.7 – strong evidence in favour of the two
classifiers being different. (This is consistent with the model I used to gen-
erate the data – they were indeed from different multinomial distributions).
• Alternatively, if we have the two error counts ca = [52, 20, 28] and cb =
[44, 14, 42]
Then, the above Bayes factor is 0.38 – weak evidence against the two clas-
sifiers being different. (This is consistent with the model I used to generate
the data – they were indeed from the same multinomial distributions)
• As a final example, consider counts ca = [459, 191, 350] and cb = [465, 206, 329].
This gives a Bayes factor of 0.008 – strong evidence that the two classifiers
are statistically the same (Indeed, the errors were in this case generated by
the same multinomial).
These results show that the Bayesian analysis performs in a way that is consistent
with the intuition that the more test data we have, the more confident we are in
our statements about which is the better model.
Here we consider the (perhaps more common) case that errors are dependent. For
example, it is often the case that if classifier A works well, then classifier B will
also work well. Similarly, if one classifier performs poorly, then often the other
will too. Here, we assume that dependencies exist, but we make no preferences for
one to another (of course, such preferences would be straightforward to include if
desired). (There may be some interest in situations where if classifier A performs
poorly, then classifier B is likely to perform well ). Thus we want to consider the
Hypothesis
Hdep : the errors that the two classifiers make are dependent.
e = (ea , eb )
158
[P ]ij = p(ea = i, eb = j)
namely that the ij element of P is the probability that A makes error i, and B
makes error j.
Then, as before
1
p(e)p(Hdep |e) = p(e|Hdep )
p(Hdep )
Z
= p(e, P |Hdep )dP
Z
= p(e|P, Hdep )p(P |Hdep )dP
Z(vec (U + C))
p(e)p(Hdep |e) = p(Hdep )
Z(vec(U ))
where vec(D) simply forms a vector by concatenating the rows of the matrix D.
Here C is the count matrix, with [C]ij equal to the number of times that joint
error (ea = i, eb = j) occurred in the P datapoints. As before, we can then use
this in a Bayes factor calculation. For the uniform prior, [U ]ij = 1, ∀i, j.
Imagine that we wish to test whether or not the errors of the classifiers are depen-
dent Hdep , against the hypothesis that they are independent Hindep .
Examples
p(Hindep )
= 3020
p(Hdep )
p(Hindep )
= 2 × 10−18
p(Hdep )
Perhaps the most useful test that can be done practically is between the Hdep versus
Hsame . This is because, in practice, it is reasonable to believe that dependencies
are quite likely in the errors that classifiers make (both classifiers will do well on
‘easy’ test examples, and badly on ‘difficult’ examples). In this sense, it is natural
to believe that dependencies will most likely exist in practice. The relevant question
is : are these dependencies strong enough to make us believe in fact that the errors
are coming from the same process? In this sense, we want to test
p(Hsame )
= 4.5 × 10−38
p(Hdep )
p(Hsame )
= 42
p(Hdep )
– strong evidence that the classifiers are performing the same (this is consis-
tent with the way I generated this data set).
160
14.1 Problems
Exercise 39 blah
14.2 Solutions
39
15 Bayesian Regression
Bayesian Regression
Regression refers to inferring an unknown input-output mapping on the basis of
observed data D = {(xµ , tµ ), µ = 1, . . . P }, where (xµ , tµ ) represents an input-
output pair. For example, fit a function to the crosses in fig(15.1a). Since there
is the possibility that each observed output tµ has been corrupted by noise, we
would like to recover the underlying clean input-output function. We assume that
each (clean) output is generated from the model f (x; w) where the parameters w
of the function f are unknown and that the observed outputs tµ are generated by
the addition of noise η to the clean model output,
t = f (x; w) + η (15.0.1)
If the noise is Gaussian distributed, η ∼ N (0, σ 2 ), the model M generates an
output t for input x with probability
√
1
p(t|w, x, M ) = exp − 2 (t − f (x; w))2 / 2πσ 2 (15.0.2)
2σ
If we assume that each data input-output pair is generated identically and inde-
pendently from the others, the data likelihood is
P
Y
p(D|w, M ) = p(tµ |w, xµ , M ) (15.0.3)
µ=1
where β = 1/σ 2 . Note the similarity between equation (15.0.4) and the sum square
regularised training error used in standard approaches to fitting functions to data,
for example using neural networks [33]. In the Bayesian framework, we can mo-
tivate the choice of a sum square error measure as equivalent to the assumption
of additive Gaussian noise. Typically, we wish to encourage smoother functions
so that the phenomenon of overfitting is avoided. One approach to solving this
problem is to use a regulariser penalty term to the training error. In the Bayesian
framework, we use a prior to achieve a similar effect. In principle, however, the
Bayesian should make use of the full posterior distribution, and not just a single
weight value. In standard neural network training, it is good practice to use com-
mittees of networks, rather than relying on the prediction of a single network[33].
In the Bayesian framework, the posterior automatically specifies a committee (in-
deed, a distribution) of networks, and the importance attached to each committee
members prediction is simply the posterior probability of that network weight.
161
162
Figure 15.1: Along the horizontal axis we plot the input x and along the vertical
axis the output t. (a) The raw input-output training data. (b) Prediction using
regularised training and fixed hyperparameters. (c) Prediction with error bars,
using ML-II optimised hyperparameters.
where Γ represents the hyperparameter set {α, β, λ}. (We drop the fixed model
dependency wherever convenient). The weight posterior is therefore a Gaussian,
p(w|Γ, D) = N (w̄, S) where
P
!−1 P
X X
S = αI + β Φ (xµ ) ΦT (xµ ) w̄ = βΣ tµ Φ (xµ ) (15.0.8)
µ=1 µ=1
How would the mean predictor be calculated if we were to include the hyperpara-
meters Γ as part of a hierarchical model? Formally, this becomes
Z Z Z
¯
f (x) = f (x; w)p(w, Γ|D)dwdΓ = f (x; w)p(w|Γ, D)dw p(Γ|D)dΓ
(15.0.9)
The term in curly brackets is the mean predictor for fixed hyperparameters. We
therefore weight each mean predictor by the posterior probability of the hyperpa-
rameter p(Γ|D). Equation (15.0.9) shows how to combine different models in an
ensemble – each model prediction is weighted by the posterior probability of the
model. There are other non-Bayesian approaches to model combination in which
the determination of the combination coefficients is motivated heuristically.
Provided the hyperparameters are well determined by the data, we may instead
approximate the above hyperparameter integral by finding the MAP hyperpara-
meters Γ∗ = arg maxΓ p(Γ|D). Since p(Γ|D) = p(D|Γ)p(Γ)/p(D), if the prior belief
about the hyperparameters is weak (p(Γ) ≈ const.), we can estimate the optimal
hyperparameters by optimising the hyperparameter likelihood
Z
p(D|Γ) = p(D|Γ, w)p(w|Γ)dw (15.0.10)
This approach to setting hyperparameters is called ‘ML-II’ [33, 27] and assumes
that we can calculate the integral in equation (15.0.10). In the case of GLMs, this
involves only Gaussian integration, giving
P
X 2
2 log p(D|Γ) = −β (tµ ) +dT S−1 d−log |S|+k log α+P log β +const. (15.0.11)
µ=1
P
where d = β µ Φ(xµ )tµ . Using the hyperparameters α, β, λ that optimise the
above expression gives the results in fig(15.1c) where we plot both the mean pre-
dictions and standard predictive error bars. This solution is more acceptable than
the previous one in which the hyperparameters were not optimised, and demon-
strates that overfitting is avoided automatically. A non-Bayesian approach to
model fitting based on minimsing a regularised training error would typically use
a procedure such as cross validation to determine the regularisation parameters
(hyperparameters). Such approaches require the use of validation data[33]. An
advantage of the Bayesian approach is that hyperparameters can be set without
the need for validation data, and thus all the data can be used directly for training.
(See also the section on logistic regression). We can write the solution to w in the
form
X
w= αµ φ(xµ )
µ
This means we can just use the kernels K throughout, and use the αµ as the
parameters. It’s an analogous treatment as for classification.... One point to bear
in mind though is that the predictions usually will decay to zero away from the
data (this depends on the choice of the kernel, but is usually the case). This
means that we will predict very confidently that the regression should be zero, far
from the training data1 . This is not really what we want – we want to be highly
uncertain away from the training data. This isn’t a problem if we use finite basis
functions φ which are non-local, for example they grow to infinity at infinity. To
be continued...relationships to Gaussian Processes.
The use of GLMs can be difficult in cases where the input dimension is high since
the number of basis functions required to cover the input space fairly well grows
exponentially with the input dimension – the so called ‘curse of dimensionality’[33].
If we specify n points of interest xi , i ∈ 1, . . . n in the input space, the GLM
specifies an n-dimensional Gaussian distribution on the function values f1 , . . . , fn
with mean f¯i = w̄T Φ xi and covariance matrix with elements cij = c(xi , xj ) =
T
Φ xi ΣΦ xj . The idea behind a GP is that we can free ourselves from the
restriction to choosing a covariance function c(xi , xj ) of the form provided by
the GLM prior – any valid covariance function can be used instead. Similarly,
we are free to choose the mean function f¯i = m(xi ). A common choice for the
covariance function is c(xi , xj ) = exp −|xi − xj |2 . The motivation is that the
function space distribution will have the property that for inputs xi and xj which
are close together, the outputs f (xi ) and f (xj ) will be highly correlated, ensuring
smoothness. This is one way of obviating the curse of dimensionality since the
matrix dimensions depend on the number of training points, and not on the number
of basis functions used. However, for problems with a large number of training
points, computational difficulties can arise, and approximations again need to be
considered.
15.1 Problems
Exercise 40 The question relates to Bayesian regression.
• Show that for
f = wT x
and p(w) ∼ N (0, Σ), that p(f |x) is Gaussian distributed. Furthermore, find
the mean and covariance of this Gaussian.
• Consider a target point t which is related to the function f by additive noise
σ 2 . What is p(f |t, x)? Hint : use p(f |t, x) ∝ p(t|f, x)p(f |x).
15.2 Solutions
40
1 For classification, this isn’t a problem since the argument of the sigmoid function goes to zero,
which means that there is complete uncertainty in the class prediction.
16 Logistic Regression
16.1 Introduction
We’ve talked about using Generative Models to do classification. Now we look at
a discriminative approach.
Linear (Hyperplane) Decision The hyperplane b + xT w = 0 forms the decision boundary (where p(c = 1|x) = 0.5)
Boundary – on the one side, examples are classified as 1’s, and on the other, 0’s. The
“bias” parameter b simply shifts the decision boundary by a constant amount. The
orientation of the decision boundary is determined by w – indeed, w represents the
165
166
0.9
0
0.8
0.7
1
0
0.6
1 0 1 1
0.5
0.4 0
0
0.3
1
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 16.1:
1
0.9
0.8
0.7
0.6
sigma(x)
0.5
0.4
0.3
0.2
0.1
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
0 0
0 0
0
0 w 1 1
1 1
1
1
1 1
Figure 16.3: The decision boundary p(c = 1|x) = 0.5 (solid line). For two dimen-
sional data, the decision boundary is a line. If all the training data for class 1 lie
on one side of the line, and for class 0 on the other, the data is said to be linearly
separable.
w(1)=7, w(2)=−3.5, b=0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
−1
0.2
−0.5
0.1
0 0
−1
−0.5 0.5
0
0.5 1
1 x(1)
x(2)
Figure 16.4: The logistic sigmoid function σ(x) = 1/(1 + e−x ), with x = wT x + b.
16.1.1 Training
Given a data set D, how can we adjust/“learn” the weights to obtain a good
classification? Probabilistically, if we assume that each data point has been drawn
independently from the same distribution that generates the data (the standard
168
0.8
0.6
0.4
−1
0.2
−0.5
0 0
−1
−0.5 0.5
0
0.5
1
1 x(1)
x(2)
Figure 16.5: The logistic sigmoid function σ(x) = 1/(1 + e−x ), with x = wT x + b .
1000 iterations
1.2
0.8
0
1
0.6
x(2)
0 0 1 1
0.4 0
1
0.2 1
−0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x(1)
Figure 16.6: The decision boundary p(c = 1|x) = 0.5 (solid line) and confidence
boundaries p(c = 1|x) = 0.9 and p(c = 1|x) = 0.1.
We wish to maximise the likelihood of the observed data. To do this, we can make
use of gradient information of the likelihood, and then ascend the likelihood.
1 Note that this is not quite the same strategy that we used in density estimation. There we
made, for each class, a model of how x is distributed. That is, given the class c, make a model
of x, p(x|c). We saw that, using Bayes rule, we can use p(x|c) to make class predictions p(c|x).
Here, however, we assume that, given x, we wish to make a model of the class probability,
p(c|x) directly. This does not require us to use Bayes rule to make a class prediction. Which
approach is best depends on the problem, but my personal feeling is that density estimation
p(x|c) is worth considering first.
169
P
X
new
b =b+η (cµ − σ(xµ ; w)) (16.1.13)
µ=1
This is called a “batch” update since the parameters w and b are updated only
after passing through the whole (batch) of training data – see the MATLAB code
below which implements the batch version (note that this is not written optimally
to improve readability). We use a stopping criterion so that if the gradient of the
objective function (the log likelihood) becomes quite small, we are close to the
optimum (where the gradient will be zero), and we stop updating the weights.
Online version An alternative that is often preferred to Batch updating, is to update the parame-
ters after each training example has been considered:
η µ
wnew = w + (c − σ(xµ ; w))xµ (16.1.14)
P
η µ
bnew = b + (c − σ(xµ ; w)) (16.1.15)
P
These rules introduce a natural source of stochastic (random) type behaviour in
the updates, and can be useful in avoiding local minima. However, as we shall see
below, the error surface for logistic regression is bowl shaped, and hence there are
no local minima. However, it is useful to bear in mind the online procedure for
other optimisation problems with local minima.
170
One important point about the training is that, provided the data is linearly
separable, the weights will continue to increase, and the classifications will become
extreme. This may be an undesirable situation in case some of the training data
has been mislabelled, or a test point needs to be classified – it is rare that we
could be absolutely sure that a test point belongs to a particular class. For non-
linearly separable data, the predictions will be less certain, as reflected in a broad
confidence interval – see fig(16.7).
The error surface is
bowl-shaped
The Hessian of the log likelihood is
∂2H X µ µ
Hij ≡ =− xi xj σ µ (1 − σ µ ) (16.1.16)
∂wi wj µ
171
10000 iterations
0.8
0.7
0.6
0 0
1
0.5
x(2)
0 1
1
0.4 0
0.3
1
0.2 1
0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x(1)
Figure 16.7: The decision boundary p(c = 1|x) = 0.5 (solid line) and confidence
boundaries p(c = 1|x) = 0.9 and p(c = 1|x) = 0.1 for non-linearly separable data.
Note how the confidence interval remains broad.
This means that the error surface has a bowl shape, and gradient ascent is guar-
anteed to find the best solution, provided that the learning rate η is small enough.
Perceptron Convergence One can show that, provided that the data is linearly separable, the above pro-
Theorem cedure used in an online fashion for the perceptron (replacing σ(x) with θ(x))
converges in a finite number of steps. The details of this proof are not important
for this course, but the interested reader may consult Neural Networks for Pattern
Recognition, by Chris Bishop. Note that the online version will not converge if the
data is not linearly separable. The batch version will converge (provided that the
learning rate η is small) since the error surface is bowl shaped.
We saw that in the case that data is linearly separable, the weights will tend to
increase indefinitely (unless we use some stopping criterion). One way to avoid this
is to penalise weights that get too large. This can be done by adding a penalty
term to the objective function L(θ) where θ is a vector of all the parameters,
θ = (w, b),
L′ (θ) = L(θ) − αθ T θ. (16.1.18)
The scalar constant α > 0 encourages smaller values of θ (remember that we wish
to maximise the log likelihood). How do we choose an appropriate value for α?
We shall return to this issue in a later chapter on generalisation.
In previous chapters, we have looked at first using PCA to reduce the dimension
of the data, so that a high dimensional datapoint x is represented by a lower
dimensional vector y.
If e1 , . . . , em are the eigenvectors with largest eigenvalues of the covariance matrix
of the high-dimensional data, then the PCA representation is
yi = (ei )T (x − c) = (ei )T x + ai (16.1.19)
172
1 0
0 1
where c is the mean of the data, and ai is a constant for each datapoint. Using
vector notation, we can write
y = ET x + a (16.1.20)
where E is the matrix who’s ith column is the eigenvector ei . If we were to use
logistic regression on the y, the argument of the sigmoid σ(h) would be
h = wT y + b = wT (E T x + a) + b (16.1.21)
T T T
= (Ew) x + b + w a = w̃ x + b̃ (16.1.22)
Hence, there is nothing to be gained by first using PCA to reduce the dimension of
the data. Mathematically, PCA is a linear projection of the data. The argument
of the logistic function is also a linear function of the data, and a linear function
combined with another is simply another linear function.
However, there is a subtle point here. If we use PCA first, then use logistic regres-
sion afterwards, although overall, this is still representable as a logistic regression
problem, the problem is constrained since we have forced linear regression to work
in the subspace spanned by the PCA vectors. Consider 100 training vectors ran-
domly positioned in a 1000 dimensional space each with a random class 0 or 1.
With very high probability, these 100 vectors will be linearly separable. Now
project these vectors onto a 10 dimensional space: with very high probability, 100
vectors plotted in a 10 dimensional space will not be linearly separable. Hence, ar-
guably, we should not use PCA first since we could potentially transform a linearly
separable problem into a non-linearly separable problem.
The XOR problem
Consider the following four training points and class labels
{([0, 0], 0), ([0, 1], 1), ([1, 0], 1), ([1, 1], 0)}.
This data represents a basic logic function, the XOR function, and is plotted in
fig(16.8). This function is clearly not representable by a linear decision boundary,
an observation much used in the 1960’s to discredit work using perceptrons. To
overcome this, we clearly need to look at methods with more complex, non-linear
decision boundaries – indeed, we encountered a quadratic decision boundary in a
previous chapter. Historically, another approach was used to increase the complex-
ity of the decision boundary, and this helped spawn the area of neural networks,
to which we will return in a later chapter.
number in the test data, the training data is found to be linearly separable. This
may surprise you, but consider that there are 784 dimensions, and only 600 training
points. The stopping criterion used was the same as in the example MATLAB code
in this chapter. Using the linear decision boundary, the number of errors made on
the 600 test points is 12.
(adding a bias is a trivial extra modification). Note that ψ(x) does not have
to be of the same dimension as w. For example, the one-dimensional input x
could get mapped to a two dimensional vector (x2 , sin(x)). (Lower dimensional
mappings are also possible, but less popular since this can make it more difficult
to find a simple classifier). The usual motivation for this is that mapping into
a high dimensional space makes it easier to find a separating hyperplane in the
high dimensional space (remember that any set of points that are independent
can be linearly separated provided we have as many dimensions as datapoints –
this motivates the use of non-linear mappings since the related high-dimensional
datapoints will then usually be independent).
If we wish to use the ML criterion, we can use exactly the same algorithm as in
the standard case, except wherever there was a x before, this gets replaced with
ψ(x).
where γ is some scalar function. The point is that, by iterating the above equation,
any solution will therefore be of the form of a linear combination of the points ψ µ ,
where for simplicity, we write ψ µ ≡ ψ(xµ ). Hence, we may assume a solution
X
w= αµ ψ µ
µ
and try to find a solution in terms of the vector of parameters αµ . This is poten-
tially advantageous since there may be less training points than dimensions of ψ.
The classifier depends only on scalar products
X
wT ψ(x) = αµ ψ(xµ )T ψ(x)
µ
Hence, the only role that ψ plays is in the form of a scalar product:
Since the right is a scalar product, it defines a positive definite (kernel) function
(see section (E)). These are symmetric functions for which, roughly speaking, the
corresponding matrix defined on a set of points xi , i = 1, . . . , N is positive define.
Indeed, Mercer’s Theorem states that a function defines a positive definite kernel
function if and only if it has such an inner product representation. What this
means is that we are then free to define a function which is positive definite kernel
function, since this is the only thing that the classifier depends on. (This is well
established in classical statistics, and forms the basis of Gaussian Processes – see
later chapter). Hence, we can define
!
X
µ
p(c = 1|x) = σ αµ K(x, x )
µ
The realisation that the higher the dimension of the space is, the easier it is
to find a hyperplane that linearly separates the data, forms the basis for the
Support Vector Machine method. The main idea (contrary to PCA) is to map each
vector in a much higher dimensional space, where the data can then be linearly
separated. Training points which do not affect the decision boundary can then
be discarded. We will not go into the details of how to do this in this course,
but the interested reader can consult http://www.support-vector.net. Related
methods currently produce the best performance for classifying handwritten digits
– better than average human performance. Essentially, however, the distinguishing
feature of the SVM approach is not in the idea of a high-dimensional projection, but
rather in the manner of finding the hyperplane. The idea is to find the hyperplane
such that the distance between the hyperplane and (only) those points which
determine the placement of the plane, should be maximal. This is a quadratic
programming problem. In this case, usually only a small set of the training data
effects the decision boundary. However, this method is not probabilistic, and no
satisfactory manner of formulating the SVM directly as a probabilistic model has
been achieved (although there have been numerous approaches, all of which contain
a fudge somewhere). More later......
A similar method which retains the benefits of a probabilistic analysis is the Rel-
evance Vector Machine.
175
In the case that K is a Kernel, we can interpret this as essentially fitting a hy-
perplane through a set of points, where the points are the data training points
projected into a (usually) higher dimensional space. However, if one does not
require this condition, we can define a more general classifier
!
X
p(c = 1|x) = σ αi Ki (x)
i
where the Ki (x), i = 1, . . . , F are a fixed set of functions mapping the vector x
to a scalar. For example, if we set Ki (x) = tanh(xT wi ), and treat also wi as
a parameter, the solution will not be representable in a Kernel way. In these
more general settings, training is more complex, since the error surface cannot
be guaranteed to be convex, and simple gradient ascent methods (indeed, any
optimisation method) will potentially get trapped in a local optimum. In this
sense, Kernels are useful since they mean we can avoid training difficulties.
As an aside, consider if we set Ki (x) = tanh(xT wi ), for fixed wi , and treat only
the αi as adjustable parameters, then the solution is representable as a Kernel
(since the argument of the sigmoid is representable as a scalar product between a
parameter vector and a fixed vector function of x).
16.3.1 Mixtures
How can we increase the power of the above methods? One way to do this is to
write
H
X H
X
p(c = 1|x) = p(c = 1, h|x) = p(c = 1|h, x)p(h|x)
h=1 h=1
Usually, the hidden variable h is taken to be discrete. Here, then p(c = 1|h, x) is
one of a set of H classifiers.
In a standard mixture model, we assume independence, p(h|x) = p(h).
In a mixture of experts (cite Jordan) model, we assume that p(h|x) has some
parametric form, for example using a softmax function
h T
e(w ) x
p(h|x) = P
(w h′ )T x
h′ e
176
In both cases, the natural way to train them is to use the variational EM approach,
since the h is a hidden variable.
Note that these methods are completely general, and not specific to logistic regres-
sion. To do... example of mixture of experts applied to handwritten digits.
(In the following, I’ll set b to zero, just for notational clarity).
Note: a potential confusion here with notation. I’m now using w where previously
I used α. Now α refers to the precision.
How can we prevent the classifications becoming too severe? If the Kernel val-
ues themselves are bounded (for the squared exponential kernel, this is clearly
the case), then putting a soft constraint on the size of the components wµ will
discourage overly confident classifications.
A convenient prior p(w) one is to use a Gaussian constraint:
αP/2 T
p(w|α) = P/2
e−αw w/2
(2π)
where α is the inverse variance (also called the precision) of the Gaussian distrib-
ution. (Remember that here the dimension of w is equal to the number of training
points).
More formally, we could put another distribution on p(α), say a Gamma distribu-
tion (see section (C)), as part of an hierarchical prior.
Z Z
α T
p(w) = p(w|α)p(α) = e− 2 w w αγ−1 e−α/β dα
α
It’s clear that the RHS is another Gamma distribution (the Gaussian and Gamma
distributions are conjugate). Indeed (exercise) the reader can easily show that the
distribution of w is a t-distribution.
Here we’ll generally keep life simple, and assume the above ‘flat’ prior on α. We
have therefore a GM of the form
P
1 Y
p(w, α|D) = p(w|α)p(α) p(cµ |xµ , w)
Z µ=1
177
w x
What about the setting of p(α)? If we assume a flat prior on α, this will effectively
mean that we favour smaller values of the variance, and hence small values of
the weights. In this case, finding the α that maximises p(α|D) is called ML-II
estimation (we don’t use ML at the first level to determine w, but rather use ML
at the second, hyperparameter level).
178
Alternatively, we notice that we can integrate out analytically (usually) over the
one-dimensional α to obtain
P
1 Y
p(w|D) = p(w) p(cµ |xµ , w)
Z µ=1
R
where p(w) = p(w|α)p(α)dα and
Z P
Y
Z= p(w) p(cµ |xµ , w)
w µ=1
The main difficulty in both approaches above is that we cannot analytically inte-
grate over the w since the distribution, and we are forced to make an approxima-
tion. The best way to make an approximation has been a topic of some intense
debate. Should we integrate out the hyperparameter distribution first, and then at-
tempt to approximate the posterior distribution p(w|D), or approximate the joint
distribution p(w, α|D)? Since we have a good idea that p(α|D) will be sharply
peaked, and p(w|α, D) is unimodal, the argument goes that it makes sense to
make the simple unimodal Laplace approximation on the simple p(w|α, D), rather
than the more complex p(w|D).
Z Y
p(α|D) ∝ p(α) p(w|α) σ (2cµ − 1)(wT k µ )
w µ
µ µ i
where [k ]i ≡ K(x , x ). A simple Laplace approximation section (F) gives
1 P
log p(α|D) ≈ log p(α) − E(w∗ ) − log det 2πH + log α + const.
2 2
and
P
α T X
E(w) = w w− log σ wT hµ
2 µ=1
hµ = (2cµ − 1)k µ . The Laplace approximation states that we need to find the
minimum of E(w). Differentiating, we get
X
∇E = αw − (1 − σ µ )hµ
µ
where σ µ ≡ σ wT hµ . We could then use a simple gradient descent algorithm.
However, since the surface is convex, and the Hessian is simple to calculate,
P
X T
H(w) = αI + σ µ (1 − σ µ )hµ (hµ )
µ=1
| {z }
J
to optimise L with respect to α, we only need consider the terms with an explicit
α dependence,
α 1 P
L(α) ≈ − (w∗ )T w∗ − log det(αI + J) + log α + const.
2 2 2
Differentiating wrt α, using ∂ log det(M ) = trace M −1 ∂M , and setting to zero,
we can make a fixedpoint iteration
P
αnew = (16.3.2)
−1
(w∗ )T w∗ + trace (αI + J)
where p(h|x, D) is the distribution of the quantity xT w. Under the Laplace ap-
proximation, w is Gaussian,
where µ = w∗ , and Σ = (H(w∗ ))−1 . This means that h (which is linearly related
to w) is also Gaussian distributed
6 0.4
0.5 0.3
0.2
0.
4
4
0.3
0.1
0.2
0.5
0.3
0.4
0.
0.1
2 6
0.2
0.6
0.7 0.
0.5 3
0.9 0.4
0.
8
0.7 0.5
0 0.4
0.8
0.8
0.7
0.2 0.3
0. 0.9
0.6
0.570.
−2 0.4 6
0.
0.
1
0.
2
0.1
0.
−4
2
3
0.
0.
0.3
0.
5
0.4
−6
−6 −4 −2 0 2 4 6
Figure 16.10: An example using Logistic Regression with the squared exponential
′ 2
kernel, e−(x−x ) . The green points are training data from class 1, and the red
points are training data from class 0. The contours represent the probability of
being in class 1. The optimal value of α found by the evidence procedure in this
case is 0.45.
The reader may verify that the only alterations in the previous evidence procedure
are simply
X
[∇E]i = αi wi − (1 − σ µ )hµi
µ
H(w) = diag(α) + J
These are used in the Newton update formula as before. The implicit equation for
the α’s is given by
1
αi =
wi2 + Σii
where Σ = (H(w))−1 . Running this procedure, one typically finds that many of
the α’s tend to infinity, and may be effectively pruned from the dataset. Those
remaining tend to be rather in the centres of mass of a bunch of datapoints of the
same class. Contrast this with the situation in SVMs, where the retained data-
points tend to be on the decision boundaries. In that sense, the RVM and SVM
have very different characteristics. The number of training points retained by the
RVM tends to be very small – smaller indeed that the number retained in the
SVM framework. However, the RVM is a little more computationally expensive
than SVMs, but otherwise retains the advantages inherited from a probabilistic
framework[34]. Naturally enough, one can extend this idea of sparseness to many
other probabilistic models, and is a special case of the automatic relevance deter-
mination (ARD) method introduced by MacKay and Neal[32]. Finding such sparse
representations has obvious applications in compression. A hot research issue to
speed up training of RVMs.
181
8 8
0.4 0.4
0.
0.
4
4
0.
3
0.3
0.4
0.4
4 4
0.1
0.5
0.2
0.5 0. 0.
4 4
0. 0.1
0.
0. 0
1
2 .1
0.3
2
2 2 0.6 0.30.2
3
0
0.
0.7 .3
0.6
8 0.9 0.8 0.9 0.7
0.
0.
0.5
0.5
6
0 0.5 0.4 0 0.4 0.5 0.4
0.4
0.2 3
0
0..67 0.3
3 0. 0.2 1
0.7 4 .3
0.
0.7
0.7
0.1
0. 0
0.
−2 0.9 .8 −2 0.9
0.3
0 0.8
0.4 0.2
0.3
0.6
0.1
0.1
0.6
0. 0.
0.2
5 5
−4 −4
0.1
0.1
0.
0.4
0.
0.
5
2
0.2
0.
4
−6 0.3 −6 0.3
0.4
0.4
−8 −8
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
Figure 16.11: An example using RVM classification with the squared exponential
′ 2
kernel, e−(x−x ) . The green points are training data from class 1, and the red
points are training data from class 0. The contours represent the probability of
being in class 1. On the left are plotted the training points. On the right we plot
the training points weighted by their relevance value 1/αµ . Nearly all the points
have a value so small that they effectively vanish.
16.4 Problems
Exercise 41 Show that
′ 2
K(x, x′ ) = e−λ(x−x )
Hint: one simple way to show this is to consider expanding the exponent, and
then to consider the properties of the power series expansion of the exponential
function).
16.5 Solutions
41
17 Naive Bayes
where each of the variables R1, R2, R3, R4 can take the values either ‘like’ or
‘dislike’, and the ‘age’ variable can take the value either ‘young’ or ‘old’. Thus the
information about the age of the customer is so powerful that this determines the
individual product preferences without needing to know anything else. This kind
of assumption is indeed rather ‘naive’, but can lead to surprisingly good results.
In this chapter, we will take the conditioning variable to represent the class of the
datapoint x. Coupled then with a suitable choice for the conditional distribution
p(xi |c), we can then use Bayes rule to form a classifier. We can generalise the
situation of two variables to a conditional independence assumption for a set of
variables x1 , . . . , xN , conditional on another variable c:
N
Y
p(x|c) = p(xi |c) (17.2.1)
i=1
182
183
x1 x2 x3 x4 x5
See fig(17.2) for the graphical model. In this chapter, we will consider two cases
of different conditional distributions, one appropriate for discrete data and the
other for continuous data. Furthermore, we will demonstrate how to learn any
free parameters of these models.
A vector x = (1, 0, 1, 1, 0)T would describe that a person likes shortbread, does
not like lager, drinks whiskey, eats porridge, and has not watched England play
football. Together with each vector xµ , there is a class label describing the na-
tionality of the person: Scottish, or English. We wish to classify a new vector
x = (1, 0, 1, 1, 0)T as either Scottish(S) or English(E). We can use Bayes rule to
calculate the probability that x is Scottish or English:
p(x|S)p(S)
p(S|x) =
p(x)
p(x|E)p(E)
p(E|x) =
p(x)
Since we must have p(S|x) + p(E|x) = 1, we could also write
p(x|S)p(S)
p(S|x) =
p(x|S)p(S) + p(x|E)p(E)
It is straightforward to show that the “prior” class probability p(S) is simply given
by the fraction of people in the database that are Scottish, and similarly p(E) is
given as the fraction of people in the database that are English. What about
p(x|S)? This is where our density model for x comes in. In the previous chapter,
we looked at a using a Gaussian distribution. Here we will make a different, very
strong conditional independence assumption:
What this assumption means is that knowing whether or not someone is Scottish,
we don’t need to know anything else to calculate the probability of their likes and
dislikes.
Matlab code to implement Naive Bayes on a small dataset is written below, where
each row of the datasets represents a (row) vector of attributes of the form equation
(17.3.1).
184
xE=[0 1 1 1 0 0; % english
0 0 1 1 1 0;
1 1 0 0 0 0;
1 1 0 0 0 1;
1 0 1 0 1 0];
xS=[1 1 1 1 1 1 1; % scottish
0 1 1 1 1 0 0;
0 0 1 0 0 1 1;
1 0 1 1 1 1 0;
1 1 0 0 1 0 0];
Based on the training data in the code above, we have the following : p(x1 = 1|E) =
1/2,p(x2 = 1|E) = 1/2,p(x3 = 1|E) = 1/3,p(x4 = 1|E) = 1/2,p(x5 = 1|E) = 1/2,
p(x1 = 1|S) = 1,p(x2 = 1|S) = 4/7,p(x3 = 1|S) = 3/7,p(x4 = 1|S) = 5/7,p(x5 =
1|S) = 3/7 and the prior probabilities are p(S) = 7/13 and p(E) = 6/13.
For x∗ = (1, 0, 1, 1, 0)T , we get
1 × 73 × 37 × 75 × 47 × 13
7
p(S|x∗) = 3 3 (17.3.2)
1× 7 × 7 × × 7 × 13 + 2 × 2 × 31 ×
5
7
4 7 1 1 1
2 × 1
2 × 6
13
which is 0.8076. Since this is greater than 0.5, we would classify this person as
being Scottish.
Consider trying to classify the vector x = (0, 1, 1, 1, 1)T . In the training data, all
Scottish people say they like shortbread. This means that p(x, S) = 0, and hence
that p(S|x) = 0. This demonstrates a difficulty with sparse data – very extreme
class probabilities can be made. One way to ameliorate this situation is to smooth
the probabilities in some way, for example by adding a certain small number M to
the frequency counts of each class. This ensures that there are no zero probabilities
in the model:
Continuous Data
Fitting continuous data is also straightforward using Naive Bayes. For example, if
we were to model each attributes distribution as a Gaussian, p(xi |c) = N (µi , σi ),
this would be exactly equivalent to using a conditional Gaussian density estimator
with a diagonal covariance matrix.
Naive Bayes has been often applied to classify documents in classes. We will outline
here how this is done. Refer to a computational linguistics course for details of
how exactly to do this.
Consider a set of documents about politics, and a set about sport. We search
Bag of words through all documents to find the, say 100 most commonly occuring words. Each
document is then represented by a 100 dimensional vector representing the number
of times that each of the words occurs in that document – the so called ‘bag of
words’ representation (this is clearly a very crude assumption since it does not take
into account the order of the words). We then fit a Naive Bayes model by fitting
a distribution of the number of occurrences of each word for all the documents of,
first sport, and then politics.
The reason Naive Bayes may be able to classify documents reasonably well in this
way is that the conditional independence assumption is not so silly : if we know
people are talking about politics, this perhaps is almost sufficient information to
specify what kinds of other words they will be using – we don’t need to know
anything else. (Of course, if you want ultimately a more powerful text classifier,
you need to relax this assumption).
For each class of the two classes, we then need to estimate the values p(xi = 1|c) ≡
θic . (The other probability, p(xi = 0|c) is simply given from the normalisation
requirement, p(xi = 0|c) = 1 − p(xi = 1|c) = 1 − θic ). Using the standard assump-
tion that the data is generated identically and independently, the likelihood of the
model generating the dataset X c (the data X belonging to class c) is
Y
p(X c ) = p(xµ |c) (17.5.1)
µ from class c
Optimising with respect to θic ≡ p(xi = 1|c) (differentiate with respect to θic and
equate to zero) gives
If we just wish to find the most likely class for a new point x, we can compare the
log probabilities, classifying x∗ as class 1 if
log p(c = 1|x∗ ) > log p(c = 0|x∗ ) (17.5.6)
Using the definition of the classifier, this is equivalent to (since the normalisation
constant − log p(x∗ ) can be dropped from both sides)
X X
log p(x∗i |c = 1) + log p(c = 1) > log p(x∗i |c = 0) + log p(c = 0)
i i
P
This decision rule can be expressed in the form: classify x∗ as class 1 if i wi x∗i +
a > 0 for some suitable choice of weights wi and constant a (the reader is invited
to find the explicit values of these weights). The interpretation of this is that w
specifies a hyperplane in the x space and x∗ is classified as a 1 if it lies on one side
of the hyperplane. We shall talk about other such “linear” classifiers in a later
chapter.
or
P X
X N X
S X
C
L= I[xµi = s]I[cµ = c] log p(xi = s|c)
µ=1 i=1 s=1 c=1
The parameters are p(xi = s|c). If we optimize this with respect to these para-
meters, using a lagrange multiplier to ensure normalisation (one for each of the
outputs i):
P X
X N X
S X
C C X
X N S
X
L= I[xµi = s]I[cµ = c] log p(xi = s|c) + λci p(xi = s|c)
µ=1 i=1 s=1 c=1 c=1 i=1 s=1
188
Hence, by normalisation,
P
µ I[xµi = s]I[cµ = c]
p(xi = s|c) = P P
I[xµi = s′ ]I[cµ′ = c]
′
s′ µ′
In words, this means simply that the optimal ML setting for the parameter p(xi =
s|c) is simply (for the class c data), the relative number of times that attribute i
is in state s. (Analogous to the binary example before).
where
Y
p(αi (c)|D) ∝ p(αi (c)) p(xµi |αi (c))
µ:cµ =c
Here the parameter ui (c) describes the form of the prior. If we take this to be the
unit vector, the distribution on the (infinite set of) distributions p(αi (c)) is flat.
This is not an unreasonable assumption to make. This then becomes
Y Z(u∗i (c))
p(c|x∗ , D) ∝ p(c)
i
Z(ûi (c))
where
u∗i i ∗
s (c) = ûs (c) + I[xi = s]
17.7 Problems
Exercise 42 A local supermarket specializing in breakfast cereals decides to ana-
lyze the buying patterns of its customers.
They make a small survey asking 6 randomly chosen people which of the breakfast
cereals (Cornflakes, Frosties, Sugar Puffs, Branflakes) they like, and also asking
for their age (older or younger than 60 years). Each respondent provides a vector
with entries 1 or 0 corresponding to whether they like or dislike the cereal. Thus
a respondent with (1101) would like Cornflakes, Frosties and Branflakes, but not
Sugar Puffs.
The older than 60 years respondents provide the following data :
(0110), (1110)
A novel customer comes into the supermarket and says she only likes Frosties and
Sugar Puffs. Using Naive Bayes trained with maximum likelihood, what is the
probability that she is younger than 60?
190
In addition, each respondent gives a value c = 1 if they are content with their
lifestyle, and c = 0 if they are not.
Thus, a response (1, 0, 1) would indicate that the respondent was ‘rich’, ‘unmar-
ried’, ‘healthy’.
The following responses were obtained from people who claimed also to be ‘content’
:
Using Naive Bayes on this data, what is the probability that a person who is ‘not
rich’, ‘married’ and ‘healthy’ is ‘content’?
What is the probability that a person who is ‘not rich’ and ‘married’ is ‘content’?
(That is, we do not know whether or not they are ‘healthy’).
Consider the following vector of attributes :
Point out any potential difficulties with using your previously described approach
to training using Naive Bayes. Hence describe how to extend your previous Naive
Bayes method to deal with this dataset. Describe in detail how maximum likelihood
could be used to train this model.
17.8 Solutions
42 Looking at the data, the estimates using maximum likelihood are
p(C = 1|Y oung) = 0.5, p(F = 1|Y oung) = 1, p(SP = 1|Y oung) = 1, p(B = 1|Y oung) = 0
and
p(C = 1|Old) = 0.75, p(F = 1|Old) = 0.25, p(Sp = 1|Old) = 0.25, p(B = 1|Old) = 0.75
191
and p(Y oung) = 2/6 and p(Old) = 4/6. Plugging this into Bayes formula, we get
Figure 18.1: A mixture model has a trivial graphical representation as a DAG with
a single hidden node, which can be in one of H states, i = 1 . . . H.
192
193
Figure 18.2: It is clear that the black dots, which represent the one dimensional
data values are naturally clustered into three groups. Hence, a reasonable
P3 model
of this data would be p(x) = p(x|1)p(1)+p(x|2)p(2)+p(x|3)p(3)
P = i=1 p(x|i)p(i)
where p(x|i) is the model for the data in cluster i, and i p(i) = 1.
Figure 18.3: Gaussian Mixture Models place blobs of probability mass in the space.
Here we have 4 mixture components in a 2 dimensional space. Each mixture has
a different covariance matrix and mean.
X
q µ (i) log (p (xµ |θi , i) p (i)) (18.1.2)
i,µ
Provided that the parameters are not shared by the mixture components, we have
simply, for each θi
X
θinew = argmax q µ (i) log p (xµ |θi , i)
θi µ
The choice for the variational distributions is user dependent. The optimal EM
setting is
q µ (i) = p(i|xµ , θi ) ∝ p (xµ |θi , i) p (i) (18.1.5)
194
then the bound serves as a way to assess which converged parameters are to be
preferred.
All computational methods which aim to fit mixtures of Gaussians using ML there-
fore either succeed by getting trapped in serendipitous local maxima, or by the ad
hoc addition of “extra constraints” on the width of the Gaussians. A more reason-
able approach is to incorporate such necessary assumptions on the widths of the
Gaussians in the prior beliefs. This has the advantage of transparency and clarity
in the line of thought. In this sense, therefore, we can use the MAP approach in
preference to the ML solution. In practice, however, it is more commonplace to
use a simple criterion which prevents the eigenvalues of the covariance matrices
from becoming too small. A Bayesian solution to this problem is possible and re-
quires a prior on covariance matrices. The natural prior in this case is the so-called
Wishart Distribution, which we shall discuss later.
195
where P is the number of training examples. The EM choice for the variational
distributions is
µ
1
−mi )T Si−1 (xµ −mi )
q µ (i) ∝ p(i)e− 2 (x (18.2.9)
Figure 18.4: Training a mixture of 10 Gaussians (a) If we start with large variances
for the Gaussians, even after one iteration, the Gaussians are centred close to the
mean of the data. (b) The Gaussians begin to separate (c) One by one, the
Gaussians move towards appropriate parts of the data (d) The final converged
solution. Here the Gaussians were constrained so that the variances could not go
below 0.01.
Symmetry Breaking An interesting observation about the performance of EM applied to mixture models
is that, initially, there appears as if little is happening, as each model jostles
with the others to try to explain the data. Eventually, some almost seemingly
random effect causes one model to break away from the jostling and explain data
close to that model. The origin of this jostling is an inherent symmetry in the
solution to this problem: it makes no difference to the likelihood if we relabel
what the components are called. This permutation symmetry causes the initial
confusion amongst the models as to who is going to explain which parts of the
data. Eventually, this symmetry is broken, and a local solution is found. This
can severely handicap the performance of EM when there are a large number of
models in the mixture. An heuristic is to therefore begin with a small number of
models, say two, for which symmetry breaking is less problematic. Once a local
broken solution has been found, then more models are included into the mixture,
initialised close to the currently found broken solutions. In this way, a hierarchical
breaking scheme is envisaged. Clearly, there is potentially some bias introduced in
this scheme as to the kinds of solutions found – however, this may be a small price
to pay in the light of waiting for a very long time as the models jostle unnecessarily.
Another popular method for initialisation is to center the means to those found
by the K-means algorithm – however, this itself requires a heuristic initialisation.
Parzen Estimator
The Parzen estimator is one of the simplest density estimators. The idea is simply
to put a Gaussian at each datapoint xk , k = 1 . . . P . Usually, this Gaussian is
chosen to be isotropic – that is, with covariance matrix Σ = σI, where σ is some
197
Whilst an intuitively reasonable thing to do, if one is working with large datasets in
high dimensional spaces, one needs to store all the datapoints in order to calculate
the density, which can be prohibitive.
Unless the widths of the Gaussians are chosen to be broad, only a small region of
the space is covered by each Gaussian bump. In high dimensional spaces therefore,
the Parzen estimator will only have appreciable density very close to the data or,
if the Gaussians are broad, the density will be underestimated close to the data.
198
h = 5; % number of mixtures
d = size(X,1); % dimension of the space
n = size(X,2); % number of training patterns
Smin = 0.001; % minimum variance of Gaussians
for i = 1:h
for k = 1:n
v = X(:,k) - M(:,i);
Q(k,i) = exp(-0.5*(v’*v)/S(i)).*P(i)./sqrt((S(i))^d);
end
end
su = sum(Q,2);
for k =1:n
Q(k,:) = Q(k,:)./su(k); % responsibilities p(i|x^n)
end
for i = 1:h % now get the new parameters for each component
N(i) = sum(Q(:,i));
Mnew(:,i) = X*Q(:,i)./N(i);
Snew(i) = (1/d)*sum( (X - repmat(Mnew(:,i),1,n)).^2 )*Q(:,i)./N(i);
if Snew(i) < Smin % don’t decrease the variance below Smin
Snew(i) = Smin;
end
end
18.3 K Means
A non-probabilistic limit of fitting Gaussian mixtures to data is given by the K
means algorithm, in which we simply represent an original set of P datapoints by
K points.
K = 3; % number of clusters
r = randperm(size(x,2));
m(:,1:K) = x(:,r(1:K)); % initialise the clusters to K randomly chosen datapoints
mold =m;
for k = 1:K
if length(find(b==k))>0
m(:,k) = mean(x(:,find(b==k))’)’;
end
end
if mean(sum( (m-mold).^2)) < 0.001; break; end; mold =m; % termination criterion
end
10
−2
−4
−6 −4 −2 0 2 4 6 8
Note that the Kmeans algorithm can be dervived as the limit σ → 0 for fitting
isotropic Gaussian mixture components.
The K means algorithm, despite its simplicity is very useful. Firstly, it converges
extremely quickly and often gives a reasonable clustering of the data, provided that
the centres are initialised reasonably (using the above procedure for example). We
can use the centres we found as positions in which to place basis function centres
in the linear parametric models chapter.
and a mixture model to the data from class 2. (One could use a different number
of mixture components for the different classes, although in practice, one might
need to avoid overfitting one class more than the other. Using the same number
of mixture components for both classes avoids this problem.)
K
X
p(x|c = 2) = p(x|k, c = 2)p(k|c = 2) (18.4.2)
k=1
So that each class has its own set of mixture model parameters. We can then form
a classifier by using Bayes rule :
p(x|c = i)p(c = i)
p(c = i|x) = (18.4.3)
p(x)
201
v1 v2 v3
Figure 18.6: A mixture model has a trivial graphical representation as a DAG with
a single hidden node, which can be in and one of H states, i = 1 . . . H.
Only the numerator is important in determining the classification since the de-
nominator is the same for the case of p(c = 2|x). This is a more powerful approach
than our original approach in which we fitted a single Gaussian to each digit class.
Using more Gaussians enables us to get a better model for how the data in each
class is distributed and this will usually result in a better classifier.
where each term p(vi |h) is a Binomial distribution – that is, there are two states
vi ∈ {0, 1}. The generalisation to many states is straightforward.
EM training
In order to train the above model, we can use the EM algorithm, since we have a
hidden variable. Formally, we can figure out the algorithm by, as usual, writing
down the energy:
X XX X
hlog p(v1µ , v2µ , v3µ , h)iqµ (h) = hlog p(viµ |h)iqµ (h) + hlog p(h)iqµ (h)
µ µ i µ
and then performing the maximisation over the table entries. However, from our
general intuition, we may immediately jump to the results:
X µ
p(vi = 1|h = j) ∝ I[vi = 1]q µ (h = j)
µ
X
p(h = j) ∝ q µ (h = j)
µ
Y
q µ (h = j) ∝ p(v µ |h = j)p(h = j) ∝ p(h = j) p(viµ |h = j)
i
202
10
10 20 30 40 50 60
Figure 18.7: Data from questionnaire responses. There are 10 questions, and
60 people responded. White denotes a ‘yes’ and black denotes ‘no’. Gray denotes
that the absence of a response (missing data). This training data was generated by
three component Binomial mixture. Missing data was then simulated by randomly
removing values from the dataset.
These equations are iterated until convergence. Code that implements the above
method is provided later in this chapter.
One of the pleasing aspects of this model is that if one of the attribute values
is missing in the data set, the only modification to the algorithm is to drop the
corresponding factor p(viµ |h) from the algorithm. The verification that this is is a
valid thing to do is left to the reader.
Example : Questionnaire
Data from a questionnaire is presented below in fig(18.7). The data has a great
number of missing values. We have reason to believe that there are three kinds of
respondents.
Running the EM algorithm on this data, with random initial values for the tables,
gives an evolution for the lower bound on the likelihood as presented in fig(18.8a).
The EM algorithm finds a good solution. The 3 hidden states probabilities learned
are p(h = 1) ≈ 1/3,p(h = 2) ≈ 1/3,p(h = 3) ≈ 1/3, which is in rough agreement
with the data generating process. The solution is permuted, but otherwise fine,
and the three basic kinds of respondents have been well identified. Note how
difficult this problem is to solve by visual inspection of the data fig(18.7).
Code that implements this problem is given at the end of this chapter.
Mixtures of HMM
A useful way to cluster temporal sequences is to use a mixture of HMMs. This
can be trained in the usual EM way, and good for clustering temporal sequences :
BioInformatics, Music etc.
203
−100
1 1
−110 2 2
3 3
−120
4 4
−130 5 5
6 6
−140
7 7
−150
8 8
−160 9 9
10 10
−170
0 5 10 15 20 25 30 35 40 45 50 0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 3.5
(a) (b) “true” p(vi = 1|h) values. (c) learned p(vi = 1|h) values
Figure 18.8: (a) The evolution of the lower bound on the likelihood as we go
through eventually 50 iterations of EM. The different regimes initially in the evo-
lution are signatures of the symmetry breaking. (b) The ‘true’ value for the pa-
rameters p(vi = 1|h). Black corresponds to the value 0.95 and white to the value
0.1. (c) The solution p(vi = 1|h) found by the converged EM approach. (b) and
(c) are closely related, except for a trivial permutation of the hidden label.
The key intuition in this chapter is that clustering corresponds to using a mixture
model. These may be trained with EM, although some care is required with
initialisation and symmetry breaking.
ANSWER: It’s true that the likelihood will be higher for the more complex model.
What we need to do is to introduce a prior over the parameters of the model, and
then integrate (as opposed to finding the set of parameters that maximises the
likelihood). This will provide the effective Occam’s factor that will penalise the
overly complex model. Need to do this. Note that this phenomenon only kicks in
when we integrate over the unknown model parameters. Essentially, it’s similar
to the fact that a more complex model will always have a lower training error –
however, what we need to measure is something more like the effective volume
of parameter space that has a low training error – this is given by the Bayesian
solution.
204
function [pv,ph]=fitmixbern(v,nh,num_em_loops)
n=size(v,1); P=size(v,2);
pv = rand(n,nh); % random initialisation for the probs
ph = rand(1,nh); ph = ph./sum(ph);
for em = 1:num_em_loops
for mu = 1:P
for i = 1:nh
p(i,mu)=bern_prob(v(:,mu),pv(:,i))*ph(i); % p(i|vmu)*const.
end
p = p ./repmat(sum(p,1),nh,1); % p(i|vmu)
end
for i = 1:nh
for datadim = 1:n
for mu = 1:P
pv1(datadim,i) = pv1(datadim,i) + (v(datadim,mu)==1).*p(i,mu);
pv0(datadim,i) = pv0(datadim,i) + (v(datadim,mu)==0).*p(i,mu);
end
end
end
for i = 1:nh
for datadim = 1:n
pvnew(datadim,i) = pv1(datadim,i)./(pv0(datadim,i)+pv1(datadim,i));
end
end
for mu=1:P
for datadim=1:n
energy = energy + (v(datadim,mu)==1)*sum(log(0.0000001+pvnew(datadim,:)).*(p(:,mu)’));
energy = energy + (v(datadim,mu)==0)*sum(log(0.0000001 + 1-pvnew(datadim,:)).*(p(:,mu)’));
end
end
energy=energy+sum((log(0.0000001+phnew))*p);
end
function p = bern_prob(c,p)
p = prod(p(find(c==1)))*prod((1-p(find(c==0))));
205
18.6 Problems
Exercise 44 If a and b are d × 1 column vectors and M is a d × d symmetric
matrix, show that aT M b = bT M a.
Exercise 45 Write the quadratic forms x21 −4x1 x2 +7x22 and (x1 +x2 )2 +(x3 +x4 )2
in the form xT Cx where C is a symmetric matrix.
Exercise 46 Consider data points generated from two different classes. Class 1
has the distribution P (x|C1 ) ∼ N (µ1 , σ 2 ) and class 2 has the distribution P (x|C2 ) ∼
N (µ2 , σ 2 ). The prior probabilities of each class are P (C1 ) = P (C2 ) = 1/2. Show
that the posterior probability P (C1 |x) is of the form
1
P (C1 |x) =
1 + exp −(ax + b)
and determine a and b in terms of µ1 , µ2 and σ 2 . The function f (z) = 1/(1 + e−z )
is known as the logistic function, and is a commonly-used transfer function in
artificial neural networks.
e−λ λx
P (x) = x = 0, 1, 2, . . .
x!
You are given a sample of n observations x1 , . . . , xn drawn from this distribution.
Determine the maximum likelihood estimator of the Poisson parameter λ.
18.7 Solutions
19 Factor Analysis and PPCA
Introduction
The notion of a continuous mixture is somewhat less clear than in the discrete case,
where each discrete mixture component has the intuitive meaning of representing
a “cluster”. Continuous mixture models generally do not have the same cluster
type intuition, since the hidden space will usually be connected. What happens in
the continuous case is that preferences for how the hidden variables are distributed
are expressed.
Such models have many extremely useful properties, and are widely applied. They
correspond to our belief that there is some continuous hidden process p(h), from
which (usually continuous) visible variables are observed, p(v|h). The literature in
this area is vast, and in this chapter we will consider only some of the most well
known examples, beginning with some relatively numerically tractable modelling
of subspaces.
v = Wh + b + ǫ (19.1.1)
where the noise ǫ is Gaussian distributed, ǫ ∼ N (0; Ψ), and the matrix W para-
meterises the linear mapping. The constant bias b essentially sets the origin of the
206
207
x
x
x
x
Figure 19.1: In linear modelling of a subspace, we hope that data in the high di-
mensional space lies close to a hyperplane that can be spanned by a smaller number
of vectors. Here, each three-dimensional datapoint can be roughly described by
using only two components.
coordinate system. The essential difference between PCA and Factor Analysis is
in the choice of Ψ.
Factor Analysis
In factor analysis, one assumes that the covariance for the noise is diagonal Ψ =
diag (ψ1 , . . . , ψn ). This is a reasonable assumption if we believe that each com-
ponent of the data, xi has Gaussian measurement error, independent of the other
components. We see therefore that, given h, the data is assumed to be Gaussian
distributed with mean W h + b and covariance Ψ
To complete the model, we need to specify the hidden distribution p(h). Since
tractability is always a concern for continuous distributions, an expedient choice
is a Gaussian
T
p (h) ∝ e−h h/2
(19.1.3)
This therefore means that the coordinates h will be limited, and will most likely
be concentrated around values close to 0. If we were to sample from such a p(h)
and then draw a value for v using p(v|h), we would see that the v vectors that we
sample would be look like a saucer in the v space.
Indeed, in this case we can easily calculate the exact form of p(v):
Z
p (v) = p (v|h) p (h) dh (19.1.4)
Since v = W h + b + η and we know that p(h) is a zero mean Gaussian with unit
covariance, and η is zero mean with Covariance Ψ, v will be Gaussian distributed
with mean b and covariance matrix W W T + Ψ.
The form of the covariance matrix is interesting and tells us some thing about the
solution: Since the matrix W only appears in the final model p(v) in the form
208
h1 h2 h3
v1 v2 v3 v4 v5
Figure 19.2: Graphical representation of factor analysis for a model with 3 hidden
or latent variables, which generate the visible or output variable v = (v1 , . . . , v5 )T .
Warning! Since the so-called factor loadings W are equally likely as any rotated version of
them, one should be very careful about interpreting the coefficients of the W –
in particular, about attributing meaning to each of the values. Such practice is
commonplace in the social sciences and, in general, is very poor science.
Training FA using EM
A natural way to train Factor Analysis, is to use our standard variational learning
framework. Of course, in one could also attempt to maximise the likelihood di-
rectly (and the likelihood is relatively simple to calculate here). However, as usual,
the variational procedure tends to converge rather more quickly, and is the one we
shall describe here.
It is left as an exercise for the interested reader to show that the following con-
ditions hold at the maximum of the the energy. Maximising E with respect to b
gives
1 X µ 1 X
b= v −W hhiqµ (h)
P µ P µ
W = AH −1
where
1 X µ T
A= c hhiqµ (h)
P µ
cµ = v µ − b
1 X
T
H= hh qµ (h)
P µ
209
Finally
1 X D E
T
Ψ= diag (cµ − W h) (cµ − W h)
P µ qµ (h)
( )
1 X µ µ T T T
= diag c (c ) − 2W A + W HW
P µ
The above recursions depend on the statistics hhiqµ (h) and hhT qµ (h) . Using the
EM optimal choice
and mean
−1
mµ = hhiqµ (h) = I + W T Ψ−1 W W T Ψ−1 cµ
From which
1 X µ µ T
H =Σ+ m (m )
P µ
The above equations then define recursions in the usual EM manner. Unfortu-
nately, the lack of a closed form solution to these equations means that FA is less
widely used than the simpler PCA (and its probabilistic variant).
A nice feature of FA is that one can perform the calculations on very high dimen-
sional data without difficulty. (In the standard PCA this is an issue, although
these problems can be avoided – see the text).
Also, unlike in PCA, the matrix W that is learned need not be orthogonal.
p(h) = N 0, ΣH )
Does this really improve the representative power of the model? For notational
simplicity, let’s consider
v = Wh + ǫ
v ∼ N (0, W ΣH W T + σ 2 I)
v ∼ N (0, W ′ W ′T + σ 2 I)
1
where W ′ ≡ W ΣH 2
. Hence, there is nothing to be gained from using a correlated
Gaussian prior p(h).
210
Figure 19.3: A comparison of factor analysis and PCA. The underlying data gen-
erating process is y = x + ǫ, where ǫ is Gaussian noise of standard deviation σ. In
the plots from left to right, σ takes the values 0.5, 1.2, 2, 3, 4. The FA solution is
given by the solid arrow, and the PCA solution by the dashed arrow. The correct
direction is given by the solid line. Note how the PCA solution “rotates” upwards
as the noise level increases, whereas the FA solution remains a better estimate of
the underlying correct direction.
where the H column vectors in UH are the first H eigenvectors of the sample
covariance matrix S,
P
1 X µ T
S= (v − m) (v µ − m) (19.1.6)
N µ=1
P
where m is the sample mean µ v µ /P . λH is a diagonal matrix containing the
corresponding eigenvalues of S. R is an arbitrary orthogonal matrix (representing
an arbitrary rotation). For this choice of W, the optimal ML noise is given by
V
1 X
σ2 = λj (19.1.7)
V −H
j=H+1
where λj is the jth eigenvalues of S. This has the interpretation as the variance
lost in the projection, averaged over the lost dimensions.
211
This means that we can rapidly find a ML linear subspace fit based on the eigen-
decomposition of the sample covariance matrix and sample mean.
An advantage of a proper probabilistic approach to PCA is that one can then,
in a principled manner, for example, contemplate discrete mixtures of Principal
Component Analysers, or indeed, a mixture of different kinds of models. Without
a probabilistic framework, it is difficult to justify how a set models should be
combined.
There are several ways to understand PCA. However, in the current context, PCA
is defined as the limit of PPCA in which σ → 0 and R = I. That is, the mapping
from the latent space to the data space is deterministic. In this case, the columns
of W are given by simply the eigenvectors of the sample covariance matrix, scaled
by the square root of their corresponding eigenvalues.
XXT E = EΛ (19.1.8)
XT XXT E = XT EΛ (19.1.9)
T
X XẼ = ẼΛ (19.1.10)
where we defined Ẽ = XT E. The last line above represents the eigenvector equation
for XT X. This is a matrix of dimensions P × P – in the above example, a 500 × 500
matrix as opposed to a 106 × 106 matrix previously. We then can calculate the
eigenvectors Ẽ and eigenvalues Λ of this matrix more easily. Once found, we then
use
E = XẼΛ−1 (19.1.11)
Certainly one advantage of these probabilistic approaches is that they may now be
used in discrete mixture models in a principled way, and this can indeed improve
performance considerably.
FA bias FA FA FA FA FA
Figure 19.4: For a 5 hidden unit model, here are plotted the results of training
PPCA and FA on 100 examples of the handwritten digit seven. Along with the
PPCA mean and FA bias, the 5 columns of W are plotted for FA, and the 5 largest
eigenvectors from PPCA are plotted.
function [W,Psi,b]=fa(v,H,num_em_loops)
P = length(v); V = size(v{1},1);
Figure 19.5: (a) 25 samples from the learned FA model. (b) 25 samples from the
learned PPCA model.
213
diagcont = zeros(V,1);
A =zeros(V,H);
btot = zeros(V,1);
for mu=1:P
c{mu} = v{mu}-b;
diagcont = diagcont + c{mu}.^2;
m{mu} = Sigma*W’*diag(1./Psi)*c{mu};
mtot = mtot + m{mu}*m{mu}’;
A = A + c{mu}*m{mu}’;
btot = btot + v{mu}-W*m{mu};
end
Hmat = Sigma + mtot./P;
A = A./P;
diagcont = diagcont./P;
diagWA = diag(W*A’);
Psi = diagcont -2*diagWA+diag(W*Hmat*W’);
b = btot./P;
W = A/Hmat;
end
The idea is the same as in FA, except that the transformation from the latent
space to the data space is non-linear. That is
1 −1
p (x|h) ∝ exp − (x − u) Ψ (x − u) (19.3.1)
2
where u = φ (h) where φ(t) is, in general, a non-linear function of t. If we take the
same Gaussian prior as before, in general, we cannot calculate the integral over
the latent space analytically anymore.
This can be approximated by
L
1X
p (x) = p x|hl (19.3.2)
L
l=1
where we have sampled L latent points from the density p (h). This is straightfor-
ward to do in the case of using a Gaussian prior on h.
What this means is that the density model is therefore a mixture of Gaussians,
constrained somewhat through the non-linear function.
One approach is to parameterise the non-linearity as
X
φ(h) = wi φi (h) (19.3.3)
i
where the φi are fixed functions and the weights wi form the parameters of the
mapping. These parameters, along with the other parameters can then be found
using variational learning (EM).
214
GTM
6
2.5
2
4
1.5
2 1
0.5
0
0
−2 −0.5
−1
−4
−1.5
−6 −2
3 3
2 2
3 3
1 2 1 2
0 1 0 1
−1 0 −1 0
−1 −1
−2 −2 −2 −2
−3 −3 −3 −3
Figure 19.6: (a) The latent space usually corresponds to a low dimensional space,
here 2 dimensional, so that a point h represented as the black dot in this space
is specified by coordinates (h1 , h2 ). Associated with this latent space is a prior
belief about where the latent parameters are. Here this is a Gaussian distribution.
(b) Each point in the latent space is mapped to some point in a typically higher
dimensional space, here 3 dimensional. The mapping here is linear so that the
object in the higher dimensional space is simply a plane – that is, a point in
the lower dimensional space gets mapped to corresponding point (black dot) in
the plane. Similarly, there will be an associated density function in this higher
dimensional space, inherited from the density function in latent space. (c) Here
the mapping from latent space to data space is non-linear, and produces a two
dimensional manifold embedded in the three dimensional space.
h1 h2 h3
v1 v2 v3 v4 v5
Figure 19.7: Graphical representation of PPCA for a model with 3 hidden or latent
variables, which generate the visible or output variable v = (v1 , . . . , v5 )T .
Since p(v) is going to be Gaussian, all we need to do is find its mean and covariance.
Since the noise is zero mean, then v will be zero mean. The covariance is given by
T D T
E
vv = (W h + ǫ) (W h + ǫ)
where the angled brackets denote an average with respect to all sources of fluctua-
tions, namely the noise and the hidden distribution. Since these noise sources are
uncorrelated, we have
T
vv = W W T + σ 2 I
Hence
p(v) = N (0, Σ = W W T + σ 2 I)
W = SΣ−1 W
Hence U are the eigenvectors of the correlation matrix S and λi = σ 2 + li2 are the
eigenvalues. This constraint requires therefore li = (λi − σ 2 )1/2 , which means that
the solutions are of the form
1/2
W = E Λ − σ2 I R
where R is an arbitrary orthogonal matrix. The reader may verify, by plugging this
solution back into the log-likelihood expression, that the eigenvalues and associated
eigenvectors which maximise the likelihood correspond to the H largest eigenvalues
of S. The standard, non-probabilistic variant PCA is given as the limiting case
σ 2 → 0 of PPCA.
Let’s order the eigenvalues so that λ1 ≥ λ2 , ... ≥ λV . The value for the log
likelihood is then (see exercises)
H V
!
P X 1 X
L=− V log(2π) + log λi + 2 λi + (V − H) log σ 2 + H
2 i=1
σ
i=H+1
The reader may then verify that the optimal ML setting for σ 2 is
V
2 1 X
σ = λj
V −H
j=H+1
Of course, we could have trained the above method using the standard EM algo-
rithm. What’s convenient about PPCA is that the solution is analytic, and boils
down to a simple eigenproblem.
Note: in the above, we clearly need λi ≥ σ 2 for the retained eigenvectors. However,
for the ML solution this is guaranteed, since σ 2 is set to the average of the discarded
eigenvalues, which must therefore be smaller that any of the retained eignvalues.
Mixtures of PPCA
19.5 Problems
Exercise 49 In one dimension, dim(x) = 1, the Gaussian distribution is defined
as
1 1 2
p(x) = √ e− 2σ2 (x−µ)
2πσ 2
You decide to fit a Gaussian to each class and use the ML estimates of the means
µ̂1 and µ̂2 . From the data, you find that the ML estimates of σ12 and σ22 are
equal, that is, σ̂12 = σ̂22 . Write down the explicit x value that defines the decision
boundary.
Point out any potential numerical difficulties in directly comparing the values p(c =
1|x) and p(c = 2|x) and explain how you might overcome this.
In more than one dimension, the multi-variate Gaussian is defined as
1
e− 2 (x−µ) S (x−µ)
1 T −1
p(x) = √
det 2πS
The posterior distribution for z is given by P (z|x) ∝ P (z)P (x|z) (we don’t need to
worry too much about the normalization of the posterior distribution in this ques-
tion).
Show that the posterior distribution is Gaussian with mean (Im +W T Ψ−1 W )−1 W T Ψ−1 x,
and state the covariance matrix of the posterior distribution.
Exercise 51 Factor analysis and scaling. Assume that a m-factor model holds
for x. Now consider the the transformation y = Cx, where C is a non-singular
diagonal matrix. Show that factor analysis is scale invariant, i.e. that the m-factor
model also holds for y, with the factor loadings appropriately scaled. How must the
specific factors be scaled?
Exercise 52 Consider data points generated from two different classes. Class 1
has the distribution P (x|C1 ) ∼ N (µ1 , σ 2 ) and class 2 has the distribution P (x|C2 ) ∼
N (µ2 , σ 2 ). The prior probabilities of each class are P (C1 ) = P (C2 ) = 1/2. Show
that the posterior probability P (C1 |x) is of the form
1
P (C1 |x) =
1 + exp −(ax + b)
218
and determine a and b in terms of µ1 , µ2 and σ 2 . The function f (z) = 1/(1 + e−z )
is known as the logistic function, and is a commonly-used transfer function in
artificial neural networks.
HINT : use the fact that the determinant of a matrix is the product of its eigen-
values.
Using Sea = λ̃a ea , a = 1, . . . , V calculate explicitly the value of the expression
trace (EDE T + σ 2 I)−1 S
in terms of the λ̃a and σ 2 . HINT: use the fact that the trace of a matrix is the
sum of its eigenvalues.
19.6 Solutions
20 Dynamic Bayesian Networks : Discrete Hidden Variables
where ǫ(t) is a random variable sampled from some distribution. For example, if
the noise is Gaussian with zero mean and variance σ 2 , then
1 1 2
p(x(t + 1)|x(t)) = √ e− 2σ2 (x(t+1)−x(t)−c)
2πσ 2
Markov Process This is an example of a Markov chain. Processes are Markov if the future state
x(t + 1) only depends on the current state x(t). This is called a first order Markov
process which refers to the dependency on only the first immediately preceding
state in time. More formally,
d2 x
= k1 , k1 = const
dt2
219
220
filtering
111111111111
000000000000
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
t
smoothing
111111111111111111111
000000000000000000000
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
prediction t
11111111
00000000
00000000
11111111
00000000
11111111
11111111111111111111111
00000000000000000000000
00000000
11111111
t
1111
0000
0000
1111
0000
1111
0000
1111 denotes the extent of data
available
Figure 20.1:
Or
Second Order Here the state of the future world depends on the present and the immediate past1 .
In general, then we would have
TY
−2
p(x(T ), . . . , x(1)) = p(x(1), x(2)) p(x(t + 2|x(t + 1), x(t))
t=1
Inference Problems
Models with discrete variables are common, and have significant application in
many fields, ranging from sequence modelling in Bioinformatics to text and speech
processing. Here is a simple example:
1 It is a deep (and at least to me, somewhat mysterious) property that all laws of physics are
only maximally second order differential equations, and hence can be well approximated by
second order stochastic differential equations.
221
Hilde is an interesting chimp. She has been trained to press the buttons 1, 2 and
3 always in sequence, although the starting state doesn’t matter. For example,
2,3,1,2,3,1,2,3,1,2,3. Hilde is quite good at this, but sometimes makes a mistake
and presses a button out of sequence. The probability that she makes a transition
from state j to state i, p(i|j) is given by the matrix elements below:
1 2 3
1 0.1 0.1 0.8
2 0.8 0.1 0.1
3 0.1 0.8 0.1
which can be represented as a matrix pij ≡ p(i|j). Alternatively, a state transition
diagram can be used, as in fig(20.2) below. To make this more informative, one
sometimes shows also the values of the transitions on the links. This is a general-
3 2
Figure 20.2: A state transition diagram, for a three state markov chain. Note that
a state transition diagram is not a graphical model – it simply graphically displays
the non-zero entries of the transition matrix p(i|j)
isation of Finite State Automata. In FSAs, the transitions are deterministic, the
corresponding table entries being either 0 or 1.
Vernon is another, slightly less reliable chimp. He has been trained such that
whenever he sees that Hilde has pressed either button 1 or 2, he grunts A and
whenever he sees a 3 he grunts B. However, he also makes mistakes, as characterised
by the probability state transition matrix below,
1 2 3
A 0.7 0.6 0.25
B 0.3 0.4 0.75
which can be represented by pij ≡ p(v(t) = i|h(t) = j).
Flippa is a super clever dolphin. She has sent to her a sequence of grunts from
Vernon, e.g. B,A,A,B,A,B and has been trained to figure out from the sequence
Vernon grunts, what is is that Hilde typed. Of course, this is not strictly solvable
in an exact sense. Flippa reports back to her trainer the most likely sequence of
buttons pressed by Hilde.
h1 h2 h3 h4
v1 v2 v3 v4
– that is, the model for each time step to time step holds for all times. The
distribution is therefore fully described by the
Transition matrix
an Emission matrix
πi = p(h1 = i).
How can Flippa solve her problem? That is find argmax p(h1 , . . . , hT |v1 , . . . , vT )?
h1 ,...,hT
As we saw previously, such most probable state calculations can be carried out
by a slight modification of the JTA. The first step then is to find a Junction Tree
for the HMM. The HMM is already moralised and triangularised. A suitable JT,
along with a valid assignment of the potentials is given in fig(20.4).
h1 , h2 h2 , h3 h3 , h4
v1 , h1 v2 , h2 v3 , h3 v4 , h4
We are interested in the so-called ‘smoothed’ posterior p(ht |v1:T ). There are two
main approaches to computing this.
Parallel Method
Sequential Method
X
p(ht |v1:T ) ∝ p(ht , ht+1 , v1:T )
ht+1
X
∝ p(ht |ht+1 , v1:t , )p(ht+1 |v1:T )
vt+1:T
ht+1
(20.1.2)
This then gives a Backwards recursion for p(ht |v1:T ). As we will see below, the
term p(ht |ht+1 , v1:t may be computed based on the filtered results p(ht |v1:t .
where xt ≡ ht , and φ (xt−1 , vt−1 , xt , vt ) = p(xt |xt−1 )p(vt |xt ). Our aim is to define
‘messages’ ρ, λ (these correspond to the α and β messages in the Hidden Markov
Model framework[37, 39]) which contain information from past observations and
future observations respectively. Explicitly, we define ρt (xt ) ∝ p(xt |v1:t ) to repre-
sent knowledge about xt given all information from time 1 to t. Similarly, λt (xt )
represents knowledge about state xt given all information from the future obser-
vations from time T to time t + 1. In the sequel, we drop the time suffix for
notational clarity. An important point is that λ(xt ) is not a distribution in xt , but
rather implicitly defined through the requirement that the marginal inference is
then given by
Taking the above equation as a starting point, we can calculate the marginal from
this
X
p(xt |v1:T ) ∝ ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) (20.1.5)
xt−1
Similarly, we can integrate equation (22.2.2) over xt to get the marginal at time
xt−1 which by consistent should be proportional to ρ (xt−1 ) λ (xt−1 ). From such
considerations we arrive at
P
xt−1 ρ (xt−1 ) φ (xt−1 , xt ) λ (xt )
ρ (xt ) ∝ , (20.1.7)
λ (xt )
P
ρ (xt−1 ) φ (xt−1 , xt ) λ (xt )
λ (xt−1 ) ∝ xt (20.1.8)
ρ (xt−1 )
X
Backward Recursion: λ (xt−1 ) ∝ φ (xt−1 , vt−1 , xt , vt ) λ (xt ) (20.1.10)
xt
which are the usual definitions of the messages defined as a set of independent
recursions. In engineering, the ρ message is called the α message, and the λ
message is called the β message. This method of performing inference is called a
parallel method since the α and β recursions are independent of each other and
can therefore be implemented in parallel. After computation, they may then be
combined to compute the smoother posterior.
The extension to more general singly connected structures is straightforward and
results in partially independent recursions which communicate only at branches of
the tree [7].
From equation (22.2.1) it is straightforward to see that λ (xt ) ∝ p(vt+1:T |xt , v1:t ) =
p(vt+1:T |xt ). By definition ρ(xt ) ∝ p(xt |v1:t ) is the filtered estimate.
Logs or Normalise?
The repeated application of the recursions equation (20.1.9) and equation (20.1.9)
may lead to numerical under/over flow. There are two strategies for dealing with
this. One is to work in log space, so that only the log of the messages are defined.
The other (which is more common in the machine learning literature) is to nor-
malise the messages ρ and λ at each stage of the iteration, so that the messages
sum to unity. Normalisation is valid since both the filtered p(xt |v1:T ) ∝ ρ (xt )
and smoothed inferences p(xt |v1:T ) ∝ ρ (xt ) λ (xt ) are simply proportional to the
messages. The missing proportionality constants can be worked out easily since
we know that distributions must sum to one.
226
Here we derive an alternative way to compute the smoothed inference p(ht |v1:T )
by correcting these filtered results. We start with the recursion
X
γ(ht ) ≡ p(ht |v1:T ) = p(ht , ht+1 |v1:T ) (20.1.14)
ht+1
X
= p(ht |ht+1 , v1:t )p(ht+1 |v1:T ) (20.1.15)
ht+1
Hence, we can form a backwards recursion for the smoothed inference. p(ht , ht+1 |v1:T )
is given by the above before summing over ht+1 . We therefore need
The denominator is just found by normalisation. In the above, we see that the
smoothed recursion makes explicit use of the filtered results. In contrast to the α−β
independent recursion, the above procedure is called a sequential procedure since
we need to first complete the α recursions, after which the γ recursion may begin.
Formally, the α − β and α − γ recursions are related through γ(ht ) ∝ α(ht )β(ht ).
Each factor
X
p(vt |v1:t−1 ) = p(vt , ht |v1:t−1 ) (20.1.16)
ht
X
= p(vt |ht )p(ht |v1:t−1 ) (20.1.17)
ht
X X
= p(vt |ht ) p(ht |ht−1 )p(ht−1 |v1:t−1 ) (20.1.18)
ht ht−1
where the final term p(ht−1 |v1:t−1 ) are just the filtered inferences. Note, therefore,
that the likelihood of a output sequence requires only a forward computation.
227
20.1.3 Viterbi
Consider the general HMM problem: Find the most likely state of
p(h1:T |y1:T )
This is easy to find, by using the max version of the JTA/BP algorithms. To make
this explicit though, we write down exactly how this would proceed:
To make the notation a little easier, let’s define the potential functions :
where for the first time step we just define φ(h1 ) = p(y1 |h1 )p(h1 ). The finding
the most likely hidden state sequence is equivalent to finding the state h1:T that
maximises the funcion
T
Y
φ = φ(h1 ) φ(ht−1 , ht )
t=2
The dependency on h1 appears only in the first two terms φ(h1 ) and φ(h1 , h2 ).
Hence when we perform the max over h1 , we can write
YT
max φ = max max φ(h1 )φ(h1 , h2 ) φ(ht−1 , ht )
h1:T h2:T h1
| {z } t=3
f (h2 )
YT
max φ = max max f (h2 )φ(h2 , h3 ) φ(ht−1 , ht )
h1:T h3:T h2
| {z } t=4
f (h3 )
We can continue this procedure, at each stage defining the new potentials
until we reach the end of the chain, and we have defined f (h2 ), . . . , f (hT ). Then,
to find which states actually correspond to the maxima, we need to backtrack: We
have at the end of the chain f (hT ). Hence, the most likely state is given by
and similarly,
We can look at more complex time dependencies in the hidden variables by in-
creasing the range of the temporal dependencies. For example, a second order
HMM is given in fig(20.5). The inference can again be carried out using the JTA,
h1 h2 h3 h4 h5
v1 v2 v3 v4 v5
h1 , h2 , h3 h2 , h3 , h4 h3 , h4 , h5
v1 , h1 v2 , h2 v3 , h3 v4 , h4 v5 , h5
Now the complexity is still linear in time, but O(T H(V + H 2 ). In general, the
complexity will be exponential in the order of the interactions.
Learning HMMs
Baum-Welch Algorithm
A HMM is trained by treating the output nodes as evidence nodes and the state
nodes as hidden nodes. This is clearly tractable since the moralization and trian-
gulation steps do not add any extra links. The cliques are of size N 2 where N is
229
the dimension of the state nodes. Inference therefore scales as O(N 2 T ) where T
is the length of the times series.
To find the parameters of the model, A, B, π, a variational type (EM) procedure
can be used, which can be constructed using our previous EM framework.
To make the notation reasonably simple, we write v = (v1 , v2 , . . . , vT ), and simi-
larly, h = (h1 , h2 , . . . , hT ).
To avoid potential confusion, we write pnew (h1 = i) to denote the (new) table
entry for probability that the intial hidden variable is in state i. The prior term,
by the previously derived EM approach then gives
X
πinew ≡ pnew (h1 = i) ∝ pold (h1 = i|v µ ) (20.1.19)
µ
which is the average number of times that the first hidden variable is in state i.
Similarly,
X TX
−1
Anew
i′ ,i ≡p new ′
(ht+1 = i |ht = i) ∝ pold (ht = i, ht+1 = i′ |v µ ) (20.1.20)
µ t=1
which is the number of times that a transition from hidden state i to hidden state
i′ occurs, averaged over all times (since we assumed stationarity) and training
sequences. Finally,
T
XX
new
Bj,i ≡ pnew (vt = j|ht = i) ∝ I[vtµ = j]pold (ht = i|v µ ) (20.1.21)
µ t=1
which is the expected number of times that, for the observation being in state j, we
are in hidden state i. The proportionalities are trivially determined by the normal-
isation constraint. Together, the above three equations define the new prior, tran-
sition and emission probabilities. Using these values for the HMM CPTs, at the
next step we can calculate the quantities pold (h1 = i|v µ ), pold (ht = i, ht+1 = i′ |v µ )
and pold (ht = i|v µ ) using the JTA (or the so-called ‘Forward-Backward’ algo-
rithm, which is equivalent). The equations (20.1.19,20.1.20,20.1.21) are repeated
until convergence.
Parameter Initialisation
(Of course, if we were to use a restricted class of q µ functions, we would only con-
verge to a local maximum of the lower bound on the likelihood). There is no
guarantee that the algorithm will find the global maximum, and indeed, the value
of the local maximum found is often critically dependent on the initial settings of
the parameters. How best to initialise the parameters is a thorny issue. According
to Rabiner :
“Experience has shown that either random (subject to the stochastic and the
nonzero value constraints) or uniform initial estimates of the π and A parameters
is adequate for giving useful re-estimates of these parameters in almost all cases.
However, for the B parameters, experience has shown that good initial estimates
are helpful in the discrete case, and are essential in the continuous distribution case
(see later). Such initial estimates can be obtained in a number of ways, including
manual segmentation of the observation sequence(s) into states with averaging of
observations within states, maximum likelihood segmentation of observations with
averaging, and segmental k-means segmentation with clustering.”
xt−1 xt xt+1
ht−1 ht ht+1
yt−1 yt yt+1
Figure 20.7: Graphical model of the IOHMM. Nodes represent the random vari-
ables and arrows indicate direct dependence between variables. In our case the
output variable yt is discrete and represents the class label, while the input variable
xt is the continuous (feature extracted from the) EEG observation. The yellow
(shaded) nodes indicate that these variables are given, so that no associated dis-
tributions need be defined for x1:T .
Related Models
Input-Output HMM
The IOHMM is just a HMM augmented with outputs (visible variables) y1:T and
hidden states h1:T . However, we now consider that we are given for each time step
an input xt . This input can be continuous or discrete and affects the transitions
as
Y
p(y1:T , h1:T |x1:T ) = p(yt |ht , xt )p(ht |ht−1 , xt )
t
This is just another HMM, and extending inference and learning to this case is
straightforward. IOHMM is usually used as a conditional classifier, where the out-
puts yt represent a class label at time t. (There are other ways to train this model,
say by specifying a label only at the end of the sequence). In the case of continuous
inputs, the tables p(yt |ht , xt ) and p(ht |ht−1 , xt ) are usually parameterised using a
non-linear function, eg.
T
p(yt = y|ht = h, xt = x) ∝ ewh,y x
Inference then follows the same line as for the standard HMM:
X
p(ht |x, y) = p(ht , ht+1 |x, y) (20.1.22)
ht+1
X
= p(ht |ht+1 , x1:t+1 , y1:t )p(ht+1 |x, y) (20.1.23)
ht+1
(20.1.24)
232
Hence, we can form a backwards recursion. p(ht , ht+1 |x1:T , y1:T ) is given by the
above before summing over ht+1 .
We therefore need
p(ht+1 , ht |x1:t+1 , y1:t ) p(ht+1 |ht , xt+1 )p(ht |x1:t , y1:t )
p(ht |ht+1 , x1:t , y1:t ) = =
p(ht+1 |x1:t+1 , y1:t ) p(ht+1 |x1:t+1 , y1:t )
The denominator is just found by normalisation. To find the rest, we use a forward
pass
X
p(ht |x1:t , y1:t ) ∝ p(ht , ht−1 , x1:t , y1:t−1 , yt ) (20.1.25)
ht−1
X
= p(yt |y1:t−1 , x1:t , ht , ht−1 )p(ht |y1:t−1 , x1:t , ht−1 )p(y1:t−1 , x1:t , ht−1 )
ht−1
(20.1.26)
X
∝ p(yt |xt , ht )p(ht |xt , ht−1 )p(ht−1 |x1:t−1 , y1:t−1 ) (20.1.27)
ht−1
Q
The likelihood is found from recusing p(y|x) = p(yt |y1:t−1 , x)
Direction Bias
The IOHMM and related conditionally trained models ‘suffer’ from the fact that
any prediction p(vt |h1:T ) in fact depends only on the past p(vt |h1:t ). This is not
true, of course, of the most likely output sequence. Such ‘direction bias’ is identified
in some sections of the literature (particularly in natural language modelling) as
problematic, and motivates the use of undirected models, such as the Conditional
Random Field.
k k
f f
g j g j
h h
e e
i i
d d
c c
a b a b
(a) (b)
Figure 20.8: A simple way to transform continuous signals into discrete signals is
to use vector quantisation. (a) After preprocessing, a section of speech is repre-
sented by a trajectory through a high dimensional space (here depicted as three
dimensions). For example, we represent one trajectory by the dotted line. Many
different utterances of the same word will hopefully produce similar trajectories to
the mean trajectory (here shown as the solid curve). Codebook vectors are rep-
resented by circles. Points in this space on the mean trajectory that are equally
separated in time are represented by a small dot. (b) A novel trajectory (the tri-
angles) is compared to the codebook vectors so that it can be transformed into a
string, here abcdefhhhjk. Note, however, that this string does not take the time
aspect into account. To map this into a string which represents the state of the
system at equally spaced times, this would be aabcddeffhhhjjkk.
at a reasonable reading rate, occupy over two megabytes of disk space. If printed,
it would occupy around a kilobyte. There is therefore a considerable amount of
compression involved in Automatic Speech Recognition (ASR).
There are various methods of proceeding from this point, but the most fundamental
and conceptually simplest is to take a Discrete Fourier Transform (DFT) of a short
chunk of the signal, referred to as the part of the signal inside a window. Imagine
that we have a sound and something like a harp, the strings of which can resonate
to particular frequencies. For any sound whatever, each string of the harp will
resonate to some extent, as it absorbs energy at the resonant frequency from the
input sound. So we can represent the input sound by giving the amount of energy
in each frequency which the harp extracts, the so-called energy spectrum.
We take, then, some time interval, compute the fast Fourier transform (FFT) and
then obtain the power spectrum of the wave form of the speech signal in that time
interval of, perhaps, 32 msec. Then we slide the time interval, the window, down
the signal, leaving some overlap in general, and repeat. We do this for the entire
length of the signal, thus getting a sequence of perhaps ninety vectors, each vector
in dimension perhaps 256, each of the 256 components being an estimate of the
energy in some frequency interval between, say, 80 Hertz and ten KHz.
Practical problems arise from trying to sample a signal having one frequency with a
sampling rate at another; this is called ‘aliasing’ in the trade, and is most commonly
detected when the waggon wheels on the Deadwood Stage go backwards, or a news
234
program cameraman points his camera at somebody’s computer terminal and gets
that infuriating black band drifting across the screen and the flickering that makes
the thing unwatchable. There is a risk that high frequencies in the speech signal
will be sampled at a lower frequency and will manifest themselves as a sort of
flicker. So it is usual to kill off all frequencies not being explicitly looked for,
by passing the signal through a filter which will not pass very high or very low
frequencies. Very high usually means more than half the sampling frequency, and
very low means little more than the mains frequency.
The 256 numbers may usefully be ‘binned’ into some smaller number of frequency
bands, perhaps sixteen of them, also covering the acoustic frequency range.
This approach turns the utterance into a longish sequence of vectors in represent-
ing the time development of the utterance, or more productively as a trajectory.
Many repetitions of the same word by the same speaker might reasonably be ex-
pected to be described as trajectories which are fairly close together. If we have a
family of trajectories corresponding to one person saying ‘yes’ and another family
corresponding to the same person saying ‘no’, then if we have an utterance of one
of those words by the same speaker and wish to know which it is, then some com-
parison between the new trajectory and the two families we already have, should
allow us to make some sort of decision as to which of the two words we think most
likely to have been uttered.
Put in this form, we have opened up a variant of traditional pattern recognition
which consists of distinguishing not between different categories of point in a space,
but different categories of trajectory in the space. Everything has become time
dependent; we deal with changing states.
An example of such a procedure is given in fig(20.8). There we see that a particular
speech signal has been transformed into a string of states. What we would like
to do is then, given a set of different utterances of the same word, with their
corresponding strings, to learn a representation for the transitions between the
states of these strings. This is where the HMM comes in. For each state (one of
the symbols in the string) there is a probability (that we need to learn) of either
staying in the same state (holding probability) or switching to one of the other
states. We can use standard HMM algorithms to learn such transitions.
Given then two models, say one of “yes” and the other of “no”, how do we use these
models to classify a novel utterance? The way that we do this is to find for which
of the two models was the sequence more likely. For example, imagine that we have
an utterance, “yeh”. Then we wish to find under which model is this utterance the
most likely. That is, we compare p(“yeh′′ |model“yes′′ ) with p(“yeh′′ |model‘′ no′′ ).
To calculate these likelihoods, we can use the standard marginalisation techniques
for graphical models.
A book by one of the leading speech recognition experts is available online at
http://labrosa.ee.columbia.edu/doc/HTKBook21/HTKBook.html.
BioInformatics
Biological sequences are often successfully modelled by HMMs, and have many
interesting and powerful applications in BioInformatics, for example for multiple
sequence alignment.
235
See http://www.comp.lancs.ac.uk/ucrel/annotation.html#POS
for a statement of the problem and some probabilistic solutions. For example, we
have a sentence, as below, in which each word has been linguistically tagged (eg
NN is the singular common noun tag, ATI is the article tag etc.).
hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_,
but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV
to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_!
One can attempt a solution to these tagging problems by using a HMM to model
the way that tag to tag transitions tend to occur from a corpus of tagged word
sequences. This forms the hidden space dynamics. An emission probability to go
from a tag to an observed word is also used, so that then for a novel sequence of
words, the most likely tag (hidden) sequence can be inferred.
One of the original, and still very common applications of HMMs is in tracking.
They have been particularly successful in tracking moving objects, whereby an
understanding of newtonian dynamics in the hidden space, coupled with an un-
derstanding of how an object with a known position and momentum would appear
on the screen/radar image, enables one to infer the position and momentum of an
object based only on radar. This has obvious military applications and is one of
the reasons that some of the algorithms associated with HMMs and related mod-
els were classified until recently (although doing the inference was probably well
understood anyway!).
236
20.3 Problems
Exercise 56 Consider a HMM with 3 states (M = 3) and 2 output symbols, with
a left-to-right state transition matrix
0.5 0.0 0.0
A = 0.3 0.6 0.0
0.2 0.4 1.0
where Aij ≡ p(h(t + 1) = i|h(t) = j), an output probabilities matrix Bij ≡ p(v(t) =
i|h(t) = j)
0.7 0.4 0.8
B=
0.3 0.6 0.2
and an initial state probabilities vector π = (0.9 0.1 0.0)T . Given that the observed
symbol sequence is 011, compute
(i) P (v1:T )
(ii) P (h1 |v1:T ). [As there are 3 observations the HMM will have three time
slices—you are asked to compute the posterior distribution of the state vari-
able in the second time slice, numbering the times 0, 1, 2.] You can check
this calculation by setting up the HMM in JavaBayes.
(iii) Find the best hidden state sequence given a sequence of observations, and
apply it to the model (Viterbi algorithm)
Exercise 57 Suppose the matrix A above had its columns all equal to the initial
probabilities vector π. In this case the HMM reduces to a simpler model—what is
it?
Exercise 59 Consider the problem : Find the most likely joint output sequence
v1:T for a HMM. That is,
arg max p(v1:T )
v1:T
where
Y
p(h1:T , v1:T ) = p(vt |ht )p(ht |ht−1 )
t
(i) Explain how the above problem can be formulated as a mixed maxproduct/sumproduct
criterion.
(ii) Explain why a local message passing algorithm cannot, in general, be found
for this problem to guarantee to find the optimal solution.
(iii) Explain how to adapt the Expectation-Maximisation algorithm to form a re-
cursive algorithm, with local message passing, to guarantee at each stage of
the EM algorithm an improved joint output state.
Exercise 60 Explain how to train a HMM using EM, but with a constrained tran-
sition matrix. In particular, explain how to learn a transition matrix with a trian-
gular structure.
237
20.4 Solutions
21 Dynamic Continuous Hiddens : Linear Dynamical Systems
ht+1 = Aht
v(t) = [ht ]1
namely the projection of the hidden variable dynamics, then v would describe a
sinusoid through time. More generally, we could consider a model
v(t) = Bh(t)
which linearly related the visible variable v(t) to the hidden dynamics at time t.
This is therefore a linear dynamical system.
A drawback to the above models is that they are all deterministic. To account for
possible stochastic behaviour, we generalise the above to
ht = Aht−1 + ηth
vt = Bht + ηtv
where ηth and ηtv are noise vectors. As a graphical model, we write
T
Y
p(h1:T , v1:T ) = p(h1 )p(v1 |h1 ) p(ht |ht−1 )p(vt |ht )
t=2
which states that ht+1 has a mean equal to Aht and has Gaussian fluctuations
described by the covariance matrix ΣH .
238
239
h1 h2 h3 h4
v1 v2 v3 v4
Figure 21.1: A LGSSM. Both hidden and visible variables are Gaussian Distrib-
uted.
Similarly,
1 1 T −1
p(vt |ht ) = p exp − (vt − Bht ) ΣV (vt − Bht )
|2πΣV | 2
p(h1 ) ∼ N (µ, Σ)
The above defines a stationary LGSSM since the parameters of the model are fixed
through time. The non-stationary case allows for different parameters at each time
step, for example ΣV (t). The above definitions are for the first order LGSSM, since
the hidden state depends on only the first previous hidden state – the extension
to higher order variants, where the hidden state depends on several past hidden
states,is straightforward. We could also consider having an external known input
at each time, which will change the mean of the hidden variable. The generalisa-
tion to this case is straightforward, and left as an exercise for the interested reader.
21.1 Inference
Consider an observation sequence v1:T . How can we infer the marginals of the
hiddens p(ht |v1:T )?
We cannot in this case directly use the JTA, since we cannot in general pass the
table entries for a continuous distribution – there are effectively an infinite number
of table entries. However, since Gaussians are fully specified by a small set of
parameters (the sufficient statistics), namely their mean and covariance matrix,
we can alternatively pass parameters during the absorption procedure to ensure
consistency. The reader is invited to carry out this procedure (or the alternative
Belief Propagation method). Whilst this scheme is valid, the resulting recursions
are numerically complex, and may be unstable. One approach to avoid this is
to make use of the Matrix Inversion Lemma to reduce the recursions to avoid
unnecessary matrix inversions. An alternative is to use the RTS-smoothing style
scheme that we introduced in the HMM chapter.
We can find the joint distribution p(ht , vt |v1:t−1 ), and then condition on vt to easily
find the distribution p(ht |v1:t ). The term p(ht , vt |v1:t−1 ) is a Gaussian and can be
found easily using the relations
vt = Bht + η v , ht = Aht−1 + η h
and covariance
−1
Ft ≡ ∆ht ∆hT T
t |v1:t−1 − ∆ht ∆vt |v1:t−1 ∆vt ∆vtT |v1:t−1 ∆vt ∆hT
t |v1:t−1
A nice thing about the above approach is that we work always in the moment
representation, and the iteration is expected to be numerically stable when the
noise covariances are small. This procedure is called the Forward Pass in the
LGSSM inference algorithm (albeit with a change to the standard notation in the
literature for representing the filtered posterior).
In principle, we can apply the Belief Propagation method to form a backpass to find
p(ht |v1:T ) (see barberieee). However,we would like to avoid defining λ messages
here since it is awkward to extend BP to the SKF case. Here we show how for
the simple case of the Kalman Filter, how a smoothing backpass can be formed
without defining λ messages. Instead we form directly a recursion for the smoothed
distribution p(ht |v1:T ).
Imagine that we have completed a forward pass, so that we have, for the KF,
the filtered distributions p(ht |v1:t ). We can form a recursion for the smoothed
posteriors p(ht |v1:T ), directly without using λ recursions as follows:
X
p(ht |v1:T ) ∝ p(ht |v1:T , ht+1 )p(ht+1 |v1:T ) (21.1.1)
ht+1
X
∝ p(ht |v1:t , ht+1 )p(ht+1 |v1:T ) (21.1.2)
ht+1
The term p(ht |v1:t , ht+1 ) can be found by conditioning the joint distribution p(ht , ht+1 |v1:t ) =
p(ht+1 |ht )p(ht |v1:t ). We can work out this joint distribution in the usual manner
1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 −1
yy (y − µy ) and covariance Σxx − Σxy Σyy Σyx .
241
by finding its mean and covariance. The term p(ht |v1:t ) is a known Gaussian from
the Forward Pass with mean ft and covariance Ft . Hence the joint distribution
p(ht , ht+1 |v1:t ) has means
To find the conditional distribution p(ht |v1:t , ht+1 ), we use the conditioned Gaussian
results which says that the conditional mean will be
−1
hht |v1:t i + ∆ht ∆hT
t+1 |v1:t ∆ht+1 ∆hT t+1 |v1:t (ht+1 − hht+1 |v1:t i)
where
←−
−1
A t ≡ ∆ht ∆hT
t+1 |v1:t ∆ht+1 ∆hT
t+1 |v1:t
←
− ≡ hh |v i −
∆h ∆hT |v
∆h ∆hT |v −1 hh |v i
m t t 1:t t t+1 1:t t+1 t+1 1:t t+1 1:t
←
−
and ←
−
η t ∼ N (0, Σ t ). Then p(ht |v1:T ) is a Gaussian distribution with mean
←
− − ≡←−
gt ≡ hht |v1:T i = A t hht+1 |v1:T i + ←
m t A t gt+1 + ←
−
m t
and covariance
←−
←
−T ← − ←
− ←−T ← −
Gt ≡ ∆ht ∆hT T
t |v1:T = A t ∆ht+1 ∆ht+1 |v1:T A t + Σ t ≡ A t Gt+1 A t + Σ t
In this way, we directly find the smoothed posterior without defining the problem-
atic λ messages. This procedure is equivalent to the Rauch-Tung-Striebel Kalman
smoother[40]. A key trick was dynamics reversal. This is sometimes called a ‘cor-
rection’ method since it takes the filtered estimate p(ht |v1:t ) and ‘corrects’ it to
form a smoothed estimate p(ht |v1:T ).
This procedure is called the Backward Pass in the LGSSM inference algorithm
(albeit with a change to the standard notation in the literature for representing
the smoothed posterior).
An advantage of the probabilistic interpretation given above is that the cross mo-
ment, which is required for learning is given by
←−−− T T
T
ht−1 hT
t p(h ,h |v )
= A t−1 Pt + ĥ t ĥTt
t−1 t 1:T
The likelihood p(v1:T ) is often required. To compute this, the simplest way is to
use the recursion:
where, at time 1,
µ1 ≡ Bµ, Σ1 ≡ BΣB T + Σv
MAP vs Marginal
In general, we have seen that there is a difference between the most probable joint
posterior state, and the joint posterior mean. However, in the case of Gaussians,
there is no difference between these two. The interested reader is invited to show
formally that the most likely state of a Gaussian is its mean. Hence, in order to
infer the most likely hidden state, this is equivalent to finding the marginal – that
is, the mean of the hidden variables.
T −1 T −1
!−1
X
X
Anew = ht+1 hT
t ht hT
t
t=1 t=1
T T
!−1
X T
X
new
B = vt hht i ht hT
t
t=1 t=1
If B is updated according to the above, the reader may show that the first equation
can be simplified to
1 X T T
Σnew
V = vt vt − vt hht i B T
T t
243
Dealing with restricted forms of the matrices is also easy to deal with. For example,
it may be that one wishes to search for independent generating processes, in which
case A will have a block diagonal structure. This restriction is easy to deal with
and left as an exercise for the reader.
The last two equations are solved by Gaussian Elimination. The averages in the
above equations are the posterior averages conditioned on the visible variables –
these are given by the Kalman Smoother routine.
The extension of learning to multiple time series is straightforward and left as an
exercise for the reader.
These equations are then discrete time difference equations indexed by t̃. However,
the instrument which measures x(t) and y(t) is not completely accurate. What is
actually measured is x̂(t) and ŷ(t), which are noisy versions of x(t) and y(t). For
simplicity, we relabel ax (t) = fx (t)/m(t), ay (t) = fy (t)/m(t) – these accelerations
will be assumed to be roughly constant, but unknown :
where ηx is a very small noise term. The prior for ax is chosen to be vague – a
zero mean Gaussian with large variance. A similar equation holds for the ay . (Of
course, another approach would be to assume strictly constant accelerations and
learn them).
One way to describe the above approach is to consider x(t), y(t), x′ (t), y ′ (t), ax (t)
and ay (t) as hidden variables. We can put a large variance prior on their initial
values, and attempt to infer the unknown trajectory. A simple demonstration for
this is given in fig(24.3), for which the code is given in the text. It is pleasing
how well the Kalman Filter infers the object trajectory despite the large amount
of measurement noise.
21.3 Problems
Exercise 61 A scaler Rth order Autoregressive Model (AR) model is defined as
R
X
vt+1 = ai vt−i + ηt+1
i=1
3000
2500
2000
1500
y
1000
500
−500
−1000
−2000 0 2000 4000 6000 8000 10000
x
21.4 Solutions
39
22 Switching Linear Dynamical Systems
with
p(vt |ht , st ) = N (v̄(st ) + B(st )ht , Σv (st )) , p(ht |ht−1 , st ) = N h̄(st ) + A(st )ht , Σh (st )
At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes
p(h1 |s1 ).
The SLDS is used in many disciplines, from econometrics to machine learning
[41, 42, 43, 44, 45, 46]. The aSLDS has been used, for example, in state-duration
modelling in acoustics [47] and econometrics [48]. See [49] and [50] for recent
reviews of work.
The SLDS can be thought of as a marriage between a Hidden Markov Model and
a Linear Dynamical system. Each of these two models are tractable. However, the
SLDS is computationally intractable, and requires specialised approximations.
1 These systems also go under the names Jump Markov model/process, switching Kalman Filter,
Switching Linear Gaussian State Space models, Conditional Linear Gaussian Models.
2 The notation x
1:T is shorthand for x1 , . . . , xT .
246
247
s1 s2 s3 s4
h1 h2 h3 h4
v1 v2 v3 v4
Figure 22.1: The independence structure of the aSLDS. Square nodes denote dis-
crete variables, round nodes continuous variables. In the SLDS links from h to s
are not normally considered.
Inference
We consider here the filtered estimate p(ht , st |v1:t ) and the smoothed estimate
p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both filtered and smoothed inference in the
SLDS is intractable, scaling exponentially with time [49]. To see this informally,
consider the filtered posterior, which may be recursively computed using
XZ
p(st , ht |v1:t ) = p(st , ht |st−1 , ht−1 , vt )p(st−1 , ht−1 |v1:t−1 ) (22.0.3)
st−1 ht−1
At timestep 1, p(s1 , h1 |v1 ) = p(h1 |s1 , v1 )p(s1 |v1 ) is an indexed set of Gaussians.
At timestep 2, due to the summation over the states s1 , p(s2 , h2 |v1:2 ) will be an
indexed set of S Gaussians; similarly at timestep 3, it will be S 2 and, in general,
gives rise to S t Gaussians.
Readers familiar with Assumed Density Filtering may wish to continue directly
to section (22.1.3). Our aim is to form a recursion for p(st , ht |v1:t ), based on a
Gaussian mixture approximation3 of p(ht |st , v1:t ). Without loss of generality, we
may decompose the filtered posterior as
The exact representation of p(ht |st , v1:t ) is a mixture with a O(S t ) components.
We therefore approximate this with a smaller I-component mixture
I
X
p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t )
it =1
3 This derivation holds also for the aSLDS, unlike that presented in [52].
248
where p(ht |it , st , v1:t ) is a Gaussian parameterised with mean4 f (it , st ) and covari-
ance F (it , st ). To find a recursion for these parameters, consider
X
p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 , st , it |st+1 , v1:t+1 )
st ,it
X
= p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 )
st ,it
(22.1.2)
We find p(ht+1 |st , it , st+1 , v1:t+1 ) from the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ),
which is a Gaussian with covariance and mean elements5
Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ),
Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 )
Σvh = B(st+1 )F (it , st )
µv = B(st+1 )A(st+1 )f (it , st )
µh = A(st+1 )f (it , st ) (22.1.3)
These results are obtained from integrating the forward dynamics, Equations
(22.0.1,22.0.2) over ht , using the results in Appendix (G.2). To find p(ht+1 |st , it , st+1 , v1:t+1 )
we may then condition p(ht+1 , vt+1 |st , it , st+1 , v1:t ) on vt+1 using the results in Ap-
pendix (G.1).
p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t )
(22.1.4)
The first factor in equation (22.1.4), p(vt+1 |it , st , st+1 , v1:t ) is given as a Gaussian
with mean µv and covariance Σvv , as given in equation (22.1.3). The last two
factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally,
p(st+1 |it , st , v1:t ) is found from
p(st+1 |it , st , v1:t ) = hp(st+1 |ht , st )ip(ht |it ,st ,v1:t ) (22.1.5)
where h·ip denotes expectation with respect to p. In the standard SLDS, equation
(22.1.5) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however,
equation (22.1.5) will generally need to be computed numerically. A simple ap-
proximation is to evaluate equation (22.1.5) at the mean value of the distribution
p(ht |it , st , v1:t ). To take covariance information into account an alternative would
be to draw samples from the Gaussian p(ht |it , st , v1:t ) and thus approximate the
average of p(st+1 |ht , st ) by sampling6 .
4 Strictly speaking, we should use the notation ft (it , st ) since, for each time t, we have a set of
means indexed by it , st . This mild abuse of notation is used elsewhere in the paper.
5 We derive this for h̄
t+1 , v̄t+1 ≡ 0, to ease notation.
6 Whilst we suggest sampling as part of the aSLDS update procedure, this does not equate this
with a sequential sampling procedure, such as Particle Filtering. The sampling here is a form of
exact sampling, for which no convergence issues arise, being used only to numerically compute
equation (22.1.5).
249
We are now in a position to calculate equation (22.1.2). For each setting of the
variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse
back to I Gaussians to form
I
X
p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 )
it+1 =1
where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ).
The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ),
where
X
p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t )
it ,st ,st+1
In the above expression, all terms have been computed in forming the recursion
for the filtered posterior p(ht+1 , st+1 |v1:t+1 ).
The user may provide any algorithm of their choice for collapsing a set of Gaussians
to a smaller set of Gaussians [53]. Here, to be explicit, we present a simple one
which is fast, but has the disadvantage that no spatial information about the
mixture is used.
First, we describe how to collapse
Pa mixture to a single Gaussian: We may collapse
aPmixture of Gaussians P p(x) = i pi N (x|µ i , ΣiT) to a single Gaussian with mean
T
i pi µi and covariance i pi Σi + µi µi − µµ .
More sophisticated methods which retain some spatial information would clearly
be potentially useful. The method presented in [43] is a suitable approach which
considers removing Gaussians which are spatially similar (and not just low-weight
components), thereby retaining a sense of diversity over the possible solutions.
The main difficulty is to find a suitable way to ‘correct’ the filtered posterior
p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ).
We initially derive this for the case of a single Gaussian representation. The ex-
tension to the mixture case is straightforward and is given in section (22.1.5). Our
derivation holds for both the SLDS and aSLDS. We approximate the smoothed
posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ), and
our aim is to find a recursion for these parameters. A useful starting point for a
recursion is:
X
p(ht , st |v1:T ) = p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T )
st+1
251
The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as
p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (22.1.7)
The difficulty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared
exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian. One
possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a
Gaussian (or mixture thereof) by minimising the Kullback-Leilbler divergence be-
tween the two, or performing moment matching in the case of a single Gaussian. A
simpler alternative is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ),
see fig(22.2). This makes life easy since p(ht+1 |st+1 , v1:T ) is already known from
the previous backward recursion. Under this assumption, the recursion becomes
X
p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) hp(ht |ht+1 , st , st+1 , v1:t )ip(ht+1 |st+1 ,v1:T )
st+1
(22.1.8)
hp(ht |ht+1 , st , st+1 , v1:t )ip(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will
now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from
the joint distribution
p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (22.1.9)
which itself can be found from a forward dynamics from the filtered estimate
p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of
p(ht |st , v1:t ), since st+1 carries no extra information about ht 8 . The only remaining
7 Equation (22.1.8) has the pleasing form of an RTS backpass for the continuous part (analogous
to LDS case), and a discrete smoother (analogous to a smoother recursion for the HMM). In
the standard Forward-Backward algorithm for the HMM [37], the posterior γt ≡ p(st |v1:T )
is formed from the product of αt ≡ p(st |v1:t ) and βt ≡ p(vt+1:T |st ). This approach is also
analogous to EP [38]. In the correction approach, a direct recursion for γt in terms of γt+1
and αt is formed, without explicitly defining βt . The two approaches to inference are known
as α − β and α − γ recursions.
8 Integrating over h
t+1 means that the information from st+1 passing through ht+1 via the term
p(ht+1 |st+1 , ht ) vanishes. Also, since st is known, no information from st+1 passes through st
to ht .
252
uncomputed statistics are the mean of ht+1 , the covariance of ht+1 and cross-
variance between ht and ht+1 , which are given by
Given the statistics of equation (22.1.9), we may now condition on ht+1 to find
p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics,
←
−
ht = A (st , st+1 )ht+1 + ←
−
η (st , st+1 )
←
− − , s ), ← −
where A and ← −
η (st , st+1 ) ∼ N (←m(s t t+1 Σ (st , st+1 )) are easily found using the
conditioned Gaussian results in Appendix (G.1). Averaging the above reversed dy-
namics over p(ht+1 |st+1 , v1:T ), we find that hp(ht |ht+1 , st , st+1 , v1:t )ip(ht+1 |st+1 ,v1:T )
is a Gaussian with statistics
←
− − , s ), Σ = ←− ←− ←−
µt = A (st , st+1 )g(st+1 )+←
m(s t t+1 t,t A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 )
p(st |st+1 , v1:T ) = hp(st |ht+1 , st+1 , v1:t )ip(ht+1 |st+1 ,v1:T ) (22.1.10)
Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the
forward pass, equation (22.1.5). In equation (22.1.11), p(ht+1 |st+1 , st , v1:t ) is found
by marginalising equation (22.1.9).
Computing the average of equation (22.1.11) with respect to p(ht+1 |st+1 , v1:T ) may
be achieved by any numerical integration method desired. The simplest approxi-
253
mation is to evaluate the integrand at the mean value of the averaging distribution9
p(ht+1 |st+1 , v1:T ). Otherwise, sampling from the Gaussian p(ht+1 |st+1 , v1:T ), has
the advantage that covariance information is used10 .
We have now computed both the continuous and discrete factors in equation
(22.1.8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) =
p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the
joint equation (22.1.8) by conditioning on st to form the mixture
X
p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T )
st+1
which may be collapsed to a single Gaussian (or mixture if desired). The smoothed
posterior p(st |v1:T ) is given by
X
p(st |v1:T ) = p(st+1 |v1:T )p(st |st+1 , v1:T )
st+1
X
= p(st+1 |v1:T ) hp(st |ht+1 , st+1 , v1:t )ip(ht+1 |st+1 ,v1:T ) . (22.1.12)
st+1
Numerical Stability
Numerical stability is a concern even in the LDS, and the same is to be expected for
the aSLDS. Since the standard LDS recursions LDSFORWARD and LDSBACKWARD
are embedded within the EC algorithm, we may immediately take advantage of
the large body of work on stabilizing the LDS recursions, such as the Joseph or
square root forms [54].
22.1.4 Remarks
The standard-EC Backpass procedure is closely related to Kim’s method [55, 45].
In both standard-EC and Kim’s method, the approximation
p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple
backward pass. The other ‘approximation’ in EC is to numerically compute the
average in equation (22.1.12). In Kim’s method, however, an update for the dis-
crete variables is formed by replacing the required term in equation (22.1.12) by
hp(st |ht+1 , st+1 , v1:t )ip(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (22.1.13)
This approximation11 decouples the discrete backward pass in Kim’s method from
the continuous dynamics, since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t )
9 Replacing ht+1 by its mean gives the simple approximation
1 T −1
1 e− 2 zt+1 (st ,st+1 )Σ (st ,st+1 |v1:t )zt+1 (st ,st+1 )
hp(st |ht+1 , st+1 , v1:t )ip(ht+1 |st+1 ,v1:T ) ≈ p p(st |st+1 , v1:t )
Z det Σ(st , st+1 |v1:t )
where zt+1 (st , st+1 ) ≡ hht+1 |st+1 , v1:T i − hht+1 |st , st+1 , v1:t i and Z ensures normalisation
over st . Σ(st , st+1 |v1:t ) is the filtered covariance of ht+1 given st , st+1 and the observations
v1:t , which may be taken from Σhh in equation (22.1.3).
10 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should
not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo
sampling scheme.
11 In the HMM, this is exact, but in the SLDS the future observations carry information about
st .
254
can be computed simply from the filtered results alone. The fundamental differ-
ence therefore between EC and Kim’s method is that the approximation, equation
(22.1.13), is not required by EC. The EC backward pass therefore makes fuller
use of the future information, resulting in a recursion which intimately couples the
continuous and discrete variables. Unlike [55] and [43], where gt , Gt ≡ ft , Ft and
only the backward pass mixture weights are updated from the forward pass, EC
actually changes the Gaussian parameters gt , Gt in a non-trivial way. The result-
ing effect on the quality of the approximation can be profound, as we will see in
the experiments.
The Expectation Propagation algorithm, discussed in more detail in section (22.2),
makes the central assumption, as in EC, of collapsing the posteriors to a Gaussian
family [50]. However, in EP, collapsing to a mixture of Gaussians is difficult –
indeed, even working with a single Gaussian may be numerically unstable. In con-
trast, EC works largely with moment parameterisations of Gaussians, for which
relatively few numerical difficulties arise. As explained in the derivation of equa-
tion (22.1.8), the conditional independence assumption p(ht+1 |st , st+1 , v1:T ) ≈
p(ht+1 |st+1 , v1:T ) is not strictly necessary in EC. We motivate it by computa-
tional simplicity, since finding an appropriate moment matching approximation
of p(ht+1 |st , st+1 , v1:T ) in equation (22.1.7) requires a relatively expensive non-
Gaussian integration. The important point here is that, if we did treat p(ht+1 |st , st+1 , v1:T )
more correctly, the only assumption in EC would be a collapse to a mixture of
Gaussians, as in EP. As a point of interest, as in EC, the exact computation re-
quires only a single forward and backward pass, whilst EP is an ‘open’ procedure
requiring iteration to convergence.
255
The average in the last line of the above equation can be tackled using the same
techniques as outlined in the single Gaussian case. To approximate p(ht |jt+1 , st+1 , it , st , v1:T )
we consider this as the marginal of the joint distribution
p(ht , ht+1 |it , st , jt+1 , st+1 , v1:T ) = p(ht |ht+1 , it , st , jt+1 , st+1 , v1:t )p(ht+1 |it , st , jt+1 , st+1 , v1:T )
As in the case of a single mixture, the problematic term is p(ht+1 |it , st , jt+1 , st+1 , v1:T ).
Analogously to before, we may make the assumption
This mixture can then be collapsed to smaller mixture using any method of choice,
to give
X
p(ht |st , v1:T ) ≈ p(jt |st , v1:T )p(ht |jt , v1:T )
jt
The resulting algorithm is presented in 3.?? which includes using mixtures in both
forward and backward passes.
is also taken in [43] which performs the collapse by removing spatially similar
Gaussians, thereby retaining diversity.
Several smoothing approaches directly use the results from ADF. The most
popular is Kim’s method, which updates the filtered posterior weights to
form the smoother. As discussed in section (22.1.4), Kim’s smoother cor-
responds to a potentially severe loss of future information and, in general,
cannot be expected to improve much on the filtered results from ADF. The
more recent work of [43] is similar in spirit to Kim’s method, whereby the
contribution from the continuous variables is ignored in forming an approx-
imate recursion for the smoothed p(st |v1:T ). The main difference is that for
the discrete variables, Kim’s method is based on a correction smoother, [40],
whereas Lerner’s method uses a Belief Propagation style backward pass [6].
Neither method correctly integrates information from the continuous vari-
ables. How to form a recursion for a mixture approximation, which does
not ignore information coming through the continuous hidden variables is a
central contribution of our work.
[44] used a two-filter method in which the dynamics of the chain are reversed.
Essentially, this corresponds to a Belief Propagation method which defines a
Gaussian sum approximation for p(vt+1:T |ht , st ). However, since this is not
a density in ht , st , but rather a conditional likelihood, formally one cannot
treat this using density propagation methods. In [44], the singularities re-
sulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically
finessed.
Expectation Propagation : EP [51] corresponds to an approximate implementation of Belief Propaga-
tion12 [6, 38]. Whilst EP may be applied to multiply-connected graphs,
it does not fully exploit the numerical advantages present in the singly-
connected aSLDS structure. Nevertheless, EP is the most sophisticated rival
to Kim’s method and EC, since it makes the least assumptions. For this rea-
son, we’ll explain briefly how EPQworks. First, let’s simplify the notation,
and write the distribution as p = t φ (xt−1 , vt−1 , xt , vt ), where xt ≡ ht ⊗ st ,
and φ (xt−1 , vt−1 , xt , vt ) ≡ p(xt |xt−1 )p(vt |xt ). EP defines ‘messages’ ρ, λ13
which contain information from past and future observations respectively14 .
Explicitly, we define ρt (xt ) ∝ p(xt |v1:t ) to represent knowledge about xt
given all information from time 1 to t. Similarly, λt (xt ) represents knowl-
edge about state xt given all observations from time T to time t + 1. In
the sequel, we drop the time suffix for notational clarity. We define λ(xt )
implicitly through the requirement that the marginal smoothed inference is
given by
p(xt |v1:T ) ∝ ρ (xt ) λ (xt ) (22.2.1)
Hence λ (xt ) ∝ p(vt+1:T |xt , v1:t ) = p(vt+1:T |xt ) and represents all future
knowledge about p(xt |v1:T ). From this
p(xt−1 , xt |v1:T ) ∝ ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) (22.2.2)
12 Non-parametric belief propagation [57], which performs approximate inference in general con-
tinuous distributions, is also related to EP applied to the aSLDS, in the sense that the messages
cannot be represented easily, and are approximated by mixtures of Gaussians.
13 These correspond to the α and β messages in the Hidden Markov Model framework [37].
14 In this Belief Propagation/EP viewpoint, the backward messages, traditionally labeled as β,
Sequential Monte Carlo (Particle Filtering) : These methods form an approximate implementation of equation
(22.0.3), using a sum of delta functions to represent the posterior (see, for
example, [58]). Whilst potentially powerful, these non-analytic methods typ-
ically suffer in high-dimensional hidden spaces since they are often based on
naive importance sampling, which restricts their practical use. ADF is gen-
erally preferential to Particle Filtering since in ADF the approximation is
a mixture of non-trivial distributions, which is better at capturing the vari-
ability of the posterior. In addition, for applications where an accurate com-
putation of the likelihood of the observations is required (see, for example
[59]), the inherent stochastic nature of sampling methods is undesirable.
23 Gaussian Processes
VERY DRAFTY!!!
y = f (x) (23.1.1)
and we aim to choose a function f that fits the data well. We assume that the
data output we observe, t has been corrupted with additive Gaussian noise1
t = f (x) + η (23.1.2)
Note that it is not necessary to explicitly write normalising constants for distributions
since they are uniquely given by the normalisation condition for probabilities. In this
1/2
case the normalising constant is 1/ 2πσ 2 .
Together, the likelihood and prior on the function f complete the specification of
the model. We can then use Bayes rule to find the posterior distribution of f in
light of the observed data D.
p(t|f )p(f )
P (f |t) = (23.1.5)
p(t)
1 Although not necessary, this assumption is convenient so that the following theory is analyti-
cally tractable. P 2
2 In equation (23.1.4) we use the notation v2 to mean
i vi for an arbitrary vector v.
259
260
which gives a measure of the confidence in the prediction hf (x)i. Note that if we
wish to make predictive error bars, we need to include possible noise corrupting
processes. For example, in the case of additive Gaussian noise, we believe that
the actual data points we observe are modelled by the process f (x) + η, where
η ∼ normal(0, σ 2 ). Given that the posterior distribution is independent of the
noise process, this means that predicitve error bars are given by simply adding σ 2
to equation (23.1.8).
In the Bayesian framework, the quantity that we need in order to assess the fi-
delity of a model M is it’s likelihood, p(t|M ). This is also sometimes called the
“evidence”. It will be simpler to explain how to use such quantities in the context
of a specific model, and we defer this discussion till section (23.2.4)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
X
Figure 23.1: A set of 5 Gaussian basis functions. The model output for a particular
x is given by a linear combination of the basis function values at that x, here given
by the intersection of the basis curves with the line x = 0.8.
That is, the model is linear in the parameters w, although the output y depends
non-linearly on the input x. If we have several points that we wish to make a
prediction at, x1 . . . xl , then we write the prediction vector (y 1 . . . y l )T = y = Φw
where we have defined the design matrix Φji = φi (xj ). Note that an upper index
refers to the datapoint number and the lower to a basis component. In fig(23.1)
we plot 5 Gaussian basis functions. The models value for say x = 0.8 is then a
linear combination of the basis function values at that input (given here by the
intersection of the vertical line with x = 0.8).
We shall assume that, in addition to w, there may be some tunable (hyper)parameters,
such as the width of the basis functions and our belief about the noise level.
One needs to cover the input region of interest sufficiently well so that the func-
tional form of the output distribution p(f ) is expressive enough to capture the
kinds of models that we are interested in finding.
We assume that our prior belief about the function f can be expressed by a belief
about the distribution of the parameters w. Here we assume that this takes the
262
form of a Gaussian distribution, with zero mean, and a user specified variance:
α
p(w) = normal(0, α−1 I) ∝ exp − w2 (23.2.2)
2
Using this distribution, we can draw (say 6) random weight vectors w1 . . . w6 and
plot the corresponding functions w1 ·φ . . . w6 ·φ, where φ = {φ1 (x) . . . φk (x)} is a
vector of basis function values.
The predictions becoming more confident towards the edges in the Gaussian basis
function case is simply a consequence of the form of the basis functions. This is
an important point - you only get out from the method answers consistent with
your model assumptions.
What is the posterior distribution p(y(x1 ) . . . y(xl )|t) for a set of chosen x1 . . . xl ?,
induced by the Gaussian weight posterior equation (23.2.4)?
As we have seen, the distribution of the weights w of the GLM are determined
automatically through the Bayesian procedure. The only parameters that are left
to the user to control are the width of the basis functions, the noise belief, the scale
α and the number and type of the basis functions. Let’s denote such parameters
by Γ. It may be that we would like to carry out a Bayesian analysis for these
parameters too, so that we can assess the relevance of different model parameter
settings in light of the observed data.
In principle, this can be viewed as just another level in a hierarchy of models. The
determined Bayesian would assign a (hyper)prior to these parameters p(Γ) and
perform model averaging over them (just as we did in the weights w case),
Z Z Z
hf (x)i = f (x|w)p(w, Γ|t)dwdΓ = f (x|w)p(w|Γ, t)dw p(Γ|t)dΓ (23.2.7)
263
Where p(Γ|t) = p(t|Γ)p(Γ)/p(t) and p(Γ) is our prior belief about the (hyper)parameters.
R
The “evidence” p(t|Γ) is obtained by integrating over the weights, p(t|Γ) = p(t|w)p(w|Γ)dw.
Typically, the integrations in equation (23.2.7) are extremely difficult to carry out
(even if p(Γ|t) is tractable) and one needs to resort to techniques such as Monte
Carlo.
A simpler alternative is to consider using those Γ that correspond to a maximum
of the model posterior p(Γ|t). Provided that the posterior p(Γ|t) is sharply peaked
around it’s optimum value, this may still give a faithful value for the average in
equation (23.2.7). Assuming a flat prior on Γ, this corresponds to using the Γ that
maximize the likelihood p(t|Γ).
In the linear case here, and with the Gaussian noise model assumption, calculating
the model likelihood involves only Gaussian integrals, giving
β 1 1 k P
log p(t|Γ) = − t2 + β 2 tT ΦT C −1 Φt − log det(C) + log α − log(2π/β)
2 2 2 2 2
(23.2.8)
GLMs can be very flexible regression models and one advantage from the Bayesian
point of view is that the model likelihood p(t|Γ) can be calculated exactly. This
makes combining models which have say different numbers of basis functions easy
to do – we just use equation (23.2.7).
From equation (23.1.4) we see that, since we already have a (Gaussian) definition
for the likelihood p(f |t), all we need to do is specify a prior distribution p(f ) on
the function space f to complete the model specification. The most natural choice
is to specify a Gaussian distribution here, since that will mean that the posterior
is also Gaussian.
Imagine that we are given a set of inputs x1 . . . xl . Consider a particular xi and it’s
corresponding possible function value y i . If we have a space of possible functions,
then they will pass through different y i for the same xi (see say x1 in fig(23.2)).
Indeed, we can construct the prior on functions so that the distribution of these
values should be Gaussian, centered around some mean value (we will take this to
be zero for simplicity) with a certain variance.
Consider now two inputs, xi and xj and their separation, |xi − xj |. Note that y i
and y j fluctuate as different functions are sampled from some function space prior.
How can we incorporate ideas of smoothness? If |xi − xj | is small, we may expect
that a set of values at y i and a set at y j should be highly correlated (as in fig(23.2)
for x1 and x2 ) . This means that we might well think that the output values y i and
y j should be highly correlated if |xi − xj | is small. Conversely, if |xi − xj | is large,
we (probably) do not expect that y i and y j will be at all correlated (as for values at
264
30
20
10
y
−10
−20
−30
−1 −0.5 0 0.5 1 1.5 2
x1 x2 x x3
Figure 23.2: Samples functions from a Gaussian Process Prior. The correlation
between y i and y j decreases with the distance |xi − xj |.
x1 and x3 in fig(23.2)). We can express these beliefs about the correlations of the
components of a vector of values y = (y 1 . . . y l ) at inputs x1 . . . xl as a multivariate
Gaussian distribution
1
p(y) ∝ exp − y T k−1 y (23.3.1)
2
where K is the covariance matrix of the outputs y. The elements of k are specified
by the covariance function c(xi , xj ). As we argued above, we might expect that
the correlation between yi and yj decreases the further apart xi and xj are.
2
In the Gaussian case, the covariance function is c(xi , xj ) = α exp −0.5λ (xi − xj ) .
Note that the shape of this function is smooth at zero.
In the Ornstein Uhlenbeck case, the covariance function is c(xi , xj ) = α exp (−0.5λ|xi − xj |).
Note how the Ornstein Uhlenbeck process gives rise to much less smooth functions
than those formed with the Gaussian covariance function.
Changing the length scale of the covariance function affects the range over which
the functions are correlated. See how changing α alters the scale of the outputs.
Imagine that we have some new inputs x∗ and we wish to make some predictions
for their outputs y ∗ . According to the Bayesian philosophy, we need to specify a
likelihood and a prior. We already specified a prior in section (23.3.1),
p(y∗ , y) = normal(0, K) (23.3.2)
265
T
where K can be partitioned into matrices k, Kx∗ x∗ , Kxx∗ , Kxx ∗ . k has elements
Since the prior and likelihood are Gaussian, it is clear that the posterior p(y∗ , y|t) ∝
p(t|y∗ , y)p(y∗ , y) is also Gaussian in y∗ , y. The marginal distribution p(y∗ |t) is
therefore also Gaussian. You might like to convince yourselves in your own time
that it takes the form
−1
2 −1
p(y∗ |t) = normal(Kxx∗ k + σ 2 T
t, Kx∗ x∗ − Kxx ∗ k + σ I Kxx∗ ) (23.3.4)
First we see predictions for one training point. The red curve is the mean prediction
and the green curve are the error bars (one standard deviation). The blue crosses are
the training data points.
We can now try to understand the posterior as in the case of GLMs. In the same
way, we alter the noise belief and actual noise and see what happens to the pre-
dictions.
Note how the error bars collapse onto the data for a single datapoint. See how
this also happens for two datapoints as well.
Can you observe any differences between the GP predictions and the GLM predic-
tions?
What do you think could be the connection between GPs and GLMs?
There are two basic methods for making predictions in classification problems (see,
e.g Ripley, 1996); (i) the sampling paradigm, where a class-conditional density
p(x|k) and a prior are created for each class k, and Bayes’ theorem is used to
determine p(k|x) given a new input x, or (ii) the diagnostic paradigm, where the
aim is to predict p(k|x) directly via some function of the input. As p(k|x) must
lie in [0, 1], this condition is usually achieved by using an output (or transfer)
function which enforces the constraint. For the two class problem a common
choice is the logistic function σ(y) = 1/(1 + e−y ). For a k > 2 class problem a
simple generalization of the logistic function, the softmax function, is frequently
used.
We will follow the diagnostic paradigm and use the logistic function, an approach
also used widely in the neural networks literature. In the simplest method of this
kind, logistic regression, the input to the sigmoid function y is simply computed as
a linear combination of the inputs, plus a bias, i.e. y = wT x + b. Neural networks
and other flexible methods allow y to be a non-linear function of the inputs.
266
1
3
y
1
π
σ
⇒ ⇒
0
−5 0 5
−3
0
Figure 23.3: π(x) is obtained from y(x) by “squashing” it through the sigmoid
function σ.
We show in section 23.4.3 how to perform the integral in 23.4.1 over the hyper-
parameters P (θ|t). Here we consider the hyperparameters to be fixed, and are
interested in the posterior distribution P (π∗ |t) = P (π(x∗ )|t) for a new input x∗ .
This can be calculated by finding the distribution P (y∗ |t) (y∗ is the activation of
π∗ ) and then using the appropriate Jacobian to transform the distribution. For-
mally the equations for obtaining P (y∗ |t) are identical to equation ??. However,
even if weQuse a GP prior so that P (y∗ , y) is Gaussian, the usual expression for
P (t|y) = i πiti (1 − πi )1−ti for classification data (where the t’s take on values
of 0 or 1), means that the average over π in equation 23.4.1 is no longer exactly
analytically tractable.
After transforming equation 23.4.1 to an integral over activations, we will employ
Laplace’s approximation, i.e. we shall approximate the integrand P (y∗ , y|t) by a
Gaussian distribution centred at a maximum of this function with respect to y∗ , y
with an inverse covariance matrix given by −∇∇ log P (y∗ , y|t). The necessary
integrations (marginalization) can then be carried out analytically (see, e.g. Green
and Silverman (1994) §5.3) and, we provide a derivation in the following section.
The averages over the hyperparameters will be carried out using Monte Carlo
techniques, which we describe in section 23.4.3.
267
Let y+ denote (y∗ , y), the complete set of activations. By Bayes’ theorem log P (y+ |t) =
log P (t|y) + log P (y+ ) − log P (t), and let Ψ+ = log P (t|y) + log P (y+ ). As P (t)
does not depend on y+ (it is just a normalizing factor), the maximum of P (y+ |t)
is found by maximizing Ψ+ with respect to y+ . We define Ψ similarly in relation
to P (y|t). Using log P (ti |yi ) = ti yi − log(1 + eyi ), we obtain
n
X 1 T −1 1 n+1
Ψ+ = tT y − log(1 + eyi ) − y+ K+ y+ − log |K+ | − log(23.4.2)
2π
i=1
2 2 2
n
X 1 1 n
Ψ = tT y − log(1 + eyi ) − y T K −1 y − log |K| − log 2π (23.4.3)
i=1
2 2 2
where K+ is the covariance matrix of the GP evaluated at x1 , . . . xn , x∗ . K+ can
be partitioned in terms of an n × n matrix K, a n × 1 vector k and a scalar k∗ ,
viz.
K k
K+ = (23.4.4)
kT k∗
As y∗ only enters into equation 23.4.2 in the quadratic prior term and has no data
point associated with it, maximizing Ψ+ with respect to y+ can be achieved by first
maximizing Ψ with respect to y and then doing the further quadratic optimization
to determine y∗ . To find a maximum of Ψ we use the Newton-Raphson (or Fisher
scoring) iteration y new = y − (∇∇Ψ)−1 ∇Ψ. Differentiating equation 23.4.3 with
respect to y we find
∇Ψ = (t − π) − K −1 y (23.4.5)
−1
∇∇Ψ = −K −N (23.4.6)
where the ‘noise’ matrix is given by N = diag(π1 (1 − π1 ), .., πn (1 − πn )). This
results in the iterative equation,
likelihood estimator for a model with a finite number of parameters. This is because
the dimension of the problem grows with the number of data points. However, if
we consider the “infill asymptotics”, where the number of data points in a bounded
region increases, then a local average of the training data at any point x will
provide a tightly localized estimate for π(x) and hence y(x), so we would expect
the distribution P (y) to become more Gaussian with increasing data.
There are many reasonable choices for the covariance function. Formally, we are
required to specify functions which will generate a non-negative definite covariance
matrix for any set of points (x1 , . . . , xk ). From a modelling point of view we wish
to specify covariances so that points with nearby inputs will give rise to similar
predictions. We find that the following covariance function works well:
d
1X
C(x, x′ ) = v0 exp{− wl (xl − x′l )2 } (23.4.9)
2
l=1
n
are denoted by ym . Formally, the softmax link function relates the activations and
probabilities through
n
n exp ym
πm = n
Σm′ exp ym ′
P n
which automatically enforces the constraint m πm = 1. The targets are similarly
n
represented by tm , which are specified using a one-of-m coding.
P
The log likelihood takes the form L = n,m tnm ln πm n
, which for the softmax link
function gives
!
X X
n n n
L= tm ym − ln exp πm′ (23.5.1)
n,m m′
As for the two class case, we shall assume that the GP prior operates in activation
n
space; that is we specify the correlations between the activations ym .
One important assumption we make is that our prior knowledge is restricted to cor-
relations between the activations of a particular class. Whilst there is no difficulty
in extending the framework to include inter-class correlations, we have not yet
encountered a situation where we felt able to specify such correlations. Formally,
the activation correlations take the form,
′ ′
n n n,n
hym ym′ i = δm,m′ Km (23.5.2)
′
n,n
where Km is the n, n′ element of the covariance matrix for the mth class. Each
individual correlation matrix Ki has the form given by equation 23.4.9 for the two
class case). We shall make use of the same intraclass correlation structure as that
given in equation 23.4.9 with a separate set of hyperparameters for each class.
For simplicity, we introduce the augmented vector notation,
y+ = y11 , ....y1n , y1∗ , y21 , ....y2n , y2∗ , ....ym
1 n
, ....ym ∗
, ym
where, as in the two class case, yi∗ denotes the target activation for class i; this
notation is also used to define t+ and π + . In a similar manner, we define y, t and
π by excluding the corresponding target values, denoted by a ‘*’ index.
With this definition of the augmented vectors, the GP prior takes the form,
1 T +
P (y+ ) ∝ exp − y+ K y+ (23.5.3)
2
where, from equation 23.5.2, the covariance matrix K + is block diagonal in the
matrices, K1+ , ..., Km
+
. Each individual matrix Ki+ expresses the correlations of
activations within class i, with covariance function given by equation 23.4.9, as for
the two class case.
The GP prior and likelihood, defined by equations 23.5.3, 23.5.1 respectively, define
the posterior distribution of activations, P (y+ |t). Again, as in section 23.4.1 we
are interested in a Laplace approximation to this posterior, and therefore need to
find the mode with respect to y+ . Dropping unnecessary constants, the multi-class
271
analogue of equation 23.4.2 for terms involving y+ in the exponent of the posterior
are:
1 T −1 X X
Ψ+ = − y+ K+ y+ + tT y − ln n
exp ym
2 n n
∇∇Ψ = −K −1 − N
Although this is in the same form as for the two class case, equation eq:deldelpsi,
there is a slight change in the definition of the ‘noise’ matrix, N . A convenient
way to define N is by introducing the matrix Π which is an (m ∗ n+ ) × (n+ ) matrix
n n
of the form Π = (diag(π11 ..π1 + ), .., diag(πm
1
..πm+ )). Using this notation, we can
write the noise matrix in the form of a diagonal matrix and an outer product,
n
N = −diag(π11 ..π1 + , .., πm
1 n+
..πm ) + ΠΠT (23.5.4)
The update equation for iterative optimization of Ψ with respect to the activation
y then follow the same form as that given by equation . The advantage of the
representation of the noise matrix in equation 23.5.4 is that we can then invert
matrices and find their determinants using the identities,
−1
(A + HH T )−1 = A−1 − A−1 H I + H T A−1 H H T A−1 (23.5.5)
and
23.6 Discussion
One should always bear in mind that all models are wrong! (If we knew the cor-
rect model, we wouldn’t need to bother with this whole business). Also, there is
no such thing as assumption free predictions, or a “universal” method that will
always predict well, regardless of the problem. In particular, there is no way that
one can simply look at data and determine what is signal and what is noise. The
separation of a signal into such components is done on the basis of belief about the
noise/signal process.
Note that our stated aim in this practical was to find a good regression model and
not to try to interpret the data. This is an important difference and should be kept
in mind. It may well be that using a non-linear model, we can (also) fit the data
well using far fewer adjustable parameters. In that case, we may be able to place
more emphasis on interpreting such lower dimensional representations and perform
feature extraction (as potentially in neural networks). However, linear models are
generally easier to work with and are a useful starting place in our search for
a good regression model. Coupled with the Gaussian noise assumption, using a
Gaussian prior on the weights of a linear model defines a Gaussian Process in the
output space. In this sense, generalised linear models are Gaussian Process with a
particular covariance function. Once this is realised, one is free to directly specify
the form of the covariance function, as we did in the latter half of the practical,
and this obviates the need for a weight space. This is in some cases convenient
since it therefore also deals with the problem of the curse of dimensionality. As far
as the Bayesian is concerned in this regression context, without any explicit belief
about the data generating process, the only requirements/prior belief one has are
typically expressed in terms of the smoothness of the function itself. That is, the
question of parameter complexity is irrelevant - the Bayesian is perfectly happy
to use a model with a billion parameters or one with 10 parameters. Whichever
model most aptly captures his/her belief about the data generating function is the
preferred choice.
IV. Approximate Inference Methods
273
24 Sampling
Readers are also invited to read the chapter on sampling methods methods in
David MacKay’s book.
24.1 Introduction
Consider the distribution p(x). Sampling is the process of generating a vector
x from the distribution p(x), with probability given by p(x). One way to view
this is that if we have a procedure S(p) from which we can generate a set of P
samples x1 , . . . , xP , then, in the limit of P → ∞, the relative frequency that the
sample value x occurs tends to p(x). (In the continuous distribution case, R this can
be defined as the limiting case of the relative frequency x ∈ ∆ tending to x∈∆ p(x).
In both cases, sampling simply means drawing examples from the distribution with
the correct frequency.
In the sequel, we assume that a random number generator exists which is able to
produce a value uniformly at random from the unit interval [0, 1]. We will make
use of this uniform random number generator to draw samples from non-uniform
distributions.
1 × 2 3
This represents a partitioning of the unit interval [0, 1] in which the interval [0, 0.6]
has been labelled as state 1, [0.6, 0.7] as state 2, and [0.7, 1.0] as state 3. If we
were to drop a point × anywhere at random, uniformly in the interval [0, 1], the
chance that × would land in interval 1 is 0.6, and the chance that it would be in
interval 2 is 0.2 and similarly, for interval 3, 0.3. This therefore defines for us a
valid sampling procedure for discrete one-dimensional distributions:
274
275
Cumulant Let pi , i = P 1, K label the K state probabilities. Calculate the so-called cumu-
lant, ci = j≤i pj , and set c0 = 0. (In the above, we have (c0 , c1 , c2 , c3 ) =
(0, 0.6, 0.7, 1)). Draw a value u uniformly at random from the unit interval [0, 1].
Find that i for which ci−1 ≤ u ≤ ci . The sampled state is then i. In our example,
we may have sampled u = 0.66. Then the sampled x state would be state x = 2,
since this is in the interval [c1 , c2 ].
Continuous Case
Intuitively, the generalisation of the discrete case to the continuous case is clear.
First we calculate the cumulant density function
Z y
C(y) = p(x)dx
−∞
Then we generate a random u uniformly from [0, 1], and then obtain the corre-
sponding sample value x by solving C(x) = u. For some special distributions, such
as Gaussians, very efficient equivalent ways to achieve this are usually employed.
One way to generalise the one dimensional case to a higher dimensional case
p(x1 , . . . , xn ) would be to translate the higher dimensional case into an equiva-
lent one-dimensional distribution. We can enumerate all the possible joint states
(x1 , . . . , xn ), giving each a unique integer y from 1 to the total number of states
accessible. This then transforms the multi-dimensional distribution into an equiv-
alent one-dimensional distribution, and sampling can be achieved as before. Of
course, in high dimensional distributions, we would have, in general, exponentially
many states if x, and an explicit enumeration would be impractical.
Belief Networks
For example
p(x1 , . . . , x6 ) = p(x1 )p(x2 )p(x3 |x1 , x2 )p(x4 |x3 )p(x5 |x3 )p(x6 |x4 , x6 )
as shown below.
By making a so-called ancestral ordering (in which parents always come before
x1 x2
x3
x4 x5
x6
children), as in the equation above, one can sample first from those nodes that
do not have any parents (here, x1 and x2 ). Given these values, one can then
sample x3 , and then x4 ,x5 and finally x6 . Hence, despite the presence of loops in
the graph, such a forward sampling procedure is straightforward. Any quantity of
interest, for example, a marginal p(x5 ), is approximated by counting the relative
number of times that x5 is in a certain state in the samples.
How can we sample from a distribution in which certain variables are clamped
in evidential values? One approach would be to proceed as above with forward
sampling, and then discard any samples which do not match the evidential states.
This can be extremely inefficient, and is not recommended.
Gibbs Sampling
One of the simplest ways to more effectively account for evidence is to employ a
recursive procedure. One way to motivate the procedure is to assume that someone
has presented you with an sample x1 from the distribution p(x). (For the moment,
we leave aside the issue of evidence). We then consider a particular variable, xi .
We may write
(One may view this decomposition as xi given all its parents, mutliplied by the
probability of the parents). Since we assume that someone has already provided us
with a sample x1 , from which we can readoff the ‘parental’ state x11 , . . . , x1i−1 , x1i+1 , . . . , x1n ,
we can then draw a sample from
There are a couple of important remarks about this procedure. Clearly, if the
initial sample x1 is not representative – that is, it is fact a part of the state space
that is relatively extremely unlikely, then we should not really expect that the
samples we draw will initially be very representative either. This motivates the
so-called ‘burn in’ stage in which, perhaps 1/3 of the samples are discarded. An-
other remark is that it is clear there will be a high degree of correlation in any two
successive samples, since only one variable is updated. What we would really like
is that each sample x is simply drawn ‘at random’ from p(x) – clearly, in general,
such random samples will not possess the same degree of correlation as those from
Gibbs sampling. This motivates so-called subsampling, in which, say, every 10th ,
sample xK , xK+10 , xK+20 , . . ., is taken, and the rest discarded.
Evidence
Evidence is easy to deal with in the Gibbs sampling procedure. One simply clamps
for all time those variables that are evidential into their evidential states. There
is also no need to sample for these variables, since their states are known.
Despite its simplicity, Gibbs sampling is one of the most useful and popular sam-
pling methods, especially in discrete cases. However, one should bear in mind that
convergence is a major issue – that is, answering questions such as ‘how many
samples are needed to be reasonably sure that my sample estimate p(x5 ) is accu-
rate?’, is, to a large extent, an unknown. Despite many mathematical results in
this area, little is really known about these issues, and general rules of thumb, and
sensible awareness on behalf of the user are rather required. (Indeed, if one were
able to answer such questions, one would understand the distribution well enough
that usually some exact technique would be preferable).
Caution
As with most sampling schemes, a word of caution is required. Whilst there are
some formal results that show that Gibbs sampling (under certain restrictions) is
a correct sampling procedure, one can easily construct cases where it will fail. In
fig(24.2), we show such a case in which the two dimensional continuous distribution
has mass only in the lower left and upper right regions. In that case, if we start in
the lower left region, we will always remain there, and never explore the upper right
region. This problem occurs essentially because there are two regions which are not
connected by a path which is reachable by Gibbs sampling. Such multi-modality
is the scourge of sampling in general, and is very difficult to address.
Importance Sampling
The aim here is to replace sampling with respect to the intractable distribution
p(x), and instead sample from a tractable, simpler distribution q(x). We need to
in someway adjust/reweight the samples from q(x) such that, in the limit of a large
278
x2
x1
Figure 24.2: A two dimensional distribution for which Gibbs sampling fails. The
upper right region is never explored. This is a case where the sampler is non-
ergodic. For an ergodic sampler there is a non-zero chance any region of the space
will be visited.
number of samples, the correct results will be obtained. Consider the average
Z R
f (x)p∗ (x)
f (x)p(x) = R ∗ dx (24.1.1)
p (x)
R ∗
f (x) pq(x)
(x)
q(x)dx
= R p∗ (x) (24.1.2)
q(x) q(x)dx
where
p∗ (xµ )
µ q(xµ )
r = PP p∗ (xµ )
µ=1 q(xµ )
Hence, in principle, this reweighting of the samples from q will give the correct
result. In high dimensional spaces x, however, the rµ will tend to have only one
dominant value close to 1, and the rest will be zero, particularly if the sampling
distribution q is not well matched to p, since then the ratio q/p will not be close to
unity. However, in a moderate number of dimensions, perhaps less than 10 or so,
this method can produce reasonable results. Indeed, it forms the basis for a simple
class of algorithms called particle filters, which are essentially importance sampling
for temporal Belief Networks (eg non-linear Kalman Filters), in which one forward
samples from a proposal distribution q, and one can exploit the simplified Markov
structure to recursively define reweighting factors.
See http://www-sigproc.eng.cam.ac.uk/smc/index.html for references.
Understanding MCMC
Consider the conditional distribution p(xt+1 |xt ). If we are given an initial sample
x1 , then we can recursively generate samples x1 , x2 , . . . , xt . After a long time
t >> 1, we can plot the samples xt . Are the samples xt samples from some
distribution and, if so, which distribution? The answer to this is (generally), yes,
they are samples from the stationary distribution p∞ (x) which is defined as
Z
p∞ (x′ ) = p(x′ |x)p∞ (x)
x
This equation defines the stationary distribution, from which we see that the sta-
tionary distribution is equal to the eigenfunction with unit eigenvalue of the tran-
sition kernel. Under some mild properties (STATE!! ergodicity usually required),
every transition distribution has a stationary distribution. This is also unique
(STATE conditions required).
The idea in MCMC is to reverse this process. If we are given the distribution p(x),
can we find a transition p(x′ |x) which has p(x) as its stationary distribution? If we
can, then we can draw samples from the Markov Chain, and use these as samples
from p(x).
Note that whilst (usually) every Markov transition p(x′ |x) has a unique stationary
distribution, every distribution p(x) has a great many different transitions p(x′ |x)
with p(x) as their equilibrium distributions. (This is why there are very many
different MCMC sampling methods for the same distribution).
Detailed Balance
How do we construct transitions p(x′ |x) with given p(x) as their stationary distrib-
utions? One convenient trick is to assume detailed balance. This is the assumption
That is, detailed balance is a sufficient condition for stationarity. It is not, however,
a necessary condition.
For example, consider drawing samples from the uniform distribution U [0, 1]. One
(rather silly!) way to do this would be to draw x′ as follows. Draw a random
number y from a small interval uniform distribution y ∼ U [0, ǫ] where, say ǫ = 0.5.
Then take x′ to be the value x + y with wrapped boundary conditions. That is, a
point x′ = 1 + δ gets mapped to δ. Clearly, under this scheme, we will eventually
sample correctly from the uniform distribution U [0, 1] however, in a left to right
manner. This clear irreversibility of the chain shows that detailed balance is not
a necessary criterion for correct MCMC sampling.
280
Metropolis/Hastings Sampling
would be a candidate transition. The reader may verify that this is indeed a
distribution since
Z Z Z
p(x′ |x) = q(x′ |x)f (x′ , x) + 1 − q(x′′ |x)f (x′′ , x) = 1
x′ x′ x′′
The above transition clearly splits into two cases, namely when x′ = x and x′ 6= x.
When x′ = x, then clearly detailed balance trivially holds. In the case x′ 6= x,
then
Then if q(x|x′ )p(x′ ) > q(x′ |x)p(x) f (x′ , x) = 1, and f (x, x′ ) = q(x′ |x)p(x)/q(x|x′ )p(x′ ),
and hence
f (x′ , x) 1 q(x|x′ )p(x′ )
′
= ′ ′ ′
=
f (x, x ) q(x |x)p(x)/q(x|x )p(x ) q(x′ |x)p(x)
The reader may show that, conversely, if q(x|x′ )p(x′ ) ≤ q(x′ |x)p(x), we also get
f (x′ ,x) q(x|x′ )p(x′ ) ′
f (x,x′ ) = q(x′ |x)p(x) . Hence the function f (x , x) as defined above is a suitable
′
function to ensure that p(x |x) satisfies detailed balance. This function is called
the Metropolis-Hastings acceptance function. Other acceptance functions may also
be derived.
1 One could contemplate, for example, a normalisation by division style method. However, it is
not necessarily easy to sample from this transition distribution. The beauty of the Metropolis
method is that this subtractive normalisation results in distribution that is easy to sample
from, as we will see.
281
How do we then sample from p(x′ |x)? Imagine we draw a candidate sample x′
from q(x′ |x). If q(x|x′ )p(x′ ) > q(x′ |x)p(x), then f (x′ , x) = 1, and we must have
x′ 6= x (since, otherwise, p(x) > p(x), which cannot be true) and we simply have
p(x′ |x) = q(x′ |x) – namely we accept the sample x′ . Conversely, if q(x|x′ )p(x′ ) ≤
′
)p(x′ )
q(x′ |x)p(x), then f (x′ , x) = q(x|x ′
q(x′ |x)p(x) , and we cannot rule out that x = x. Hence
Z
p(x′ |x) = q(x′ |x)f (x′ , x) + δ(x′ |x) 1 − q(x′′ |x)f (x′′ , x)
x′′
To sample from this mixture, we sample from the mixture weight f (x′ , x). We
therefore with probability f (x′ , x) draw a sample from q(x′ |x) (that is, we accept
the candidate) and otherwise take the sample x′ = x. A common mistake in
MCMC is, when we reject the candidate x′ , simply to restart the procedure. The
correct approach is that, if the candidate x′ is rejected, we take the original x as
a new sample. Hence, another copy of x is included in the sample set – ie each
iteration of the algorithm produces a sample – either a copy of the current sample,
or the candidate sample.
The reader may show that Gibbs sampling can be put in the this framework for a
suitable proposal q(x′ |x).
Whilst all of this is quite cool, the reader should bear in mind a couple of points.
Firstly, having a ‘correct’ transition does not guarantee that the samples will indeed
be from p(x). The proposal distribution q(x′ |x) may not explore all regions, in
which case we have not shown detailed balance holds for all points x and x′ . This
is what can happen in the example of the Gibbs sampler, which is not ergodic –
we locally satisfy detailed balance in a region, but not over all space, hence the
samples are not samples from p(x).
Another related point is that, even if we can guarantee ergodicity of the chain, we
have no clue as to how long we need to wait until we have drawn a representative
sample from p(x). The reason for this is essentially that, if the chain is ergodic, then
indeed, eventually, we will be drawing samples from the stationary distribution –
but when?. Assessing convergence is a major headache in MCMC, and in some
sense, just as difficult as sampling from p(x) itself, since we need to have some
global idea of the distribution in order to know how long before we are likely to
have reached a representative point of p(x). In practice, there are some heuristics....
If we use
1 ′ 2
q(x′ |x) ∝ e− 2σ2 (x −x)
Figure 24.3: One thousand samples from a non-Gaussian distribution. The con-
tours plotted are iso-probability contours. Here, Metropolis sampling was used
with a standard deviation of 1 in each dimension.
sample is taken to be x.
The Metropolis algorithm with isotropic Gaussians above is intuitive, and simple
to implement. However, it is not necessarily very efficient. Intuitively, we will
certainly accept the candidate if the unnormalised probability is higher than the
probability at the current state. We attempt to find a higher point on the distribu-
tion essentially by making a small jump in a random direction. In high dimensions,
it is unlikely that a random direction will result in a value of the probability which
is higher than the current value. Because of this, only very small jumps (which
will typically result in a < 1) will be accepted. However, if only very small jumps
are accepted, the speed with which we explore the space x is extremely slow, and
a tremendous number of samples would be required.
Assessing convergence
This is a method for continuous systems that aims to make non-trivial jumps in
the samples and, in so doing, to jump potentially from one mode to another.
It is customary (though not necessary) to derive Hybrid MCMC in terms of Hamil-
tonians. We will follow this approach here as well.
Let’ define the difficult distribution from which we wish to sample as2
1 Hx (x)
p(x) = e
Zx
for some given ‘Hamiltonian’ H(x). We then define another, ‘easy’ distribution
1 Hy (y)
p(y) = e
Zy
P(x,y) = P(x)P(y)
(x’,y’)
(x,y’)
y
(x,y) x
Figure 24.4: Hybrid Monte Carlo. Starting from the point x, y, we first draw a
new value for y from the Gaussian p(y). Then we use Hamiltonian dynamics to
traverse the distribution at roughly constant energy H(x, y) to reach a point x′ , y ′ .
We accept this point if H(x′ , y ′ ) > H(x, y ′ ). Otherwise this candidate is accepted
with probability exp(H(x′ , y ′ ) − H(x, y ′ )).
The dynamic step is a Metropolis step, with a very special kind of proposal dis-
tribution. In the dynamic step, the main idea is to go from one point of the space
x, y to a new point x′ , y ′ that is a non-trivial distance from x, y and which will
be accepted with a high probability. In the basic formulation using a symmetric
proposal distribution, we will accept the candidate x′ , y ′ if the values H(x′ , y ′ ) is
higher than or close to the value H(x, y ′ ). How can such a non-trivial distance
between x, y and x′ , y ′ be accomplished? One way to do this is to use Hamiltonian
dynamics.
Hamiltonian Dynamics
H(x′ , y ′ ) = H(x, y)
We can satisfy this (up to first order) by considering the Taylor expansion
This is a single scalar requirement, and there are therefore many different solutions
for ∆x and ∆y that satisfy this single condition. In physics, it is customary to
assume isotropy, which limits dramatically the number of possible solutions to
essentially just the following:
where ǫ is a small value to ensure that the Taylor expansion is accurate. Hence
There are specific ways to implement the dynamic equations above (called Leapfrog
discretization) that are more accurate – see the Radford Neal reference. (Special
case of simplectic discretization I think).
We can then follow the Hamiltonian dynamics for many time steps (usually of the
order of several hundred) to reach a candidate point x′ , y ′ . If the Hamiltonian
dynamics was well behaved, H(x′ , y ′ ) will have roughly the same value as H(x, y).
285
We then do a Metropolis step, and accept the point x′ , y ′ if H(x′ , y ′ ) > H(x, y)
and otherwise accept it with probability exp(H(x′ , y ′ ) − H(x, y)). If rejected, we
take the initial point x, y as the sample.
In order to make a symmetric proposal distribution, at the start of the dynamic
step, we choose ǫ = +ǫ0 or ǫ = −ǫ0 uniformly. This means that there is the same
chance that we go back to the point x, y starting from x′ , y ′ , as vice versa.
Combined with the Gibbs step, we then have the general procedure.
1. Start from x, y. Draw a new sample y ′ from p(y).
2. Then, starting from x, y ′ , choose a random (forwards or backwards) and
then perform Hamiltonian dynamics for some time steps until we reach a
candidate x′ , y ′ . Accept x′ , y ′ if H(x′ , y ′ ) > H(x, y), otherwise accept it with
probability exp(H(x′ , y ′ ) − H(x, y)). If rejected, we take the sample as x, y ′ .
3. The above steps are repeated.
One obvious feature of HMC is that we now use, not just the potential H(x) to find
define candidate samples, but the gradient of H(x) as well. An intuitive reason
for the success of the algorithm is that it is less myopic than straightforward
Metropolis, since the use of the gradient enables the algorithm to feel it’s way
to other regions of high probability, by following at all times likely paths in the
augmented space. One can also view the auxiliary variables as momentum variables
– it is as if the sample has now a momentum. Provided this momentum is high
enough, we can escape local minima......more later.
Slice Sampling
Blah
Swendson-Wang
This is a classic algorithm used for discrete variables. The main motivation here
is to introduce p(y|x) in such a way that the distribution p(x|y) is easy to sample
from.
Originally, the SW method was introduced to alleviate the problems encountered
in sampling from Ising Models close to their critical temperature, in which Gibbs
sampling completely breaks down.
In it’s simplest form, the Ising model with no external interactions on a set of
variables x1 , . . . , xn , xi ∈ {0, 1} takes the form
1 Y βI[xi =xj ]
p(x) = e
Z i∼j
which means that this is a pairwise Markov network with a potential contribution
eβ if neighbouring nodes i and j are in the same state, and a contribution 1
otherwise. We assume that β > 0 which encourages neighbours to be in the same
state. The lattice based neighbourhood structure makes this difficult to sample
from, and especially when the inverse temperature encourages large scale islands
to form. In that case, the probability of an individual variable being flipped by
Gibbs sampling is negligible.
286
It’s clear that we need to employ p(y|x) to cancel the terms eβI[xi =xj ] . We can do
this by making
Y Y 1
p(y|x) = p(yij |xi , xj ) = I[0 < yij < eβI[xi =xj ] ]
i∼j i∼j
z ij
where I[0 < yij < eβI[xi =xj ] ] denotes a uniform distribution between 0 and
eβI[xi =xj ] . zij is the normalisation constant zij = eβI[xi =xj ] . Hence
Let’s assume that we have a sample yij . If yij > 1 then to draw a sample from
p(x|y), we must have 1 < eβI[xi =xj ] , which means that xi and xj are in the same
state. Otherwise, if yij < 1, then what constraint does this place on what the x
can be? None! Hence, wherever yij > 1, we bond xi and xj to be in the same
state. The probability
Z ∞
1 eβ − 1
yij > 1 = I[0 < yij < eβI[xi =xj ] ] β = 1 − e−β
yij =1 zij e
Temporal Distributions
Many applications involve temporal distributions of the generic form
T
Y
p(v1:T , h1:T ) = p(v1 |h1 )p(h1 ) p(vt |ht )p(ht |ht−1 )
t=2
287
We encountered a few already, namely the Kalman Filter, Hidden Markov Model
and Switching Kalman Filter. Our interest here will be in the calculation of
p(ht |v1:T ).
In the mentioned models, we have used either exact inference methods, or devel-
oped (in the SKF case) approximate inference methods. However, there are cases
where the transitions are such that it may not be clear how to form an appropriate
analytic approximation procedure (although, in my experience, such situations are
rare), and more general numerical approximations are sought.
It should be born in mind that tayloring the approximation method to the model
at hand is usually vital for reasonable performance. Nevertheless, we’ll discuss
below some fairly general sampling procedures that may be brought to bear, and
have proved popular, mainly due to their implementational simplicity.
h1 h2 h3 h4
v1 v2 v3 v4
Figure 24.5: A Switching Kalman Filter. The variables h and v are Gaussian dis-
tributed. The Switch variables s are discrete, and control the means and variances
of the Gaussian transitions.
Particle Filters
Despite our interest in p(ht |v1:T ), PFs make the assumption that the so-called
‘filtered estimate’ p(ht |v1:t ) would be a reasonable approximation or, at least, a
quantity of interest.
The traditional viewpoint of a Particle Filter is as a recursive importance sampler.
Here, we show how it can be viewed as a (somewhat severe) approximation of the
Forward Pass in Belief Propagation.
Z
ρ(ht ) ∝ p(vt |ht ) p(ht |ht−1 )ρ(ht−1 ) (24.2.5)
ht−1
The constant k is used to make ρ(ht ) a distribution. Although ρ(ht−1 ) was a simple
sum of delta peaks, in general ρ(ht ) will not be – the peaks get ‘broadened’ by the
hidden-to-hidden and hidden-to-observation factors. One can think of many ways
to approximate ρ(ht ).
In PFs, we make another approximation of ρ(ht ) in the form of a weighted sum of
delta-peaks. There are many ways to do this, but the simplest is to just sample
a set of points from the (unnormalised) distribution equation (24.2.7). There are
many ways we could carry out this sampling. One simple way is to that equation
(24.2.7) represents a mixture distribution.
Consider
I
1 X
p(h) = wi φi (h)
Z i=1
How can we sample from this distribution? Clearly, there are many approaches.
A simple idea is to use importance sampling.
Z
1 X φi (h)
hf (h)i = wi qi (h)f (h) (24.2.8)
Z i h qi (h)
1 X X φi (hµi )
≈ wi f (hµi ) (24.2.9)
Z i µ
qi (hµi )
P P µ
µ φi (hi )
i wi µ f (hi ) qi (hµi )
≈ P P φi (hµi ) (24.2.10)
i wi µ qi (hµi )
P µ µ
i,µ ri f (hi )
≈ P µ (24.2.11)
i,µ ri
φ (hµ )
where riµ ≡ wi qii(hµi ) If, say for each mixture component i, we generate a set of P
i
samples hµi , µ = 1, . . . , P , then we will have I × P weights riµ . We then need to
select from this set, a smaller set (usually of size I again) points riµ . This can either
be done by discarding small riµ , or sampling from the unnormalised distribution riµ .
One done, we have a set of retained riµ , from which a new set of mixture weights wi∗
can be found by normalising the selected weights. Heuristics are usually required
since, as is nearly always the case with naive IS, only a few of the weights will
have significant value – exponential dominance problem. In practice, repopulation
heuristics are usually employed to get around this. And other hacks.....
In my humble opinion, there is little advantage in using the (very poor) importance
sampler. Rather, it is better to look again at the equation
Z
ρ(ht ) ∝ p(vt |ht ) p(ht |ht−1 )ρ(ht−1 ) (24.2.12)
ht−1
Assuming that we have a sample hµt−1 from ρ(ht−1 ), we can then draw a sample
ht from ρ(ht , ht−1 ) by Gibbs sampling, ie by sampling from the unnormalised
conditional ρ(ht , ht−1 = hµt−1 ). For each sample, we can then proceed to the next
time step. This will then generate a single sample path hµ1 , hµ2 , . . . hµT . We repeat
this procedure to get a set of sample paths (this can, of course, be also done in
parallel, so that we generate at each time step, a set of sample hµt , µ = 1, . . . P .
The advantage of this approach is that any of the more powerful sampling methods
developed over the last 50 years can be used, and one is not hampered by the
miserable performance of importance sampling.
290
for mu=2:1000
x(:,mu)=metropolis(x(:,mu-1),s,’logp’);
plot(x(1,mu),x(2,mu),’.’);
if mod(mu,100)==1
drawnow
end
end
function xnew=metropolis(x,s,logp)
xcand=x+s*randn(size(x));
loga=feval(logp,xcand)-feval(logp,x);
if loga>0
xnew=xcand;
else
r=rand;
if log(r)<loga
xnew=xcand;
else
xnew=x;
end
end
function l=logp(x)
l1 = exp(- (x(1)^2+x(2)^2+sin(x(1)+x(2))^2));
f=3;
l2 = exp(- ((x(1)-f)^2+(x(2)-f)^2+sin(x(1)+x(2)-2*f)^2));
l=log(l1+l2);
291
Rao-Blackwellisation
Explain why this is very often a red-herring since it assumes that one has a good
sampler (which indeed is the whole problem of sampling!). Give a picture where it’s
easy to sample if high dimensions, but more multi-modal in the lower projected
dimension, compounding the difficulty of sampling. (Rao-Blackwellisation says
that the variance of the sampler will always be higher in the higher space – but
that is in fact a good thing in many practical cases.) My feeling is that RB is just
another inappropriate piece of theory that misses the point.
Appendix A Basic Concepts in Probability
H, T, H, H, T, T, H, H, H, T, H, T, H
Her professor asks her to make an analysis of the results of her experiments. Ini-
tially, Sally is tempted to say that she does not wish to summarise the experiments,
292
293
since this would constitute a loss of information. However, she realises that there is
little to be gained from simply reporting the outcomes of the experiment, without
any summarisation of the results.
Independence → compact She therefore make an assumption (which she states in her report), namely that
description the outcome of tossing a coin at one time, did not influence the outcome at any
other time (so, if the coin came up heads now, this will not influence whether
or not the coin comes up heads or tails the next throw). This is a common
assumption and called independence of trials. Under this assumption, the ordering
of the data has no relevance, and the only quantity which is therefore invariant
under this assumption is the total number of heads, and the total number of tails
observed in the experiment. See how this assumption has enabled us to make
a compact description of the data. Indeed, it is such independence assumptions,
their characterisation and exploitation that is the subject of graphical models.
Random → compact model Sally repeats the experiment many times, and notices that the ratio of the number
description of heads observed to the total number of throws tends to roughly 0.5. She therefore
summarises the results of her experiments by saying that, in the long run, on
average, she believes half the time the coin will end up heads, and half the time
tails. Key words in the above sentence are believes and long run. We say believes
since Sally cannot repeat the experiment an infinite number of times, and it is
therefore her belief that if she were to toss the coin an infinite number of times, the
number of heads occurring would be half that of the total. Hence, she invokes the
concept of randomness/probability to describe a model that she believes accurately
reflects the kind of experimental results she has found.
In a sense, Sally (and her environment) operates like a random number generator.
(If we knew all the possible things that could influence Sally, and the coin, and how
she might toss it, we might conclude that we could predict the outcome of the coin
toss – there is nothing ‘random’ about it. However, to do so would not render itself
to a compact description – hence the usefulness of the concept of randomness).
Events are possible outcomes of a generating process. For example ‘the coin is
heads’ is and event, as is ‘the coin is tails’. In this case, these two events are
mutually exclusive, since they cannot both simultaneously occur.
• We’ll take the pragmatic viewpoint that the probability of an event occurring
is simply a number p(event occurs) between 0 and 1.
• p(event occurs) = 0 means that it is certain that the event cannot occur.
Similarly, p(event occurs) = 1 means that it is certain that the event occurs.
• We need a rule for how events interact :
I’ll use p(x, y) to denote the joint event p(x and y).
294
p(x, y)
p(x|y) ≡
p(y)
Here’s one way to think about conditional probability : Imagine we have a dart
board, split into 20 sections. We may think of a drunk dart thrower, and describe
our lack of knowledge about the throwing by saying that the probability that a
dart occurs in any one of the 20 regions is p(region i) = 1/20. Imagine then, that
our friend, Randy, the random dart thrower is blindfolded, and throws a dart.
With pint in hand, a friend of Randy tells him that he hasn’t hit the 20 region.
What is the probability that Randy has hit the 5 region?
Well, if Randy hasn’t hit the 20 region, then only the regions 1 to 19 remain and,
since there is no preference for Randy for any of these regions, then the probability
is 1/19. To see how we would calculate this with the rules of probability :
Degree of Belief
The above dart board example is easy to think about in terms of probability –
it’s straightforward to imagine many repetitions of the experiment, and one could
think about the ‘long’ run way of defining probability.
Here’s another problem :
What’s the probability that I will like the present my grandmother has bought
me for Christmas? In a purely technical sense, if we were to define probability
as a limiting case of infinite repetitions of the same experiment, this wouldn’t
make much sense in this case. We can’t repeat the experiment. However, simply
the predictability or degree of belief interpretation sidesteps this issue – it’s just
a consistent framework for manipulating real values consistent with our intuition
about probability.
TODO: Confidence intervals, Central limit theorem, Asymptotic Equipartition
Theorem.
Appendix B Graph Terminology:
Directed and Undirected Let G = (X, L) be a graph. When Lij ∈ L and Lij ∈ L, the link Lji 6∈ L is called
Links a directed link. Otherwise, the link is called undirected.
Directed and Undirected A graph in which all the links are directed is called a directed graph and a graph in
Graphs which all the links are undirected is called an undirected graph. A graph containing
both directed and undirected links is called a chain graph.
Adjacency Set Given a graph G = (X, L) and a node Xi , the adjacency set of Xi is the set of
nodes directly attainable from Xi , that is Adj (Xi ) = {Xj |Lij ∈ L}. That is, in a
directed graph, the adjacency set of a node G is the set of nodes that G points to.
In an undirected graph, it is the set of nodes that G is directly connected to.
The directed graph in fig(B.1a) has adjacency sets
A A
B D F B D F
C E C E
(a) (b)
295
296
Complete set A subset of nodes S of a graph G is said to be complete if there are links between
every pair of nodes in S. Correspondingly, a graph is complete if there is a link
between every pair of nodes in the graph.
For example, fig(B.2.1) is a complete graph of 5 nodes.
B C
D E
A F A F
E E
B G B G
D D
C H C H
(a) (b)
Figure B.3: Two different graphs and their associated cliques. (a) Here there are
two cliques of size 3. (b) Here there are several cliques of size three and one clique
of size 4. Links belonging to cliques of size two have not been coloured.
Clique A complete set of nodes C is called a clique if C is not a subset of another complete
set. In other words, a clique cannot be expanded since the expanded set would
not be complete.
For example, in fig(B.3a), the cliques are
ABE is not a clique since this is a complete subset of a larger complete subset,
namely ABED.
297
A B A B
C D D
E F E F
G H G H
I J I J
(a) (b)
E F
G H
I J
(c)
Figure B.4: (a) A disconnected graph. (b) A Tree (c) A loopy graph
Loop A loop is a closed path (a series of nodes with intersecting adjacency sets) in an
undirected graph.
For example, B − D − F − C − D in fig(B.1b) is a loop.
Neighbours of a node The set of nodes adjacent to a node Xi in an undirected graph is referred to as
the neighbours of Xi , N br (Xi ) = {Xj |Xj ∈ Adj (Xi )}
So, in an undirected graph, the neighbours of a node are identical to the adjacency
set of that node.
Connected Undirected An undirected graph is connected if there is at least one path between every pair
Graphs on nodes. Otherwise, the graph is disconnected.
For example, fig(B.4a) is an disconnected graph.
Tree A connected undirected graph is a tree if for every pair of nodes there exists a
unique path.
For example fig(B.4b) is a tree.
Multiply-connected or Loopy A connected undirected graph is multiply-connected (or loopy) if it contains at
Graphs least one pair of nodes that are joined by more than one path, or equivalently, if
it contains at least one loop.
298
A B A B
C D C D
E F E F
G H G H
I J I J
(a) (b)
Figure B.5: (a) Parents (A, B) and children (E, F ) of node D (b) C → E → G →
I → C is a cycle.
A A B
D C D
E F E F
G H G H
I J I J
(a) (b)
Parents and Children When there is a directed link from Xi to Xj , then Xi is said to be a parent of Xj ,
and Xj is a child of Xi .
For example, in fig(B.5a) the parents of node E are C and D. Node E has only
one child, node G.
Cycle A cycle is a closed directed path in a directed graph. If a graph contains no cycles,
it is called acyclic.
For example C, in fig(B.5b) C → E → G → I → C is a cycle. The nodes along
the path connected D, F, H, J, G, E, D do not form a cycle since the directions of
the links are not consistent.
Family The set consisting of a node and its parents is called the family of the node.
Simple Trees and Polytrees A directed tree is called a simple tree if every node has at most one parent. Oth-
299
Gamma Distribution
γ−1
1 x − x0
e− β (x−x ) ,
1 0
p(x) = x ≥ x0 , β > 0 (C.0.1)
βΓ(γ) β
γ is called the shape parameter, x0 the location parameter, β is the scale parameter
and
Z ∞
Γ(a) = ta−1 e−t dt
0
Dirichlet Distribution
where
QQ
Γ(uq )
Z(u) = q=1
P
Q
Γ q=1 u q
Dirichlet(α|u)
The parameter u controls how strongly the mass of the distribution is pushed to
the corners of the simplex. Setting uq = 1 for all q corresponds to a uniform
distribution.
In the binary case Q = 2, this is also called a Beta distribution.
300
Appendix D Bounds on Convex Functions
log(x)
(1,0)
p(x)
q(x)
(a) (b)
Figure D.1: (a) The probability density functions for two different distributions
p(x) and q(x). We would like to numerically characterise the difference between
these distributions. (b) A simple linear bound on the logarithm enables us to
define a useful distance measure between distributions (see text).
where the notation hf (x)ir(x) denotes average of the function f (x) with respect
to the distribution r(x). For a continuous variable, this would be hf (x)ir(x) =
R P
f (x)r(x)dx, and for a discrete variable, we would have hf (x)ir(x) = x f (x)r(x).
The advantage of this notation is that much of the following holds independent of
whether the variables are discrete or continuous.
KL(q||p) ≥ 0 The KL divergence is always ≥ 0. To see this, consider the following simple linear
bound on the function log(x) (see fig(D.1b)):
log(x) ≤ x − 1
Replacing x by p(x)/q(x) in the above bound
p(x) p(x)
− 1 ≥ log ⇒ p(x) − q(x) ≥ q(x) log p(x) − q(x) log q(x)
q(x) q(x)
R
Now
R integrate (or sum in the case of discrete variables) both sides. Using p(x)dx =
1, q(x)dx = 1, and rearranging gives
Z
{q(x) log q(x) − q(x) log p(x)} dx ≡ hlog q(x) − log p(x)iq(x) ≥ 0
Furthermore, one can show that the KL divergence is zero if and only if the two
distributions are exactly the same.
301
302
Z
J = log p(x)f (x)
x
Clearly these two bounds are equal for the setting q(x) = p(x) (this reminds me of
the difference in sampling routines, between sampling from the prior and sampling
from a distribution which is more optimal for calculating the average). The first
term encourages q to be close to p. The second encourages q to be close to f , and
the third encourages q to be sharply peaked.
Appendix E Positive Definite Matrices and Kernel Functions:
Aei = λi ei
Hence
Hence we must have λi (ej )T ei = λj (ej )T ei . If λi 6= λj , the only way this condition
can be satisfied is if (ej )T ei = 0 – ie, the eigenvectors are orthogonal.
This means also that we can represent a symmetric matrix as
X
A= λi ei (ei )T
i
Hermitian Matrices
A = AT ∗
where ∗ denotes the conjugate operator. The reader can easily show that for
Hermitian matrices, the eigenvectors form an orthogonal set, and furthermore
that the eigenvalues are real.
y T Ay > 0
This is clearly greater than zero if and only if all the eigenvalues are positive. This
is therefore an equivalent condition for positive definiteness.
303
304
Kernel
Kernels are closely related to the idea of Gaussian Processes. See [63] for an
excellent review. Consider a collection of points, x1 , . . . , xP , and a symmetric
function K(xi , xj ) = K(xj , xi ). We can use this function to define a symmetric
matrix K with elements
[K]ij = K(xi , xj )
The function K is called a Kernel if the corresponding matrix (for all P ) is positive
definite.
Eigenfunctions
Z
K(x′ , x)φa (x) = λa φa (x′ )
x
By an analogous argument that proves the theorem of linear algebra above, the
eigenfunctions are orthogonal:
Z
φa (x)φ∗b (x) = δab
x
where φ∗ (x) is the complex conjugate of φ(x)1 . From the previous results, we know
that a symmetric real matrix K must have a decomposition in terms eigenvectors
with positive, real eigenvalues. Since this is to be true for any dimension of matrix,
it suggests that we need the (real symmetric) kernel function itself to have a
decomposition (provided the eigenvalues are countable)
X
K(xi , xj ) = λµ φµ (xi )φ∗µ (xj )
µ
since then
X X X X X
yi K(xi , xj )yj = λµ yi φµ (xi )φ∗µ (xj )yj = λµ ( yi φµ (xi )) ( yi φ∗µ (xi ))
i,j i,j,µ µ i i
| {z }| {z }
zi zi∗
which is greater than zero if the eigenvalues are all positive (since for complex z,
zz ∗ ≥ 0).
If the eigenvalues are uncountable (which happens when the domain of the kernel
is unbounded), the appropriate decomposition is
Z
K(x , x ) = λ(s)φ(xi , s)φ∗ (xj , s)ds
i j
for real functions ψi (x), then K is a Kernel. Note that we do not require the
functions to be orthogonal. However, we know that if K is a Kernel, then an
alternative expansion in terms of orthogonal functions exists. More generally, K
is a Kernel if it has a representation:
Z
K(x, x ) = ψ(x, s)ψ(x′ , s)ds
′
Translation Invariance
In the case that k(x, x′ ) = k(x − x′ ) (note that we do not impose symmetry here
of the function k(x, x′ ), so that this section holds for more general functions than
kernels), the function is called stationary. The eigenproblem becomes
Z
k(x − x′ )e(x′ )dx′ = λe(x)
In this case, the LHS is in the form of a convolution. It makes sense therefore to
take the Fourier Transform:
k̃ẽ = λẽ
This means that ẽ is a delta function, and that therefore the eigenfunctions are
eisx .
When k is Hermitian, it has an expansion in terms of eigenfunctions. Note that the
form of the conjugate expansion automatically ensures that the Kernel is transla-
′
tion invariant, since φ(x+ a)φ∗ (x′ + a) = eis(x+a−x −a) = φ(x)φ∗ (x′ ). (Indeed, this
shows generally why Fourier representations are central to systems which possess
translation invariance). Exercise: what happens if we take the Laplace transform
of a translation invariant operator?
For the squared exponential kernel, the Fourier Transform of the kernel is a
2
Gaussian, and hence λ(s) = e−s . Hence, we have a representation of the ker-
nel as
Z Z
−(x−x′ )2 −s2 isx −isx′ 2 ′
e = e e e ds = e−s +is(x−x ) ds (E.0.1)
The reader may verify that this is indeed an identity by considering that this is
just a rewriting of the Fourier Transform of a Gaussian:
Z
2 2
e−w = e−x eiwx dx
The form of the representation equation (E.0.1) of the kernel verifies that it is
indeed a kernel, since the eigenvalues are all positive.
306
Blochner’s Theorem
For a stationary process (the Kernel is translation invariant), we can define the
so-called correlation function which, for K(0) > 0, is
K(x)
ρ(x) =
K(0)
Blochner’s theorem states that ρ(x) is positive semidefinite, if and only if it is the
characteristic function of a variable ω,
Z
ρ(x) = eixω f (ω)dω
for a probability distribution f (ω). This means that we can prove positive semi-
definiteness of K by checking that its Fourier Transform is non-negative.
For stationary Kernels which are also isotropic,(K(x) = K(|x|)), one can show
that ρ must be representable as a Hankel transform[63].
Mercer’s Theorem
(Not really sure why this is useful!!) For x, x′ being in a bounded domain of RN ,
then any positive (semi) definite Kernel has an expansion of the form
X
K(x, x′ ) = φi (x)φi (x′ )
i
Note that this may possibly be an infinite series. Also, note that there is no
requirement that the functions be orthogonal.
Aside : it is interesting to think about the unbounded case since, for example, the
conditions of Mercer’s theorem do not apply to the squared exponential Kernel
′ 2
e−(x−x ) . From Blochner’s Theorem, or using simple Fourier Analysis, we can
indeed form an expansion of this Kernel in terms of an integral representation of
complex orthogonal functions. What is interesting about the squared exponential
case is that we can indeed also find a Mercer representation (exercise!), even though
the conditions of the theorem do not hold.
Appendix F Approximating Integrals:
Consider a distribution
1 −E(w)
p(w) = e
Z
In many cases (due to some form of the asymptotics), distributions will often
tend to be come rather sharply peaked. The Laplace distribution aims to make a
Gaussian approximation of p(w). The Laplace approach is a simple perturbation
expansion, which assumes that the Gaussian is sharply peaked around some value
w∗ . If we find
Hence
Z
∗ √
e−E(w) ≈ e−E(w )
det 2πH −1
w
307
Appendix G Inference with Gaussian Random Variables
We have two variables X and Y . Imagine the X models the position of an object
in the world (in one dimension) and Y is an observation, say in a camera, of the
position of the object in the camera. A camera calibration procedure tells us the
relationship between X and Y ; in our case we assume
Y = 2X + 8 + Ny
where Ny is some Gaussian measurement noise with zero mean and variance 1.
Thus our model for P (y|x) is
1 1
P (y|x) = √ exp{− (y − 2x − 8)2 }.
2π 2
Also we assume that x ∼ N (0, 1/α) so that
r
α αx2
P (x) = exp − .
2π 2
For the covariance matrix, we have that var(X) = 1/α. For var(Y ) we find
4
var(Y ) = E[(Y − µy )2 ] = E[(2X + Ny )2 ] = + 1,
α
and for covar(XY ) we find
2
covarXY = E[(X − µx )(Y − µy )] = E[X(2X + Ny )] =
α
and thus
1/α 2/α
Σ= .
2/α 4/α + 1
Given a vector of random variables split into two parts X1 and X2 with
µ1
µ=
µ2
and
Σ11 Σ12
Σ=
Σ21 Σ22
308
309
310
311