Xpectation Aximization: Grading An Exam Without An Answer Key
Xpectation Aximization: Grading An Exam Without An Answer Key
A hen is only an egg’s way of making another egg. – Samuel Butler Learning Objectives:
• Explain the relationship between
parameters and hidden variables.
• Construct generative stories for
clustering and dimensionality
reduction.
• Draw a graph explaining how EM
works by constructing convex lower
bounds.
Suppose you were building a naive Bayes model for a text cate- • Implement EM for clustering with
mixtures of Gaussians, and contrast-
gorization problem. After you were done, your boss told you that it
ing it with k-means.
became prohibitively expensive to obtain labeled data. You now have
• Evaluate the differences betweem
a probabilistic model that assumes access to labels, but you don’t EM and gradient descent for hidden
have any labels! Can you still do something? variable models.
Amazingly, you can. You can treat the labels as hidden variables,
and attempt to learn them at the same time as you learn the param-
eters of your model. A very broad family of algorithms for solving
problems just like this is the expectation maximization family. In this
chapter, you will derive expectation maximization (EM) algorithms
for clustering and dimensionality reduction, and then see why EM
works.
Dependencies:
1 that denotes how well they do on the exam. The score is what we
really want to compute. For each question m and each student n, the
student has provided an answer an,m , which is either zero or one.
There is also an unknown ground truth answer for each question m,
which we’ll call tm , which is also either zero or one.
As a starting point, let’s consider a simple heuristic and then com-
plexify it. The heuristic is the “majority vote” heuristic and works as
follows. First, we estimate tm as the most common answer for ques-
tion m: tm = argmaxt ∑n 1[ an,m = t]. Once we have a guess for each
true answer, we estimate each students’ score as how many answers
1
they produced that match this guessed key: sn = M ∑m 1[ an,m = tm ].
Once we have these scores, however, we might want to trust some
of the students more than others. In particular, answers from stu-
dents with high scores are perhaps more likely to be correct, so we
can recompute the ground truth, according to weighted votes. The
weight of the votes will be precisely the score the corresponding each
student:
You can recognize this as a chicken and egg problem. If you knew the
student’s scores, you could estimate an answer key. If you had an
answer key, you could compute student scores. A very common
strategy in computer science for dealing with such chicken and egg
problems is to iterate. Take a guess at the first, compute the second,
recompute the first, and so on.
In order to develop this idea formally, we have to case the prob-
lem in terms of a probabilistic model with a generative story. The
generative story we’ll use is:
the answer from Ber(0.1). The exponent in step 3 selects which of two
Bernoulli distributions to draw from, and then implements this rule.
This can be translated into the following likelihood:
p( a, t, s)
" # " #
= ∏ 0.5 tm
0.51− t m
× ∏1
m n
"
∏ ∏ snn,m m (1 − sn )(1−an,m )tm
a t
×
n m
i
(1− a )(1−tm )
sn n,m (1 − sn ) an,m (1−tm ) (16.2)
(1− an,m )(1−tm )
= 0.5 M ∏ ∏ snn,m m (1 − sn )(1−an,m )tm sn
a t
(1 − sn ) an,m (1−tm )
n m
(16.3)
Suppose we knew the true lables t. We can take the log of this
likelihood and differentiate it with respect to the score sn of some
student (note: we can drop the 0.5 M term because it is just a con-
stant):
h
log p( a, t, s) = ∑ ∑ an,m tm log sn + (1 − an,m )(1 − tm ) log(sn )
n m
i
+ (1 − an,m )tm log(1 − sn ) + an,m (1 − tm ) log(1 − sn )
(16.4)
∂ log p( a, t, s) ha
n,m tm + (1 − an,m )(1 − tm ) (1 − an,m )tm + an,m (1 − tm ) i
∂sn
= ∑ sn
−
1 − sn
m
(16.5)
The derivative has the form sAn − 1−Bsn . If we set this equal to zero and
solve for sn , we get an optimum of sn = AA + B . In this case:
∑
B= (1 − an,m )tm + an,m (1 − tm ) (16.7)
m
∑
A+B = 1 =M (16.8)
m
In the case of known ts, this matches exactly what we had in the
heuristic.
However, we do not know t, so instead of using the “true” val-
ues of t, we’re going to use their expectations. In particular, we will
compute sn by maximizing its likelihood under the expected values
expectation maximization 189
The full solution is then to alternate between these two. You can
start by initializing the ground truth values at the majority vote (this
seems like a safe initialization). Given those, compute new scores.
Given those new scores, compute new ground truth values. And
repeat until tired.
In the next two sections, we will consider a more complex unsu-
pervised learning model for clustering, and then a generic mathe-
matical framework for expectation maximization, which will answer
questions like: will this process converge, and, if so, to what?
If you had access to labels, this would be all well and good, and
you could obtain closed form solutions for the maximum likelihood
estimates of all parameters by taking a log and then taking gradients
of the log likelihood:
Suppose that you don’t have labels. Analogously to the K-means You should be able to derive the
algorithm, one potential solution is to iterate. You can start off with ? maximum likelihood solution re-
sults formally by now.
guesses for the values of the unknown variables, and then iteratively
improve them over time. In K-means, the approach was the assign
examples to labels (or clusters). This time, instead of making hard
assignments (“example 10 belongs to cluster 4”), we’ll make soft as-
signments (“example 10 belongs half to cluster 4, a quarter to cluster
2 and a quarter to cluster 5”). So as not to confuse ourselves too
much, we’ll introduce a new variable, zn = hzn,1 , . . . , zn,K (that sums
to one), to denote a fractional assignment of examples to clusters.
This notion of soft-assignments is visualized in Figure 16.1. Here,
we’ve depicted each example as a pie chart, and it’s coloring denotes
the degree to which it’s been assigned to each (of three) clusters. The
size of the pie pieces correspond to the zn values.
All that has happened here is that the hard assignments “[yn = k]”
have been replaced with soft assignments “zn,k ”. As a bit of fore-
shadowing of what is to come, what we’ve done is essentially replace
known labels with expected labels, hence the name “expectation maxi-
mization.”
Putting this together yields Algorithm 16.2. This is the GMM
(“Gaussian Mixture Models”) algorithm, because the probabilitic
model being learned describes a dataset as being drawn from a mix-
ture distribution, where each component of this distribution is a
Gaussian. Aside from the fact that GMMs
Just as in the K-means algorithm, this approach is succeptible to use soft assignments and K-means
local optima and quality of initialization. The heuristics for comput- ? uses hard assignments, there are
other differences between the two
ing better initializers for K-means are also useful here. approaches. What are they?
Algorithm 38 GMM(X, K)
1: for k = 1 to K do
6: repeat
7: for n = 1 to N do
8: for k = 1 to K do
− D h i
zn,k ← θk 2πσk2 2 exp − 2σ1 2 || xn − µk ||2
9: // compute
k
(unnormalized) fractional assignments
10: end for
11: zn ← ∑ 1zn,k zn // normalize fractional assignments
k
12: end for
13: for k = 1 to K do
14: θk ← N1 ∑n zn,k // re-estimate prior probability of cluster k
∑n zn,k xn
15: µk ← ∑n zn,k
// re-estimate mean of cluster k
∑n zn,k || xn −µk ||
16: σk2 ← ∑n zn,k
// re-estimate variance of cluster k
17: end for
18: until converged
19: return z // return cluster assignments
you apply this idea more generally and (2) why is it even a reason-
able thing to do? Expectation maximization is a family of algorithms
for performing maximum likelihood estimation in probabilistic mod-
els with hidden variables.
The general flavor of how we will proceed is as follows. We want
to maximize the log likelihood L, but this will turn out to be diffi-
cult to do directly. Instead, we’ll pick a surrogate function L̃ that’s a
lower bound on L (i.e., L̃ ≤ L everywhere) that’s (hopefully) easier
to maximize. We’ll construct the surrogate in such a way that increas-
ing it will force the true likelihood to also go up. After maximizing
L̃, we’ll construct a new lower bound and optimize that. This process
Figure 16.2: em:lowerbound: A figure
is shown pictorially in Figure 16.2. showing successive lower bounds
To proceed, consider an arbitrary probabilistic model p( x, y | θ),
where x denotes the observed data, y denotes the hidden data and
θ denotes the parameters. In the case of Gaussian Mixture Models,
x was the data points, y was the (unknown) labels and θ included
the cluster prior probabilities, the cluster means and the cluster vari-
ances. Now, given access only to a number of examples x1 , . . . , x N ,
you would like to estimate the parameters (θ) of the model.
Probabilistically, this means that some of the variables are un-
known and therefore you need to marginalize (or sum) over their
possible values. Now, your data consists only of X = h x1 , x2 , . . . , x N i,
expectation maximization 193
not the ( x, y) pairs in D. You can then write the likelihood as:
At this point, the natural thing to do is to take logs and then start
taking gradients. However, once you start taking logs, you run into a
problem: the log cannot eat the sum!
Namely, the log gets “stuck” outside the sum and cannot move in to
decompose the rest of the likelihood term!
The next step is to apply the somewhat strange, but strangely
useful, trick of multiplying by 1. In particular, let q(·) be an arbitrary
probability distribution. We will multiply the p(. . . ) term above by
q(yn )/q(yn ), a valid step so long as q is never zero. This leads to:
p( xn , yn | θ)
L(X | θ) = ∑ log ∑ q(yn ) (16.27)
n yn q(yn )
p( xn , yn | θ)
L(X | θ) ≥ ∑ ∑ q(yn ) log (16.28)
n yn q(yn )
h i
= ∑ ∑ q(yn ) log p( xn , yn | θ) − q(yn ) log q(yn ) (16.29)
n yn
, L̃(X | θ) (16.30)
Note that this inequality holds for any choice of function q, so long as
its non-negative and sums to one. In particular, it needn’t even by the
194 a course in machine learning