0% found this document useful (0 votes)
32 views

Xpectation Aximization: Grading An Exam Without An Answer Key

This document introduces expectation maximization (EM) and uses a grading exam example to illustrate how EM works. It describes a scenario where a professor wants to grade an exam but does not have an answer key. It formulates the problem as one with hidden variables (the true answers and student scores) and introduces an iterative algorithm that treats the hidden variables as parameters to estimate. The algorithm alternates between estimating the hidden variables given the current parameters, and re-estimating the parameters given the estimates of the hidden variables. This converges to a locally optimal solution.

Uploaded by

Jiahong He
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Xpectation Aximization: Grading An Exam Without An Answer Key

This document introduces expectation maximization (EM) and uses a grading exam example to illustrate how EM works. It describes a scenario where a professor wants to grade an exam but does not have an answer key. It formulates the problem as one with hidden variables (the true answers and student scores) and introduces an iterative algorithm that treats the hidden variables as parameters to estimate. The algorithm alternates between estimating the hidden variables given the current parameters, and re-estimating the parameters given the estimates of the hidden variables. This converges to a locally optimal solution.

Uploaded by

Jiahong He
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

16 | E XPECTATION M AXIMIZATION

A hen is only an egg’s way of making another egg. – Samuel Butler Learning Objectives:
• Explain the relationship between
parameters and hidden variables.
• Construct generative stories for
clustering and dimensionality
reduction.
• Draw a graph explaining how EM
works by constructing convex lower
bounds.
Suppose you were building a naive Bayes model for a text cate- • Implement EM for clustering with
mixtures of Gaussians, and contrast-
gorization problem. After you were done, your boss told you that it
ing it with k-means.
became prohibitively expensive to obtain labeled data. You now have
• Evaluate the differences betweem
a probabilistic model that assumes access to labels, but you don’t EM and gradient descent for hidden
have any labels! Can you still do something? variable models.

Amazingly, you can. You can treat the labels as hidden variables,
and attempt to learn them at the same time as you learn the param-
eters of your model. A very broad family of algorithms for solving
problems just like this is the expectation maximization family. In this
chapter, you will derive expectation maximization (EM) algorithms
for clustering and dimensionality reduction, and then see why EM
works.
Dependencies:

16.1 Grading an Exam without an Answer Key

Alice’s machine learning professor Carlos gives out an exam that


consists of 50 true/false questions. Alice’s class of 100 students takes
the exam and Carlos goes to grade their solutions. If Carlos made
an answer key, this would be easy: he would just count the fraction
of correctly answered questions each student got, and that would be
their score. But, like many professors, Carlos was really busy and
didn’t have time to make an answer key. Can he still grade the exam?
There are two insights that suggest that he might be able to. Sup-
pose he know ahead of time that Alice was an awesome student, and
is basically guaranteed to get 100% on the exam. In that case, Carlos
can simply use Alice’s answers as the ground truth. More generally,
if Carlos assumes that on average students are better than random
guessing, he can hope that the majority answer for each question is
likely to be correct. Combining this with the previous insight, when
doing the “voting”, he might want to pay more attention to the an-
swers of the better students.
To be a bit more pedantic, suppose there are N = 100 students
and M = 50 questions. Each student n has a score sn , between 0 and
expectation maximization 187

1 that denotes how well they do on the exam. The score is what we
really want to compute. For each question m and each student n, the
student has provided an answer an,m , which is either zero or one.
There is also an unknown ground truth answer for each question m,
which we’ll call tm , which is also either zero or one.
As a starting point, let’s consider a simple heuristic and then com-
plexify it. The heuristic is the “majority vote” heuristic and works as
follows. First, we estimate tm as the most common answer for ques-
tion m: tm = argmaxt ∑n 1[ an,m = t]. Once we have a guess for each
true answer, we estimate each students’ score as how many answers
1
they produced that match this guessed key: sn = M ∑m 1[ an,m = tm ].
Once we have these scores, however, we might want to trust some
of the students more than others. In particular, answers from stu-
dents with high scores are perhaps more likely to be correct, so we
can recompute the ground truth, according to weighted votes. The
weight of the votes will be precisely the score the corresponding each
student:

tm = argmax ∑ sn 1[ an,m = t] (16.1)


t n

You can recognize this as a chicken and egg problem. If you knew the
student’s scores, you could estimate an answer key. If you had an
answer key, you could compute student scores. A very common
strategy in computer science for dealing with such chicken and egg
problems is to iterate. Take a guess at the first, compute the second,
recompute the first, and so on.
In order to develop this idea formally, we have to case the prob-
lem in terms of a probabilistic model with a generative story. The
generative story we’ll use is:

1. For each question m, choose a true answer tm ∼ Ber(0.5)

2. For each student n, choose a score sn ∼ Uni(0, 1)

3. For each question m and each student n, choose an answer


an,m ∼ Ber(sn )tm Ber(1 − sn )1−tm

In the first step, we generate the true answers independently by


flipping a fair coin. In the second step, each students’ overall score
is determined to be a uniform random number between zero and
one. The tricky step is step three, where each students’ answer is
generated for each question. Consider student n answering question
m, and suppose that sn = 0.9. If tm = 1, then an,m should be 1 (i.e.,
correct) 90% of the time; this can be accomplished by drawing the an-
swer from Ber(0.9). On the other hand, if tm = 0, then an,m should 1
(i.e., incorrect) 10% of the time; this can be accomplished by drawing
188 a course in machine learning

the answer from Ber(0.1). The exponent in step 3 selects which of two
Bernoulli distributions to draw from, and then implements this rule.
This can be translated into the following likelihood:

p( a, t, s)
" # " #
= ∏ 0.5 tm
0.51− t m
× ∏1
m n
"
∏ ∏ snn,m m (1 − sn )(1−an,m )tm
a t
×
n m
i
(1− a )(1−tm )
sn n,m (1 − sn ) an,m (1−tm ) (16.2)
(1− an,m )(1−tm )
= 0.5 M ∏ ∏ snn,m m (1 − sn )(1−an,m )tm sn
a t
(1 − sn ) an,m (1−tm )
n m
(16.3)

Suppose we knew the true lables t. We can take the log of this
likelihood and differentiate it with respect to the score sn of some
student (note: we can drop the 0.5 M term because it is just a con-
stant):
h
log p( a, t, s) = ∑ ∑ an,m tm log sn + (1 − an,m )(1 − tm ) log(sn )
n m
i
+ (1 − an,m )tm log(1 − sn ) + an,m (1 − tm ) log(1 − sn )
(16.4)
∂ log p( a, t, s) ha
n,m tm + (1 − an,m )(1 − tm ) (1 − an,m )tm + an,m (1 − tm ) i
∂sn
= ∑ sn

1 − sn
m
(16.5)

The derivative has the form sAn − 1−Bsn . If we set this equal to zero and
solve for sn , we get an optimum of sn = AA + B . In this case:

A = ∑ an,m tm + (1 − an,m )(1 − tm )


 
(16.6)
m


 
B= (1 − an,m )tm + an,m (1 − tm ) (16.7)
m


 
A+B = 1 =M (16.8)
m

Putting this together, we get:


1

 
sn = an,m tm + (1 − an,m )(1 − tm ) (16.9)
M m

In the case of known ts, this matches exactly what we had in the
heuristic.
However, we do not know t, so instead of using the “true” val-
ues of t, we’re going to use their expectations. In particular, we will
compute sn by maximizing its likelihood under the expected values
expectation maximization 189

of t, hence the name expectation maximization. If we are going


to compute expectations of t, we have to say: expectations accord-
ing to which probability distribution? We will use the distribution
p(tm | a, s). Let t̃m denote Etm ∼ p(tm | a,s) [tm ]. Because tm is a bi-
nary variable, its expectation is equal to it’s probability; namely:
t̃m = p(tm | a, s).
How can we compute this? We will compute C = p(tm = 1, a, s)
and D = p(tm = 0, a, s) and then compute t̃m = C/(C + D ). The
computation is straightforward:

C = 0.5 ∏ snn,m (1 − sn )1−an,m ∏ ∏


a
= 0.5 sn (1 − s n ) (16.10)
n n: n:
an,m =1 an,m =0
1− an,m
D = 0.5 ∏ sn (1 − sn ) an,m = 0.5 ∏ (1 − s n ) ∏ sn (16.11)
n n: n:
an,m =1 an,m =0

If you inspect the value of C, it is basically “voting” (in a product


form, not a sum form) the scores of those students who agree that the
answer is 1 with one-minus-the-score of those students who do not.
The value of D is doing the reverse. This is a form of multiplicative
voting, which has the effect that if a given student has a perfect score
of 1.0, their results will carry the vote completely.
We now have a way to:

1. Compute expected ground truth values t̃m , given scores.

2. Optimize scores sn given expected ground truth values.

The full solution is then to alternate between these two. You can
start by initializing the ground truth values at the majority vote (this
seems like a safe initialization). Given those, compute new scores.
Given those new scores, compute new ground truth values. And
repeat until tired.
In the next two sections, we will consider a more complex unsu-
pervised learning model for clustering, and then a generic mathe-
matical framework for expectation maximization, which will answer
questions like: will this process converge, and, if so, to what?

16.2 Clustering with a Mixture of Gaussians

In Chapter 9, you learned about probabilitic models for classification


based on density estimation. Let’s start with a fairly simple classifica-
tion model that assumes we have labeled data. We will shortly remove
this assumption. Our model will state that we have K classes, and
data from class k is drawn from a Gaussian with mean µk and vari-
ance σk2 . The choice of classes is parameterized by θ. The generative
story for this model is:
190 a course in machine learning

1. For each example n = 1 . . . N:

(a) Choose a label yn ∼ Disc(θ)

(b) Choose example xn ∼ Nor(µyn , σy2n )

This generative story can be directly translated into a likelihood as


before:

p( D ) = ∏ Mult(yn | θ)Nor(xn | µyn , σy2n ) (16.12)


n
for each example
z }| " #{
h i− D 1
2
=∏
2
2πσy2n exp − 2 xn − µyn (16.13)

θyn
n |{z} 2σyn
choose label | {z }
choose feature values

If you had access to labels, this would be all well and good, and
you could obtain closed form solutions for the maximum likelihood
estimates of all parameters by taking a log and then taking gradients
of the log likelihood:

θk = fraction of training examples in class k (16.14)


1
=
N ∑[yn = k ]
n
µk = mean of training examples in class k (16.15)
∑n [yn = k ] xn
=
∑n [yn = k ]
σk2 = variance of training examples in class k (16.16)
∑n [yn = k] || xn − µk ||
=
∑n [yn = k ]

Suppose that you don’t have labels. Analogously to the K-means You should be able to derive the
algorithm, one potential solution is to iterate. You can start off with ? maximum likelihood solution re-
sults formally by now.
guesses for the values of the unknown variables, and then iteratively
improve them over time. In K-means, the approach was the assign
examples to labels (or clusters). This time, instead of making hard
assignments (“example 10 belongs to cluster 4”), we’ll make soft as-
signments (“example 10 belongs half to cluster 4, a quarter to cluster
2 and a quarter to cluster 5”). So as not to confuse ourselves too
much, we’ll introduce a new variable, zn = hzn,1 , . . . , zn,K (that sums
to one), to denote a fractional assignment of examples to clusters.
This notion of soft-assignments is visualized in Figure 16.1. Here,
we’ve depicted each example as a pie chart, and it’s coloring denotes
the degree to which it’s been assigned to each (of three) clusters. The
size of the pie pieces correspond to the zn values.

Figure 16.1: em:piecharts: A figure


expectation maximization 191

Formally, zn,k denotes the probability that example n is assigned to


cluster k:

zn,k = p(yn = k | xn ) (16.17)


p(yn = k, xn )
= (16.18)
p( xn )
1
= Mult(k | θ)Nor( xn | µk , σk2 ) (16.19)
Zn
Here, the normalizer Zn is to ensure that zn sums to one.
Given a set of parameters (the θs, µs and σ2 s), the fractional as-
signments zn,k are easy to compute. Now, akin to K-means, given
fractional assignments, you need to recompute estimates of the
model parameters. In analogy to the maximum likelihood solution
(Eqs (??)-(??)), you can do this by counting fractional points rather
than full points. This gives the following re-estimation updates:

θk = fraction of training examples in class k (16.20)


1
=
N ∑ zn,k
n
µk = mean of fractional examples in class k (16.21)
∑n zn,k xn
=
∑n zn,k
σk2 = variance of fractional examples in class k (16.22)
∑n zn,k || xn − µk ||
=
∑n zn,k

All that has happened here is that the hard assignments “[yn = k]”
have been replaced with soft assignments “zn,k ”. As a bit of fore-
shadowing of what is to come, what we’ve done is essentially replace
known labels with expected labels, hence the name “expectation maxi-
mization.”
Putting this together yields Algorithm 16.2. This is the GMM
(“Gaussian Mixture Models”) algorithm, because the probabilitic
model being learned describes a dataset as being drawn from a mix-
ture distribution, where each component of this distribution is a
Gaussian. Aside from the fact that GMMs
Just as in the K-means algorithm, this approach is succeptible to use soft assignments and K-means
local optima and quality of initialization. The heuristics for comput- ? uses hard assignments, there are
other differences between the two
ing better initializers for K-means are also useful here. approaches. What are they?

16.3 The Expectation Maximization Framework

At this point, you’ve seen a method for learning in a particular prob-


abilistic model with hidden variables. Two questions remain: (1) can
192 a course in machine learning

Algorithm 38 GMM(X, K)
1: for k = 1 to K do

2: µk ← some random location // randomly initialize mean for kth cluster


3: σk2 ← 1 // initialize variances
4: θk ← 1/K // each cluster equally likely a priori
5: end for

6: repeat

7: for n = 1 to N do
8: for k = 1 to K do
− D h i
zn,k ← θk 2πσk2 2 exp − 2σ1 2 || xn − µk ||2

9: // compute
k
(unnormalized) fractional assignments
10: end for
11: zn ← ∑ 1zn,k zn // normalize fractional assignments
k
12: end for
13: for k = 1 to K do
14: θk ← N1 ∑n zn,k // re-estimate prior probability of cluster k
∑n zn,k xn
15: µk ← ∑n zn,k
// re-estimate mean of cluster k
∑n zn,k || xn −µk ||
16: σk2 ← ∑n zn,k
// re-estimate variance of cluster k
17: end for
18: until converged
19: return z // return cluster assignments

you apply this idea more generally and (2) why is it even a reason-
able thing to do? Expectation maximization is a family of algorithms
for performing maximum likelihood estimation in probabilistic mod-
els with hidden variables.
The general flavor of how we will proceed is as follows. We want
to maximize the log likelihood L, but this will turn out to be diffi-
cult to do directly. Instead, we’ll pick a surrogate function L̃ that’s a
lower bound on L (i.e., L̃ ≤ L everywhere) that’s (hopefully) easier
to maximize. We’ll construct the surrogate in such a way that increas-
ing it will force the true likelihood to also go up. After maximizing
L̃, we’ll construct a new lower bound and optimize that. This process
Figure 16.2: em:lowerbound: A figure
is shown pictorially in Figure 16.2. showing successive lower bounds
To proceed, consider an arbitrary probabilistic model p( x, y | θ),
where x denotes the observed data, y denotes the hidden data and
θ denotes the parameters. In the case of Gaussian Mixture Models,
x was the data points, y was the (unknown) labels and θ included
the cluster prior probabilities, the cluster means and the cluster vari-
ances. Now, given access only to a number of examples x1 , . . . , x N ,
you would like to estimate the parameters (θ) of the model.
Probabilistically, this means that some of the variables are un-
known and therefore you need to marginalize (or sum) over their
possible values. Now, your data consists only of X = h x1 , x2 , . . . , x N i,
expectation maximization 193

not the ( x, y) pairs in D. You can then write the likelihood as:

p(X | θ) = ∑ ∑ · · · ∑ p(X, y1 , y2 , . . . y N | θ) marginalization


y1 y2 yN
(16.23)
= ∑ ∑ · · · ∑ ∏ p( xn , yn | θ) examples are independent
y1 y2 yN n
(16.24)
= ∏ ∑ p( xn , yn | θ) algebra
n yn
(16.25)

At this point, the natural thing to do is to take logs and then start
taking gradients. However, once you start taking logs, you run into a
problem: the log cannot eat the sum!

L(X | θ) = ∑ log ∑ p( xn , yn | θ) (16.26)


n yn

Namely, the log gets “stuck” outside the sum and cannot move in to
decompose the rest of the likelihood term!
The next step is to apply the somewhat strange, but strangely
useful, trick of multiplying by 1. In particular, let q(·) be an arbitrary
probability distribution. We will multiply the p(. . . ) term above by
q(yn )/q(yn ), a valid step so long as q is never zero. This leads to:

p( xn , yn | θ)
L(X | θ) = ∑ log ∑ q(yn ) (16.27)
n yn q(yn )

We will now construct a lower bound using Jensen’s inequality.


This is a very useful (and easy to prove!) result that states that
f (∑i λi xi ) ≥ ∑i λi f ( xi ), so long as (a) λi ≥ 0 for all i, (b) ∑i λi = 1,
and (c) f is concave. If this looks familiar, that’s just because it’s a
direct result of the definition of concavity. Recall that f is concave if
f ( ax + by) ≥ a f ( x ) + b f ( x ) whenever a + b = 1. Prove Jensen’s inequality using the
You can now apply Jensen’s inequality to the log likelihood by ? definition of concavity and induc-
tion.
identifying the list of q(yn )s as the λs, log as f (which is, indeed,
concave) and each “x” as the p/q term. This yields:

p( xn , yn | θ)
L(X | θ) ≥ ∑ ∑ q(yn ) log (16.28)
n yn q(yn )
h i
= ∑ ∑ q(yn ) log p( xn , yn | θ) − q(yn ) log q(yn ) (16.29)
n yn

, L̃(X | θ) (16.30)

Note that this inequality holds for any choice of function q, so long as
its non-negative and sums to one. In particular, it needn’t even by the
194 a course in machine learning

same function q for each n. We will need to take advantage of both of


these properties.
We have succeeded in our first goal: constructing a lower bound
on L. When you go to optimize this lower bound for θ, the only part
that matters is the first term. The second term, q log q, drops out as a
function of θ. This means that the the maximization you need to be
able to compute, for fixed qn s, is:

θ(new) ← arg max ∑ ∑ qn (yn ) log p( xn , yn | θ) (16.31)


θ n yn

This is exactly the sort of maximization done for Gaussian mixture


models when we recomputed new means, variances and cluster prior
probabilities.
The second question is: what should qn (·) actually be? Any rea-
sonable q will lead to a lower bound, so in order to choose one q over
another, we need another criterion. Recall that we are hoping to max-
imize L by instead maximizing a lower bound. In order to ensure
that an increase in the lower bound implies an increase in L, we need
to ensure that L(X | θ) = L̃(X | θ). In words: L̃ should be a lower
bound on L that makes contact at the current point, θ.

16.4 Further Reading

TODO further reading

You might also like