Expectation Maximization
Expectation Maximization
FALL 2021-2022
Assoc. Prof. Yusuf Yaslan & Assist. Prof. Ayşe Tosun
Expectation Maximization
Lecture Notes from Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) AND
Hertzmann and Fleet 2010 Machine Learning and Data Mining Lecture Notes, CSC 411/D11,
Computer Science Department, University of Toronto AND
C. Bishop 2003 Mixture of Gaussians and EM Part I and II, BCS Summer School, Exeter AND
Tengyu Ma and Andrew Ng, 2019 CS229 Lecture notes, Stanford University
Maximum Likelihood for GMM
• For MoG we have shown that
=
• We learn the model parameters by minimizing negative log
likelihood.
• =]
• There’s no closed form solution.
2
Expectation Maximization
• The solutions are not closed since they are coupled
• Suggests an iterative scheme for solving them
• Make initial guesses for the parameters
• Alternate between the following steps:
• E Step: Evaluate responsibilities
• M-Step: Update parameters using ML
Jensen’s Inequality
• Let f be a convex function (f’’(x) ≥ 0 for all x ϵ R) and X be a random
variable. Then E[f(X)] ≥ f(E[X])
f is convex
X=a with 0.5 probability
X= b with 0.5 probability
E[f(X)] ≥ f(E[X])
Image is obtained from Tengyu Ma and Andrew Ng, 2019 CS229 Lecture notes, Stanford University
Jensen’s Inequality
• If f is strictly convex function (f’’(x) > 0), then E[f(X)] = f(E[X]) holds
true if and only if E[X] = X with probability 1. (i.e. İf X is constant)
p ( x; ) p ( x, z; )
z
• We wish to fit the parameters q by maximizing the log-likelihood of
the data:
n n
l ( ) log( p ( x i ; )) log p ( x ( i ) , z (i ) ; )
i 1 i 1 z(i )
n
p ( x ( i ) , z ( i ) ; )
l ( ) logE (i )
i 1 Qi ( z )
logE (i ) E log (i )
i 1 Qi ( z ) i 1 Qi ( z )
n
p ( x (i )
, z (i )
; )
Qi ( z ) log
(i )
(i )
i 1 z i Qi ( z )
Image is obtained from C. Bishop 2003 Mixture of Gaussians and EM Part I and II, BCS Summer School, Exeter
• We choose Qi ( z (isuch
)
) that inequality above would hold with
equality
p ( x ( i ) , z ( i ) ; )
(i )
const Qi ( z ) p ( x , z ; )
(i ) (i ) (i )
Qi ( z )
• Since
i
Q ( z (i )
)
1then
zi
p ( x , z ; )
(i )
p ( x , z ; )
(i ) (i ) (i )
(i )
Qi ( z ) p ( z (i )
| x (i )
; )
p ( x , z ; )
(i ) (i )
p ( x ; )
(i )
z(i )
EM Algorithm
Start with initial parameters q0
Repeat until convergence {
Q( | t ) EZ log p ( x (i ) , z (i ) ; )
n
p ( x , z ; ) (i ) (i )
arg max Qi ( z ) log (i )
(i )
M Step: Q i 1 z i Q ( z )
i
}
Image is obtained from Prof. Olga Veksler Pattern Recognition Lecture Notes
Image is obtained from Prof. Olga Veksler Pattern Recognition Lecture Notes
Image is obtained from Prof. Olga Veksler Pattern Recognition Lecture Notes
EM for Gaussian Mixture Models
For 1 ≤ i ≤N, 1 ≤ j ≤ K define hidden variables z(i)
Qi ( z (i ) j )
Probability of z(i) taking the value of j under distribution Qi:
N
p ( x (i )
, z (i )
; , , )
i 1 z ( i )
(i )
Qi ( z ) log
Q ( z (i )
)
i
N K
p ( x ( i ) | z ( i ) j; , ) p ( z ( i ) j; )
Qi ( z (i ) j ) log
i 1 j 1 Qi ( z (i ) j )
1 1 (i ) T 1
exp ( x j ) j ( x j ) j
(i )
N K (2 ) | j |
d / 2 1 / 2
2
w j log
(i )
(i )
i 1 j 1 w j
l
w (i )
( x (i )
l )( x (i )
l ) T
l i 1 n
l
w (i )
i 1
• In the E step we re- estimate w(i)j
1/ 2 1 (i ) T 1
| i | exp ( x l ) l ( x l ) l
(i )
(i ) 2
wl K
1 (i )
j 1
1/ 2
2
T 1
| j | exp ( x j ) j ( x j ) j
(i )
EM Algorithm
• It can be proven that EM algorithm converges to the local maximum
of the log-likelihood.
• Convergence of the EM usually faster, in the beginning very large
steps are made
• Gradient descent is not guaranteed to converge
• Gradient descent has also difficulties of choosing the appropriate
learning rate