0% found this document useful (0 votes)
6 views

Expectation Maximization

Uploaded by

Tuğba Can
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Expectation Maximization

Uploaded by

Tuğba Can
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

BLG 527E Machine Learning

FALL 2021-2022
Assoc. Prof. Yusuf Yaslan & Assist. Prof. Ayşe Tosun

Expectation Maximization
Lecture Notes from Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) AND
Hertzmann and Fleet 2010 Machine Learning and Data Mining Lecture Notes, CSC 411/D11,
Computer Science Department, University of Toronto AND
C. Bishop 2003 Mixture of Gaussians and EM Part I and II, BCS Summer School, Exeter AND
Tengyu Ma and Andrew Ng, 2019 CS229 Lecture notes, Stanford University
Maximum Likelihood for GMM
• For MoG we have shown that

=
• We learn the model parameters by minimizing negative log
likelihood.
• =]
• There’s no closed form solution.

2
Expectation Maximization
• The solutions are not closed since they are coupled
• Suggests an iterative scheme for solving them
• Make initial guesses for the parameters
• Alternate between the following steps:
• E Step: Evaluate responsibilities
• M-Step: Update parameters using ML
Jensen’s Inequality
• Let f be a convex function (f’’(x) ≥ 0 for all x ϵ R) and X be a random
variable. Then E[f(X)] ≥ f(E[X])

f is convex
X=a with 0.5 probability
X= b with 0.5 probability
E[f(X)] ≥ f(E[X])

Image is obtained from Tengyu Ma and Andrew Ng, 2019 CS229 Lecture notes, Stanford University
Jensen’s Inequality
• If f is strictly convex function (f’’(x) > 0), then E[f(X)] = f(E[X]) holds
true if and only if E[X] = X with probability 1. (i.e. İf X is constant)

• Jensen’s inequality also holds for concave functions (f’’(x) ≤ 0), bu


the direction of all the inequalities reversed E[f(X)] ≤ f(E[X])
Problem Definition
• Suppose we have an estimation problem in which we have
{x1, x2 ,…, xn}, n independent examples.
• We have a latent variable model p(x,z;q)
• The density for x can be obtained by marginalized over latent x:

p ( x; )  p ( x, z; )
z
• We wish to fit the parameters q by maximizing the log-likelihood of
the data:
n n
l ( ) log( p ( x i ; ))  log p ( x ( i ) , z (i ) ; )
i 1 i 1 z(i )

Here z(i)’s are latent random variables.


If z(i)’s were observed then the maximum likelihood estimate would be
easy.
Maximizing l(q) explicity might be difficult.
EM algorithm repeatedly construct a lower bound on l (E-step) and
then optimize that lower bound (M-step).
n
p ( x , z ; ) (i ) (i )
l ( ) log  Qi ( z ) (i )
(i )
i 1 z ( i ) Qi ( z )

i, Qi ( z ) is some distribution over z’s  Q ( z ) 1,


z
i Qi ( z ) 0

n
 p ( x ( i ) , z ( i ) ; ) 
l ( ) logE  (i ) 
i 1  Qi ( z ) 

 p ( x ( i ) , z ( i ) ; )  is with respect to zi drawn according


E 
 Qi ( z ) 
(i ) to the distribution given by Qi
n
 p ( x ( i ) , z ( i ) ; )  n
 p ( x , z ; ) 
(i ) (i )

 logE  (i )   E  log (i ) 
i 1  Qi ( z )  i 1  Qi ( z ) 

• Jensen’s Inequality log(E[X]) ≥ E[log(X)]

n
p ( x (i )
, z (i )
; )
  Qi ( z ) log
(i )
(i )
i 1 z i Qi ( z )

• This is a lower bound on l(q)


n
p ( x , z ; ) (i ) (i )
l ( )   Qi ( z ) log (i )
(i )
i 1 z i Qi ( z )
• We want a lower bound to be equal to l at previous q

Image is obtained from C. Bishop 2003 Mixture of Gaussians and EM Part I and II, BCS Summer School, Exeter
• We choose Qi ( z (isuch
)
) that inequality above would hold with
equality

p ( x ( i ) , z ( i ) ; )
(i )
const  Qi ( z )  p ( x , z ; )
(i ) (i ) (i )

Qi ( z )

• Since
 i
Q ( z (i )
) 
1then
zi

p ( x , z ; )
(i )
p ( x , z ; )
(i ) (i ) (i )
(i )
Qi ( z )    p ( z (i )
| x (i )
; )
 p ( x , z ; )
(i ) (i )
p ( x ; )
(i )

z(i )
EM Algorithm
Start with initial parameters q0
Repeat until convergence {

E- Step: Compute the expectaion of loglikelihood with respect to


current estimate q(t) and X Lets’s call it Q(q|q(t))


Q( |  t ) EZ log p ( x (i ) , z (i ) ; ) 
n
p ( x , z ; ) (i ) (i )
arg max  Qi ( z ) log (i )
(i )
M Step: Q i 1 z i Q ( z )
i

}
Image is obtained from Prof. Olga Veksler Pattern Recognition Lecture Notes
Image is obtained from Prof. Olga Veksler Pattern Recognition Lecture Notes
Image is obtained from Prof. Olga Veksler Pattern Recognition Lecture Notes
EM for Gaussian Mixture Models
For 1 ≤ i ≤N, 1 ≤ j ≤ K define hidden variables z(i)

w(ji ) Qi ( z (i ) j ) P( z (i ) j | x (i ) ;  ,  , )

Qi ( z (i ) j )
Probability of z(i) taking the value of j under distribution Qi:
N
p ( x (i )
, z (i )
;  ,  , )
 
i 1 z ( i )
(i )
Qi ( z ) log
Q ( z (i )
)
i
N K
p ( x ( i ) | z ( i )  j;  , ) p ( z ( i )  j;  )
  Qi ( z (i )  j ) log
i 1 j 1 Qi ( z (i )  j )
1   1 (i ) T 1 
exp ( x   j )  j ( x   j )  j
(i )
N K (2 ) |  j |
d / 2 1 / 2
 2 
  w j log
(i )
(i )
i 1 j 1 w j

Let’s maximize with respect to  l we should take the


derivative w.r.t l
N (i ) (i )
w x 1 N (i )
l  l
, (i )
l   wl
i 1 w l N i 1
N

 l
w (i )
( x (i )
  l )( x (i )
  l ) T

 l  i 1 n

 l
w (i )

i 1
• In the E step we re- estimate w(i)j

 1/ 2   1 (i ) T 1 
|  i | exp ( x  l )  l ( x  l ) l
(i )

(i )  2 
wl  K
  1 (i ) 

j 1
 1/ 2

 2
T 1
|  j | exp ( x   j )  j ( x   j )  j
(i )


EM Algorithm
• It can be proven that EM algorithm converges to the local maximum
of the log-likelihood.
• Convergence of the EM usually faster, in the beginning very large
steps are made
• Gradient descent is not guaranteed to converge
• Gradient descent has also difficulties of choosing the appropriate
learning rate

You might also like