771 A18 Lec16
771 A18 Lec16
Piyush Rai
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 1
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Given data X = {x 1 , . . . , x N }, the goal is to estimate the parameters Θ or latent variable Z or both
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n
Given data X = {x 1 , . . . , x N }, the goal is to estimate the parameters Θ or latent variable Z or both
(note: we can usually estimate Θ given Z, and vice-versa)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X
ΘMLE = arg max log p(x n , z n |Θ)
Θ
n=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ)
Θ
n=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X
p(x n |Θ) = p(x n , z n |Θ)
zn
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ), in
fact almost never an exponential family distribution
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1
Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1
The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ), in
fact almost never an exponential family distribution. MLE for Θ won’t have closed form solutions!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
An Important Identity
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity
Assume discrete Z, the identity below holds for any choice of the distribution q(Z)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity
Assume discrete Z, the identity below holds for any choice of the distribution q(Z)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity
Assume discrete Z, the identity below holds for any choice of the distribution q(Z)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity
Assume discrete Z, the identity below holds for any choice of the distribution q(Z)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity
Assume discrete Z, the identity below holds for any choice of the distribution q(Z)
Maximizing L(q, Θ) will also improve log p(X|Θ). Also, as we’ll see, it’s easier to maximize L(q, Θ)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
.. therefore, Θnew = arg max Q(Θ, Θold ) where Q(Θ, Θold ) = Ep(Z|X,Θold ) [log p(X, Z|Θ)]
θ
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)
old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,
.. therefore, Θnew = arg max Q(Θ, Θold ) where Q(Θ, Θold ) = Ep(Z|X,Θold ) [log p(X, Z|Θ)]
θ
Q(Θ, Θold ) = Ep(Z|X,Θold ) [log p(X, Z|Θ)] is known as expected complete data log-likelihood (CLL)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
What’s Going On: A Visual Illustration..
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
After maximizing
w.r.t.
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
After maximizing
w.r.t.
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
After maximizing
w.r.t.
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: Another Illustration
The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)
To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q
(Step 1)
(Step 2)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
What’s Going On: Another Illustration
The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)
To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q
(Step 1)
(Step 2)
Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
What’s Going On: Another Illustration
The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)
To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q
(Step 1)
(Step 2)
Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)
Step 2 maximizes the lower bound L(q, Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )
3 Step 2: Now maximize the expected complete data log-likelihood w.r.t. Θ
N
X
Θ(t) = arg max Q(Θ, Θ(t−1) ) = arg max Ep(z (t) |x n ,Θ(t−1) ) [log p(x n , z (t)
n |Θ)]
Θ Θ n
n=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )
3 Step 2: Now maximize the expected complete data log-likelihood w.r.t. Θ
N
X
Θ(t) = arg max Q(Θ, Θ(t−1) ) = arg max Ep(z (t) |x n ,Θ(t−1) ) [log p(x n , z (t)
n |Θ)]
Θ Θ n
n=1
4 If not yet converged, set t = t + 1 and go to step 2.
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )
3 Step 2: Now maximize the expected complete data log-likelihood w.r.t. Θ
N
X
Θ(t) = arg max Q(Θ, Θ(t−1) ) = arg max Ep(z (t) |x n ,Θ(t−1) ) [log p(x n , z (t)
n |Θ)]
Θ Θ n
n=1
4 If not yet converged, set t = t + 1 and go to step 2.
Note: If we can take the MAP estimate ẑ n of z n (not full posterior) in Step 1 and maximize the CLL in
PN
Step 2 using that estimate, i.e., do arg maxΘ n=1 log p(x n , ẑ (t)
n |Θ), this will be identical to ALT-OPT
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n
znz> >
n replaced by Ep(z n |x n ,Θold ) [z n z n ]
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n
znz> >
n replaced by Ep(z n |x n ,Θold ) [z n z n ]
.. and so on..
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1
If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n
znz> >
n replaced by Ep(z n |x n ,Θold ) [z n z n ]
.. and so on..
The expected CLL may not always be computable and may need to be approximated
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
EM for Gaussian Mixture Model
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 10
EM for Gaussian Mixture Model
Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model
Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model
Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1
.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model
Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1
.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model
Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1
.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model
Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1
.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)
π N (x n |µk ,Σk ) PK
Note: We can finally normalize E[znk ] as E[znk ] = PK k since E[znk ] = 1
k=1
`=1 π` N (x n |µ` ,Σ` )
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model
(t) Nk
πk =
N
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model
(t) Nk
πk =
N
4 Set t = t + 1 and go to step 2 if not yet converged
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2
These expectations can be obtained from the posterior p(z n |x n ) (easy to compute due to conjugacy)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)
Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2
These expectations can be obtained from the posterior p(z n |x n ) (easy to compute due to conjugacy)
The M step maximizes the expected CLL w.r.t. the parameters (W, σ 2 in this case)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ
.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ
.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ
.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ
.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)
If M step has multiple parameters whose updates depend on each other, they are updated in an
alternating fashion - called Expectation Conditional Maximization (ECM) algorithm
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ
.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)
If M step has multiple parameters whose updates depend on each other, they are updated in an
alternating fashion - called Expectation Conditional Maximization (ECM) algorithm
Other advanced probabilistic inference algorithms are based on ideas similar to EM
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case
Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ
.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)
If M step has multiple parameters whose updates depend on each other, they are updated in an
alternating fashion - called Expectation Conditional Maximization (ECM) algorithm
Other advanced probabilistic inference algorithms are based on ideas similar to EM
E.g., Variational Bayesian (VB) inference
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14