0% found this document useful (0 votes)
27 views99 pages

771 A18 Lec16

The document discusses latent variable models and the challenges of parameter estimation. It explains that if the latent and observed variables were known, maximum likelihood estimation would be straightforward. However, in latent variable models the latent variables are unknown, making direct maximum likelihood intractable. The Expectation Maximization algorithm provides an approach for approximating maximum likelihood in these models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views99 pages

771 A18 Lec16

The document discusses latent variable models and the challenges of parameter estimation. It explains that if the latent and observed variables were known, maximum likelihood estimation would be straightforward. However, in latent variable models the latent variables are unknown, making direct maximum likelihood intractable. The Expectation Maximization algorithm provides an approach for approximating maximum likelihood in these models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Latent Variable Models and Expectation Maximization

Piyush Rai

Introduction to Machine Learning (CS771A)

September 27, 2018

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 1
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters


For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters


For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters


For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model
z n ∈ {1, . . . , K } with p(z n |φ) = multinoulli(π1 , . . . , πK )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters


For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model
z n ∈ {1, . . . , K } with p(z n |φ) = multinoulli(π1 , . . . , πK )
x n ∈ RD with p(x n |z n , θ) = N (x|µz n .Σz n )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters


For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model
z n ∈ {1, . . . , K } with p(z n |φ) = multinoulli(π1 , . . . , πK )
x n ∈ RD with p(x n |z n , θ) = N (x|µz n .Σz n )
Here Θ = (φ, θ) = {πk , µk , Σk }Kk=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters


For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model
z n ∈ {1, . . . , K } with p(z n |φ) = multinoulli(π1 , . . . , πK )
x n ∈ RD with p(x n |z n , θ) = N (x|µz n .Σz n )
Here Θ = (φ, θ) = {πk , µk , Σk }Kk=1

Given data X = {x 1 , . . . , x N }, the goal is to estimate the parameters Θ or latent variable Z or both

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters


For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model
z n ∈ {1, . . . , K } with p(z n |φ) = multinoulli(π1 , . . . , πK )
x n ∈ RD with p(x n |z n , θ) = N (x|µz n .Σz n )
Here Θ = (φ, θ) = {πk , µk , Σk }Kk=1

Given data X = {x 1 , . . . , x N }, the goal is to estimate the parameters Θ or latent variable Z or both
(note: we can usually estimate Θ given Z, and vice-versa)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X
ΘMLE = arg max log p(x n , z n |Θ)
Θ
n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ)
Θ
n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X
p(x n |Θ) = p(x n , z n |Θ)
zn

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ), in
fact almost never an exponential family distribution
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ) = arg max log p(X|Θ)
Θ Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ), in
fact almost never an exponential family distribution. MLE for Θ won’t have closed form solutions!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )


( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )


( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )


( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q, Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )


( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q, Θ)

Maximizing L(q, Θ) will also improve log p(X|Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )


( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q, Θ)

Maximizing L(q, Θ) will also improve log p(X|Θ). Also, as we’ll see, it’s easier to maximize L(q, Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold )


q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold ) = arg min KL(q||pz )


q q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold ) = arg min KL(q||pz ) = pz = p(Z|X, Θold )


q q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold ) = arg min KL(q||pz ) = pz = p(Z|X, Θold )


q q

Maximize L w.r.t. Θ with q fixed at q̂ = p(Z|X, Θold )


new
Θ = arg max L(q̂, Θ)
Θ

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold ) = arg min KL(q||pz ) = pz = p(Z|X, Θold )


q q

Maximize L w.r.t. Θ with q fixed at q̂ = p(Z|X, Θold )


new
X old p(X, Z|Θ)
Θ = arg max L(q̂, Θ) = arg max p(Z|X, Θ ) log
Θ Θ
Z
p(Z|X, Θold )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold ) = arg min KL(q||pz ) = pz = p(Z|X, Θold )


q q

Maximize L w.r.t. Θ with q fixed at q̂ = p(Z|X, Θold )


new
X old p(X, Z|Θ) X old
Θ = arg max L(q̂, Θ) = arg max p(Z|X, Θ ) log = arg max p(Z|X, Θ ) log p(X, Z|Θ)
Θ Θ
Z
p(Z|X, Θold ) Θ
Z

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold ) = arg min KL(q||pz ) = pz = p(Z|X, Θold )


q q

Maximize L w.r.t. Θ with q fixed at q̂ = p(Z|X, Θold )


new
X old p(X, Z|Θ) X old
Θ = arg max L(q̂, Θ) = arg max p(Z|X, Θ ) log = arg max p(Z|X, Θ ) log p(X, Z|Θ)
Θ Θ
Z
p(Z|X, Θold ) Θ
Z

.. therefore, Θnew = arg max Q(Θ, Θold ) where Q(Θ, Θold ) = Ep(Z|X,Θold ) [log p(X, Z|Θ)]
θ

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( ) ( )
X p(X, Z|Θ) X p(Z|X, Θ)
L(q, Θ) = q(Z) log and KL(q||pz ) = − q(Z) log
Z
q(Z) Z
q(Z)

old
Maximize L w.r.t. q with Θ fixed at Θ : Since log p(X|Θ) will be a constant in this case,

q̂ = arg max L(q, Θold ) = arg min KL(q||pz ) = pz = p(Z|X, Θold )


q q

Maximize L w.r.t. Θ with q fixed at q̂ = p(Z|X, Θold )


new
X old p(X, Z|Θ) X old
Θ = arg max L(q̂, Θ) = arg max p(Z|X, Θ ) log = arg max p(Z|X, Θ ) log p(X, Z|Θ)
Θ Θ
Z
p(Z|X, Θold ) Θ
Z

.. therefore, Θnew = arg max Q(Θ, Θold ) where Q(Θ, Θold ) = Ep(Z|X,Θold ) [log p(X, Z|Θ)]
θ

Q(Θ, Θold ) = Ep(Z|X,Θold ) [log p(X, Z|Θ)] is known as expected complete data log-likelihood (CLL)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

After maximizing
w.r.t.

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

After maximizing
w.r.t.

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

After maximizing
w.r.t.

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..

Step 1: We set q̂ = p(Z|X, Θold ), L(q̂, Θ) touches log p(X|Θ) at Θold


Step 2: We maximize L(q̂, Θ) w.r.t. Θ (equivalent to maximizing Q(Θ, Θold ))

Local Maxima Found

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: Another Illustration

The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)

To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q

(Step 1)
(Step 2)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
What’s Going On: Another Illustration

The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)

To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q

(Step 1)
(Step 2)
Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
What’s Going On: Another Illustration

The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)

To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q

(Step 1)
(Step 2)
Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)

Step 2 maximizes the lower bound L(q, Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )
3 Step 2: Now maximize the expected complete data log-likelihood w.r.t. Θ
N
X
Θ(t) = arg max Q(Θ, Θ(t−1) ) = arg max Ep(z (t) |x n ,Θ(t−1) ) [log p(x n , z (t)
n |Θ)]
Θ Θ n
n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )
3 Step 2: Now maximize the expected complete data log-likelihood w.r.t. Θ
N
X
Θ(t) = arg max Q(Θ, Θ(t−1) ) = arg max Ep(z (t) |x n ,Θ(t−1) ) [log p(x n , z (t)
n |Θ)]
Θ Θ n
n=1
4 If not yet converged, set t = t + 1 and go to step 2.

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
The Expectation Maximization (EM) Algorithm
The ALT-OPT of L(q, Θ) that we saw leads to the EM algorithm (Dempster, Laird, Rubin, 1977)
The EM Algorithm
1 Initialize Θ as Θ(0) , set t = 1
2 Step 1: Compute posterior of latent variables given current parameters Θ(t−1)
(t) (t)
p(z n |Θ(t−1) )p(x n |z n , Θ(t−1) )
p(z (t)
n |x n , Θ
(t−1)
)= ∝ prior × likelihood
p(x n |Θ(t−1) )
3 Step 2: Now maximize the expected complete data log-likelihood w.r.t. Θ
N
X
Θ(t) = arg max Q(Θ, Θ(t−1) ) = arg max Ep(z (t) |x n ,Θ(t−1) ) [log p(x n , z (t)
n |Θ)]
Θ Θ n
n=1
4 If not yet converged, set t = t + 1 and go to step 2.

Note: If we can take the MAP estimate ẑ n of z n (not full posterior) in Step 1 and maximize the CLL in
PN
Step 2 using that estimate, i.e., do arg maxΘ n=1 log p(x n , ẑ (t)
n |Θ), this will be identical to ALT-OPT
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 8
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n

znz> >
n replaced by Ep(z n |x n ,Θold ) [z n z n ]

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n

znz> >
n replaced by Ep(z n |x n ,Θold ) [z n z n ]
.. and so on..

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
Writing Down the Expected CLL
Deriving the EM algorithm for any model requires finding the expression of the expected CLL
N
X
Q(Θ, Θold ) = Ep(z n |x n ,Θold ) [log p(x n , z n |Θ)]
n=1
N
X
= Ep(z n |x n ,Θold ) [log p(x n |z n , Θ) + log p(z n |Θ)]
n=1

If p(x n |z n , Θ) and p(z n |Θ) are exp-family distributions, expected CLL will have a simple form
Finding the expression for the expected CLL in such cases is fairly straightforward
First write down the expressions for p(x n |z n , Θ) and p(z n |Θ) and simplify as much as possible
In the resulting expressions, replace all terms containing z n ’s by their respective expectations, e.g.,
z n replaced by Ep(z n |x n ,Θold ) [z n ], i.e., the posterior mean of z n

znz> >
n replaced by Ep(z n |x n ,Θold ) [z n z n ]
.. and so on..

The expected CLL may not always be computable and may need to be approximated
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 9
EM for Gaussian Mixture Model

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 10
EM for Gaussian Mixture Model

Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model

Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1

The expected CLL Q(Θ, Θold ) will be


N X
X K
Q(Θ, Θold ) = E[log p(X, Z|Θ)] = E[znk ][log πk + log N (x n |µk , Σk )]
n=1 k=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model

Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1

The expected CLL Q(Θ, Θold ) will be


N X
X K
Q(Θ, Θold ) = E[log p(X, Z|Θ)] = E[znk ][log πk + log N (x n |µk , Σk )]
n=1 k=1

.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model

Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1

The expected CLL Q(Θ, Θold ) will be


N X
X K
Q(Θ, Θold ) = E[log p(X, Z|Θ)] = E[znk ][log πk + log N (x n |µk , Σk )]
n=1 k=1

.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model

Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1

The expected CLL Q(Θ, Θold ) will be


N X
X K
Q(Θ, Θold ) = E[log p(X, Z|Θ)] = E[znk ][log πk + log N (x n |µk , Σk )]
n=1 k=1

.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model

Let’s first look at the CLL. Similar to generative classification with Gaussian class-conditionals
N X
X K
log p(X, Z|Θ) = znk [log πk + log N (x n |µk , Σk )] (we’ve seen how we get this)
n=1 k=1

The expected CLL Q(Θ, Θold ) will be


N X
X K
Q(Θ, Θold ) = E[log p(X, Z|Θ)] = E[znk ][log πk + log N (x n |µk , Σk )]
n=1 k=1

.. where the expectation is w.r.t. the current posterior of z n , i.e., p(z n |x n , Θold )
In this case, we only need E[znk ] which can be computed as
E[znk ] = γnk = 0 × p(znk = 0|x n , Θold ) + 1 × p(znk = 1|x n , Θold ) = p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)
π N (x n |µk ,Σk ) PK
Note: We can finally normalize E[znk ] as E[znk ] = PK k since E[znk ] = 1
k=1
`=1 π` N (x n |µ` ,Σ` )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 11
EM for Gaussian Mixture Model

EM for Gaussian Mixture Model

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model

EM for Gaussian Mixture Model


1 Initialize Θ = {πk , µk , Σk }K (0)
k=1 as Θ , set t = 1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model

EM for Gaussian Mixture Model


1 Initialize Θ = {πk , µk , Σk }K (0)
k=1 as Θ , set t = 1
2 E step: compute the expectation of each z n (we need it in M step)
(t−1) (t−1) (t−1)
(t) (t) πk N (x n |µk , Σk )
E[znk ] = γnk = PK (t−1) (t−1) (t−1)
∀n, k
π
`=1 ` N (x n |µ` , Σ ` )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model

EM for Gaussian Mixture Model


1 Initialize Θ = {πk , µk , Σk }K (0)
k=1 as Θ , set t = 1
2 E step: compute the expectation of each z n (we need it in M step)
(t−1) (t−1) (t−1)
(t) (t) πk N (x n |µk , Σk )
E[znk ] = γnk = PK (t−1) (t−1) (t−1)
∀n, k
π
`=1 ` N (x n |µ` , Σ ` )
PN
3 Given “responsibilities” γnk = E[znk ], and Nk = n=1 γnk , re-estimate Θ via MLE
N
(t) 1 X (t)
µk = γnk x n
Nk n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model

EM for Gaussian Mixture Model


1 Initialize Θ = {πk , µk , Σk }K (0)
k=1 as Θ , set t = 1
2 E step: compute the expectation of each z n (we need it in M step)
(t−1) (t−1) (t−1)
(t) (t) πk N (x n |µk , Σk )
E[znk ] = γnk = PK (t−1) (t−1) (t−1)
∀n, k
π
`=1 ` N (x n |µ` , Σ ` )
PN
3 Given “responsibilities” γnk = E[znk ], and Nk = n=1 γnk , re-estimate Θ via MLE
N
(t) 1 X (t)
µk = γnk x n
Nk n=1
N
(t) 1 X (t) (t) (t) >
Σk = γ (x n − µk )(x n − µk )
Nk n=1 nk

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model

EM for Gaussian Mixture Model


1 Initialize Θ = {πk , µk , Σk }K (0)
k=1 as Θ , set t = 1
2 E step: compute the expectation of each z n (we need it in M step)
(t−1) (t−1) (t−1)
(t) (t) πk N (x n |µk , Σk )
E[znk ] = γnk = PK (t−1) (t−1) (t−1)
∀n, k
π
`=1 ` N (x n |µ` , Σ ` )
PN
3 Given “responsibilities” γnk = E[znk ], and Nk = n=1 γnk , re-estimate Θ via MLE
N
(t) 1 X (t)
µk = γnk x n
Nk n=1
N
(t) 1 X (t) (t) (t) >
Σk = γ (x n − µk )(x n − µk )
Nk n=1 nk

(t) Nk
πk =
N

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
EM for Gaussian Mixture Model

EM for Gaussian Mixture Model


1 Initialize Θ = {πk , µk , Σk }K (0)
k=1 as Θ , set t = 1
2 E step: compute the expectation of each z n (we need it in M step)
(t−1) (t−1) (t−1)
(t) (t) πk N (x n |µk , Σk )
E[znk ] = γnk = PK (t−1) (t−1) (t−1)
∀n, k
π
`=1 ` N (x n |µ` , Σ ` )
PN
3 Given “responsibilities” γnk = E[znk ], and Nk = n=1 γnk , re-estimate Θ via MLE
N
(t) 1 X (t)
µk = γnk x n
Nk n=1
N
(t) 1 X (t) (t) (t) >
Σk = γ (x n − µk )(x n − µk )
Nk n=1 nk

(t) Nk
πk =
N
4 Set t = t + 1 and go to step 2 if not yet converged

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 12
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
2
log p(X, Z|W, σ )

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
N
2
Y 2
log p(X, Z|W, σ ) = log p(x n , z n |W, σ )
n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
N N
2
Y 2
Y 2
log p(X, Z|W, σ ) = log p(x n , z n |W, σ ) = log p(x n |z n , W, σ )p(z n )
n=1 n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
N N N
2
Y 2
Y 2
X 2
log p(X, Z|W, σ ) = log p(x n , z n |W, σ ) = log p(x n |z n , W, σ )p(z n ) = {log p(x n |z n , W, σ ) + log p(z n )}
n=1 n=1 n=1

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
N N N
2
Y 2
Y 2
X 2
log p(X, Z|W, σ ) = log p(x n , z n |W, σ ) = log p(x n |z n , W, σ )p(z n ) = {log p(x n |z n , W, σ ) + log p(z n )}
n=1 n=1 n=1

Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N  
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
N N N
2
Y 2
Y 2
X 2
log p(X, Z|W, σ ) = log p(x n , z n |W, σ ) = log p(x n |z n , W, σ )p(z n ) = {log p(x n |z n , W, σ ) + log p(z n )}
n=1 n=1 n=1

Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N  
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2

Expected CLL will require replacing z n by E[z n ] and z n z > >


n by E[z n z n ]

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
N N N
2
Y 2
Y 2
X 2
log p(X, Z|W, σ ) = log p(x n , z n |W, σ ) = log p(x n |z n , W, σ )p(z n ) = {log p(x n |z n , W, σ ) + log p(z n )}
n=1 n=1 n=1

Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N  
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2

Expected CLL will require replacing z n by E[z n ] and z n z > >


n by E[z n z n ]

These expectations can be obtained from the posterior p(z n |x n ) (easy to compute due to conjugacy)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
Another Example: (Probabilistic) Dimensionality Reduction
Let’s consider a latent factor model for dimensionality reduction (will revisit this later)

p(x n |z n , W, σ 2 ) = N (Wz n , σ 2 ID ) p(z n ) = N (0, IK )


A low-dim z n ∈ RK mapped to high-dim x n ∈ RD via a projection matrix W ∈ RD×K
The complete data log-likelihood for this model will be
N N N
2
Y 2
Y 2
X 2
log p(X, Z|W, σ ) = log p(x n , z n |W, σ ) = log p(x n |z n , W, σ )p(z n ) = {log p(x n |z n , W, σ ) + log p(z n )}
n=1 n=1 n=1

Plugging in the expressions for p(x n |z n , W, σ 2 ) and p(z n ) and simplifying (exercise)
N  
X D 2 1 2 1 > > 1 > > 1 >
CLL = − log σ + ||x n || − z n W x n + tr(z n z n W W) + tr(z n z n )
n=1
2 2σ 2 σ2 2σ 2 2

Expected CLL will require replacing z n by E[z n ] and z n z > >


n by E[z n z n ]

These expectations can be obtained from the posterior p(z n |x n ) (easy to compute due to conjugacy)

The M step maximizes the expected CLL w.r.t. the parameters (W, σ 2 in this case)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 13
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ

.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ

.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ

.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ

.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)
If M step has multiple parameters whose updates depend on each other, they are updated in an
alternating fashion - called Expectation Conditional Maximization (ECM) algorithm

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ

.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)
If M step has multiple parameters whose updates depend on each other, they are updated in an
alternating fashion - called Expectation Conditional Maximization (ECM) algorithm
Other advanced probabilistic inference algorithms are based on ideas similar to EM

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14
The EM Algorithm: Some Comments
The E and M steps may not always be possible to perform exactly. Some reasons
The posterior of latent variables p(Z|X, Θ) may not be easy to find
Would need to approximate p(Z|X, Θ) in such a case

Even if p(Z|X, Θ) is easy, the expected CLL, i.e., E[log p(X, Z|Θ)] may still not be tractabe
Z
E[log p(X, Z|Θ)] = log p(X, Z|Θ)p(Z|X, Θ)dZ

.. which can be approximated, e.g., using Monte-Carlo expectation (called Monte-Carlo EM)
Maximization of the expected CLL may not be possible in closed form
EM works even if the M step is only solved approximately (Generalized EM)
If M step has multiple parameters whose updates depend on each other, they are updated in an
alternating fashion - called Expectation Conditional Maximization (ECM) algorithm
Other advanced probabilistic inference algorithms are based on ideas similar to EM
E.g., Variational Bayesian (VB) inference
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 14

You might also like