0% found this document useful (0 votes)

27 views99 pages

771 A18 Lec16

The document discusses latent variable models and the challenges of parameter estimation. It explains that if the latent and observed variables were known, maximum likelihood estimation would be straightforward. However, in latent variable models the latent variables are unknown, making direct maximum likelihood intractable. The Expectation Maximization algorithm provides an approach for approximating maximum likelihood in these models.

Uploaded by

DUDEKULA VIDYASAGAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views99 pages

771 A18 Lec16

Uploaded by

DUDEKULA VIDYASAGAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

Latent Variable Models and Expectation Maximization

Piyush Rai

Introduction to Machine Learning (CS771A)

September 27, 2018

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 1
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Recap: Latent Variable Models
Assume each observation x n to be associated with a “local” latent variable z n

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters

For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters

For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters

For brevity, we usually refer to the global parameters θ and φ as Θ = (θ, φ)
A Gaussian mixture model is an example of such a model
z n ∈ {1, . . . , K } with p(z n |φ) = multinoulli(π1 , . . . , πK )
x n ∈ RD with p(x n |z n , θ) = N (x|µz n .Σz n )

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters

Given data X = {x 1 , . . . , x N }, the goal is to estimate the parameters Θ or latent variable Z or both

Parameters of p(x|z, θ) and p(z|φ) are collectively referred to as “global” parameters

Given data X = {x 1 , . . . , x N }, the goal is to estimate the parameters Θ or latent variable Z or both
(note: we can usually estimate Θ given Z, and vice-versa)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 2
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X
ΘMLE = arg max log p(x n , z n |Θ)
Θ
n=1

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)

Simple to solve (usually closed form) if p(z n |φ) and p(x n |z n , θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where z n is “hidden”, the MLE problem will be the following
N
X
ΘMLE = arg max log p(x n |Θ)
Θ
n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X
p(x n |Θ) = p(x n , z n |Θ)
zn

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ), in
fact almost never an exponential family distribution
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both x n and z n then we could do
N
X N
X
ΘMLE = arg max log p(x n , z n |Θ) = arg max [log p(z n |φ) + log p(x n |z n , θ)]
Θ Θ
n=1 n=1

The form of p(x n |Θ) may not be simple since we need to sum over unknown z n ’s possible values
X Z
p(x n |Θ) = p(x n , z n |Θ) ... or if z n is continuous: p(x n |Θ) = p(x n , z n |Θ)dz n
zn
The summation/integral may be intractable + may lead to complex expressions for p(x n |Θ), in
fact almost never an exponential family distribution. MLE for Θ won’t have closed form solutions!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )

( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )

( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )

( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q, Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )

( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q, Θ)

Maximizing L(q, Θ) will also improve log p(X|Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
An Important Identity

Define pz = p(Z|X, Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q, Θ) + KL(q||pz )

( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)
( )
X p(Z|X, Θ)
KL(q||pz ) = − q(Z) log
Z
q(Z)

(Exercise: Verify the above identity)

Since KL(q||pz ) ≥ 0, L(q, Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q, Θ)

Maximizing L(q, Θ) will also improve log p(X|Θ). Also, as we’ll see, it’s easier to maximize L(q, Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
Maximizing L(q, Θ)
Note that L(q, Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q, Θ) + KL(q||pz ) with
( )
X p(X, Z|Θ)
L(q, Θ) = q(Z) log
Z
q(Z)