0% found this document useful (0 votes)
42 views

CS772 Lec10

The document discusses latent variable models and the Expectation-Maximization (EM) algorithm. It notes that in latent variable models, the latent variables (like cluster assignments) are difficult to directly maximize the likelihood of. The EM algorithm instead maximizes an easier quantity called the Q function, which is a lower bound on the log likelihood. It does this through an expectation (E) step, where it computes the conditional posterior of the latent variables, and a maximization (M) step, where it directly estimates the model parameters to maximize the expected complete log likelihood with respect to the conditional posterior of the latent variables.

Uploaded by

juggernautjha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

CS772 Lec10

The document discusses latent variable models and the Expectation-Maximization (EM) algorithm. It notes that in latent variable models, the latent variables (like cluster assignments) are difficult to directly maximize the likelihood of. The EM algorithm instead maximizes an easier quantity called the Q function, which is a lower bound on the log likelihood. It does this through an expectation (E) step, where it computes the conditional posterior of the latent variables, and a maximization (M) step, where it directly estimates the model parameters to maximize the expected complete log likelihood with respect to the conditional posterior of the latent variables.

Uploaded by

juggernautjha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Latent Variable Models and EM Algorithm

CS772A: Probabilistic Machine Learning


Piyush Rai
2
Conditional Posterior
▪ Consider a model with 𝐾 unknown params/hyperparams Θ = (𝜃1 , 𝜃2 , … , 𝜃𝐾 )
𝑝 Θ 𝑝(𝑿|Θ) 𝑝 Θ 𝑝(𝑿|Θ) Usually intractable integral
Joint posterior
𝑝 Θ𝑿 = = so the full posterior can’t
𝑝(𝑿) ‫ 𝑝 ׬‬Θ 𝑝 𝑿 Θ 𝑑𝜃1 𝑑𝜃2 … 𝑑𝜃𝐾 be computed exactly

▪ We can however compute conditional posteriors (CP)which for each 𝜃𝑖 looks like
Can be data and/or other
𝑝(𝜃𝑖 |whatever 𝜃𝑖 depends on) params/hyperparams given their
fixed values (or current estimates)
▪ To compute each CP, look at the joint distribution 𝑝 𝑿, Θ
𝑝 𝑿, Θ = 𝑝(𝑿, 𝜃1 , 𝜃2 , … , 𝜃𝐾 ) = 𝑝 𝑿 𝜃1 , 𝜃2 , … , 𝜃𝐾 𝑝 𝜃1 𝜃2 , … , 𝜃𝐾 )𝑝(𝜃2 𝜃3 , … , 𝜃𝐾 … 𝑝(𝜃𝐾 )

▪ CP of 𝜃𝑖 will be proportional to the product of all the terms involving 𝜃𝑖


▪ If those terms are conjugate to each other, it is called local conjugacy. CP is then easy to compute

▪ Many algorithms for computing point estimate/full posterior use the CPs
▪ Expectation Maximization, Variational Inference , MCMC (especially Gibbs sampling) CS772A: PML
3
Latent Variable Models
▪ Application 1: Can use latent variables to learn latent properties/features of data, e.g.,
▪ Cluster assignment of each observation (in mixture models)
▪ Low-dim rep. or “code” of each observation (e.g., prob. PCA, variational autoencoders, etc)
Plate notation of a generic LVM

𝑝 𝒛𝑛 𝜙 : A suitable prior distribution based on the nature of 𝒛𝑛


𝑝 𝒙𝑛 𝒛𝑛 , 𝜃 : A suitable likelihood based on the nature of 𝒙𝑛
𝜃

𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ In such apps, latent variables (𝒛𝑛 ’s) are called “local variables” (specific to individual
obs.)and other unknown parameters/hyperparams (𝜃, 𝜙 above) are called “global var”
CS772A: PML
4
Latent Variable Models
▪ Application 2: Sometimes, augmenting a model by latent variables simplifies inference
▪ These latent variables aren’t part of the original model definition
▪ Some of the popular examples of such augmentation include
▪ In Probit regression for binary classification, we can model each label 𝑦𝑛 ∈ {0,1} as
𝑦𝑛 = 𝕀[𝑧𝑛 > 0] where 𝑧𝑛 ∼ 𝒩(𝒘⊤ 𝒙𝑛 , 1) is an auxiliary latent variable
.. and use EM etc, to infer the unknowns 𝒘 and 𝑧𝑛 ’s (PML-2, Sec 15.4)
▪ Many sparse priors on weights can be thought of as Gaussian “scale-mixtures”

.. where 𝜏𝑑 ’s are latent vars. Can use EM to infer 𝒘, 𝜏 (MLAPP 13.4.4 - EM for LASSO)
▪ Such augmentations can often make a non-conjugate model a locally conjugate one
▪ Conditional posteriors of the unknowns often have closed form in such cases
CS772A: PML
5
Nomenclature/Notation Alert
▪ Why call some unknowns as parameters and others as latent variables?
▪ Well, no specific reason. Sort of a convention adopted by some algorithms
▪ EM: Unknowns estimated in E step referred to as latent vars; those in M step as params
▪ Usually: Latent vars – (Conditional) posterior computed; parameters – point estimation

▪ Some algos won’t make such distinction and will infer posterior over all unknowns
▪ Sometimes the “global” or “local” unknown distinction makes it clear
▪ Local variables = latent variables, global variables = parameters

▪ But remember that this nomenclature isn’t really cast in stone, no need to be confused
so long as you are clear as to what the role of each unknown is, and how we want to
estimate it (posterior or point estimate) and using what type of inference algorithm

CS772A: PML
6
Hybrid Inference (posterior infer. + point est.)
▪ In many models, we infer posterior on some unknowns and do point est. for others

▪ We have already seen MLE-II for lin reg. which alternates between መ 𝛽)
CP of 𝑤: 𝑝(𝒘|𝑿, 𝒚, 𝜆, መ

▪ Inferring CP over the main parameter given the point estimates of hyperparams
▪ Maximizing the marginal lik. to do point estimation for hyperparams መ 𝛽መ = argmax𝜆,𝛽 𝑝(𝒚|𝑿, 𝜆, 𝛽)
𝜆,

▪ The Expectation-Maximization algorithm (will see today) also does something similar
▪ In E step, the CP of latent variables is inferred, given current point-est of params
▪ M step maximizes expected complete data log-lik. to get point estimates of params

▪ If we can’t (due to computational or other reasons) infer posterior over all unknowns,
how to decide which variables to infer posterior on, and for which to do point-est?

▪ Usual approach: Infer posterior over local vars and point estimates for global vars
▪ Reason: We typically have plenty of data to reliably estimate the global variables so it is okay even
if we just do point estimation for those CS772A: PML
7

Inference/Parameter Estimation in
Latent Variable Models using
Expectation-Maximization (EM)

CS772A: PML
8
Parameter Estimation in Latent Variable Models
▪ Assume each observation 𝒙𝑛 to be associated with a “local” latent variable 𝒛𝑛
𝑝 𝒛𝑛 𝜙 : A suitable prior distribution based on the nature of 𝒛𝑛
𝑝 𝒙𝑛 𝒛𝑛 , 𝜃 : A suitable likelihood based on the nature of 𝒙𝑛
𝜃

𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ Although we can do fully Bayesian inference for all the unknowns, suppose we
only want a point estimate of the “global” parameters Θ = (𝜃, 𝜙) via MLE/MAP
▪ Such MLE/MAP problems in LVMs are difficult to solve in a “clean” way
▪ Would typically require gradient based methods with no closed form updates for Θ
▪ However, EM gives a clean way to obtain closed form updates for Θ
CS772A: PML
9
Why MLE/MAP of Params is Hard for LVMs?
▪ Suppose we want to estimate Θ = (𝜃, 𝜙) via MLE. If we knew 𝒛𝑛 , we could solve
Easy to solve

In particular, if they are


exp-fam distributions
▪ Easy. Usually closed form if 𝑝 𝒛𝑛 𝜙 and 𝑝 𝒙𝑛 𝒛𝑛 , 𝜃 have simple forms
▪ However, since in LVMs, 𝒛𝑛 is hidden, the MLE problem for Θ will be the following
Basically, the marginal
likelihood after
integrating out 𝑧𝑛
▪ log 𝑝(𝒙𝑛 |Θ) will not have a simple expression since 𝑝(𝑥𝑛 |Θ) requires sum/integral

▪ MLE now becomes difficult (basically MLE-II now), no closed form expression for Θ.
▪ Can we maximize some other quantity instead of log 𝑝(𝑥𝑛 |Θ) for this MLE?
CS772A: PML
10
An Important Identity
▪ Assume 𝑝𝑧 = 𝑝(𝒁|𝑿, Θ) and 𝑞(𝒁) to be some prob distribution over 𝒁, then
log 𝑝 𝑿 Θ = ℒ 𝑞, Θ + 𝐾𝐿(𝑞||𝑝𝑧 ) Verify the identity
Assume 𝒁 discrete
𝑝(𝑋,𝑍|Θ)
▪ In the above ℒ 𝑞, Θ = σ𝑍 𝑞 𝑍 log 𝑞(𝑍)

𝑝(𝒁|𝑿,Θ)
▪ 𝐾𝐿(𝑞| 𝑝𝑧 = − σ𝑍 𝑞 𝒁 log 𝑞(𝒁)

▪ KL is always non-negative, so log 𝑝 𝑿 Θ ≥ ℒ 𝑞, Θ

▪ Thus ℒ 𝑞, Θ is a lower-bound on log 𝑝 𝑿 Θ

▪ Thus if we maximize ℒ 𝑞, Θ , it will also improve log 𝑝 𝑿 Θ

▪ Also, as we’ll see, it’s easier to maximize ℒ 𝑞, Θ


CS772A: PML
Basically, log of marginal log 𝑝 𝑿 Θ is called Incomplete- 11
Maximizing ℒ 𝑞, Θ likelihood w.r.t. 𝚯 with 𝒁
integrated out
Data Log Likelihood (ILL)

▪ ℒ 𝑞, Θ depends on 𝑞 and Θ. We’ll use ALT-OPT to maximize it


▪ Let’s maximize ℒ 𝑞, Θ w.r.t. 𝑞 with Θ fixed at some Θold Since log 𝑝 𝑿 Θ = ℒ 𝑞, Θ + 𝐾𝐿(𝑞||𝑝𝑧 )
is constant when Θ is held fixed at Θold

𝑞ො = argmax𝑞 ℒ 𝑞, Θold = argmin𝑞 𝐾𝐿(𝑞| 𝑝𝑧 = 𝑝𝑧 = 𝑝(𝒁|𝑿, Θold )


The posterior distribution of 𝑍
▪ Now let’s maximize ℒ 𝑞, Θ w.r.t. Θ with 𝑞 fixed at 𝑞ො = 𝑝𝑧 = 𝑝(𝒁|𝑿, Θold ) given current parameters Θold
𝑝(𝑿, 𝒁|Θ)
Θnew = argmaxΘ ℒ 𝑞,
ො Θ = argmaxΘ ෍ 𝑝(𝒁|𝑿, Θold ) log
𝑝(𝒁|𝑿, Θold )
𝑍
Maximization of expected CLL where
the expectation is w.r.t. the posterior = argmaxΘ ෍ 𝑝 𝒁 𝑿, Θold log 𝑝(𝑿, 𝒁|Θ)
distribution of 𝑍 given current 𝑍 Complete-Data Log
parameters Θold = argmaxΘ 𝔼 [log 𝑝(𝑿, 𝒁|Θ)] Likelihood (CLL)
𝑝 𝒁 𝑿, Θold
Much easier than maximizing ILL since
CLL will have simple expressions (since = argmaxΘ 𝒬(Θ, Θold )
it is akin to knowing 𝑍)
CS772A: PML
12
The Expectation-Maximization (EM) Algorithm
▪ ALT-OPT of ℒ 𝑞, Θ w.r.t. 𝑞 and Θ gives the EM algorithm (Dempster, Laird, Rubin, 1977)
Primarily designed for doing point estimation of the Usually computing CP + expected CLL
parameters Θ but also gives (CP of) latent variables 𝑧𝑛 is referred to as the E step, and max.
of exp-CLL w.r.t. Θ as the M step

Conditional posterior of
each latent variable 𝑧𝑛
Latent variables also
assumed indep. a priori Assuming the (expected) CLL
𝔼 old [log 𝑝(𝑿, 𝒁|Θ)]
𝑝 𝒁 𝑿, Θ
factorizes over all observations

▪ Note: If we can take the MAP estimate 𝑧Ƹ𝑛 of 𝑧𝑛 (not full posterior) in Step 1 and maximize
(𝑡)
the CLL in Step 2 using that, i.e., do argmaxΘ σ𝑁
𝑛=1 log 𝑝 𝒙 ,
𝑛 𝑛 𝑧Ƹ Θ this will be ALT-OPT
CS772A: PML
13
The Expected CLL
▪ Expected CLL in EM is given by (assume observations are i.i.d.)

▪ If 𝑝 𝒛𝑛 Θ and 𝑝 𝒙𝑛 𝒛𝑛 , Θ are exp-family distributions, 𝒬(Θ, Θold ) has a very simple form
▪ In resulting expressions, replace terms containing 𝑧𝑛 ’s by their respective expectations, e.g.,
▪ 𝒛𝑛 replaced by 𝔼𝑝 𝒛 𝒙 , Θ෡ [𝒛𝑛 ]
𝑛 𝑛
▪ 𝒛𝑛 𝒛𝑛 ⊤ replaced by 𝔼𝑝 𝒛 𝒙 , Θ
෡ [𝒛𝑛 𝒛𝑛
⊤]
𝑛 𝑛

▪ However, in some LVMs, these expectations are intractable to compute and need to be
approximated (will see some examples later)
CS772A: PML
14
What’s Going On? Alternating between
them until convergence
to some local optima KL becomes zero and ℒ 𝑞, Θ becomes
equal to log 𝑝 𝑿 Θ ; thus their curves
▪ As we saw, the maximization of lower bound ℒ 𝑞, Θ had two steps touch at current Θ

▪ Step 1 finds the optimal 𝑞 (call it 𝑞)


ො by setting it as the posterior of 𝒁 given current Θ
▪ Step 2 maximizes ℒ 𝑞,ො Θ w.r.t. Θ which gives a new Θ.
Note that Θ only changes in Step 2
so the objective log 𝑝 𝑿 Θ
Local optima can only change in Step 2
Green curve: ℒ 𝑞,
ො Θ after found for Θ𝑀𝐿𝐸 log 𝑝 𝑿 Θ
setting 𝑞 to 𝑞ො
Also kind of similar to Newton’s
method (and has second order like
convergence behavior in some cases)

Good initialization matters; Unlike Newton’s method, we don’t


otherwise would converge construct and optimize a quadratic
to a poor local optima approximation, but a lower bound

Even though original MLE problem


(2) Θ(3) argmaxΘ log 𝑝 𝑿 Θ could be solved
Θ(0) Θ(1)Θ Θ(𝑀𝐿𝐸)
using gradient methods, EM often
works faster and has cleaner updates
CS772A: PML
15
EM vs Gradient-based Methods
▪ Can also estimate params using gradient-based optimization instead of EM
▪ We can usually explicitly sum over or integrate out the latent variables 𝒁, e.g.,

▪ Now we can optimize ℒ(Θ) using first/second order optimization to find the optimal Θ
▪ EM is usually preferred over this approach because
▪ The M step has often simple closed-form updates for the parameters Θ
▪ Often constraints (e.g., PSD matrices) are automatically satisfied due to form of updates
▪ In some cases†, EM usually converges faster (and often like second-order methods)
▪ E.g., Example: Mixture of Gaussians with when the data is reasonably well-clustered
▪ EM applies even when the explicit summing over/integrating out is expensive/intractable
▪ EM also provides the conditional posterior over the latent variables Z (from E step)
†Optimization with EM and Expectation-Conjugate-Gradient (Salakhutdinov et al, 2003), On Convergence Properties of the EM Algorithm for Gaussian Mixtures (Xu and Jordan, 1996),
Statistical guarantees for the EM algorithm: From population to sample-based analysis (Balakrishnan et al, 2017)
CS772A: PML
16
Some Applications of EM
▪ Mixture Models and Dimensionality Reduction/Representation Learning
▪ Mixture Models: Mixture of Gaussians, Mixture of Experts, etc
▪ Dim. Reduction/Representation Learning: Probabilistic PCA, Variational Autoencoders
▪ Problems with missing features or missing labels (which are treated as latent variables)
෡ = argmaxΘ log 𝑝 𝒙𝑜𝑏𝑠 Θ = argmaxΘ log ‫ 𝑠𝑏𝑜𝒙 𝑝 ׬‬, 𝒙𝑚𝑖𝑠𝑠 Θ 𝑑𝒙𝑚𝑖𝑠𝑠
▪Θ
෡ = argmaxΘ σ𝑁
▪Θ 𝑛=1 log 𝑝 𝑥 ,
𝑛 𝑛𝑦 Θ + σ 𝑁+𝑀
𝑛=𝑁+1 log σ 𝐾
𝑐=1 𝑝 𝑥𝑛 , 𝑦𝑛 = 𝑐 Θ
▪ Hyperparameter estimation in probabilistic models (an alternative to MLE-II)
▪ MLE-II estimates hyperparams by maximizing the marginal likelihood, e.g.,
መ 𝛽መ = argmax𝜆,𝛽 𝑝 𝒚 𝑿, 𝜆, 𝛽 = argmax𝜆,𝛽 න 𝑝 𝒚 𝒘, 𝑿, 𝛽 𝑝 𝒘 𝜆 𝑑𝒘 For a Bayesian linear
𝜆, regression model

▪ With EM, can treat 𝒘 as latent var, and 𝜆, 𝛽 as “parameters”


▪ E step will estimate the CP of 𝑤 given current estimates of 𝜆, 𝛽
▪ M step will re-estimate 𝜆, 𝛽 by maximizing the expected CLL Expectations w.r.t.
the CP of 𝒘
CS772A: PML
If the 𝒛𝑛 were known, it just becomes
17
An Example: Mixture Models generative classification, for which
which we know how to estimate 𝜃 and
𝜙, given training data

▪ Assume 𝐾 probability distributions (e.g., Gaussians), one for each cluster


𝑝(𝑥) is a
Gaussian mixture
𝑝 𝑧𝑛 𝜙 = multinoulli(𝝅) Parameters of the 𝐾 distributions, model (GMM)
(also means 𝑝 𝑧𝑛 = 𝑘 𝜙 = 𝜋𝑘 ) e.g,. 𝜃 = 𝜇𝑘 , Σ𝑘 𝐾
𝑘=1

Discrete latent variable (with 𝐾 possible


values) or a one-hot vector of length 𝐾.
𝜃
Modeled by a multinoulli distribution as prior Assumed generated from one
The likelihood
of the 𝐾 distributions
distributions
depending on the true (but
The parameter vector unknown) value of 𝑧𝑛 (which
𝝅 = 𝜋1 , 𝜋2 , … , 𝜋𝐾 of
the multinoulli distribution
𝜙 𝒛𝑛 𝒙𝑛 clustering will find))

𝑝 𝒙𝑛 𝒛𝑛 = 𝑘, 𝜃 = 𝒩 𝜇𝑘 , Σ𝑘
𝑁
▪ The log-likelihood will be
MLE on this objective won’t
𝐾
give closed form solution for
log 𝑝 𝒙𝑛 Θ = log ෍ 𝑝 𝒙𝑛 , 𝒛𝑛 = 𝑘 Θ the parameters
𝑘=1
𝐾 𝐾
= log ෍ 𝑝 𝒛𝑛 = 𝑘 𝜙 𝑝(𝒙𝑛 |𝒛𝑛 = 𝑘, 𝜃) = log ෍ 𝜋𝑘 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘
𝑘=1 𝑘=1 CS772A: PML
18
Detour: MLE for Generative Classification
▪ Assume a 𝐾 class generative classification model with Gaussian class-conditionals
▪ Assume class 𝑘 = 1,2, … , 𝐾 is modeled by a Gaussian with mean 𝜇𝑘 and cov matrix Σ𝑘
▪ The labels 𝒛𝑛 (known) are one-hot vecs. Also, 𝑧𝑛𝑘 = 1 if 𝒛𝑛 = 𝑘, and 𝒛𝑛𝑘 = 0, o/w
▪ Assuming class prior as 𝑝(𝒛𝑛 = 𝑘) = 𝜋𝑘 , the model has params Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 } 𝐾
𝑘=1
▪ Given training data {𝒙𝑛 , 𝒛𝑛 } 𝑁
𝑛=1 , the MLE solution will be

1 𝑁 𝑁𝑘
𝜋ො 𝑘 = ෍ 𝑧𝑛𝑘 Same as where 𝑁𝑘 is # of training ex. for which 𝑦𝑛 = 𝑘
𝑁
𝑁 𝑛=1
1 𝑁
1
𝜇Ƹ 𝑘 = ෍ 𝑧𝑛𝑘 𝒙𝑛 Same as σ𝑁 𝒙
𝑁𝑘 𝑛=1 𝑁𝑘 𝑛:𝒛𝑛 =𝑘 𝑛
1 𝑁
Σ෠ 𝑘 = ෍ 𝑧𝑛𝑘 (𝒙𝑛 −𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 )⊤ Same as 1 σ𝑁 𝑛:𝒛 =𝑘 (𝒙𝑛 − 𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 ) ⊤
𝑁𝑘 𝑛=1 𝑁𝑘 𝑛

CS772A: PML
19
Detour: MLE for Generative Classification
▪ Here is a formal derivation of the MLE solution for Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 } 𝐾
𝑘=1
෡ = argmaxΘ 𝑝(𝑿, 𝒁|Θ) = argmaxΘ ς𝑁
Θ 𝑛=1 𝑝(𝒙𝑛 , 𝒛𝑛 |Θ) multinoulli Gaussian

= argmaxΘ ς𝑁
𝑛=1 𝑝(𝒛𝑛 |Θ) 𝑝(𝑥𝑛 |𝒛𝑛 , Θ)
In general, in models with probability distributions
from the exponential family, the MLE problem will
usually have a simple analytic form 𝐾 𝑧𝑛𝑘 𝐾
= argmaxΘ ς𝑁 ς 𝜋
𝑛=1 𝑘=1 𝑘 ς𝑘=1 𝑝(𝑥𝑛 |𝒛𝑛 = 𝑘, Θ)
𝑧𝑛𝑘

Also, due to the form of the likelihood


(Gaussian) and prior (multinoulli), the
𝑁 𝐾
MLE problem had a nice separable = argmaxΘ ෑ ෑ [𝜋𝑘 𝑝(𝑥𝑛 |𝒛𝑛 = 𝑘, Θ)] 𝑧𝑛𝑘
structure after taking the log 𝑛=1 𝑘=1
𝑁 𝐾
Can see that, when estimating the
parameters of the 𝑘 𝑡ℎ Gaussian = argmaxΘ log ෑ ෑ [𝜋𝑘 𝑝(𝑥𝑛 |𝒛𝑛 = 𝑘, Θ)]𝑧𝑛𝑘
(𝜋𝑘 , 𝜇𝑘 , Σ𝑘 ), we only will only need 𝑛=1 𝑘=1
training examples from the 𝑘 𝑡ℎ class,
𝑁 𝐾
i.e., examples for which 𝑧𝑛𝑘 = 1
= argmaxΘ ෍ ෍ 𝑧𝑛𝑘 [log 𝜋𝑘 + log 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘 ]
𝑛=1 𝑘=1
The form of this expression is important;
will encounter this in GMM too
CS772A: PML
20
EM for Mixture Models
▪ So how do we estimate the parameters of a GMM where 𝒛𝑛 ’s are unknown?
Hmmmm.. So can we make a guess
Well, you kind of already know
what the value of each 𝒛𝑛 and then
how to do this. ☺ Remember
generative classification? 𝜃 estimate 𝜃 and 𝜙 as we do in case
of generative classification??
Yes, exactly. ☺ However, just like in
gen-class, you will need to repeat
the guess and estimate them a few
times until you converge

𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ The guess about 𝒛𝑛 can be in one of the two forms
▪ A “hard” guess – a single best value 𝒛
ො 𝑛 (some “optimal” value of the random variable 𝒛𝑛 )
▪ The “expected” value 𝔼 𝒛𝑛 of the random variable 𝒛𝑛 EM is pretty much like ALT-OPT
but with soft/expected values
▪ Using the hard guess 𝒛ො 𝑛 of 𝒛𝑛 will result in an ALT-OPT like algorithm of the latent variables

▪ Using the expected value of 𝒛𝑛 will give the so-called Expectation-Maximization (EM) algo
CS772A: PML
21
EM for Gaussian Mixture Model (GMM)
Expectation of CLL

▪ EM finds Θ𝑀𝐿𝐸 by maximizing 𝔼 log 𝑝 𝑿, 𝒁 Θ ෡ Θ


rather than log 𝑝 𝑿, 𝒁
▪ Note: Expectation will be w.r.t. the conditional posterior distribution of 𝒁, i.e., 𝑝(𝒁|𝑿, Θ)
It is “conditional” posterior
▪ The EM algorithm for GMM operates as follows because it is also conditioned Requires knowing Θ
on Θ, not just data 𝑋
▪ Initialize Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 }𝐾
𝑘=1 as Θ

▪ Repeat until convergence Needed to get the expected CLL
෡ ). Since obs are i.i.d, compute separately for each 𝑛 (and for 𝑘 = 1,2, . . 𝐾)
▪ Compute conditional posterior 𝑝(𝒁|𝑿, Θ
෡ ), just a
Same as 𝑝(𝑧𝑛𝑘 = 1| 𝒙𝑛 , Θ ෡ ∝ 𝑝 𝒛𝑛 = 𝑘 Θ
𝑝 𝒛𝑛 = 𝑘 𝒙𝑛 , Θ ෡ 𝑝 𝒙𝑛 𝒛𝑛 = 𝑘, Θ
෡ = 𝜋ො 𝑘 𝒩 𝑥𝑛 |𝜇ො𝑘 , Σ෠ 𝑘
different notation

▪ Update Θ by maximizing the expected complete data log-likelihood


Solution has a similar form as 𝑁
ALT-OPT (or gen. class.), ෡ = argmaxΘ 𝔼𝑝(𝒁|𝑿,Θ෡ ) log 𝑝 𝑿, 𝒁 Θ
Θ = ෍ 𝔼𝑝(𝒛𝑛 |𝒙𝑛 ,Θ෡ ) log 𝑝 𝒙𝑛 , 𝒛𝑛 Θ
except we now have the 𝑛=1
expectation of 𝑧𝑛𝑘 being used
𝑁 𝐾
1 𝑁 1 𝑁
𝜋ො 𝑘 = ෍ 𝔼[𝑧𝑛𝑘 ] 𝜇Ƹ 𝑘 = ෍ 𝔼[𝑧𝑛𝑘 ]𝒙𝑛 = argmaxΘ 𝔼 ෍ ෍ 𝑧𝑛𝑘 [log 𝜋𝑘 + log 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘 ]
𝑁 𝑛=1 𝑁𝑘 𝑛=1 𝑛=1 𝑘=1
𝑁𝑘 : Effective number 𝑁 𝐾
of points in cluster k 1 𝑁
= argmaxΘ ෍ ෍ 𝔼[𝑧𝑛𝑘 ][log 𝜋𝑘 + log 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘 ]
Σ෠ 𝑘 = ෍ 𝔼[𝑧𝑛𝑘 ](𝒙𝑛 −𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 )⊤ 𝑛=1 𝑘=1
𝑁𝑘 𝑛=1 CS772A: PML
22
EM for GMM (Contd) Reason: σ𝐾
𝑘=1 𝛾𝑛𝑘 = 1

ෝ 𝑘 𝒩 𝑥𝑛 |ෝ
𝜋 ෡𝑘
𝜇𝑘 ,Σ
Need to normalize: 𝔼 𝑧𝑛𝑘 =
▪ The EM algo for GMM required 𝔼[𝑧𝑛𝑘 ]. Note 𝑧𝑛𝑘 ∈ {0,1} 𝐾
σℓ=1 𝜋ෝ ℓ 𝒩 𝑥𝑛 |ෝ ෡ℓ
𝜇ℓ ,Σ

෡ + 1 × 𝑝(𝑧𝑛𝑘 = 1|𝑥𝑛 , Θ)
𝔼 𝑧𝑛𝑘 = 𝛾𝑛𝑘 = 0 × 𝑝(𝑧𝑛𝑘 = 0|𝑥𝑛 , Θ) ෡ = 𝑝(𝑧𝑛𝑘 = 1|𝑥𝑛 , Θ) ො 𝑘 𝒩 𝑥𝑛 |𝜇Ƹ 𝑘 , Σ෠ 𝑘
෡ ∝𝜋

Accounts for fraction of Accounts for cluster shapes (since


Soft 𝐾-means, which is more of a heuristic to points in each cluster each cluster is a Gaussian
get soft-clustering, also gave us probabilities
but doesn’t account for cluster shapes or
fraction of points in each cluster

Effective number of points


in the 𝑘 𝑡ℎ cluster

M-step:

CS772A: PML
23
EM: Some Final Comments
▪ The E and M steps may not always be possible to perform exactly. Some reasons
▪ The conditional posterior of latent variables 𝑝(𝑍|𝑋, Θ) may not be easy to compute
▪ Will need to approximate 𝑝(𝑍|𝑋, Θ) using methods such as MCMC or variational inference Results in
▪ Even if 𝑝(𝑍|𝑋, Θ) is easy, the expected CLL may not be easy to compute Monte-Carlo EM
Can often be approximated
by Monte-Carlo using
sample from the CP of 𝑍

▪ Maximization of the expected CLL may not be possible in closed form


▪ EM works even if the M step is only solved approximately (Generalized EM)
▪ If M step has multiple parameters whose updates depend on each other, they are
updated in an alternating fashion - called Expectation Conditional Maximization (ECM)
▪ Other advanced probabilistic inference algos are based on ideas similar to EM
▪ E.g., Variational EM, Variational Bayes (VB) inference, a.k.a. Variational Inference (VI)
CS772A: PML

You might also like