0% found this document useful (0 votes)

42 views

CS772 Lec10

The document discusses latent variable models and the Expectation-Maximization (EM) algorithm. It notes that in latent variable models, the latent variables (like cluster assignments) are difficult to directly maximize the likelihood of. The EM algorithm instead maximizes an easier quantity called the Q function, which is a lower bound on the log likelihood. It does this through an expectation (E) step, where it computes the conditional posterior of the latent variables, and a maximization (M) step, where it directly estimates the model parameters to maximize the expected complete log likelihood with respect to the conditional posterior of the latent variables.

Uploaded by

juggernautjha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

CS772 Lec10

Uploaded by

juggernautjha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Latent Variable Models and EM Algorithm

CS772A: Probabilistic Machine Learning

Piyush Rai
2
Conditional Posterior
▪ Consider a model with 𝐾 unknown params/hyperparams Θ = (𝜃1 , 𝜃2 , … , 𝜃𝐾 )
𝑝 Θ 𝑝(𝑿|Θ) 𝑝 Θ 𝑝(𝑿|Θ) Usually intractable integral
Joint posterior
𝑝 Θ𝑿 = = so the full posterior can’t
𝑝(𝑿) ‫ 𝑝 ׬‬Θ 𝑝 𝑿 Θ 𝑑𝜃1 𝑑𝜃2 … 𝑑𝜃𝐾 be computed exactly

▪ We can however compute conditional posteriors (CP)which for each 𝜃𝑖 looks like
Can be data and/or other
𝑝(𝜃𝑖 |whatever 𝜃𝑖 depends on) params/hyperparams given their
fixed values (or current estimates)
▪ To compute each CP, look at the joint distribution 𝑝 𝑿, Θ
𝑝 𝑿, Θ = 𝑝(𝑿, 𝜃1 , 𝜃2 , … , 𝜃𝐾 ) = 𝑝 𝑿 𝜃1 , 𝜃2 , … , 𝜃𝐾 𝑝 𝜃1 𝜃2 , … , 𝜃𝐾 )𝑝(𝜃2 𝜃3 , … , 𝜃𝐾 … 𝑝(𝜃𝐾 )

▪ CP of 𝜃𝑖 will be proportional to the product of all the terms involving 𝜃𝑖

▪ If those terms are conjugate to each other, it is called local conjugacy. CP is then easy to compute

▪ Many algorithms for computing point estimate/full posterior use the CPs
▪ Expectation Maximization, Variational Inference , MCMC (especially Gibbs sampling) CS772A: PML
3
Latent Variable Models
▪ Application 1: Can use latent variables to learn latent properties/features of data, e.g.,
▪ Cluster assignment of each observation (in mixture models)
▪ Low-dim rep. or “code” of each observation (e.g., prob. PCA, variational autoencoders, etc)
Plate notation of a generic LVM

𝑝 𝒛𝑛 𝜙 : A suitable prior distribution based on the nature of 𝒛𝑛

𝑝 𝒙𝑛 𝒛𝑛 , 𝜃 : A suitable likelihood based on the nature of 𝒙𝑛
𝜃

𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ In such apps, latent variables (𝒛𝑛 ’s) are called “local variables” (specific to individual
obs.)and other unknown parameters/hyperparams (𝜃, 𝜙 above) are called “global var”
CS772A: PML
4
Latent Variable Models
▪ Application 2: Sometimes, augmenting a model by latent variables simplifies inference
▪ These latent variables aren’t part of the original model definition
▪ Some of the popular examples of such augmentation include
▪ In Probit regression for binary classification, we can model each label 𝑦𝑛 ∈ {0,1} as
𝑦𝑛 = 𝕀[𝑧𝑛 > 0] where 𝑧𝑛 ∼ 𝒩(𝒘⊤ 𝒙𝑛 , 1) is an auxiliary latent variable
.. and use EM etc, to infer the unknowns 𝒘 and 𝑧𝑛 ’s (PML-2, Sec 15.4)
▪ Many sparse priors on weights can be thought of as Gaussian “scale-mixtures”

.. where 𝜏𝑑 ’s are latent vars. Can use EM to infer 𝒘, 𝜏 (MLAPP 13.4.4 - EM for LASSO)
▪ Such augmentations can often make a non-conjugate model a locally conjugate one
▪ Conditional posteriors of the unknowns often have closed form in such cases
CS772A: PML
5
Nomenclature/Notation Alert
▪ Why call some unknowns as parameters and others as latent variables?
▪ Well, no specific reason. Sort of a convention adopted by some algorithms
▪ EM: Unknowns estimated in E step referred to as latent vars; those in M step as params
▪ Usually: Latent vars – (Conditional) posterior computed; parameters – point estimation

▪ Some algos won’t make such distinction and will infer posterior over all unknowns
▪ Sometimes the “global” or “local” unknown distinction makes it clear
▪ Local variables = latent variables, global variables = parameters

▪ But remember that this nomenclature isn’t really cast in stone, no need to be confused
so long as you are clear as to what the role of each unknown is, and how we want to
estimate it (posterior or point estimate) and using what type of inference algorithm

CS772A: PML
6
Hybrid Inference (posterior infer. + point est.)
▪ In many models, we infer posterior on some unknowns and do point est. for others

▪ We have already seen MLE-II for lin reg. which alternates between መ 𝛽)
CP of 𝑤: 𝑝(𝒘|𝑿, 𝒚, 𝜆, መ

▪ Inferring CP over the main parameter given the point estimates of hyperparams
▪ Maximizing the marginal lik. to do point estimation for hyperparams መ 𝛽መ = argmax𝜆,𝛽 𝑝(𝒚|𝑿, 𝜆, 𝛽)
𝜆,

▪ The Expectation-Maximization algorithm (will see today) also does something similar
▪ In E step, the CP of latent variables is inferred, given current point-est of params
▪ M step maximizes expected complete data log-lik. to get point estimates of params

▪ If we can’t (due to computational or other reasons) infer posterior over all unknowns,
how to decide which variables to infer posterior on, and for which to do point-est?

▪ Usual approach: Infer posterior over local vars and point estimates for global vars
▪ Reason: We typically have plenty of data to reliably estimate the global variables so it is okay even
if we just do point estimation for those CS772A: PML
7

Inference/Parameter Estimation in
Latent Variable Models using
Expectation-Maximization (EM)

CS772A: PML
8
Parameter Estimation in Latent Variable Models
▪ Assume each observation 𝒙𝑛 to be associated with a “local” latent variable 𝒛𝑛
𝑝 𝒛𝑛 𝜙 : A suitable prior distribution based on the nature of 𝒛𝑛
𝑝 𝒙𝑛 𝒛𝑛 , 𝜃 : A suitable likelihood based on the nature of 𝒙𝑛
𝜃

𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ Although we can do fully Bayesian inference for all the unknowns, suppose we
only want a point estimate of the “global” parameters Θ = (𝜃, 𝜙) via MLE/MAP
▪ Such MLE/MAP problems in LVMs are difficult to solve in a “clean” way
▪ Would typically require gradient based methods with no closed form updates for Θ
▪ However, EM gives a clean way to obtain closed form updates for Θ
CS772A: PML
9
Why MLE/MAP of Params is Hard for LVMs?
▪ Suppose we want to estimate Θ = (𝜃, 𝜙) via MLE. If we knew 𝒛𝑛 , we could solve
Easy to solve

In particular, if they are

exp-fam distributions
▪ Easy. Usually closed form if 𝑝 𝒛𝑛 𝜙 and 𝑝 𝒙𝑛 𝒛𝑛 , 𝜃 have simple forms
▪ However, since in LVMs, 𝒛𝑛 is hidden, the MLE problem for Θ will be the following
Basically, the marginal
likelihood after
integrating out 𝑧𝑛
▪ log 𝑝(𝒙𝑛 |Θ) will not have a simple expression since 𝑝(𝑥𝑛 |Θ) requires sum/integral

▪ MLE now becomes difficult (basically MLE-II now), no closed form expression for Θ.
▪ Can we maximize some other quantity instead of log 𝑝(𝑥𝑛 |Θ) for this MLE?
CS772A: PML
10
An Important Identity
▪ Assume 𝑝𝑧 = 𝑝(𝒁|𝑿, Θ) and 𝑞(𝒁) to be some prob distribution over 𝒁, then
log 𝑝 𝑿 Θ = ℒ 𝑞, Θ + 𝐾𝐿(𝑞||𝑝𝑧 ) Verify the identity
Assume 𝒁 discrete
𝑝(𝑋,𝑍|Θ)
▪ In the above ℒ 𝑞, Θ = σ𝑍 𝑞 𝑍 log 𝑞(𝑍)

𝑝(𝒁|𝑿,Θ)
▪ 𝐾𝐿(𝑞| 𝑝𝑧 = − σ𝑍 𝑞 𝒁 log 𝑞(𝒁)

▪ KL is always non-negative, so log 𝑝 𝑿 Θ ≥ ℒ 𝑞, Θ

▪ Thus ℒ 𝑞, Θ is a lower-bound on log 𝑝 𝑿 Θ

▪ Thus if we maximize ℒ 𝑞, Θ , it will also improve log 𝑝 𝑿 Θ

▪ Also, as we’ll see, it’s easier to maximize ℒ 𝑞, Θ

CS772A: PML
Basically, log of marginal log 𝑝 𝑿 Θ is called Incomplete- 11
Maximizing ℒ 𝑞, Θ likelihood w.r.t. 𝚯 with 𝒁
integrated out
Data Log Likelihood (ILL)

▪ ℒ 𝑞, Θ depends on 𝑞 and Θ. We’ll use ALT-OPT to maximize it

▪ Let’s maximize ℒ 𝑞, Θ w.r.t. 𝑞 with Θ fixed at some Θold Since log 𝑝 𝑿 Θ = ℒ 𝑞, Θ + 𝐾𝐿(𝑞||𝑝𝑧 )
is constant when Θ is held fixed at Θold

𝑞ො = argmax𝑞 ℒ 𝑞, Θold = argmin𝑞 𝐾𝐿(𝑞| 𝑝𝑧 = 𝑝𝑧 = 𝑝(𝒁|𝑿, Θold )

The posterior distribution of 𝑍
▪ Now let’s maximize ℒ 𝑞, Θ w.r.t. Θ with 𝑞 fixed at 𝑞ො = 𝑝𝑧 = 𝑝(𝒁|𝑿, Θold ) given current parameters Θold
𝑝(𝑿, 𝒁|Θ)
Θnew = argmaxΘ ℒ 𝑞,
ො Θ = argmaxΘ ෍ 𝑝(𝒁|𝑿, Θold ) log
𝑝(𝒁|𝑿, Θold )
𝑍
Maximization of expected CLL where
the expectation is w.r.t. the posterior = argmaxΘ ෍ 𝑝 𝒁 𝑿, Θold log 𝑝(𝑿, 𝒁|Θ)
distribution of 𝑍 given current 𝑍 Complete-Data Log
parameters Θold = argmaxΘ 𝔼 [log 𝑝(𝑿, 𝒁|Θ)] Likelihood (CLL)
𝑝 𝒁 𝑿, Θold
Much easier than maximizing ILL since
CLL will have simple expressions (since = argmaxΘ 𝒬(Θ, Θold )
it is akin to knowing 𝑍)
CS772A: PML
12
The Expectation-Maximization (EM) Algorithm
▪ ALT-OPT of ℒ 𝑞, Θ w.r.t. 𝑞 and Θ gives the EM algorithm (Dempster, Laird, Rubin, 1977)
Primarily designed for doing point estimation of the Usually computing CP + expected CLL
parameters Θ but also gives (CP of) latent variables 𝑧𝑛 is referred to as the E step, and max.
of exp-CLL w.r.t. Θ as the M step

Conditional posterior of
each latent variable 𝑧𝑛
Latent variables also
assumed indep. a priori Assuming the (expected) CLL
𝔼 old [log 𝑝(𝑿, 𝒁|Θ)]
𝑝 𝒁 𝑿, Θ
factorizes over all observations

▪ Note: If we can take the MAP estimate 𝑧Ƹ𝑛 of 𝑧𝑛 (not full posterior) in Step 1 and maximize
(𝑡)
the CLL in Step 2 using that, i.e., do argmaxΘ σ𝑁
𝑛=1 log 𝑝 𝒙 ,
𝑛 𝑛 𝑧Ƹ Θ this will be ALT-OPT
CS772A: PML
13
The Expected CLL
▪ Expected CLL in EM is given by (assume observations are i.i.d.)

▪ If 𝑝 𝒛𝑛 Θ and 𝑝 𝒙𝑛 𝒛𝑛 , Θ are exp-family distributions, 𝒬(Θ, Θold ) has a very simple form
▪ In resulting expressions, replace terms containing 𝑧𝑛 ’s by their respective expectations, e.g.,
▪ 𝒛𝑛 replaced by 𝔼𝑝 𝒛 𝒙 , Θ෡ [𝒛𝑛 ]
𝑛 𝑛
▪ 𝒛𝑛 𝒛𝑛 ⊤ replaced by 𝔼𝑝 𝒛 𝒙 , Θ
෡ [𝒛𝑛 𝒛𝑛
⊤]
𝑛 𝑛

▪ However, in some LVMs, these expectations are intractable to compute and need to be
approximated (will see some examples later)
CS772A: PML
14
What’s Going On? Alternating between
them until convergence
to some local optima KL becomes zero and ℒ 𝑞, Θ becomes
equal to log 𝑝 𝑿 Θ ; thus their curves
▪ As we saw, the maximization of lower bound ℒ 𝑞, Θ had two steps touch at current Θ

▪ Step 1 finds the optimal 𝑞 (call it 𝑞)

ො by setting it as the posterior of 𝒁 given current Θ
▪ Step 2 maximizes ℒ 𝑞,ො Θ w.r.t. Θ which gives a new Θ.
Note that Θ only changes in Step 2
so the objective log 𝑝 𝑿 Θ
Local optima can only change in Step 2
Green curve: ℒ 𝑞,
ො Θ after found for Θ𝑀𝐿𝐸 log 𝑝 𝑿 Θ
setting 𝑞 to 𝑞ො
Also kind of similar to Newton’s
method (and has second order like
convergence behavior in some cases)

Good initialization matters; Unlike Newton’s method, we don’t

otherwise would converge construct and optimize a quadratic
to a poor local optima approximation, but a lower bound

Even though original MLE problem

(2) Θ(3) argmaxΘ log 𝑝 𝑿 Θ could be solved
Θ(0) Θ(1)Θ Θ(𝑀𝐿𝐸)
using gradient methods, EM often
works faster and has cleaner updates
CS772A: PML
15
EM vs Gradient-based Methods
▪ Can also estimate params using gradient-based optimization instead of EM
▪ We can usually explicitly sum over or integrate out the latent variables 𝒁, e.g.,

▪ Now we can optimize ℒ(Θ) using first/second order optimization to find the optimal Θ
▪ EM is usually preferred over this approach because
▪ The M step has often simple closed-form updates for the parameters Θ
▪ Often constraints (e.g., PSD matrices) are automatically satisfied due to form of updates
▪ In some cases†, EM usually converges faster (and often like second-order methods)
▪ E.g., Example: Mixture of Gaussians with when the data is reasonably well-clustered
▪ EM applies even when the explicit summing over/integrating out is expensive/intractable
▪ EM also provides the conditional posterior over the latent variables Z (from E step)
†Optimization with EM and Expectation-Conjugate-Gradient (Salakhutdinov et al, 2003), On Convergence Properties of the EM Algorithm for Gaussian Mixtures (Xu and Jordan, 1996),
Statistical guarantees for the EM algorithm: From population to sample-based analysis (Balakrishnan et al, 2017)
CS772A: PML
16
Some Applications of EM
▪ Mixture Models and Dimensionality Reduction/Representation Learning
▪ Mixture Models: Mixture of Gaussians, Mixture of Experts, etc
▪ Dim. Reduction/Representation Learning: Probabilistic PCA, Variational Autoencoders
▪ Problems with missing features or missing labels (which are treated as latent variables)
෡ = argmaxΘ log 𝑝 𝒙𝑜𝑏𝑠 Θ = argmaxΘ log ‫ 𝑠𝑏𝑜𝒙 𝑝 ׬‬, 𝒙𝑚𝑖𝑠𝑠 Θ 𝑑𝒙𝑚𝑖𝑠𝑠
▪Θ
෡ = argmaxΘ σ𝑁
▪Θ 𝑛=1 log 𝑝 𝑥 ,
𝑛 𝑛𝑦 Θ + σ 𝑁+𝑀
𝑛=𝑁+1 log σ 𝐾
𝑐=1 𝑝 𝑥𝑛 , 𝑦𝑛 = 𝑐 Θ
▪ Hyperparameter estimation in probabilistic models (an alternative to MLE-II)
▪ MLE-II estimates hyperparams by maximizing the marginal likelihood, e.g.,
መ 𝛽መ = argmax𝜆,𝛽 𝑝 𝒚 𝑿, 𝜆, 𝛽 = argmax𝜆,𝛽 න 𝑝 𝒚 𝒘, 𝑿, 𝛽 𝑝 𝒘 𝜆 𝑑𝒘 For a Bayesian linear
𝜆, regression model

▪ With EM, can treat 𝒘 as latent var, and 𝜆, 𝛽 as “parameters”

▪ E step will estimate the CP of 𝑤 given current estimates of 𝜆, 𝛽
▪ M step will re-estimate 𝜆, 𝛽 by maximizing the expected CLL Expectations w.r.t.
the CP of 𝒘
CS772A: PML
If the 𝒛𝑛 were known, it just becomes
17
An Example: Mixture Models generative classification, for which
which we know how to estimate 𝜃 and
𝜙, given training data

▪ Assume 𝐾 probability distributions (e.g., Gaussians), one for each cluster

𝑝(𝑥) is a
Gaussian mixture
𝑝 𝑧𝑛 𝜙 = multinoulli(𝝅) Parameters of the 𝐾 distributions, model (GMM)
(also means 𝑝 𝑧𝑛 = 𝑘 𝜙 = 𝜋𝑘 ) e.g,. 𝜃 = 𝜇𝑘 , Σ𝑘 𝐾
𝑘=1

Discrete latent variable (with 𝐾 possible

values) or a one-hot vector of length 𝐾.
𝜃
Modeled by a multinoulli distribution as prior Assumed generated from one
The likelihood
of the 𝐾 distributions
distributions
depending on the true (but
The parameter vector unknown) value of 𝑧𝑛 (which
𝝅 = 𝜋1 , 𝜋2 , … , 𝜋𝐾 of
the multinoulli distribution
𝜙 𝒛𝑛 𝒙𝑛 clustering will find))

𝑝 𝒙𝑛 𝒛𝑛 = 𝑘, 𝜃 = 𝒩 𝜇𝑘 , Σ𝑘
𝑁
▪ The log-likelihood will be
MLE on this objective won’t
𝐾
give closed form solution for
log 𝑝 𝒙𝑛 Θ = log ෍ 𝑝 𝒙𝑛 , 𝒛𝑛 = 𝑘 Θ the parameters
𝑘=1
𝐾 𝐾
= log ෍ 𝑝 𝒛𝑛 = 𝑘 𝜙 𝑝(𝒙𝑛 |𝒛𝑛 = 𝑘, 𝜃) = log ෍ 𝜋𝑘 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘
𝑘=1 𝑘=1 CS772A: PML
18
Detour: MLE for Generative Classification
▪ Assume a 𝐾 class generative classification model with Gaussian class-conditionals
▪ Assume class 𝑘 = 1,2, … , 𝐾 is modeled by a Gaussian with mean 𝜇𝑘 and cov matrix Σ𝑘
▪ The labels 𝒛𝑛 (known) are one-hot vecs. Also, 𝑧𝑛𝑘 = 1 if 𝒛𝑛 = 𝑘, and 𝒛𝑛𝑘 = 0, o/w
▪ Assuming class prior as 𝑝(𝒛𝑛 = 𝑘) = 𝜋𝑘 , the model has params Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 } 𝐾
𝑘=1
▪ Given training data {𝒙𝑛 , 𝒛𝑛 } 𝑁
𝑛=1 , the MLE solution will be

1 𝑁 𝑁𝑘
𝜋ො 𝑘 = ෍ 𝑧𝑛𝑘 Same as where 𝑁𝑘 is # of training ex. for which 𝑦𝑛 = 𝑘
𝑁
𝑁 𝑛=1
1 𝑁
1
𝜇Ƹ 𝑘 = ෍ 𝑧𝑛𝑘 𝒙𝑛 Same as σ𝑁 𝒙
𝑁𝑘 𝑛=1 𝑁𝑘 𝑛:𝒛𝑛 =𝑘 𝑛
1 𝑁
Σ෠ 𝑘 = ෍ 𝑧𝑛𝑘 (𝒙𝑛 −𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 )⊤ Same as 1 σ𝑁 𝑛:𝒛 =𝑘 (𝒙𝑛 − 𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 ) ⊤
𝑁𝑘 𝑛=1 𝑁𝑘 𝑛

CS772A: PML
19
Detour: MLE for Generative Classification
▪ Here is a formal derivation of the MLE solution for Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 } 𝐾
𝑘=1
෡ = argmaxΘ 𝑝(𝑿, 𝒁|Θ) = argmaxΘ ς𝑁
Θ 𝑛=1 𝑝(𝒙𝑛 , 𝒛𝑛 |Θ) multinoulli Gaussian

= argmaxΘ ς𝑁
𝑛=1 𝑝(𝒛𝑛 |Θ) 𝑝(𝑥𝑛 |𝒛𝑛 , Θ)
In general, in models with probability distributions
from the exponential family, the MLE problem will
usually have a simple analytic form 𝐾 𝑧𝑛𝑘 𝐾
= argmaxΘ ς𝑁 ς 𝜋
𝑛=1 𝑘=1 𝑘 ς𝑘=1 𝑝(𝑥𝑛 |𝒛𝑛 = 𝑘, Θ)
𝑧𝑛𝑘

Also, due to the form of the likelihood

(Gaussian) and prior (multinoulli), the
𝑁 𝐾
MLE problem had a nice separable = argmaxΘ ෑ ෑ [𝜋𝑘 𝑝(𝑥𝑛 |𝒛𝑛 = 𝑘, Θ)] 𝑧𝑛𝑘
structure after taking the log 𝑛=1 𝑘=1
𝑁 𝐾
Can see that, when estimating the
parameters of the 𝑘 𝑡ℎ Gaussian = argmaxΘ log ෑ ෑ [𝜋𝑘 𝑝(𝑥𝑛 |𝒛𝑛 = 𝑘, Θ)]𝑧𝑛𝑘
(𝜋𝑘 , 𝜇𝑘 , Σ𝑘 ), we only will only need 𝑛=1 𝑘=1
training examples from the 𝑘 𝑡ℎ class,
𝑁 𝐾
i.e., examples for which 𝑧𝑛𝑘 = 1
= argmaxΘ ෍ ෍ 𝑧𝑛𝑘 [log 𝜋𝑘 + log 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘 ]
𝑛=1 𝑘=1
The form of this expression is important;
will encounter this in GMM too
CS772A: PML
20
EM for Mixture Models
▪ So how do we estimate the parameters of a GMM where 𝒛𝑛 ’s are unknown?
Hmmmm.. So can we make a guess
Well, you kind of already know
what the value of each 𝒛𝑛 and then
how to do this. ☺ Remember
generative classification? 𝜃 estimate 𝜃 and 𝜙 as we do in case
of generative classification??
Yes, exactly. ☺ However, just like in
gen-class, you will need to repeat
the guess and estimate them a few
times until you converge

𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ The guess about 𝒛𝑛 can be in one of the two forms
▪ A “hard” guess – a single best value 𝒛
ො 𝑛 (some “optimal” value of the random variable 𝒛𝑛 )
▪ The “expected” value 𝔼 𝒛𝑛 of the random variable 𝒛𝑛 EM is pretty much like ALT-OPT
but with soft/expected values
▪ Using the hard guess 𝒛ො 𝑛 of 𝒛𝑛 will result in an ALT-OPT like algorithm of the latent variables

▪ Using the expected value of 𝒛𝑛 will give the so-called Expectation-Maximization (EM) algo
CS772A: PML
21
EM for Gaussian Mixture Model (GMM)
Expectation of CLL

▪ EM finds Θ𝑀𝐿𝐸 by maximizing 𝔼 log 𝑝 𝑿, 𝒁 Θ ෡ Θ

rather than log 𝑝 𝑿, 𝒁
▪ Note: Expectation will be w.r.t. the conditional posterior distribution of 𝒁, i.e., 𝑝(𝒁|𝑿, Θ)
It is “conditional” posterior
▪ The EM algorithm for GMM operates as follows because it is also conditioned Requires knowing Θ
on Θ, not just data 𝑋
▪ Initialize Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 }𝐾
𝑘=1 as Θ
෡
▪ Repeat until convergence Needed to get the expected CLL
෡ ). Since obs are i.i.d, compute separately for each 𝑛 (and for 𝑘 = 1,2, . . 𝐾)
▪ Compute conditional posterior 𝑝(𝒁|𝑿, Θ
෡ ), just a
Same as 𝑝(𝑧𝑛𝑘 = 1| 𝒙𝑛 , Θ ෡ ∝ 𝑝 𝒛𝑛 = 𝑘 Θ
𝑝 𝒛𝑛 = 𝑘 𝒙𝑛 , Θ ෡ 𝑝 𝒙𝑛 𝒛𝑛 = 𝑘, Θ
෡ = 𝜋ො 𝑘 𝒩 𝑥𝑛 |𝜇ො𝑘 , Σ෠ 𝑘
different notation

▪ Update Θ by maximizing the expected complete data log-likelihood

Solution has a similar form as 𝑁
ALT-OPT (or gen. class.), ෡ = argmaxΘ 𝔼𝑝(𝒁|𝑿,Θ෡ ) log 𝑝 𝑿, 𝒁 Θ
Θ = ෍ 𝔼𝑝(𝒛𝑛 |𝒙𝑛 ,Θ෡ ) log 𝑝 𝒙𝑛 , 𝒛𝑛 Θ
except we now have the 𝑛=1
expectation of 𝑧𝑛𝑘 being used
𝑁 𝐾
1 𝑁 1 𝑁
𝜋ො 𝑘 = ෍ 𝔼[𝑧𝑛𝑘 ] 𝜇Ƹ 𝑘 = ෍ 𝔼[𝑧𝑛𝑘 ]𝒙𝑛 = argmaxΘ 𝔼 ෍ ෍ 𝑧𝑛𝑘 [log 𝜋𝑘 + log 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘 ]
𝑁 𝑛=1 𝑁𝑘 𝑛=1 𝑛=1 𝑘=1
𝑁𝑘 : Effective number 𝑁 𝐾
of points in cluster k 1 𝑁
= argmaxΘ ෍ ෍ 𝔼[𝑧𝑛𝑘 ][log 𝜋𝑘 + log 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘 ]
Σ෠ 𝑘 = ෍ 𝔼[𝑧𝑛𝑘 ](𝒙𝑛 −𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 )⊤ 𝑛=1 𝑘=1
𝑁𝑘 𝑛=1 CS772A: PML
22
EM for GMM (Contd) Reason: σ𝐾
𝑘=1 𝛾𝑛𝑘 = 1

ෝ 𝑘 𝒩 𝑥𝑛 |ෝ
𝜋 ෡𝑘
𝜇𝑘 ,Σ
Need to normalize: 𝔼 𝑧𝑛𝑘 =
▪ The EM algo for GMM required 𝔼[𝑧𝑛𝑘 ]. Note 𝑧𝑛𝑘 ∈ {0,1} 𝐾
σℓ=1 𝜋ෝ ℓ 𝒩 𝑥𝑛 |ෝ ෡ℓ
𝜇ℓ ,Σ

෡ + 1 × 𝑝(𝑧𝑛𝑘 = 1|𝑥𝑛 , Θ)
𝔼 𝑧𝑛𝑘 = 𝛾𝑛𝑘 = 0 × 𝑝(𝑧𝑛𝑘 = 0|𝑥𝑛 , Θ) ෡ = 𝑝(𝑧𝑛𝑘 = 1|𝑥𝑛 , Θ) ො 𝑘 𝒩 𝑥𝑛 |𝜇Ƹ 𝑘 , Σ෠ 𝑘
෡ ∝𝜋

Accounts for fraction of Accounts for cluster shapes (since

Soft 𝐾-means, which is more of a heuristic to points in each cluster each cluster is a Gaussian
get soft-clustering, also gave us probabilities
but doesn’t account for cluster shapes or
fraction of points in each cluster

Effective number of points

in the 𝑘 𝑡ℎ cluster

M-step:

CS772A: PML
23
EM: Some Final Comments
▪ The E and M steps may not always be possible to perform exactly. Some reasons
▪ The conditional posterior of latent variables 𝑝(𝑍|𝑋, Θ) may not be easy to compute
▪ Will need to approximate 𝑝(𝑍|𝑋, Θ) using methods such as MCMC or variational inference Results in
▪ Even if 𝑝(𝑍|𝑋, Θ) is easy, the expected CLL may not be easy to compute Monte-Carlo EM
Can often be approximated
by Monte-Carlo using
sample from the CP of 𝑍

▪ Maximization of the expected CLL may not be possible in closed form

▪ EM works even if the M step is only solved approximately (Generalized EM)
▪ If M step has multiple parameters whose updates depend on each other, they are
updated in an alternating fashion - called Expectation Conditional Maximization (ECM)
▪ Other advanced probabilistic inference algos are based on ideas similar to EM
▪ E.g., Variational EM, Variational Bayes (VB) inference, a.k.a. Variational Inference (VI)
CS772A: PML

Pharmacokinetic Pharmacodynamic Modeling & Simulation PDF
No ratings yet
Pharmacokinetic Pharmacodynamic Modeling & Simulation PDF
70 pages
GMM
No ratings yet
GMM
26 pages
CS772-Lec6
No ratings yet
CS772-Lec6
23 pages
Lecture 19 and 20
No ratings yet
Lecture 19 and 20
27 pages
771 A18 Lec6
No ratings yet
771 A18 Lec6
155 pages
771 A18 Lec16
No ratings yet
771 A18 Lec16
99 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Expectation Maximization (EM) Algorithm.pptx
No ratings yet
Expectation Maximization (EM) Algorithm.pptx
47 pages
Lec16 PDF
No ratings yet
Lec16 PDF
10 pages
Figueiredo EM Algorithm
No ratings yet
Figueiredo EM Algorithm
35 pages
CM Latent - Models 2022
No ratings yet
CM Latent - Models 2022
27 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
An Alternative View of EM_poornima
No ratings yet
An Alternative View of EM_poornima
4 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
CS772-Lec5
No ratings yet
CS772-Lec5
22 pages
lecture5
No ratings yet
lecture5
16 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Lecture 19 and 20
No ratings yet
Lecture 19 and 20
27 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
CS772-Lec7
No ratings yet
CS772-Lec7
18 pages
8th Lecture Note - 1039837803 230515 094639
No ratings yet
8th Lecture Note - 1039837803 230515 094639
10 pages
05 Vae
No ratings yet
05 Vae
76 pages
Ps 4
No ratings yet
Ps 4
6 pages
ds11 2
No ratings yet
ds11 2
19 pages
Likelihood EM HMM Kalman
No ratings yet
Likelihood EM HMM Kalman
46 pages
Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
No ratings yet
Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
22 pages
Lecture 26 - Latent Variable Models (1) - Plain
No ratings yet
Lecture 26 - Latent Variable Models (1) - Plain
9 pages
The EM Algorithm: Ajit Singh November 20, 2005
No ratings yet
The EM Algorithm: Ajit Singh November 20, 2005
4 pages
06 Estimation
No ratings yet
06 Estimation
31 pages
Beamer
No ratings yet
Beamer
34 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Lecture 13. em Algorithm (After-Class)
No ratings yet
Lecture 13. em Algorithm (After-Class)
6 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
771 A18 Lec7
No ratings yet
771 A18 Lec7
120 pages
771 A18 Lec17
No ratings yet
771 A18 Lec17
120 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Questions_for_Unit_4 (2)
No ratings yet
Questions_for_Unit_4 (2)
6 pages
Latent 2
No ratings yet
Latent 2
4 pages
[Seminar] An introduction to simulation-based inference
No ratings yet
[Seminar] An introduction to simulation-based inference
44 pages
Quiz2 - Solution
No ratings yet
Quiz2 - Solution
2 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
Last Time Today
No ratings yet
Last Time Today
34 pages
UNIT 4 - EM Alg
No ratings yet
UNIT 4 - EM Alg
3 pages
Lecture 11
No ratings yet
Lecture 11
124 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
11 Hidden Markov Models (HMMS) Model and Problem Description
No ratings yet
11 Hidden Markov Models (HMMS) Model and Problem Description
15 pages
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
No ratings yet
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
4 pages
Lecture3 EM
No ratings yet
Lecture3 EM
36 pages
21 Efficient Inference A K-Means
No ratings yet
21 Efficient Inference A K-Means
32 pages
CS772 Lec7
No ratings yet
CS772 Lec7
13 pages
CS772-Lec1
No ratings yet
CS772-Lec1
15 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
The Expectation Maximization Algorithm
No ratings yet
The Expectation Maximization Algorithm
7 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Unit 3 ML
No ratings yet
Unit 3 ML
45 pages
CS 229 Autumn 2017 Problem Set #3: Deep Learning & Unsupervised Learning
No ratings yet
CS 229 Autumn 2017 Problem Set #3: Deep Learning & Unsupervised Learning
9 pages
TR 97 021
No ratings yet
TR 97 021
15 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
How To Use The HMM Toolbox
No ratings yet
How To Use The HMM Toolbox
3 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Linear Algebra Lecture Notes
No ratings yet
Linear Algebra Lecture Notes
50 pages
Module 5 Trigonometry
No ratings yet
Module 5 Trigonometry
0 pages
Signals Analysis
No ratings yet
Signals Analysis
6 pages
Self Study Problem Set - Solutions: Total Payoff
No ratings yet
Self Study Problem Set - Solutions: Total Payoff
8 pages
JIM 101 Assignment 1 2015
No ratings yet
JIM 101 Assignment 1 2015
2 pages
Ioqm 23 1
No ratings yet
Ioqm 23 1
21 pages
Number Representations and Computer Arithmetic MCQ (Free PDF) - Objective Question Answer For Number Representations and Computer Arithmetic Quiz - Download Now!
No ratings yet
Number Representations and Computer Arithmetic MCQ (Free PDF) - Objective Question Answer For Number Representations and Computer Arithmetic Quiz - Download Now!
28 pages
Model Complexity Control For Regression Using VC Generalization Bounds
No ratings yet
Model Complexity Control For Regression Using VC Generalization Bounds
15 pages
ATC Module 2
No ratings yet
ATC Module 2
51 pages
X M PP PB 2 ( BBPS 2024 - 2025 )
No ratings yet
X M PP PB 2 ( BBPS 2024 - 2025 )
7 pages
Unit 1 MMW
No ratings yet
Unit 1 MMW
126 pages
Lesson 9 - February 03, 2025
No ratings yet
Lesson 9 - February 03, 2025
8 pages
Missing Number Series
No ratings yet
Missing Number Series
15 pages
2nd-SUMMATIVE-TEST Math 9
No ratings yet
2nd-SUMMATIVE-TEST Math 9
3 pages
Somers D
No ratings yet
Somers D
27 pages
Chapter 7 Algebraaic Expressions
100% (3)
Chapter 7 Algebraaic Expressions
33 pages
Logical Connectives
No ratings yet
Logical Connectives
12 pages
Tarea 5 RG
No ratings yet
Tarea 5 RG
1 page
IOCL Various Apprentice Post Exam Syllabus 2023
No ratings yet
IOCL Various Apprentice Post Exam Syllabus 2023
3 pages
Modelling and Calculating The Surface Area of A Clay Pot
No ratings yet
Modelling and Calculating The Surface Area of A Clay Pot
18 pages
Discrete Mathematical Structures IS314 SUPPLEMENTARY EXAM 3RD SEM AUG 2017
No ratings yet
Discrete Mathematical Structures IS314 SUPPLEMENTARY EXAM 3RD SEM AUG 2017
4 pages
Maths Unit 1
No ratings yet
Maths Unit 1
112 pages
t2 M 1580 Year 3 Missing Number Problems Activity Sheet Ver 3
No ratings yet
t2 M 1580 Year 3 Missing Number Problems Activity Sheet Ver 3
6 pages
B Tech AIDS PDF
No ratings yet
B Tech AIDS PDF
402 pages
Quantum Mechanics Propagator
No ratings yet
Quantum Mechanics Propagator
12 pages
Spike
No ratings yet
Spike
21 pages
CH 08
No ratings yet
CH 08
34 pages
2000 Solutions Cayley Contest: Canadian Mathematics Competition
No ratings yet
2000 Solutions Cayley Contest: Canadian Mathematics Competition
13 pages
s.s.3.maths.MOCK.2023.24[1]
No ratings yet
s.s.3.maths.MOCK.2023.24[1]
4 pages

CS772 Lec10

Uploaded by

CS772 Lec10

Uploaded by

Latent Variable Models and EM Algorithm

CS772A: Probabilistic Machine Learning

▪ CP of 𝜃𝑖 will be proportional to the product of all the terms involving 𝜃𝑖

𝑝 𝒛𝑛 𝜙 : A suitable prior distribution based on the nature of 𝒛𝑛

In particular, if they are

▪ KL is always non-negative, so log 𝑝 𝑿 Θ ≥ ℒ 𝑞, Θ

▪ Thus ℒ 𝑞, Θ is a lower-bound on log 𝑝 𝑿 Θ

▪ Thus if we maximize ℒ 𝑞, Θ , it will also improve log 𝑝 𝑿 Θ

▪ Also, as we’ll see, it’s easier to maximize ℒ 𝑞, Θ

▪ ℒ 𝑞, Θ depends on 𝑞 and Θ. We’ll use ALT-OPT to maximize it

𝑞ො = argmax𝑞 ℒ 𝑞, Θold = argmin𝑞 𝐾𝐿(𝑞| 𝑝𝑧 = 𝑝𝑧 = 𝑝(𝒁|𝑿, Θold )

▪ Step 1 finds the optimal 𝑞 (call it 𝑞)

Good initialization matters; Unlike Newton’s method, we don’t

Even though original MLE problem

▪ With EM, can treat 𝒘 as latent var, and 𝜆, 𝛽 as “parameters”

▪ Assume 𝐾 probability distributions (e.g., Gaussians), one for each cluster

Discrete latent variable (with 𝐾 possible

Also, due to the form of the likelihood

▪ EM finds Θ𝑀𝐿𝐸 by maximizing 𝔼 log 𝑝 𝑿, 𝒁 Θ ෡ Θ

▪ Update Θ by maximizing the expected complete data log-likelihood

Accounts for fraction of Accounts for cluster shapes (since

Effective number of points

▪ Maximization of the expected CLL may not be possible in closed form

You might also like