CS772 Lec10
CS772 Lec10
▪ We can however compute conditional posteriors (CP)which for each 𝜃𝑖 looks like
Can be data and/or other
𝑝(𝜃𝑖 |whatever 𝜃𝑖 depends on) params/hyperparams given their
fixed values (or current estimates)
▪ To compute each CP, look at the joint distribution 𝑝 𝑿, Θ
𝑝 𝑿, Θ = 𝑝(𝑿, 𝜃1 , 𝜃2 , … , 𝜃𝐾 ) = 𝑝 𝑿 𝜃1 , 𝜃2 , … , 𝜃𝐾 𝑝 𝜃1 𝜃2 , … , 𝜃𝐾 )𝑝(𝜃2 𝜃3 , … , 𝜃𝐾 … 𝑝(𝜃𝐾 )
▪ Many algorithms for computing point estimate/full posterior use the CPs
▪ Expectation Maximization, Variational Inference , MCMC (especially Gibbs sampling) CS772A: PML
3
Latent Variable Models
▪ Application 1: Can use latent variables to learn latent properties/features of data, e.g.,
▪ Cluster assignment of each observation (in mixture models)
▪ Low-dim rep. or “code” of each observation (e.g., prob. PCA, variational autoencoders, etc)
Plate notation of a generic LVM
𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ In such apps, latent variables (𝒛𝑛 ’s) are called “local variables” (specific to individual
obs.)and other unknown parameters/hyperparams (𝜃, 𝜙 above) are called “global var”
CS772A: PML
4
Latent Variable Models
▪ Application 2: Sometimes, augmenting a model by latent variables simplifies inference
▪ These latent variables aren’t part of the original model definition
▪ Some of the popular examples of such augmentation include
▪ In Probit regression for binary classification, we can model each label 𝑦𝑛 ∈ {0,1} as
𝑦𝑛 = 𝕀[𝑧𝑛 > 0] where 𝑧𝑛 ∼ 𝒩(𝒘⊤ 𝒙𝑛 , 1) is an auxiliary latent variable
.. and use EM etc, to infer the unknowns 𝒘 and 𝑧𝑛 ’s (PML-2, Sec 15.4)
▪ Many sparse priors on weights can be thought of as Gaussian “scale-mixtures”
.. where 𝜏𝑑 ’s are latent vars. Can use EM to infer 𝒘, 𝜏 (MLAPP 13.4.4 - EM for LASSO)
▪ Such augmentations can often make a non-conjugate model a locally conjugate one
▪ Conditional posteriors of the unknowns often have closed form in such cases
CS772A: PML
5
Nomenclature/Notation Alert
▪ Why call some unknowns as parameters and others as latent variables?
▪ Well, no specific reason. Sort of a convention adopted by some algorithms
▪ EM: Unknowns estimated in E step referred to as latent vars; those in M step as params
▪ Usually: Latent vars – (Conditional) posterior computed; parameters – point estimation
▪ Some algos won’t make such distinction and will infer posterior over all unknowns
▪ Sometimes the “global” or “local” unknown distinction makes it clear
▪ Local variables = latent variables, global variables = parameters
▪ But remember that this nomenclature isn’t really cast in stone, no need to be confused
so long as you are clear as to what the role of each unknown is, and how we want to
estimate it (posterior or point estimate) and using what type of inference algorithm
CS772A: PML
6
Hybrid Inference (posterior infer. + point est.)
▪ In many models, we infer posterior on some unknowns and do point est. for others
▪ We have already seen MLE-II for lin reg. which alternates between መ 𝛽)
CP of 𝑤: 𝑝(𝒘|𝑿, 𝒚, 𝜆, መ
▪ Inferring CP over the main parameter given the point estimates of hyperparams
▪ Maximizing the marginal lik. to do point estimation for hyperparams መ 𝛽መ = argmax𝜆,𝛽 𝑝(𝒚|𝑿, 𝜆, 𝛽)
𝜆,
▪ The Expectation-Maximization algorithm (will see today) also does something similar
▪ In E step, the CP of latent variables is inferred, given current point-est of params
▪ M step maximizes expected complete data log-lik. to get point estimates of params
▪ If we can’t (due to computational or other reasons) infer posterior over all unknowns,
how to decide which variables to infer posterior on, and for which to do point-est?
▪ Usual approach: Infer posterior over local vars and point estimates for global vars
▪ Reason: We typically have plenty of data to reliably estimate the global variables so it is okay even
if we just do point estimation for those CS772A: PML
7
Inference/Parameter Estimation in
Latent Variable Models using
Expectation-Maximization (EM)
CS772A: PML
8
Parameter Estimation in Latent Variable Models
▪ Assume each observation 𝒙𝑛 to be associated with a “local” latent variable 𝒛𝑛
𝑝 𝒛𝑛 𝜙 : A suitable prior distribution based on the nature of 𝒛𝑛
𝑝 𝒙𝑛 𝒛𝑛 , 𝜃 : A suitable likelihood based on the nature of 𝒙𝑛
𝜃
𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ Although we can do fully Bayesian inference for all the unknowns, suppose we
only want a point estimate of the “global” parameters Θ = (𝜃, 𝜙) via MLE/MAP
▪ Such MLE/MAP problems in LVMs are difficult to solve in a “clean” way
▪ Would typically require gradient based methods with no closed form updates for Θ
▪ However, EM gives a clean way to obtain closed form updates for Θ
CS772A: PML
9
Why MLE/MAP of Params is Hard for LVMs?
▪ Suppose we want to estimate Θ = (𝜃, 𝜙) via MLE. If we knew 𝒛𝑛 , we could solve
Easy to solve
▪ MLE now becomes difficult (basically MLE-II now), no closed form expression for Θ.
▪ Can we maximize some other quantity instead of log 𝑝(𝑥𝑛 |Θ) for this MLE?
CS772A: PML
10
An Important Identity
▪ Assume 𝑝𝑧 = 𝑝(𝒁|𝑿, Θ) and 𝑞(𝒁) to be some prob distribution over 𝒁, then
log 𝑝 𝑿 Θ = ℒ 𝑞, Θ + 𝐾𝐿(𝑞||𝑝𝑧 ) Verify the identity
Assume 𝒁 discrete
𝑝(𝑋,𝑍|Θ)
▪ In the above ℒ 𝑞, Θ = σ𝑍 𝑞 𝑍 log 𝑞(𝑍)
𝑝(𝒁|𝑿,Θ)
▪ 𝐾𝐿(𝑞| 𝑝𝑧 = − σ𝑍 𝑞 𝒁 log 𝑞(𝒁)
Conditional posterior of
each latent variable 𝑧𝑛
Latent variables also
assumed indep. a priori Assuming the (expected) CLL
𝔼 old [log 𝑝(𝑿, 𝒁|Θ)]
𝑝 𝒁 𝑿, Θ
factorizes over all observations
▪ Note: If we can take the MAP estimate 𝑧Ƹ𝑛 of 𝑧𝑛 (not full posterior) in Step 1 and maximize
(𝑡)
the CLL in Step 2 using that, i.e., do argmaxΘ σ𝑁
𝑛=1 log 𝑝 𝒙 ,
𝑛 𝑛 𝑧Ƹ Θ this will be ALT-OPT
CS772A: PML
13
The Expected CLL
▪ Expected CLL in EM is given by (assume observations are i.i.d.)
▪ If 𝑝 𝒛𝑛 Θ and 𝑝 𝒙𝑛 𝒛𝑛 , Θ are exp-family distributions, 𝒬(Θ, Θold ) has a very simple form
▪ In resulting expressions, replace terms containing 𝑧𝑛 ’s by their respective expectations, e.g.,
▪ 𝒛𝑛 replaced by 𝔼𝑝 𝒛 𝒙 , Θ [𝒛𝑛 ]
𝑛 𝑛
▪ 𝒛𝑛 𝒛𝑛 ⊤ replaced by 𝔼𝑝 𝒛 𝒙 , Θ
[𝒛𝑛 𝒛𝑛
⊤]
𝑛 𝑛
▪ However, in some LVMs, these expectations are intractable to compute and need to be
approximated (will see some examples later)
CS772A: PML
14
What’s Going On? Alternating between
them until convergence
to some local optima KL becomes zero and ℒ 𝑞, Θ becomes
equal to log 𝑝 𝑿 Θ ; thus their curves
▪ As we saw, the maximization of lower bound ℒ 𝑞, Θ had two steps touch at current Θ
▪ Now we can optimize ℒ(Θ) using first/second order optimization to find the optimal Θ
▪ EM is usually preferred over this approach because
▪ The M step has often simple closed-form updates for the parameters Θ
▪ Often constraints (e.g., PSD matrices) are automatically satisfied due to form of updates
▪ In some cases†, EM usually converges faster (and often like second-order methods)
▪ E.g., Example: Mixture of Gaussians with when the data is reasonably well-clustered
▪ EM applies even when the explicit summing over/integrating out is expensive/intractable
▪ EM also provides the conditional posterior over the latent variables Z (from E step)
†Optimization with EM and Expectation-Conjugate-Gradient (Salakhutdinov et al, 2003), On Convergence Properties of the EM Algorithm for Gaussian Mixtures (Xu and Jordan, 1996),
Statistical guarantees for the EM algorithm: From population to sample-based analysis (Balakrishnan et al, 2017)
CS772A: PML
16
Some Applications of EM
▪ Mixture Models and Dimensionality Reduction/Representation Learning
▪ Mixture Models: Mixture of Gaussians, Mixture of Experts, etc
▪ Dim. Reduction/Representation Learning: Probabilistic PCA, Variational Autoencoders
▪ Problems with missing features or missing labels (which are treated as latent variables)
= argmaxΘ log 𝑝 𝒙𝑜𝑏𝑠 Θ = argmaxΘ log 𝑠𝑏𝑜𝒙 𝑝 , 𝒙𝑚𝑖𝑠𝑠 Θ 𝑑𝒙𝑚𝑖𝑠𝑠
▪Θ
= argmaxΘ σ𝑁
▪Θ 𝑛=1 log 𝑝 𝑥 ,
𝑛 𝑛𝑦 Θ + σ 𝑁+𝑀
𝑛=𝑁+1 log σ 𝐾
𝑐=1 𝑝 𝑥𝑛 , 𝑦𝑛 = 𝑐 Θ
▪ Hyperparameter estimation in probabilistic models (an alternative to MLE-II)
▪ MLE-II estimates hyperparams by maximizing the marginal likelihood, e.g.,
መ 𝛽መ = argmax𝜆,𝛽 𝑝 𝒚 𝑿, 𝜆, 𝛽 = argmax𝜆,𝛽 න 𝑝 𝒚 𝒘, 𝑿, 𝛽 𝑝 𝒘 𝜆 𝑑𝒘 For a Bayesian linear
𝜆, regression model
𝑝 𝒙𝑛 𝒛𝑛 = 𝑘, 𝜃 = 𝒩 𝜇𝑘 , Σ𝑘
𝑁
▪ The log-likelihood will be
MLE on this objective won’t
𝐾
give closed form solution for
log 𝑝 𝒙𝑛 Θ = log 𝑝 𝒙𝑛 , 𝒛𝑛 = 𝑘 Θ the parameters
𝑘=1
𝐾 𝐾
= log 𝑝 𝒛𝑛 = 𝑘 𝜙 𝑝(𝒙𝑛 |𝒛𝑛 = 𝑘, 𝜃) = log 𝜋𝑘 𝒩 𝒙𝑛 |𝜇𝑘 , Σ𝑘
𝑘=1 𝑘=1 CS772A: PML
18
Detour: MLE for Generative Classification
▪ Assume a 𝐾 class generative classification model with Gaussian class-conditionals
▪ Assume class 𝑘 = 1,2, … , 𝐾 is modeled by a Gaussian with mean 𝜇𝑘 and cov matrix Σ𝑘
▪ The labels 𝒛𝑛 (known) are one-hot vecs. Also, 𝑧𝑛𝑘 = 1 if 𝒛𝑛 = 𝑘, and 𝒛𝑛𝑘 = 0, o/w
▪ Assuming class prior as 𝑝(𝒛𝑛 = 𝑘) = 𝜋𝑘 , the model has params Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 } 𝐾
𝑘=1
▪ Given training data {𝒙𝑛 , 𝒛𝑛 } 𝑁
𝑛=1 , the MLE solution will be
1 𝑁 𝑁𝑘
𝜋ො 𝑘 = 𝑧𝑛𝑘 Same as where 𝑁𝑘 is # of training ex. for which 𝑦𝑛 = 𝑘
𝑁
𝑁 𝑛=1
1 𝑁
1
𝜇Ƹ 𝑘 = 𝑧𝑛𝑘 𝒙𝑛 Same as σ𝑁 𝒙
𝑁𝑘 𝑛=1 𝑁𝑘 𝑛:𝒛𝑛 =𝑘 𝑛
1 𝑁
Σ 𝑘 = 𝑧𝑛𝑘 (𝒙𝑛 −𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 )⊤ Same as 1 σ𝑁 𝑛:𝒛 =𝑘 (𝒙𝑛 − 𝜇Ƹ 𝑘 )(𝒙𝑛 −𝜇Ƹ 𝑘 ) ⊤
𝑁𝑘 𝑛=1 𝑁𝑘 𝑛
CS772A: PML
19
Detour: MLE for Generative Classification
▪ Here is a formal derivation of the MLE solution for Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 } 𝐾
𝑘=1
= argmaxΘ 𝑝(𝑿, 𝒁|Θ) = argmaxΘ ς𝑁
Θ 𝑛=1 𝑝(𝒙𝑛 , 𝒛𝑛 |Θ) multinoulli Gaussian
= argmaxΘ ς𝑁
𝑛=1 𝑝(𝒛𝑛 |Θ) 𝑝(𝑥𝑛 |𝒛𝑛 , Θ)
In general, in models with probability distributions
from the exponential family, the MLE problem will
usually have a simple analytic form 𝐾 𝑧𝑛𝑘 𝐾
= argmaxΘ ς𝑁 ς 𝜋
𝑛=1 𝑘=1 𝑘 ς𝑘=1 𝑝(𝑥𝑛 |𝒛𝑛 = 𝑘, Θ)
𝑧𝑛𝑘
𝜙 𝒛𝑛 𝒙𝑛
𝑁
▪ The guess about 𝒛𝑛 can be in one of the two forms
▪ A “hard” guess – a single best value 𝒛
ො 𝑛 (some “optimal” value of the random variable 𝒛𝑛 )
▪ The “expected” value 𝔼 𝒛𝑛 of the random variable 𝒛𝑛 EM is pretty much like ALT-OPT
but with soft/expected values
▪ Using the hard guess 𝒛ො 𝑛 of 𝒛𝑛 will result in an ALT-OPT like algorithm of the latent variables
▪ Using the expected value of 𝒛𝑛 will give the so-called Expectation-Maximization (EM) algo
CS772A: PML
21
EM for Gaussian Mixture Model (GMM)
Expectation of CLL
ෝ 𝑘 𝒩 𝑥𝑛 |ෝ
𝜋 𝑘
𝜇𝑘 ,Σ
Need to normalize: 𝔼 𝑧𝑛𝑘 =
▪ The EM algo for GMM required 𝔼[𝑧𝑛𝑘 ]. Note 𝑧𝑛𝑘 ∈ {0,1} 𝐾
σℓ=1 𝜋ෝ ℓ 𝒩 𝑥𝑛 |ෝ ℓ
𝜇ℓ ,Σ
+ 1 × 𝑝(𝑧𝑛𝑘 = 1|𝑥𝑛 , Θ)
𝔼 𝑧𝑛𝑘 = 𝛾𝑛𝑘 = 0 × 𝑝(𝑧𝑛𝑘 = 0|𝑥𝑛 , Θ) = 𝑝(𝑧𝑛𝑘 = 1|𝑥𝑛 , Θ) ො 𝑘 𝒩 𝑥𝑛 |𝜇Ƹ 𝑘 , Σ 𝑘
∝𝜋
M-step:
CS772A: PML
23
EM: Some Final Comments
▪ The E and M steps may not always be possible to perform exactly. Some reasons
▪ The conditional posterior of latent variables 𝑝(𝑍|𝑋, Θ) may not be easy to compute
▪ Will need to approximate 𝑝(𝑍|𝑋, Θ) using methods such as MCMC or variational inference Results in
▪ Even if 𝑝(𝑍|𝑋, Θ) is easy, the expected CLL may not be easy to compute Monte-Carlo EM
Can often be approximated
by Monte-Carlo using
sample from the CP of 𝑍