Learning group variational inference

Variational Inference
Presenter: Shuai Zhang, CSE, UNSW

Content
1
•Brief Introduction
2
•Core Idea of VI
•Optimization
3
•Example: Bayesian Mix of Gauss

What is Variational Inference?
Variational Bayesian methods are a family of techniques for
approximating intractable integrals arising in Bayesian inference
and machine learning [Wiki].
It is widely used to approximate posterior densities for Bayesian
models, an alternative strategy to Markov Chain Monte Carlo, but
it tends to be faster and easier to scale to large data.
It has been applied to problems such as document analysis,
computational neuroscience and computer vision.

Core Idea
Consider a general problem of Bayesian Inference - Let the latent
variables in our problem be and the observed data
Inference in a Bayesian model amounts to conditioning on data
and computing the posterior

Approximate Inference
The inference problem is to compute the conditional given by the
below equation.
the denominator is the marginal distribution of the data obtained
by marginalizing all the latent variables from the joint distribution
p(x,z).
For many models, this evidence integral is unavailable in closed
form or requires exponential time to compute. The evidence is
what we need to compute the conditional from the joint; this is
why inference in such models is hard

MCMC
In MCMC, we first construct an ergodic Markov chain on z whose
stationary distribution is the posterior
Then, we sample from the chain to collect samples from the
stationary distribution.
Finally, we approximate the posterior with an empirical estimate
constructed from the collected samples.

VI vs. MCMC
MCMC VI
More computationally intensive Less intensive
Guarantees producing
asymptotically exact samples from
target distribution
No such guarantees
Slower Faster, especially for large data
sets and complex distributions
Best for precise inference Useful to explore many scenarios
quickly or large data sets

Core Idea
Rather than use sampling, the main idea behind variational
inference is to use optimization.
we restrict ourselves a family of approximate distributions D over
the latent variables. We then try to find the member of that
family that minimizes the Kullback-Leibler divergence to the exact
posterior. This reduces to solving an optimization problem.
The goal is to approximate p(z|x) with the resulting q(z). We
optimize q(z) for minimal value of KL divergence

Core Idea
The objective is not computable. Because
Because we cannot compute the KL, we optimize an alternative
objective that is equivalent to the KL up to an added constant.
We know from our discussion of EM.

Core Idea
Thus, we have the objective function:
Maximizing the ELBO is equivalent to minimizing the KL
divergence.
Intuition: We rewrite the ELBO as a sum of the expected log
likelihood of the data and the KL divergence between the prior
p(z) and q(z)

Mean field approximation
Now that we have specified the variational objective function
with the ELBO, we now need to specify the variational family of
distributions from which we pick the approximate variational
distribution.
A common family of distributions to pick is the Mean-field
variational family. Here, the latent variables are mutually
independent and each governed by a distinct factor in the
variational distribution.

Coordinate ascent mean-field VI
Having specified our objective function and the variational family
of distributions from which to pick the approximation, we now
work to optimize.
CAVI maximizes ELBO by iteratively optimizing each variational
factor of the mean-field variational distribution, while holding
the others fixed. It however, does not guarantee finding the
global optimum.

given that we fix the value of all other variational factors ql(zl) (l
not equal to j), the optimal 𝑞 𝑗(𝑧𝑗) is proportional to the
exponentiated expected log of the complete conditional. This
then is equivalent to being proportional to the log of the joint
because the mean-field family assumes that all the latent
variables are independent.

Below, we rewrite the first term using iterated expectation and
for the second term, we have only retained the term that
depends on
In this final equation, the RHS is equal to the negative KL
divergence between 𝑞 𝑗 and exp(A). Thus, maximizing this
expression is the same as minimizing the KL divergence between
𝑞 𝑗 and exp(A).
This occurs when 𝑞 𝑗 =exp(A).

Bayesian Mixture of Gaussians
The full hierarchical model of
The joint dist.

The mean field variational family contains approximate posterior
densities of the form

Derive the ELBO as a function of the variational factors. Solve for
the ELBO

Next, we derive the CAVI update for the variational factors.

References
1. https://am207.github.io/2017/wiki/VI.html
2. https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
3. https://www.cs.cmu.edu/~epxing/Class/10708-17/notes-17/10708-scribe-lecture13.pdf
4. https://arxiv.org/pdf/1601.00670.pdf

Week Report
• Last week
• Metric Factorization model
• Learning Group
• This week
• Submit the ICDE paper

Learning group variational inference

More Related Content

What's hot (11)

Similar to Learning group variational inference (20)

More from Shuai Zhang (8)

Recently uploaded (20)

Learning group variational inference

Editor's Notes