0% found this document useful (0 votes)
14 views

Machine Learning and Pattern Recognition - Variational - Details

The document discusses variational inference methods for approximating intractable distributions. It outlines stochastic variational inference which uses Monte Carlo sampling and reparameterization to estimate gradients for optimization. This allows variational inference to be performed for many models using simple black-box code.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Machine Learning and Pattern Recognition - Variational - Details

The document discusses variational inference methods for approximating intractable distributions. It outlines stochastic variational inference which uses Monte Carlo sampling and reparameterization to estimate gradients for optimization. This allows variational inference to be performed for many models using simple black-box code.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

More details on variational methods

Variational methods used to be complicated. After writing down the standard KL-divergence
objective (previous note), researchers would try to derive clever fixed-point update equations
to optimize it. For some models, including simple ones like logistic regression, this strategy
didn’t work out. Special-case variational objectives would be crafted for particular models.
As a result, most text-book treatments of the applications of variational methods are fairly
complicated, and beyond what’s required for this course.
Fortunately the stochastic variational inference (SVI) methods developed in the last few
years are simpler to understand, more general, and scale to enormous datasets. This note
will outline enough of the idea to explain the demonstration code on the website. The
demonstration is applied to logistic regression, but in principle its log-likelihood and
gradients could be replaced with any other model with real-valued parameters.
As a reminder, we wish to minimize

J (m, V ) = Eq [log q(w)] − Eq [log p(D | w)] − Eq [log p(w)], (1)

with respect to our variational parameters {m, V }, the mean and covariance of our Gaussian
approximate posterior. For any {m, V } we have a lower bound on the log-marginal likelihood,
or the model evidence1 :
log p(D) ≥ − J (m, V ). (2)
We would like to maximize the model likelihood p(D) with respect to any hyperparameters,
such as the prior variance on the weights, σw2 . We can’t do that exactly, but we can instead
minimize J with respect to these parameters. So, we will jointly minimize J with respect to
the variational distribution and model hyperparameters, aiming for a tight bound2 and a
large model likelihood.

1 Unconstrained optimization
Stochastic gradient descent on parameters V and σw2 will sometimes set negative variances
and covariances that aren’t positive definite. Instead we should optimize unconstrained
quantities, such as log σw .
To optimize a covariance matrix, we can first write it in terms of its Cholesky decomposition:
V = LL> . The diagonal elements of L are positive3 , the other elements are unconstrained.
We take the log of the diagonal elements of the Cholesky decomposition, leaving the other
elements equal to the values in the Cholesky decomposition, and optimize that unconstrained
matrix.
[The website version of this note has a question here.]

2 The entropy terms


The negative entropy term Eq [log q(w)] and cross-entropy term Eq [log p(w)] can both be
computed in closed form, if the prior is Gaussian. Substituting in the definition of a general
multivariate Gaussian we get:
 
1 > −1 1
EN (w; m,V ) [log N (w; µ, Σ)] = EN (w; m,V ) − (w − µ) Σ (w − µ) − log |2πΣ| . (3)
2 2

1. Where this “Evidence Lower BOund” is often called the ELBO.


2. The bound can’t actually be tight, and give the exact marginal likelihood, unless the true posterior is Gaussian.
3. The diagonal elements could be negative in the same sense that we could set σw negative and get a positive σw2 .
However, optimizing the Cholesky decomposition L directly, allowing negative diagonal values, is not a good idea:
the gradients become extreme when a diagonal element approaches zero.

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1


The remaining expectation is of a quadratic form, and so can be expressed in terms of m
and V. For the negative entropy term, things simplify considerably:
D 1
EN (w; m,V ) [log N (w; m, V )] = − − log |2πV |, (4)
2 2
where as usual, D is the number of parameters in w. For a spherical Gaussian prior, the
cross entropy term turns out to be:
h i 1 D
− EN (w; m,V ) log N (w; 0, σw2 I) = 2 (Trace(V ) + m> m) + log(2πσw2 ). (5)
2σw 2
We can evaluate both entropy terms, and they are both differentiable. The terms involving V
are simple functions of the Cholesky decomposition: 12 log |V | = ∑i log Lii , and Trace(V ) =
∑ij L2ij , making it easy to get the gradients we need.

3 The log-likelihood term


The final term, the average negative log-likelihood under the variational posterior,
" #
N
− EN (w; m,V ) [log p(D | w)] = −EN (w; m,V ) ∑ log p(y(n) | x(n) , w) (6)
n =1

cannot be computed in closed form. We could convert the integral of each term into a 1D
integral and compute it numerically. However, that is expensive.
We only need unbiased estimates of the gradients to perform stochastic gradient descent.
We can get a get a simple “Monte Carlo” unbiased estimate by sampling a random weight
from the variational posterior:
N
− EN (w; m,V ) [log p(D | w)] ≈ − ∑ log p(y(n) | x(n) , w), w ∼ N (m, V ). (7)
n =1

We can also replace the sum by scaling up the contribution of a random example, and still
get an unbiased estimate:

− EN (w; m,V ) [log p(D | w)] ≈ − N log p(y(n) | x(n) , w),


w ∼ N (m, V ), n ∼ Uniform{1..N }.
(8)
Alternatively we could take the average log-likelihood for a random minibatch of M exam-
ples, and scale it up by N. The demonstration code just uses all of the data in each update,
as the number of datapoints was small.

4 Gradients of the log-likelihood term


We could obtain gradients of the log-likelihood term by differentiating the expectation
symbolically, and then finding a Monte Carlo approximation of the result. However, it’s
easier, and lower-variance to use a “reparameterization trick”.
The standard way to sample a random weight w from the variational posterior is to sample
a vector of standard normals ν ∼ N (0, I) and transform it: w = m + Lν. Similarly, the
expectation can be rewritten:

EN (w; m,V ) [ f (w)] = EN (ν; 0,I) [ f (m + Lν)] . (9)

The expectation is now under a constant distribution, so it’s easy to write down derivatives
with respect to the variational parameters:

∇m EN (w; m,V ) [ f (w)] = EN (ν; 0,I) [∇m f (m + Lν)] (10)


≈ ∇m f (m + Lν), ν ∼ N (0, I), (11)

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2


and

∇L EN (w; m,V ) [ f (w)] ≈ ∇L f (m + Lν), ν ∼ N (0, I) (12)


= [∇w f (w)]ν> , w = m + Lν, ν ∼ N (0, I). (13)

To estimate both gradients, we just need to be able to evaluate gradients of the log-likelihood
function with respect to the weights, which we already know how to do if we can do
maximum likelihood fitting.
[The website version of this note has a question here.]

5 Black box variational inference


Unbiased estimates of gradients of “the ELBO”, the variational lower bound on the marginal
likelihood, don’t require any new derivations for new models. Tricks were required to
make the parameterization of the Gaussian unconstrained, and to derive the working above.
However, all of that work only needs doing once. In the demonstration code these details are
put into one standard function. This routine can be used as a “black box” — like a generic
optimizer. We pass the routine a function that evaluates our log likelihood and its gradients,
and we can perform variational inference in the new model.

6 Check your understanding


If you want to reproduce all of the derivations in this note, you might find the trace trick
from the maths cribsheet in the background materials useful.4

7 References
Shakir Mohamed has recent tutorial slides on variational inference. The final slide has a
reading list of both the classic and modern variational inference papers that discovered the
theory in this note.
Black-box stochastic variational inference in five lines of Python, David Duvenaud, Ryan
P. Adams, NeurIPS Workshop on Black-box Learning and Inference, 2015. The associated
Python code and neural net demo require autograd.

4. As an example, here is how to find the following expectation over a D-dimensional vector z:
EN (z;0,V ) [log N (z; 0, V )]. (14)
Using standard manipulations, including the “trace trick”:
 
1 1
EN (z;0,V ) [log N (z; 0, V )] + log |2πV | = EN (z;0,V ) − Tr z> V −1 z

(15)
2 2
 
1
= EN (z;0,V ) − Tr zz> V −1

(16)
2
1  h i 
= − Tr EN (z;0,V ) zz> V −1 (17)
2
1 1 D
= − Tr(VV −1 ) = − Tr(ID ) = − . (18)
2 2 2

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

You might also like