0% found this document useful (0 votes)

14 views

Machine Learning and Pattern Recognition - Variational - Details

The document discusses variational inference methods for approximating intractable distributions. It outlines stochastic variational inference which uses Monte Carlo sampling and reparameterization to estimate gradients for optimization. This allows variational inference to be performed for many models using simple black-box code.

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Machine Learning and Pattern Recognition - Variational - Details

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

More details on variational methods

Variational methods used to be complicated. After writing down the standard KL-divergence
objective (previous note), researchers would try to derive clever fixed-point update equations
to optimize it. For some models, including simple ones like logistic regression, this strategy
didn’t work out. Special-case variational objectives would be crafted for particular models.
As a result, most text-book treatments of the applications of variational methods are fairly
complicated, and beyond what’s required for this course.
Fortunately the stochastic variational inference (SVI) methods developed in the last few
years are simpler to understand, more general, and scale to enormous datasets. This note
will outline enough of the idea to explain the demonstration code on the website. The
demonstration is applied to logistic regression, but in principle its log-likelihood and
gradients could be replaced with any other model with real-valued parameters.
As a reminder, we wish to minimize

J (m, V ) = Eq [log q(w)] − Eq [log p(D | w)] − Eq [log p(w)], (1)

with respect to our variational parameters {m, V }, the mean and covariance of our Gaussian
approximate posterior. For any {m, V } we have a lower bound on the log-marginal likelihood,
or the model evidence1 :
log p(D) ≥ − J (m, V ). (2)
We would like to maximize the model likelihood p(D) with respect to any hyperparameters,
such as the prior variance on the weights, σw2 . We can’t do that exactly, but we can instead
minimize J with respect to these parameters. So, we will jointly minimize J with respect to
the variational distribution and model hyperparameters, aiming for a tight bound2 and a
large model likelihood.

1 Unconstrained optimization
Stochastic gradient descent on parameters V and σw2 will sometimes set negative variances
and covariances that aren’t positive definite. Instead we should optimize unconstrained
quantities, such as log σw .
To optimize a covariance matrix, we can first write it in terms of its Cholesky decomposition:
V = LL> . The diagonal elements of L are positive3 , the other elements are unconstrained.
We take the log of the diagonal elements of the Cholesky decomposition, leaving the other
elements equal to the values in the Cholesky decomposition, and optimize that unconstrained
matrix.
[The website version of this note has a question here.]

2 The entropy terms

The negative entropy term Eq [log q(w)] and cross-entropy term Eq [log p(w)] can both be
computed in closed form, if the prior is Gaussian. Substituting in the definition of a general
multivariate Gaussian we get:

1 > −1 1
EN (w; m,V ) [log N (w; µ, Σ)] = EN (w; m,V ) − (w − µ) Σ (w − µ) − log |2πΣ| . (3)
2 2

1. Where this “Evidence Lower BOund” is often called the ELBO.

2. The bound can’t actually be tight, and give the exact marginal likelihood, unless the true posterior is Gaussian.
3. The diagonal elements could be negative in the same sense that we could set σw negative and get a positive σw2 .
However, optimizing the Cholesky decomposition L directly, allowing negative diagonal values, is not a good idea:
the gradients become extreme when a diagonal element approaches zero.

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

The remaining expectation is of a quadratic form, and so can be expressed in terms of m
and V. For the negative entropy term, things simplify considerably:
D 1
EN (w; m,V ) [log N (w; m, V )] = − − log |2πV |, (4)
2 2
where as usual, D is the number of parameters in w. For a spherical Gaussian prior, the
cross entropy term turns out to be:
h i 1 D
− EN (w; m,V ) log N (w; 0, σw2 I) = 2 (Trace(V ) + m> m) + log(2πσw2 ). (5)
2σw 2
We can evaluate both entropy terms, and they are both differentiable. The terms involving V
are simple functions of the Cholesky decomposition: 12 log |V | = ∑i log Lii , and Trace(V ) =
∑ij L2ij , making it easy to get the gradients we need.

3 The log-likelihood term

The final term, the average negative log-likelihood under the variational posterior,
" #
N
− EN (w; m,V ) [log p(D | w)] = −EN (w; m,V ) ∑ log p(y(n) | x(n) , w) (6)
n =1

cannot be computed in closed form. We could convert the integral of each term into a 1D
integral and compute it numerically. However, that is expensive.
We only need unbiased estimates of the gradients to perform stochastic gradient descent.
We can get a get a simple “Monte Carlo” unbiased estimate by sampling a random weight
from the variational posterior:
N
− EN (w; m,V ) [log p(D | w)] ≈ − ∑ log p(y(n) | x(n) , w), w ∼ N (m, V ). (7)
n =1

We can also replace the sum by scaling up the contribution of a random example, and still
get an unbiased estimate:

− EN (w; m,V ) [log p(D | w)] ≈ − N log p(y(n) | x(n) , w),

w ∼ N (m, V ), n ∼ Uniform{1..N }.
(8)
Alternatively we could take the average log-likelihood for a random minibatch of M exam-
ples, and scale it up by N. The demonstration code just uses all of the data in each update,
as the number of datapoints was small.

4 Gradients of the log-likelihood term

We could obtain gradients of the log-likelihood term by differentiating the expectation
symbolically, and then finding a Monte Carlo approximation of the result. However, it’s
easier, and lower-variance to use a “reparameterization trick”.
The standard way to sample a random weight w from the variational posterior is to sample
a vector of standard normals ν ∼ N (0, I) and transform it: w = m + Lν. Similarly, the
expectation can be rewritten:

EN (w; m,V ) [ f (w)] = EN (ν; 0,I) [ f (m + Lν)] . (9)

The expectation is now under a constant distribution, so it’s easy to write down derivatives
with respect to the variational parameters:

∇m EN (w; m,V ) [ f (w)] = EN (ν; 0,I) [∇m f (m + Lν)] (10)

≈ ∇m f (m + Lν), ν ∼ N (0, I), (11)

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

and

∇L EN (w; m,V ) [ f (w)] ≈ ∇L f (m + Lν), ν ∼ N (0, I) (12)

= [∇w f (w)]ν> , w = m + Lν, ν ∼ N (0, I). (13)

To estimate both gradients, we just need to be able to evaluate gradients of the log-likelihood
function with respect to the weights, which we already know how to do if we can do
maximum likelihood fitting.
[The website version of this note has a question here.]

5 Black box variational inference

Unbiased estimates of gradients of “the ELBO”, the variational lower bound on the marginal
likelihood, don’t require any new derivations for new models. Tricks were required to
make the parameterization of the Gaussian unconstrained, and to derive the working above.
However, all of that work only needs doing once. In the demonstration code these details are
put into one standard function. This routine can be used as a “black box” — like a generic
optimizer. We pass the routine a function that evaluates our log likelihood and its gradients,
and we can perform variational inference in the new model.

6 Check your understanding

If you want to reproduce all of the derivations in this note, you might find the trace trick
from the maths cribsheet in the background materials useful.4

7 References
Shakir Mohamed has recent tutorial slides on variational inference. The final slide has a
reading list of both the classic and modern variational inference papers that discovered the
theory in this note.
Black-box stochastic variational inference in five lines of Python, David Duvenaud, Ryan
P. Adams, NeurIPS Workshop on Black-box Learning and Inference, 2015. The associated
Python code and neural net demo require autograd.

4. As an example, here is how to find the following expectation over a D-dimensional vector z:
EN (z;0,V ) [log N (z; 0, V )]. (14)
Using standard manipulations, including the “trace trick”:

1 1
EN (z;0,V ) [log N (z; 0, V )] + log |2πV | = EN (z;0,V ) − Tr z> V −1 z

(15)
2 2

1
= EN (z;0,V ) − Tr zz> V −1

(16)
2
1 h i
= − Tr EN (z;0,V ) zz> V −1 (17)
2
1 1 D
= − Tr(VV −1 ) = − Tr(ID ) = − . (18)
2 2 2

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
MixSIR Manual 0.5
No ratings yet
MixSIR Manual 0.5
20 pages
Black Box Variational Inference
No ratings yet
Black Box Variational Inference
11 pages
3 Bayesian Deep Learning
No ratings yet
3 Bayesian Deep Learning
33 pages
A Brief Primer On Variational Inference - Fabian Dablander
No ratings yet
A Brief Primer On Variational Inference - Fabian Dablander
14 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
A Beginner’s Guide to Variational Inference
No ratings yet
A Beginner’s Guide to Variational Inference
48 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Notes
No ratings yet
Notes
9 pages
Var Bayes Linreg
No ratings yet
Var Bayes Linreg
14 pages
Auto Encoding Variational Bayes
No ratings yet
Auto Encoding Variational Bayes
14 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
Modeling, Inference and Prediction: 2.1 Probabilistic Models
No ratings yet
Modeling, Inference and Prediction: 2.1 Probabilistic Models
16 pages
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
No ratings yet
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
9 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
1312.6114v1
No ratings yet
1312.6114v1
9 pages
24 Variational Inference
No ratings yet
24 Variational Inference
24 pages
Variation Al
No ratings yet
Variation Al
25 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
cs229 Notes14
No ratings yet
cs229 Notes14
6 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Machine Learning and AI
No ratings yet
Machine Learning and AI
10 pages
08_VariationalInference
No ratings yet
08_VariationalInference
31 pages
Bayesian Monte Carlo: Carl Edward Rasmussen and Zoubin Ghahramani
No ratings yet
Bayesian Monte Carlo: Carl Edward Rasmussen and Zoubin Ghahramani
8 pages
Tut6 Questions
No ratings yet
Tut6 Questions
2 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
Using Maxlik
No ratings yet
Using Maxlik
20 pages
Robust, Accurate Stochastic Optimization For Variational Inference
No ratings yet
Robust, Accurate Stochastic Optimization For Variational Inference
16 pages
Reparametrization Trick
No ratings yet
Reparametrization Trick
8 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
RVM Tutorial
No ratings yet
RVM Tutorial
25 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Gradient Estimation Using Stochastic Computation Graphs
No ratings yet
Gradient Estimation Using Stochastic Computation Graphs
13 pages
Chapter A
No ratings yet
Chapter A
18 pages
Statistical Methods: Multivariate Analysis
No ratings yet
Statistical Methods: Multivariate Analysis
1 page
8.auto-Encoding Variational Bayes
No ratings yet
8.auto-Encoding Variational Bayes
14 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
An Introduction To Variational Calculus in Machine Learning
No ratings yet
An Introduction To Variational Calculus in Machine Learning
7 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Mlgs 2021 Retake
No ratings yet
Mlgs 2021 Retake
54 pages
Representer Function
No ratings yet
Representer Function
12 pages
Bayesian Online Natural Gradient (BONG)
No ratings yet
Bayesian Online Natural Gradient (BONG)
43 pages
Summer_School_19
No ratings yet
Summer_School_19
95 pages
Variational Autoencoder
No ratings yet
Variational Autoencoder
21 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
math-questions
No ratings yet
math-questions
2 pages
Linear Regression, Active Learning
No ratings yet
Linear Regression, Active Learning
10 pages
Statistics
No ratings yet
Statistics
60 pages
Solutions To The Exercises On The Bias-Variance Dilemma
No ratings yet
Solutions To The Exercises On The Bias-Variance Dilemma
8 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
dpnn_tpami2017
No ratings yet
dpnn_tpami2017
9 pages
Machine Learning and Pattern Recognition Week 2 Error Bars
No ratings yet
Machine Learning and Pattern Recognition Week 2 Error Bars
3 pages
Lec_4
No ratings yet
Lec_4
35 pages
Elements of Statistical Learning Solutions
100% (3)
Elements of Statistical Learning Solutions
112 pages
Etc3400 Tute Ex 6 2022
No ratings yet
Etc3400 Tute Ex 6 2022
5 pages
Robust Inference With Variational Bayes: Rgiordano@berkeley - Edu Tbroderick@csail - Mit.edu
No ratings yet
Robust Inference With Variational Bayes: Rgiordano@berkeley - Edu Tbroderick@csail - Mit.edu
16 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
w2c_central_limit
No ratings yet
w2c_central_limit
1 page
w2e_multivariate_gaussian
No ratings yet
w2e_multivariate_gaussian
6 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
Part 4
No ratings yet
Part 4
24 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
MDA3S
No ratings yet
MDA3S
22 pages
Part 5
No ratings yet
Part 5
31 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
TS Part2
No ratings yet
TS Part2
62 pages
Part 3
No ratings yet
Part 3
29 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
Mixed Effects Models for the Population Approach Models, Tasks, Methods and Tools - 1st Edition Google Drive Download
100% (4)
Mixed Effects Models for the Population Approach Models, Tasks, Methods and Tools - 1st Edition Google Drive Download
16 pages
STAT 2006 Chapter 2 - 2022
No ratings yet
STAT 2006 Chapter 2 - 2022
83 pages
Journal of Retailing Vol. 91-2 P. 343-357
No ratings yet
Journal of Retailing Vol. 91-2 P. 343-357
16 pages
Connecting Dynamic Vegetation Models To Data - An Inverse Perspective
No ratings yet
Connecting Dynamic Vegetation Models To Data - An Inverse Perspective
13 pages
Guelph Neural ODEs Tutorial
No ratings yet
Guelph Neural ODEs Tutorial
70 pages
Neyman Pearson Lemma
No ratings yet
Neyman Pearson Lemma
3 pages
Actuarial Questions
100% (2)
Actuarial Questions
73 pages
Baker (2011) Fragility Fitting
No ratings yet
Baker (2011) Fragility Fitting
10 pages
Bayesian Statistics For Data Science - Towards Data Science
No ratings yet
Bayesian Statistics For Data Science - Towards Data Science
7 pages
Development of A Predictive Model For Industrial Circuit Breaker Degradation in Stochastic Environments
No ratings yet
Development of A Predictive Model For Industrial Circuit Breaker Degradation in Stochastic Environments
17 pages
Unit 2 - A Quick Review of Probability
No ratings yet
Unit 2 - A Quick Review of Probability
15 pages
Chapter 4 Introduction To Probability - Jaggia4e - PPT
No ratings yet
Chapter 4 Introduction To Probability - Jaggia4e - PPT
105 pages
Syll 562 spr18
No ratings yet
Syll 562 spr18
2 pages
Fischer Wang 2011 Chapter 3
No ratings yet
Fischer Wang 2011 Chapter 3
14 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
21 pages
Chapter 4 Introduction To Probability
100% (1)
Chapter 4 Introduction To Probability
46 pages
Probabilistic Risk and Safety
100% (13)
Probabilistic Risk and Safety
612 pages
Section A: Easy To Code
No ratings yet
Section A: Easy To Code
29 pages
Iit Jam 2017 Syllabus For Mathematical Statistics
No ratings yet
Iit Jam 2017 Syllabus For Mathematical Statistics
2 pages
Zhang X. Et Al. (2011) - A Framework For Hand Gesture Recognition Based On Accelerometer and EMG Sensors
No ratings yet
Zhang X. Et Al. (2011) - A Framework For Hand Gesture Recognition Based On Accelerometer and EMG Sensors
13 pages
Searching For The GOAT of Tennis Win Prediction: Stephanie Ann Kovalchik
No ratings yet
Searching For The GOAT of Tennis Win Prediction: Stephanie Ann Kovalchik
12 pages
Interpretasi n47
No ratings yet
Interpretasi n47
8 pages
Regresion Heterocedástica
No ratings yet
Regresion Heterocedástica
21 pages
Special Networks: "Principles of Soft Computing, 2
No ratings yet
Special Networks: "Principles of Soft Computing, 2
22 pages
Structural Equation Modeling Foundations and Extensions Advanced Quantitative Techniques in The Social Sciences Second Edition David W. Kaplan
100% (3)
Structural Equation Modeling Foundations and Extensions Advanced Quantitative Techniques in The Social Sciences Second Edition David W. Kaplan
84 pages
Full Insider Risk and Personnel Security An Introduction 1st Edition Martin PDF All Chapters
100% (4)
Full Insider Risk and Personnel Security An Introduction 1st Edition Martin PDF All Chapters
62 pages
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
No ratings yet
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
34 pages
Homework 01 Key Answer
No ratings yet
Homework 01 Key Answer
3 pages
Edited Book - Recent Applied Research in Mathematical, Statistical and Computational Sciences
No ratings yet
Edited Book - Recent Applied Research in Mathematical, Statistical and Computational Sciences
129 pages

Machine Learning and Pattern Recognition - Variational - Details

Uploaded by

Machine Learning and Pattern Recognition - Variational - Details

Uploaded by

More details on variational methods

J (m, V ) = Eq [log q(w)] − Eq [log p(D | w)] − Eq [log p(w)], (1)

2 The entropy terms

1. Where this “Evidence Lower BOund” is often called the ELBO.

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

3 The log-likelihood term

− EN (w; m,V ) [log p(D | w)] ≈ − N log p(y(n) | x(n) , w),

4 Gradients of the log-likelihood term

EN (w; m,V ) [ f (w)] = EN (ν; 0,I) [ f (m + Lν)] . (9)

∇m EN (w; m,V ) [ f (w)] = EN (ν; 0,I) [∇m f (m + Lν)] (10)

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

∇L EN (w; m,V ) [ f (w)] ≈ ∇L f (m + Lν), ν ∼ N (0, I) (12)

5 Black box variational inference

6 Check your understanding

MLPR:w11c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

You might also like