Advances in Bayesian Machine Learning From Uncertainty To Decision Making
Advances in Bayesian Machine Learning From Uncertainty To Decision Making
Chao Ma
I hereby declare that except where specific reference is made to the work of others, the
contents of this dissertation are original and have not been submitted in whole or in part
for consideration for any other degree or qualification in this, or any other university. This
dissertation is my own work and contains nothing which is the outcome of work done in
collaboration with others, except as specified in the text and Acknowledgements. This
dissertation contains fewer than 65,000 words including appendices, bibliography, footnotes,
tables and equations and has fewer than 150 figures.
Chao Ma
December 2021
Advances in Bayesian Machine Learning:
From Uncertainty to Decision Making
Chao Ma
Bayesian uncertainty quantification is the key element to many machine learning ap-
plications. To this end, approximate inference algorithms [176] are developed to perform
inference at a relatively low cost. Despite the recent advancements of scaling approximate
inference to “big model × big data” regimes, many open challenges remain. For instance,
how to properly quantify the parameter uncertainties for complicated, non-identifiable mod-
els (such as neural networks)? How to properly handle the uncertainties caused by missing
data, and perform learning/inference in a scalable way? Furthermore, how to optimally
collect new information, so that missing data uncertainties can be further reduced, and better
decisions can be made?
In this work, we propose new research directions and new technical contributions
towards these research questions. This thesis is organized in two parts (theme A and theme
B). In theme A, we consider quantifying model uncertainty under the supervised learning
setting. To step aside some of the difficulties of parameter-space inference, we propose a
new research direction called function space approximate inference. That is, by treating
supervised probabilistic models as stochastic processes (measures over functions), we can
now approximate the true posterior of the predictive functions by another class of (simpler)
stochastic processes. We provide two different methodologies for function space inference
and demonstrate that they return better uncertainty estimates, as well as improved empirical
performances on complicated models.
In theme B, we consider the quantification of missing data uncertainty under the unsuper-
vised learning setting. We propose a new approach for quantifying missing data uncertainty,
based on deep generative models. It allows us to step aside from the computational bur-
den of traditional methods, and perform accurate and scalable missing data imputation.
Furthermore, by utilizing the uncertainty estimates returned by the generative models, we
propose an information-theoretic framework for efficient, scalable, and personalized active
information acquisition. This allows us to maximally reduce missing data uncertainty, and
make improved decisions with new information.
This thesis is dedicated to my loving family and friends.
Acknowledgements
Without the help of many, this thesis would have been impossible to finish. I wish to express
my deepest gratitude to my supervisor, Professor Jose Miguel Hernandez Lobato. In many
ways, Miguel is the best supervisor that I can ever imagine. He is such a knowledgeable,
humble, patient, and lovely person to work with, and I cannot thank him enough for
recognizing my potential and enthusiasm and taking me as one of his first PhD students.
He convinced me to explore original research ideas just for pure curiosity, which led to
many unexpected successes. Thanks to Miguel, these four years become one of the most
memorable experiences in my life, and being his student is such a great privilege.
I especially thank Yingzhen Li (Imperial College) and Cheng Zhang (Microsoft Re-
search). Yingzhen’s deep insights and enthusiasm inspired me to pursue PhD research in
this field, despite all the difficulties that I had to overcome. She has been a fantastic mentor
during my PhD, who always has time for research conversations, that are necessary for
pushing forward my PhD. Cheng Zhang deserves special thanks for hosting me at Microsoft
Research Cambridge, and providing me with the fantastic opportunity to conduct novel
research in such a vibrant and exciting environment. She is a great person to work with, and
constantly encourages me to step out of my comfort zone. It is my absolute pleasure for
being able to participate in project Azua (now Causica) since its early days and witness how
our theoretical research makes a difference in real-life applications.
My amazing collaborators and fellow PhD students have been a rich source of inspiration.
In particular, I would like to thank my fellow colleagues from both CBL group at the
University of Cambridge, and the Causica group at Microsoft Research Cambridge. It
has been a great pleasure to be part of those intelligent research forces and learn from
like-minded Bayesian cool kids in town. I am grateful to Prof. Carl Henrik Ek and Prof. Jes
Frellsen, who kindly agreed to serve as my viva examiners and provided honest feedback on
polishing this thesis.
Finally, the support from my family and friends has always been the pillar of my PhD. I
will dedicate this thesis to my parents and my maternal grandparents, for everything that I
have accomplished.
Contents
1 Introduction 1
1.1 Uncertainty: the only consequence of rationality . . . . . . . . . . . . . . . 2
1.2 Bayesian expected-ultility maximization . . . . . . . . . . . . . . . . . . . 3
1.3 Challenges: model uncertainty and missing data
uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Challenge I: accurate and scalable approximate inference for
supervised learning models . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Challenge II: Unsupervised learning and inference under the
presence of missing data . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 Challenge III: Replicating human expert’s ability to collect
high-value information . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Outline of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 List of publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Bibliography 223
List of Figures
2.1 The relationships among log p(D), L(q), and DKL [qλ (θθ )||p(θθ |D)] in vari-
ational inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Chapter 3 preview: parameter space inference versus function space in-
ference. Left: parameter space VI minimizes KL divergence over the pa-
rameters θ = {θ1 , θ2 }, which suffers from overparameterization and model
unidentifiability. Right: on the contrary, function space inference directly
performs Bayesian inference in the space of functions, and minimizes KL di-
vergence over functions. This is equivalent to perform inference on minimal
sufficient parameters, which forms an identifiable reparameterization of the
regression model, and resolves the pathologies of posterior inconsistency of
parameter space inference. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Examples of IPs: (a) Neural samplers; (b) Warped GPs (c) Bayesian neural
networks; (d) Bayesian RNNs. . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 First row: Predictions returned from VIP (left), VDO (middle) and exact GP
with RBF + Periodic kernel (right), respectively. Dark grey dots: noisy observa-
tions; dark line: clean ground truth function; dark gray line: predictive means;
Gray shaded area: confidence intervals with 2 standard deviations. Second row:
Corresponding predictive uncertainties. . . . . . . . . . . . . . . . . . . . . . 71
4.3 Test performance on synthetic example (left two) and solar irradiance inter-
polation (right two) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Test performance on clean energy dataset . . . . . . . . . . . . . . . . . . 72
xviii List of Figures
4.5 Interpolations returned by VIP (top), variational dropout (middle), and exact
GP (bottom), respectively. SVGP visualization is omitted as it looks nearly
the same. Here grey dots: training data, red dots: test data, dark dots:
predictive means, light grey and dark grey areas: Confidence intervals
with 2 standard deviations of the training and test set, respectively. Note that
our GP/SVGP predictions reproduces [78]. . . . . . . . . . . . . . . . . . 73
5.8 The posterior samples from VIPs with different number of basis functions.
As more basis functions are used, the posterior samples from VIP become
more and more noisy, and finally converges to GP-like behaviour when 500
basis functions are used. Compared to the ground truth estimate from Figure
5.6, VIP clearly under-estimates the predictive uncertainties in-between the
training samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.9 The posterior samples from FVI with different number of basis functions.
FVI is still able to learn the piecewise linear behaviour from the prior as
more basis functions are used. As the number of basis functions is increased
to 500, FVI converges to a solution that is much closer to the ground truth
(compared with VIP), and is still able to exhibit non-Gaussian behaviours
from the prior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.10 A regression task on a synthetic dataset (red crosses) reproduced from
[70]. We plot predictive mean and uncertainties for each algorithms. This
tasks is used to demonstrate the theoretical finding on the pathologies of
weight-space VI for single-layer BNNs: there is no setting of the variational
parameters that can model the in-between uncertainty between two data
clusters. The functional BNNs [330] also has this problem, since BNNs are
use as part of the model. On the contrary, our functional VI method can
produce sensible in-between uncertainties for out-of-distribution data. See
Appendix 5.C.2 for more details. . . . . . . . . . . . . . . . . . . . . . . . 133
5.11 CPU time comparison, FVI vs f-BNN on implicit priors. Although f-BNNs
are only trained for 100 epochs, its running time is still 100x slower than FVI. 134
5.12 F-BNN on structured implicit priors, trained with 10k epochs . . . . . . . . 135
5.13 Histograms of predictive entropies on CIFAR10/SVHN OOD detection.
Left: MFVI. Right: FVI. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Information reward estimated during the first 4 active variable selection steps
on a randomly chosen Boston Housing test data point. Model: PNP, strategy:
EDDI. Each row contains two plots regarding the same time step. Bar plots
on the left show the information reward estimation of each variable on the
y-axis. All unobserved variables start with green bars, and turns purple
once selected by the algorithm. Right: violin plot of the posterior density
estimations of remaining unobserved variables. . . . . . . . . . . . . . . . 160
7.5 Information curves of active variable selection, demonstrated on three UCI
datasets (based on PNP parameterization of Partial VAE). This displays
negative test RMSE (y axis, the lower the better) during the course of active
selection (x-axis). Error bars represent standard errors over 10 runs. . . . . 162
7.6 First four decision steps on Boston Housing test data. EDDI is “personalized”
comparing SING. Full names of the variables are listed in the Appendix 7.B.2.164
7.7 Comparison of DRAL [174] and EDDI on Boston Housing dataset. EDDI
out performs DRAL significantly regarding test RMSE in every step. . . . 164
7.8 Information curves of active variable selection on risk assessment task on
MIMIC III, produced with PNP setting. . . . . . . . . . . . . . . . . . . . 165
7.9 Information curves of active (grouped) variable selection on risk assessment
task on NHANES, produced with PNP setting. . . . . . . . . . . . . . . . 165
7.10 Random images generated using (a) naive zero imputing, (b) zero imputing
with mask, (c) PN and (d) PNP, respectively. . . . . . . . . . . . . . . . . 171
7.11 information curves (based on RMSE) of active variable selection for the
three UCI datasets and the three approaches, i.e. (First row) PointNet (PN),
(Second row) Zero Imputing (ZI), and (Third row) Zero Imputing with
mask (ZI-m). Green: random strategy; Black: EDDI; Pink: Single best
ordering. This displays RMSE (y axis, the lower the better) during the
course of active selection (x-axis). . . . . . . . . . . . . . . . . . . . . . . 173
7.12 Information curves (based on test negative log-likelihood) of active variable
selection for the three UCI datasets and the three approaches, i.e. (First
row) PointNet (PN), (Second row) Zero Imputing (ZI), and (Third row)
Zero Imputing with mask (ZI-m). Green: random strategy; Black: EDDI;
Pink: Single best ordering. This displays negative test log likelihood (y
axis, the lower the better) during the course of active selection (x-axis). . . 174
List of Figures xxi
7.13 Information curves of active variable selection for the three UCI datasets and
PNP-Partial VAE. Black: EDDI; Blue: Single best ordering. This displays
test RMSE (y axis, the lower the better) during the course of active selection
(x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.16 Information curves of active variable selection on risk assessment task on
MIMIC III, produced from: (a) Zero Imputing (ZI), (b) PointNet (PN)
and (c) Zero Imputing with mask (ZI-m). Green: random strategy; Black:
EDDI; Pink: Single best ordering. This displays negative test log likelihood
(y axis, the lower the better) during the course of active selection (x-axis) . 178
7.14 Information reward estimated during the first 4 active variable selection steps
on a randomly chosen Boston Housing test data point. Model: PNP, strategy:
EDDI. Each row contains two plots regarding the same time step. Bar plots
on the left show the information reward estimation of each variable on the
y-axis. All unobserved variables start with green bars, and turns purple
once selected by the algorithm. Right: violin plot of the posterior density
estimations of remaining unobserved variables. . . . . . . . . . . . . . . . 183
7.15 Information reward estimated during the first 4 active variable selection
steps on a randomly chosen Boston Housing test data point. Models: PNP,
strategy: single ordering. Each row contains two plots regarding the same
time step. Bar plots on the left show the information reward estimation of
each variable on the y-axis. All unobserved variables start with green bars,
and turns purple once selected by the algorithm. Right: violin plot of the
posterior density estimations of remaining unobserved variables. . . . . . . 184
8.1 Exemplar missing data situations. (a): MCAR; (b): MAR; (c)-(i): MNAR. 187
8.2 Graphical representations of our GINA. . . . . . . . . . . . . . . . . . . . 195
8.3 Visualization of generated X2 and X3 from synthetic experiment. Row-
wise (A-C) plots for dataset A, B, and C, respectively; Column-wise (i-iv):
training set (only displays fully observed samples), PVAE samples, Not-
MIWAE samples, and GINA samples, respectively. Contour plot: kernel
density estimate of ground truth density of complete data; . . . . . . . . . 198
8.4 Visualization of imputed X2 and X3 from synthetic experiment. Row-wise
(A-C) plots for dataset A, B, and C, respectively; Column-wise: PVAE im-
puted samples, Not-MIWAE imputed samples, and GINA imputed samples,
respectively. Contour plot: kernel density estimate of ground truth density
of complete data; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
List of Tables
Introduction
Fear comes from uncertainty; we can eliminate the fear within us when
“ we know ourselves better.
”
Bruce Lee,
strongly favor the Bayesian treatment of uncertainties and decision making in intelligent
systems, i.e., the Bayesian EUM formulation. Next, we will describe Bayesian EUM, and
how uncertainties play a role in Bayesian EUM in the context of machine learning.
• The space of the states, Θ , with θ ∈ Θ being the state of the world that is usually un-
observable to the decision maker. We assume θ follows some probability distribution
Θ, E). Here, E is a σ -algebra on Θ , usually referred as the
(called prior) p(θθ ) over (Θ
space of events.
• The space of observable variables, V, the elements of which are denoted by v. These
represent the information available for decision making. We assume v is generated
according to the conditional probability function p(v|θθ ) of v given the state θ .
• The set of available actions A, each element of which a(·) ∈ A is a function from
Θ to C, where C is the space of consequences. That is, depending on the state of the
world, an action will cause different consequences.
• The loss function L(c), which assigns a real number for different consequences c ∈ C.
Since c is fully determined by both the state θ ∈ Θ and the action a(·) ∈ A, we can
also view the loss L as a function of θ and a. That is, we can rewrite the loss function
as L(θθ , a(θ )) (since a is a function of θ ), or simply L(θθ , a)
Then, the rational decision making solution is obtained by the solving the following
(conditional) Bayesian EUM problem:
where the conditional distribution p(θθ |v) represents the decision maker’s belief on the state
of nature conditioned on observable information v. p(θθ |v) is given by Bayes rule,
p(θθ )p(v|θθ )
p(θθ |v) = R , (1.2)
Θ p(θ
θ ∈Θ θ )p(v|θθ )dθθ
which quantifies the uncertainty/risks of the state θ , given the currently observed information.
Common sources of uncertainty may include [74]:
• Noisy data: for example uncertainty from measurement noise; many real-world
datasets are inevitably corrupted by noise due to imperfection of measurement. In this
case, the observable information v in the Bayesian EUM problem corresponds to the
noisy version of θ . That is, v = θ̃θ , and the measurement noise is given by p(θ̃θ |θθ ),
which is usually irreducible.
• Model uncertainty: given a certain set of observations, there may exist many models
with different configurations of structures/parameters that give similar performance
over training datasets, but behave differently in out-sample prediction. This typically
corresponds to ill-posed tasks where the given dataset does not determine the unique-
ness of the solution. This is also known as model non-identifiability in engineering
problems.
• Missing data: uncertainty due to absence of certain entries in observable data records
v. This is common in data from responses to questionnaires, where participants may
refuse to answer certain questions.
able to distinguish different diseases. This means that we have to (equally) consider multiple
plausible explanations at the same time. To reduce such uncertainty, doctors will commonly
ask to conduct more medical tests and/or ask more questions regarding symptoms.
aleatoric and epistemic uncertainty The noisy data example above is a case of aleatoric
uncertainty, which accounts for variability in the outcome due to inherent randomness. The
other two examples (model uncertainty and missing data) are cases of epistemic uncertainty,
also known as systematic uncertainty, which accounts for uncertainties due to ignorance, i.e.,
lack of knowledge. In this thesis, we will focus on quantifying and reducing both epistemic
and aleatoric uncertainty.
Remark (Generality of Bayesian EUM). Note that the setting of the Bayesian EUM problem
(1.1) is quite general: common machine learning problems such as prediction, classifi-
cation, statistical hypothesis testing, statistical parameter estimation, Bayesian inference,
experimental design, etc. can all be described as special cases of Equation (1.1):
which corresponds to the mode of the posterior predictive distribution p(y|x∗ , D).
• Approximate inference. In this case, the action a(·) is a distribution that assigns a
probability density value to each θ and represented by the approximate distribution
q(θθ ). The loss can be defined as L(θθ , a) = log q(θθ ), and the Bayesian EUM problem
can be given by
q⋆ = arg max E p(θθ |v) log q(θθ ).
q
See [254] for more details and more examples of Bayesian EUM problem.
Takeaway So far, we have seen the following: 1), the theoretical results mentioned in
Section 1.1 show that Bayesian EUM is the consequence of rationality, which could be
the common principle behind natural and machine intelligent systems. 2), being able to
compute p(θθ |v) is key in the Bayesian EUM problem, which includes many important tasks
in machine learning and statistical inference as special cases. Therefore, this motivates
us to develop effective techniques to: 1), quantify model uncertainty and missing data
uncertainty (i.e., evaluating p(θθ |v)) efficiently; and 2), use the information of uncertainty to
correctly guide our decision-making process. My thesis will be focusing addressing these
two questions in specific scenarios.
• ii) Some models involve the usage of so-called “implicit distributions” [179, 198],
which are probability measures assigned implicitly by the specification of a process
that generates samples from them. One of the most well-known implicit distributions
is the generator in a generative adversarial net (GAN) [88], which transforms isotropic
noise into high dimensional data and which can then be used to specify flexible
distributions for p(θθ ). For such models, the evaluation of p(θθ ) is intractable, which
adds additional difficulties when doing Bayesian inference.
• iii) In many models such as those defined by using neural networks as components,
it is hard to interpret the implication of the choice of prior. That is, it is unclear
8 Introduction
what effect the prior p(θθ ) over the weights θ will have in the resulting predictive
distribution.
All those challenging problem listed above present unignorable difficulties for perform-
ing inference in the parameter space. This motivates us to propose new methodologies
to address these issues. We will provide a more in-depth analysis of those challenges in
Chapter 3.
• i) Unsupervised learning under missing data. That is, how to specify/learn the dis-
tribution of the complete data p(x), given only partially observed data points with
missing values? Furthermore, can this be done in the large data/large model regime?
This problem is often referred to as unsupervised learning under missing values.
• ii) Efficient missing data imputation. There are many possible partitions of the
complete data into missing/observed subsets, U, O. For D observable variables, there
exists 2D different combinations for the observed subset O. Therefore, there are 2D
different posterior distributions of the form p(xU |xO ) that may need to be computed,
with each of them requiring exact/approximate Bayesian inference. How can we
efficiently address this significant computational challenge?
We will provide a more detailed analysis of challenges for missing data uncertainty
quantification in Chapter 6.
1.4 Outline of this thesis 9
Remark (example of missing data uncertainty, cont.) When a patient experiencing a lip
sore shows up at the hospital, a doctor usually knows little about the patient’s current status.
Therefore, he/she is uncertain of the cause of such symptoms. However, an experienced
doctor might actively acquire information to reduce such uncertainty. He/she would first
evaluate the current situation, investigate what are the possible scenarios, and then he/she
will ask questions accordingly. He/she might ask whether the patient has a mouth ulcer and
whether he/she is experiencing a high temperature. Based on the answers, the doctor can
evaluate again the possibilities of each possible outcome. For example, if the answers are
positive and negative, respectively, then the doctor might conclude that the patient is most
likely having a cold sore. Meanwhile, it is also equally important to acknowledge other
high-risk possibilities such as skin cancers are not excluded, at least not without further
medical tests.
The above scenario highlights one of the most important applications of Bayesian
approaches, i.e., how to answer what does our model know? or equivalently, how can we
know if the model does not know? Bayesian approaches provide a principled answer: when
there is not enough information to make predictions/decisions, the estimated uncertainty
level (either model uncertainty p(θθ |v) and/or missing data uncertainty p(xU |xO )) should be
quite high. This will indicate that the model does not quite know what it is doing. Then, we
must either refuse to make decisions (hand over to human experts), or proceed to collect
more information (acquire and add more variables to xO ), until we feel significantly more
certain. This inspires us to use Bayesian approaches for automating the human expert’s
ability to collect high-value information.
themes: theme A addresses Challenge I and theme B addresses Challenge II and III, which
explores topics in supervised learning and unsupervised learning, respectively.
Contributions This chapter is one of the first works that demonstrates the idea
of performing approximate inference for modern probabilistic models in function
space. Our key contribution is twofold. First, we introduced a flexible class
of stochastic process priors, namely the variational implicit processes, for the
sake of Bayesian modelling. Secondly, we proposed a new inference method for
this type of priors, based on Gaussian process (GP) approximations in function
space.
Outline In this chapter, we first review the basics of Gaussian processes (GPs)
in Bayesian machine learning. We argue that one of the key ideas of GPs is
that they directly specify prior and posterior distributions over functions, which
could be useful to address some of the aforementioned challenges in model un-
certainty estimation. We address the key question of how to extend such function
space inference ideas to Bayesian parametric models such as Bayesian neural
networks. Then, we show how to perform efficient function-space inference
using approximate Bayesian inference techniques. We develop the variational
implicit process (VIP) as a solution. Similar to Gaussian processes (GPs), in
implicit processes (IPs) an implicit multivariate prior is placed over any finite
collections of random variables. Based on Gaussian process approximations,
1.4 Outline of this thesis 11
a novel and efficient approximate inference algorithm for IPs is then derived
based on a generalized version of the wake-sleep algorithm. We finally perform
experiments to demonstrate the effectiveness of VIPs on a number of tasks on
which VIPs return better uncertainty estimates and superior performance than
other existing inference methods.
• Chapter 5: Functional variational inference.
Outline In this chapter, we begin by reviewing three basic types of missing data
mechanism: missing completely at random (MCAR), missing at random (MAR),
14 Introduction
and missing not at random (MNAR). We argue that model identifiability is a main
obstacle for unsupervised learning under MNAR, and is overlooked by many
modern deep generative models. To this end, we provide a theoretical analysis
of identifiability for generative models under different MNAR assumptions.
More specifically, we propose a set of sufficient conditions, under which the
ground truth parameters can be uniquely identified via optimizing the partial
ELBO proposed in Chapter 7. We also demonstrate how the assumptions can
be slightly relaxed under model mis-specification. Based on our theoretical
result, we propose a practical algorithm model based on identifiable variational
auto-encoders, which enables us to apply flexible deep generative models in a
principled way, even in the presence of MNAR data. This Chapter ends with an
empirical evaluation of the proposed model across different tasks.
• Chapter 6: discusses the backgrounds of missing data uncertainty and the challenges
of performing unsupervised learning under missing data using generative models.
• Chapter 8: extends the work in Chapter 7 to more general missing not at random
(MNAR) assumptions and studies the model identifiability of deep generative models
under MNAR.
1.5 List of publications 15
• Chapter 9: summarizes the thesis and suggests directions for future directions.
• Chao Ma, Yingzhen Li, and José Miguel Hernández-Lobato, “Variational implicit
processes”, in International Conference on Machine Learning (ICML), 2019.
• Chao Ma, and Cheng Zhang, “Identifiable Generative Models for Missing Not at
Random Data Imputation”, in Neural Information Processing Systems (NeurIPS),
2021.
Indirect publications
• Ignacio Peis, Chao Ma, and José Miguel Hernández-Lobato. “Missing Data Imputa-
tion and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo.”,
in Neural Information Processing Systems (NeurIPS), 2022.
16 Introduction
• Weijie He, Xiaohao Mao, Chao Ma, Yu Hang, José Miguel Hernández-Lobato,
and Ting Cheng, “BSODA: A Bipartite Scalable Framework for Online Disease
Diagnosis.”, Proceedings of the ACM Web Conference, 2022.
• Chao Ma, Sebastian Tschiatschek, Yingzhen Li, Richard Turner, José Miguel Hernández-
Lobato, and Cheng Zhang, “HM-VAEs: a deep generative model for real-valued data
with heterogeneous marginals”, in Symposium on Advances in Approximate Bayesian
Inference, 2020.
In all of the paper described above, the author (Chao Ma) is responsible for research
question identification, major research contributions, theoretical derivations and proofs,
model designs and experiments, under the supervision of other co-authors, with the following
exceptions:
• The paper titled “EDDI: Efficient Dynamic Discovery of High-Value Information with
Partial VAE”, in which the research question is originally identified by Sebastian
Nowozin and Cheng Zhang, and the main technical details, derivations, designs and
implementations are done by this author (Chao Ma) during the internship at Microsoft
Research.
• The papers titled “BSODA: A Bipartite Scalable Framework for Online Disease
Diagnosis.” and “Missing Data Imputation and Acquisition with Deep Hierarchical
Models and Hamiltonian Monte Carlo”, I was only partly involved in the initial
design of the algorithm, the discussions of the experimental results and a small
fraction of paper writing. Most of the implementation, experiments and paper writing
are conducted by Weijie He and Ignacio Peis.
Chapter 2
The most important maxim for data analysis to heed, and one which many
“ statisticians seem to have shunned is this: ‘Far better an approximate
answer to the right question, which is often vague, than an exact answer
to the wrong question, which can always be made precise.’ Data analysis
must progress by approximate answers, at best, since its knowledge of
what the problem is will at best be approximate.
”
John W Tukey, "The Future of Data Analysis", 1962,
models where Bayesian inference will be performed, including Bayesian neural networks,
Gaussian processes, variational autoencoders, and Helmholtz machines.
Remark (Why probability: a brief review). The discussion of philosophy of statistics, i.e.,
how we justify and interpret the probabilities, is out of the scope of this thesis. However,
we do like to point out that justifications for using probabilities to represent uncertainties
have been extensively discussed in the literature. Here, we briefly review some of those
important theoretical results. These results, while having their technical limitations, still
2.1 Bayesian machine learning principles 19
provide invaluable insights and greatly strengthen the argument regarding the inevitability
of probabilities.
One of the first justifications can be found in the theory of subjective probability of
Ramsey [265] and De Finetti [53], namely the principle of coherence (of bets), which
also relates to decision making. The principle of coherence states that, unless the bettor’s
assignment of beliefs to random events satisfies the rules of probabilities, he/she will take
bets that return a definite loss. Such bets are also called the Dutch book, or in the terminology
of economics, an arbitrage.
Alternatively, as described in Chapter 1, Savage [296] independently took an axiomatic
approach to inference under uncertainty, firmly based on the context of decision making.
Under certain axioms of an idealized rational agent (i.e, if the decision-maker has rational
preferences among all possible decisions), then his assignment of uncertainty must satisfy
the rules of probability.
Finally, another important axiomatic approach owes to Cox [45] and Jaynes [137], which
extends logical reasoning under Boolean algebra to logical inference under uncertainty.
Cox’s theorem states that any system representing the strengths of belief (plausibility)
of propositions will be inconsistent with true-false logic unless they satisfy the rules of
probability. That is, logical reasoning under uncertainty must be implemented by statistical
inference using probability.
p(θθ )p(D|θθ )
p(θθ |D) = R . (2.1)
θ ∈θθ p(θθ )p(D|θθ )dθθ
Similarly, we can perform inference on an unknown observable variable. Given a test input
x∗ , the Bayesian posterior predictive distribution for the corresponding output, y∗ , is given
20 Inference and Models in Bayesian Machine Learning
by Z
∗ ∗
p(y |x , D) = p(y∗ |x∗ , θ )p(θθ |D)dθθ , (2.2)
θ ∈θθ
which quantifies the uncertainty on the test output y∗ , after observing the data D. Note that
both Equation (2.1) and Equation (2.2) require the marginalization of θ , which produces the
so called marginal likelihood (or model evidence), given by
Z
p(D) = p(θθ )p(D|θθ )dθθ .
θ ∈θθ
Intuitively, p(D) quantifies the average probability of a parameter randomly sampled from
p(θθ ) generating D. Since the parameters are integrated out, a model selection crite-
rion based on p(D) would naturally guard against over-fitting, and control the model
complexity. To understand such effect, factorize p(D) in a sequential way, p(D) =
p(y1 )p(y2 |y1 )p(y3 |y1 , y2 )...p(yN |y1:N−1 ) (with inputs omitted). If the model is too sim-
ple, it will fit each of those distributions poorly. If it is too complex, then it would over-fit the
“early samples", and predict the other samples poorly [235]. This is referred as the Bayesian
Occam’s Razor effect [205]. Therefore, the quantity p(D) is often used to perform model
comparison and model selection across different models.
Unfortunately, in the setting of many modern machine learning problems, the integration
R
problem θ ∈θθ p(θθ )p(D|θθ )dθθ is often computationally intractable [176] except for a few
special cases. This is often due to: 1), the integration problem often involves a high-
dimensionality parameter space that is impractical for numerical integration methods; and
2), the functional form of the likelihood function, p(y|x, θ ), can often be very complicated.
To confront such challenges, we often resort to Approximate inference techniques [176],
which will be introduced in the next section.
Variational inference (VI) [140] converts the above inference problem into an optimization
problem by first proposing a class of approximate posterior distributions qλ (θθ ) parameter-
ized by variational parameters λ , and then minimizing with respect to λ a certain divergence
measure between the approximate posterior qλ (θθ ) and p(θθ |D). One common parameter-
ization method is the so-called mean-field approximation [294]. For example, qλ (θθ ) can
be parameterized by a full-factorized Gaussian, qλ (θθ ) = ∏1≤d≤|θθ | N (θd ; md , σd2 ), where
|θθ | denotes the dimensionality of θ , θd is the d-th component of θ , and N (θd ; md , σd2 )
is the Gaussian density function of θd with mean md and variance σd2 . The variational
parameters λ are now given by a collection of mean and standard deviation parameters
λ = {(md , σd )|1 ≤ d ≤ |θθ |}.
Variational inference usually minimizes the (exclusive) Kullback–Leibler divergence
[163] DKL [·||·] between the approximate posterior qλ (θθ ) and the true posterior p(θθ |D). It
is defined as
q (θθ )
Z
DKL [qλ (θθ )||p(θθ |D)] = log λ q (θθ )dθθ . (2.4)
θ p(θθ |D) λ
DKL [qλ (θθ )||p(θθ |D)] has two nice properties that makes it a suitable objective function for
VI. First, DKL [qλ (θθ )||p(θθ |D)] is always non-negative; and second, DKL [qλ (θθ )||p(θθ |D)] =
0 if and only if qλ (θθ ) = p(θθ |D), that is, when the variational distribution is the perfect
approximation to the ground truth.
Q(Gi )
DKL [Q||P] = sup ∑ Q(Gi ) log , (2.5)
G i P(Gi )
where G = i Gi is a finite measurable partition of Ω, Q and P are the probability measures of
S
Q and P on (Ω, B), respectively. In other words, the KL-divergence between two probability
measures is the supreme of the relative entropies obtained on all possible (finite measurable)
partitions of Ω.
Proposition 2.1. The KL-divergence defined in Definition 2.1 satisfies the following proper-
ties:
where dQ
dP denotes the Radon-Nikodym derivative of Q w.r.t. P. This gives the alterna-
tive measure-theoretic definition of KL-divergence used in the literature [213].
However, we cannot directly compute DKL [qλ (θθ )||p(θθ |D)] since the true posterior
p(θθ |D) is unknown. To get around this, notice that the marginal likelihood log p(D) can be
rewritten in terms of DKL [qλ (θθ )||p(θθ |D)]:
Figure 2.1 The relationships among log p(D), L(q), and DKL [qλ (θθ )||p(θθ |D)] in variational
inference.
p(D, θ )
L(q) = Eqλ (θθ ) log = Eqλ (θθ ) [log p(D|θθ )] − DKL [qλ (θθ )||p(θθ )] . (2.8)
qλ (θθ ) | {z } | {z }
likelihood/predictive loss Regularization
Intuitively, the first term Eqλ (θθ ) [log p(D|θθ )], known as the likelihood term of the ELBO,
quantifies the model’s expected predictive performance on training data D. The second term
DKL [qλ (θθ )||p(θθ )] serves as the regularization term, which encourages qλ (θθ ) to be close
to the prior, p(θθ ). Finally, we can optimize L(q) by performing gradient ascent using the
gradient ∇λ L(q).
N
L(q) = ∑ Eqλ (θθ ) [log p(yn|xn, θ )] − DKL[qλ (θθ )||p(θθ )]. (2.10)
n=1
N
L̂(q) = ∑ Eqλ (θθ ) [log p(yk |xk , θ )] − DKL[qλ (θθ )||p(θθ )], (2.11)
K k∈K
This will converge to a local optimum λ ⋆ of L(q) if the learning rate schedule {rt } satisfies
the Robbins-Monro conditions [276]:
∞ ∞
∑ rt = ∞, ∑ rt2 < ∞. (2.13)
t=1 t=1
Finally, to deal with the analytical intractability of Eqλ (θθ ) [log p(yk |xk , θ )], the following
Monte Carlo approximation is often used:
1 M
Eqλ (θθ ) [log p(yk |xk , θ )] ≈ ∑ log p(yk |xk , θ m),
M m=1
θ m ∼ qλ (θθ ), (2.14)
This allows us to estimate the expectations Eqλ (θθ ) [log p(yk |xk , θ )] without the need to obtain
analytic solutions. By combining the noisy ELBO (2.11) and the MC approximation (2.14),
variational inference can be easily applied to complicated models and large datasets, which
forms an algorithmic foundation for modern Bayesian deep learning.
the gradient, ∇λ Eqλ (θθ ) [log p(yk |xk , θ )]. However, note that the gradient is taken regarding
the parameters λ of the probabilistic measure qλ (θθ ) in the expectation operator, Eqλ (θθ ) [·].
Therefore, we cannot naively exchange the order of the derivative operator ∇λ and the
expectation operator Eqλ (θθ ) [·]. To properly compute ∇λ Eqλ (θθ ) [log p(yk |xk , θ )], several
techniques can be adopted:
1. The REINFORCE gradient estimator [366, 73], also known as the score function
estimator, is derived as follows:
Z
∇λ Eqλ (θθ ) [log p(yk |xk , θ )] = ∇λ [qλ (θθ ) log p(yk |xk , θ )]dθθ
Zθ ∈θθ
= qλ (θθ )∇λ log qλ (θθ ) log p(yk |xk , θ )dθθ
θ ∈θθ
= Eqλ (θθ ) [∇λ log qλ (θθ ) log p(yk |xk , θ )]
| {z }
score function
M
1
≈ ∑ ∇λ log qλ (θθ m) log p(yk |xk , θ m),
M m=1
θ m ∼ qλ (θθ ).
Note that the above derivation does not require the gradient information of
log p(yk |xk , θ ). Therefore, the REINFORCE estimator can even be applied to the case
where log p(yk |xk , θ ) is not differentiable w.r.t. θ .
2. The path-wise estimator, and the reparameterization trick [231, 149]. Suppose qλ (θθ )
can be reparameterized into the following sampling procedure:
θ = hλ (ε), ε ∼ p(ε),
θ = µ + σ ε, ε ∼ N (ε; 0, 1).
26 Inference and Models in Bayesian Machine Learning
∇λ Eqλ (θθ ) [log p(yk |xk , θ )] = ∇λ E p(ε) [log p(yk |xk , hλ (ε))]
= E p(ε) [∇λ log p(yk |xk , hλ (ε))]
1 M
≈ ∑ ∇λ log p(yk |xk , hλ (ε m )), ε m ∼ p(ε).
M m=1
The path-wise estimator does require to compute the gradient, ∇λ log p(yk |xk , hλ (ε m )).
However, when the path-wise gradient estimator can be applied, it tends to have lower
variance than the REINFORCE estimator [74, 73, 272], especially when 1), θ is high-
dimensional; and 2), log p(yk |xk , θ ) is smooth with small magnitude of derivatives. Regard-
ing the variance of the two estimators, [74] has shown the following result:
Proposition 2.2 (Variance analysis). Let qλ (θθ ) = N (µ, σ 2 ), λ = {µ, σ }, and f (θθ )
be a real-valued function. Assume Varqλ (θθ ) ((θθ − µ) f (θθ )) < ∞, Varqλ (θθ ) ( f ′ (θθ )) < ∞,
Eqλ (θθ ) (|(θθ − µ) f ′ (θθ ) + f (θθ )|) < ∞ and Eqλ (θθ ) (| f ′′ (θθ )|) < ∞. If
The α-divergence is a generic class of divergences that includes as special cases the (ex-
clusive) KL-divergence (α = 0, corresponds to VI) , the inclusive KL-divergence (α=1,
corresponds to the EP objetive), and Hellinger distance (α=0.5), etc. Thus, BB-α naturally
unifies VI [140] and expectation propagation (EP) [223, 178]. As α → +∞, Dα [p||q] tends
to encourage mode-covering behaviour (i.e., q will tend to cover all local modes of p; on the
contrary, as α → −∞, Dα [p||q] encourages mode-seeking behavior, and q will tend to place
more mass on the area where p has the largest probability [225].
Before introducing further details regarding BB-α, we first briefly discuss the power
expectation propagation (power EP) algorithm [224], since BB-α is largely inspired from
the power EP. Unlike VI (which optimizes KL divergence globally), power EP parameterizes
1 θ ˜ θ ˜ θ
i of local approximations qλ (θ ) ∝ Z p(θ ) ∏n fn (θ ), where each fn (θ ) ∝
qλ (θθh ) via a set θ
exp λ Tn φ (θθ ) is an exponential family that captures the individual effect of p(yn |xn , θ ).
Moreover, power EP optimizes the α-divergence Dα [p||q] locally using message passing. If
convergent, it converges to a fixed point of the following energy function, LPEP (λ
λ 0 , {λ
λ n }):
N
LPEP (λ
λ 0 , {λ
λ n }) = log Z(λ
λ 0) + ( − 1) log Z(λ
λ q)
α
1 N
Z
λ n )T φ (θθ ) dθθ ,
− ∑ log p(yn |xn , θ )α exp (λ
λ q − αλ (2.17)
α n=1
where λ q = λ 0 + ∑N
n=1 λ n is the natural parameter of qλ (θ
θ ).
Remark (Local approximations via power EP). We breifly describe here how iterative local
message passing is done in power EP. For each iteration, power EP will pick a factor (say f˜n )
to refine. This is done by minimizing the α-divergence Dα [qλ (θθ ) p(yf˜n |x n ,θ
θ)
(θθ )
||qλ (θθ )], whose
n
28 Inference and Models in Bayesian Machine Learning
p(yn |xn , θ )
∇λ q Dα [qλ (θθ ) ||q] =
f˜n (θθ )
p(yn |xn , θ ) α
1
Z
1−α
∇λ q 1− qλ (θθ ) qλ (θθ ) dθθ =
α(1 − α) θ ∈θθ f˜n (θθ )
p(yn |xn , θ ) α
1
Z
− qλ (θθ ) qλ (θθ )1−α ∇λ q log qλ (θθ )dθθ ∝
α θ ∈θθ ˜
fn (θθ )
Eqλ (θθ ) [φφ (θθ )] − Eq (θθ ) p(yn |xn ,θθ )α [φφ (θθ )]
λ f˜ (θθ )α
| {z n }
tilted distribution
Eqλ (θθ ) [φφ (θθ )] ← Eq p(yn |xn ,θθ )α [φφ (θθ )].
λ (θ
θ)
f˜n (θθ )α
Note that power EP in general does not have any convergence guarantees. However,
when convergent, it converges to a fixed point of LPEP .
Now we are ready to describe BB-α. BB-α is hugely inspired by power EP, but took a
different approach towards energy optimization. BB-α directly optimizes LPEP with tied
factors f˜n = f˜ to avoid prohibitive local factor updates and storage on the whole dataset.
This means λ n = λ for all n and λ q = λ 0 + Nλ λ . Therefore instead of parameterizing each
factors, we can directly parameterize qλ (θθ ) and replace all the local factors in the power-EP
energy function by f˜(θθ ) ∝ (qλ (θθ )/p(θθ ))1/N . After re-arranging terms, this gives the BB-α
energy: " !α #
1
1 fn (θθ )p(θθ ) N
Lα (q) = − ∑ log Eq 1 , (2.18)
α n qλ (θθ ) N λ
which can be further approximated by the following if the dataset is large [177]:
1
Lα (q) = DKL [q||p] − log Eq [p(yn |xn , θ )α ] . (2.19)
α∑n
uncertainty estimation than VI, and has been applied successfully in different scenarios
[177, Depeweg et al.]. Moreover, from the perspective of model evidence approximation,
it has been suggested [180] that the BB-α energy Lα (q) forms a better estimation of log
marginal likelihood, log p(D) when compared with the evidence lower bound, L(q). There-
fore, in Chapter 3 of this thesis, BB-α energy is used for both Bayesian inference and model
evidence approximation.
Implicit models provide exceptionally powerful tools for parameterizing and learning
probabilistic distributions. One of the most well known implicit distributions in machine
learning is the generative adversarial nets (GANs) [88, 11], where gλ (·) : R|ε| 7→ R|θθ | is
defined by as a neural network (namely the generator) that maps the noise ε to higher
dimensional observations. GANs have demonstrated their expressiveness and flexibility in
complicated tasks such as image generation [262], protein modeling [270], lattice simulation
in high-energy physics [256], weak lensing convergence map in cosmology [238], and
Fermi-Hubbard model simulation in high-temperature superconductivity [40].
Definition 2.2 (Pushforward measure). Given a probability space (Ω, B, P), then a mea-
surable mapping G from (Ω, B, P) to another measurable space (S, M) will induce a
probability measure G∗ , called the pushforward of P, defined by
The GAN example abve is a special of case of Equation (2.22) by considering Ω = R|ε| ,
S = R|θθ | , and G as a measurable function defined by a neural network.
Although implicit distributions are very powerful, variational inference using such
posterior approximators is quite difficult. This is due to that the KL-divergence term
DKL [q||p] of the ELBO (2.8, 2.29) requires evaluation of the probability density function of
q, which is unfortunately intractable if they are defined by implicit distributions. Therefore,
several implicit variational inference methods have been proposed based on additional
approximations to evaluate DKL [q||p]. One common technique is called the density ratio
estimator [328, 125, 339, 179]. To derive this technique, notice that DKL [q||p] can be written
as
qλ (θθ )
DKL [q||p] = Eqλ (θθ ) log = Eqλ (θθ ) [logU(θθ )], (2.23)
p(θθ )
qλ (θθ )
θ
where U(θθ ) = p(θθ ) is the density ratio between qλ (θ ) and p(θ ). U(θ ) can be estimated
θ θ θ
via a GAN-based idea by training the following binary classifier D(θθ ) that distinguish
2.2 Approximate inference algorithms 31
1
D(θθ ) := P(θθ is sampled from q|θθ ) = . (2.24)
1 + exp{− logU(θθ )}
max Eqλ (θθ ) [log D(θθ )] + E p(θθ ) [log(1 − D(θθ ))]. (2.25)
U
qλ (θθ )
U⋆ = . (2.26)
p(θθ )
Once U ⋆ has been obtained, the implicit VI ELBO can be estimated via
N
L(q) = ∑ Eqλ (θθ ) [log p(yn|xn, θ )] − Eqλ (θθ ) [U ⋆(θθ )] . (2.27)
n=1
2.3 Models
Many recent advances in Bayesian machine learning started from the developments of tools
and techniques of approximate Bayesian inference, which have enabled the application of
Bayesian principles to complicated models. In this section, we introduce some of these
important models in Bayesian machine learning. Specifically, we will introduce Bayesian
neural networks, Gaussian processes, and deep generative models, where elements of modern
deep learning are combined with Bayesian modeling (i.e., Bayesian deep learning). We will
also discuss the advantages and disadvantages behind those ideas, and thus motivate some
new fresh ideas that will be developed later in the thesis.
where w = {(w1 , b1 ), ..., (wL , bL )} is the set of neural network weights (and biases), R(w)
is the regularization term weighted by β (a common choice would be the Frobenius norm
r(w) = ||w||2F ), ςl−1 (·) is the non-linear activation function for the l th layer. The optimization
problem minw l(w, D) is often solved by stochastic optimization methods.
neural network weights, which will “Bayesianize” the DNNs and give us the Bayesian
neural networks [205, 243]. This combines the best from both worlds: on one hand, modern
approximate inference methods enable deep learning models to take advantage of Bayesian
methods, and represent uncertainties in their predictions; on the other hand, the introduction
of deep learning elements also harmonizes well with Bayesian approaches, since this allows
one to use much more flexible probabilistic models.
Definition 2.3 (Bayesian neural neworks (BNN)). Let y = g(x, w) be a neural network
where x is the input, y is the output, and w denotes the weights. To build a Bayesian neural
network (BNN), we place a prior p(w) over w. This prior models the epistemic uncertainty
responsible for parameter uncertainties and can be explained away by observing more data.
Furthermore, we add an observational noise ε ∼ N (ε; 0, σ 2 ) to the output. This models the
aleatoric uncertainty, which accounts for the intrinsic noise in the observation (i.e., data
uncertainty). Then, a BNN is given by
Z
log p(y|x) = log E p(w) p(y|x, w) = log N (y; g(x, w), σ 2 )p(w)dw, (2.28)
w
Then, we can perform Bayesian inference with the BNN model: given a dataset
D = {xi , yi }ni=1 , the goal of Bayesian inference is to compute the posterior p(w|D) ∝
p(D|w)p(w). Since this is intractable, we often resort to approximate inference methods. In
principle, all scalable approximate inference algorithms such as scalable VI, α-divergence
minimization, (stochastic) expectation propagation can be applied. In particular, we may
apply scalable variational inference by introducing a variational distribution q(w) that ap-
proximates the posterior p(w|D). q(w) is trained by maximizing (an noisy estimation of)
the evidence lower bound (ELBO):
Once an optimal q⋆ has been found, we can approximate the Bayesian posterior predictive
distribution on a test input x∗ as follows:
1
p(y∗ |x∗ , D) ≈ ∑ N (y∗; g(x∗, wm), σ 2), wm ∼ q⋆ (θθ ), 1 ≤ m ≤ M. (2.30)
M 1≤m≤M
34 Inference and Models in Bayesian Machine Learning
Remark (A brief history of VI for BNNs). Here we give a brief review of how VI meth-
ods for BNNs are developed in the literature. Variational inference for Bayesian neural
network weights is first introduced by [117], which considered variational inference from
an information-theoretic perspective (minimum description length). This idea is further
developed by [17] in the case of non-factorial proxy in variational inference. More recently,
[91] proposed the stochastic version of variational inference and derived one of the first
scalable learning algorithms for Bayesian neural networks. This is later refined and extended
by Bayes-by-Backprop (BBB) [29], which applied the reparameterization trick (path-wise
gradient estimator) of [150] to obtain an unbiased estimate of the ELBO (and its gradients).
This also allows the specification of non-Gaussian priors, which further boosted the per-
formance of BNNs to a level comparable to other practical deep learning methods (such
as dropout) at that time. Variational Bernoulli dropout (VDO) [75, 122] instead interprets
dropout [323] as performing variational inference for BNNs. They also apply path-wise
gradient estimator) of [150], but instead uses a mixture of Gaussians as q(w), such that
the columns of the weight matrices are randomly pruned to zero. This also introduces
weight-correlations to q(w), and halved the number of variational parameters. BBB and
VDO are still among the most popular BNN algorithms to date.
2.3 Models 35
Modern Laplace approximations. The Laplace approximation [56, 204] is perhaps one
of the first approximate inference methods in Bayesian neural networks, which approx-
imates the BNN posterior by a multivariate Gaussian centered around its MAP mode.
However, the need for calculating second-order derivatives (i.e., the Hessian matrix) made
this vanilla approach computationally infeasible. Later, generalized Gauss-Newton (GGN)
methods [301, 210] as well as its scalable block-diagonal/Kronecker-factored approxima-
tions [211, 32] have been introduced for the practical second-order optimization of deep
neural networks. These approximation techniques have resulted in modern Laplace approxi-
mations for Bayesian neural networks [275, 71]. More recently, it has been argued that GGN
approximation can be viewed as a linearized version of Bayesian neural networks, which
popularized another branch of works that approximate the Bayesian posterior predictive
distribution by that of the linearized model [71, 145, 127].
systems, all wrapped up in a single, exact closed-form solution to the posterior inference
problem.
We briefly introduce GPs for regression. Consider again the scenario of having a
training set v = D := {(xn , yn )}n=1:N , where X = {x1 , x2 , ..., xN } is the set of inputs, and
y = {y1 , y2 , ..., yN }. A Gaussian Process model assumes that yn is generated according the fol-
lowing procedure: firstly a function f (·) is drawn from a Gaussian Process GP(M(·), K(·, ·))
(to be defined later). Then for each input data xn , the corresponding yn is then drawn accord-
ing to
yn = f (xn ) + εn , ε ∼ N (0, σ 2 ), n = 1, · · · , N.
A Gaussian Process is then a nonparametric distribution defined over the space of functions,
as defined next.
Now, given a set of observational data {(xn , yn )}N n=1 , we can perform probabilistic
inference and assign posterior probabilities over all plausible functions that might have
generated the data. Under the setting of regression, given a new test point input data x∗ , we
are interested in posterior distributions over f∗ . Fortunately, this posterior distribution of
interest admits a closed-form solution f∗ ∼ N (µ∗ , Σ∗ ):
In our notation, (y)n = yn , (Kx∗ f )n = K(x∗ , xn ), and Kx∗ x∗ = K(x∗ , x∗ ). Although the Gaus-
sian Process regression framework is theoretically very elegant, in practice its computational
burden is prohibitive for large datasets since the matrix inversion (Kff + σ 2 I)−1 takes O(N 3 )
time due to Cholesky decomposition. Once this matrix inversion is done, predictions at
test time can be made with a cost of O(N) to compute the posterior mean µ∗ and a cost of
O(N 2 ) to compute the posterior variance Σ∗ , respectively.
38 Inference and Models in Bayesian Machine Learning
Despite the success and popularity of GPs (and other Bayesian non-parametric methods)
in the past decades, their O(N 3 ) computation and O(N 2 ) storage complexities makes their
application to large-scale datasets impractical. Therefore, people often resort to complicated
approximate methods, e.g. see [304, 261, 315, 335, 109, 38, 37, 287, 46, 343].
Another critical issue to be addressed is the representational power of GP kernels. It has
been argued that local kernels commonly used for nonlinear regressions are not able to obtain
hierarchical representations for high dimensional data [23], which limits the usefulness of
Bayesian nonparametric models in complicated tasks. A number of solutions were proposed,
including deep GPs [49, 47, 36], the design of expressive kernels [346, 65, 337], and the
hybrid model with features from deep neural nets as the input to a GP [116, 367]. However,
the first two approaches still struggle to model complex high dimensional data such as text
and images; and in the third approach, the advantages of fully Bayesian approaches are not
fully exploited.
Definition 2.6 (Latent variable models). A latent variable model is a generative model,
where pθ (xn ) is defined by assigning local latent variables zn for each data instance xn . In
other words, given a non-labelled dataset D = {xn }1≤n≤N , we have that
Z
log pθ (D) = ∑ log pθ (xn , zn )dzn . (2.33)
n zn
Without loss of generality, in this thesis we assume all generative models are latent variable
models.
2.3 Models 39
where Λ is a linear transformation matrix, σ is the standard deviation of the Gaussian noise,
and the learnable model parameters are given by θ = {Λ, σ }.
Alternatively, another useful way to specify p(xn , zn ) is to use the so-called energy-based
models
e−Eθ (x,z)
pθ (x, z) = .
Zθ
One famous example are Restricted Boltzmann machines (RBMs) [115], where Eθ (x, z) is
given by
Eθ (x, z) = xT W z + bTx x + bTz z.
One central task for unsupervised learning with generative models is learning, which is
to perform maximum likelihood learning over θ :
Note that since we only have access to {xn }1≤n≤N , and the latent variables {zn }1≤n≤N are
unobserved, they need to be marginalized out during maximum likelihood learning:
Z
⋆
θ = arg max ∑ log pθ (xn ) = ∑ log pθ (xn |zn )pθ (zn )dzn . (2.36)
θ n i zn
Remark (Global vs. local latent variables). Note that the ELBO in Equation (2.37) is
different from that of a BNN in Equation (2.29). In a BNN, all data instances share the same
latent variable, i.e. the neural network weights w. We call w the global latent variables.
Hence, for datasets of N data instances, the BNN bound in Equation (2.29) only has exactly
one approximate factor (i.e., qλ (w)) and one KL-term. However, in a generative model,
each data instance xn in the dataset is assigned with different latent variables zn . This creates
N different approximate factors qλ n (zn ) as well as N KL terms in the ELBO 2.37.
The optimization is often done by recursively alternating between the following two
steps:
When exact inference is possible, i.e. qλ (zn ) = pθ (zn |xn ) for all 1 ≤ n ≤ N, the above VEM
procedure never decreases the likelihood since
λ old
L({λ n }1≤n≤N , θ
old
λ new
) ≤ L({λ n }1≤n≤N , θ
old
) = ∑ log pθ old (xn )
n
λ new
≤L({λ n }1≤n≤N , θ
new
)≤ ∑ log pθ new (xn).
n
Unfortunately, exact inference is intractable in most cases, and the approximate inference
used in the E step will break the condition qλ (zn ) = pθ (zn |xn ). Therefore, in practice, the
likelihood of VEM is not guaranteed to always have non-negative increments. However,
2.3 Models 41
under mild conditions, the VEM procedure is still convergent and will converge to a local
optimum of L.
Note that variational inference is not the only candidate for E-step. An alternative
inference method is Monte Carlo inference : we can draw MC samples from p(zn |xn ) using
MCMC methods [7, 236, 244], and perform gradient estimation:
where gθ (·) is a deep neural network parameterized by θ that transforms zn into the mean
parameter of the Gaussian distribution pθ (xn |zn ). The above deep generative model can
also be written in terms of the following sampling process:
which takes the form of implicit distributions, introduced in Section 2.2.4, and that are known
to excel at designing highly expressive probabilistic distributions. The distribution pθ (xn |zn )
is often referred as the decoder in the context of autoencoders [150], or the generator in the
context of GANs [88].
Just like any generative models, learning and inference for deep generative models, as
defined in Equation (2.39) and (2.40), can also be performed by optimizing the evidence
42 Inference and Models in Bayesian Machine Learning
In other words, qλ (zn |xn ) tries to predict (the sufficient statistics of) the approximate
posterior distribution qλ n (zn ) by taking as input the observable xn . qλ (zn |xn ) is often
referred as the encoder, or recognition net/inference net. An important way to parameterize
qλ (zn |xn ) is via neural networks:
where both µλ (·) and σλ (·) are deep neural nets that maps the observables xn to the mean
and standard deviations of the distribution qλ (zn |xn ).
Now, with the amortized approximation qλ n (zn ) ≈ qλ (zn |xn ), the amortized variational
lower bound becomes:
pθ (xn , zn )
∑ log pθ (xn) ≥ L(λλ , θ ) = ∑ Eqλ (zn|xn) log qλ (zn|xn) . (2.43)
n n
Finally, instead of the two-step recursive procedure in variational EM, we can now apply
the advanced techniques of scalable variational inference (Section 2.2.2) to the amortized
2.3 Models 43
objective (2.43), and perform gradient descent on both θ and λ simultaneously. This
reproduces the variational auto-encoder model of [150].
Amortized inference certainly sacrifices representation power due to the parameter
tying approximations λ n ≈ λ . However, it brings several major advantages for generative
modeling. The most direct impact is the drastically reduced memory cost, making this new
formulation highly scalable to large datasets. Also, it allows for fast down-top inference on
new data, without worrying about the mixing problem of MCMC or the additional cost of
optimization steps in variational EM. Moreover, the parameterization qλ (zn |xn ) decouples
the local structures (represented by xn ) and the global structure (represented by λ shared
across all data instances) of q [176]. This potentially helps the optimization of the variational
parameters λ as well as the model parameters θ .
• The wake phase: in this phase we optimize the model parameters θ , by maximizing
the amortized ELBO of Equation (2.43) w.r.t. θ only.
• The sleep phase: Recall that in normal VI/variational EM, we train the recognition
network qλ by optimizing
where pdata (x) is the ground truth data distribution that generated the training set, D.
On the contrary, the sleep phase of the wake-sleep algorithm optimizes the following
reverse KL-divergence:
Note that the outer expectation is taken using the model distribution pθ (x) instead of
the data distribution pdata (x). This quantity can be optimized using gradient descent,
which computes the gradients as below:
∇λ E pθ (x) DKL [pθ (z|x)||qλ (z|x)] = ∇λ E pθ (x,z) [log pθ (z|x) − log qλ (z|x)] (2.46)
= E pθ (x,z) [−∇λ log qλ (z|x)] . (2.47)
Therefore, the intuition behind the sleep phase is to perform maximum likelihood
learning of the recognition model qλ (z|x), using “dreamed” samples obtained from
the model, pθ (x, z). This avoids taking a derivative w.r.t. the probability measure of
the expectation operator E pθ (x,z) , which makes the gradient estimation become more
well-behaved. However, it comes with a cost of breaking a unified single objective,
i.e. the evidence lower bound L.
Remark (Sleep phase λ updates and wake phase λ updates). The sleep phase update
described above is usually referred to as the sleep phase λ update [31]. An alternative
update rule is the so called wake phase λ updates, which optimizes
E pdata (x) DKL [pθ (z|x)||qλ (z|x)] = E pdata (x)pθ (z|x) [log pθ (z|x) − log qλ (z|x)] (2.48)
≈ E pdata (x)qλ (z|x) [log pθ (z|x) − log qλ (z|x)] . (2.49)
That is, it performs maximum likelihood learning of the recognition model qλ (z|x) using
samples from the data distribution. This approach is sometimes more advantageous than
the sleep phase λ update. Since pθ (z|x) is unknown, it is often replaced by the biased
approximation qλ (z|x) (which needs further bias-reduction techniques such as importance
sampling).
2.4 Conclusion
In this chapter, we gave a systematic review of the basic techniques for uncertainty quan-
tification in machine learning, including Bayesian modeling, approximate inference, and
important models for both supervised and unsupervised learning. Some of these techniques
will be further applied, developed, and compared in the next chapters (3, 4, and 5). During
this process, we hope to give the readers a better idea of existing advancements and chal-
lenges in Bayesian machine learning and Bayesian deep learning, with most of them usually
2.4 Conclusion 45
centered around the scalability, flexibility, and accuracy of approximate inference methods.
In the rest of the thesis, we will present new contributions along those dimensions.
Part A
Part A (Chapter 3, 4 and 5) of the thesis, we will first focus on supervised learning
I N
problems, and try to address Challenge I identified Chapter 1, i.e., obtaining efficient and
accurate model uncertainty in supervised learning problems. Estimating model uncertainty
in a proper way is a crucial task, as given a set of observations, there might exist a number
of potential models that could fit the data equally well. Therefore, we are uncertain about
which predictive model to choose in the end, which will effect the decisions that we will be
making based on such models. As introduced in Chapter 2, this type of uncertainty can in
principle be quantified by performing Bayesian inference over model parameters.
In this chapter, we will first take Bayesian neural networks as an example, and review
some of the difficulties of performing Bayesian inference in parameter space. Undoubtedly,
model unidentifiability/overparameterization is one of the major obstacles to improving the
quality of approximate inference. Thus, we will argue that we may consider performing
inference in function space, as it is equivalent to performing inference on minimal suffi-
cient parameters, hence resolving the problem of model non-identifiability and posterior
inconsistency (see Figure 3.1 for preview).
KL KL
Figure 3.1 Chapter 3 preview: parameter space inference versus function space inference.
Left: parameter space VI minimizes KL divergence over the parameters θ = {θ1 , θ2 },
which suffers from overparameterization and model unidentifiability. Right: on the contrary,
function space inference directly performs Bayesian inference in the space of functions, and
minimizes KL divergence over functions. This is equivalent to perform inference on minimal
sufficient parameters, which forms an identifiable reparameterization of the regression
model, and resolves the pathologies of posterior inconsistency of parameter space inference.
• Expressiveness of variational family. When performing VI, it has been observed that
mean-field BNNs have limited expressiveness, and tend to underestimate in-between
uncertainty [70, 68]. While it is generally possible to introduce non-factored or non-
3.2 Interlude: indentifiability, minimal sufficiency, and
inference consistency 51
Gaussian approximations (see Section 2.3.2), it is much more difficult to design such
posterior approximations for BNNs due to their high volume of parameters.
• Specification of priors. Neural network weights usually do not have scientific implica-
tions, which makes it difficult to specify meaningful priors that will induce certain
favorable predictive behaviors [69]. What makes it even worse is, the choice of priors
has a substantial impact on Bayesian neural networks [313, 361, 72], especially for
out-of-distribution detection [313].
• Cold posterior effect. It has been empirically observed that BNN posteriors tend to
achieve better performance when the posterior over weights are sharpened/tempered
by a temperature T < 1 [361, 359, 94]:
1 1 1
log p(w|D) T ∝ log p(D|w) + log p(w). (3.1)
T T
The cold posterior could be caused by a number of issues, for instance inaccurate
inference [132], model mis-specification (especially priors [72, 361]), data curation
[3], data augmentation [132], etc. We point out that a similar effect has been discovered
not only in BNNs, but also for other model classes as well [94].
Although the challenges listed above are quite prevalent in Bayesian neural network
applications, we note that they are not tied to BNNs only. Many of these problems are in part
caused by more general properties of probabilistic models, namely, model non-identifiability
[156, 278], which is quite commonly observed for both parametric and non-parametric
models. We will discuss identifiability and related theoretical results in the next sections
and will show that performing inference in function space will resolve the problem of
identifiability and posterior consistency.
Remark (Identifiability) The formal definition of identifiability has been presented and
generalized in different contexts, such as [278, 255, 6]. Here, we stick to the widely used
definition based on observational equivalence [278]:
Definition 3.2 (Identifiability). The model P is said to be identifiable (on Θ), if all observa-
tionally equivalent parameter pairs have the same value.
a For
instance, in Bayesian analysis we often assume X to be polish space, with B to be a Borel σ -algebra,
which ensures that Pθ (·) is well defined, i.e., a regular conditional probability distribution.
b Similarly, we will also assume Θ to be a polish space with some Borel σ -algebra. This will ensure the
The reason that we bring up the concept of identifiability is that model identifiability
relates to one of the most important asymptotic properties of statistical inference: the
posterior consistency. That is, whether the posterior distribution will robustly contract to the
true parameter of data distribution under perfect information, regardless of the choice of
prior belief. In the language of probability theory, it can be formally described as follows:
In other words, Definition 3.3 formalizes the idea that “perfect knowledge should be
able to override prior beliefs asymptotically” [153]. This property provides an important
theoretical justification for Bayesian approaches to practical problems, such as regression
and classification. Due to the crucial importance of the posterior consistency of P(θ |Dn ), its
1 Depending on the context, there are different definitions of posterior consistency [345, 168]. Here we
only present one of the mostly adopted ones.
2This assumption can be further removed. If we do so, the definition of consistency becomes: P(θ |D ) is
n
consistent, if for all neighbourhood U of Pθ ⋆ in P, P(θ ∈ U|Dn ) → 1,, Pθ ⋆ -a.s. as n → ∞.
3.2 Interlude: indentifiability, minimal sufficiency, and
inference consistency 53
sufficient conditions have been studied broadly in the statistics literature, and a wide spec-
trum of posterior consistency theorems have been derived. Informally speaking, posterior
consistency theorems usually take the following form:
A number of consistency theorems in this form have been derived under different scenar-
ios, for example the Bernstein-Von Mises (BvM) theorems [164, 345] for parametric models,
Doob’s theorem [63] and Schwartz theorem [302] for nonparametric models and consistency
results under misspecification (i.e., Pθ ⋆ ∈
/ P) [25, 152, 151]. The formal statements of those
theorems are quite technical and out of the scope of this thesis. However, we note that model
identifiability assumptions often appear as a core part of the sufficient conditions. In Doob’s
theorem, identifiability is directly assumed; in Schwartz theorem and BvM theorems, the
identifiability condition is replaced by stronger conditions called testability [302].
When the identifiability assumption is violated, the corresponding posterior consistency
no longer holds. In other words, the posterior distribution will be substantially affected by the
prior distribution, which makes Bayesian inference in this case problematic. An interesting
observation is, even when the model parameter is non-identifiable, the corresponding
posterior consistency may still be achieved on a reparameterized version of the original
parameters, called the minimal sufficient parameters [16]. The concept of minimal sufficient
parameter is defined in a similar way as sufficient statistics:
Definition 3.4 (Minimal sufficient parameters [249]). Given a model pθ (x), a quantity
φ is called a sufficient parameter, if i), φ = Λ(θ ) for some function Λ, and ii), x ⊥ θ |φ .
Furthermore, φ is said to be a minimal sufficient parameter, if it is a sufficient parameter
and can be expressed as a function of any sufficient parameter.
Remark (Non-uniqueness and existence). Note that the minimal sufficient parameter of
a model is not unique: they can be only determined up to a one-to-one transformation.
Moreover, minimal sufficient parameters always exist, since the likelihood function itself
constitutes a minimal sufficient parameter of the model.
One can show that both the likelihood function Pθ (x) and the posterior P(θ |x) depend
on θ only through its minimal sufficient parameters, and these minimal sufficient parameters
are always identifiable [16, 13, 249]. Therefore, a more intuitive way to think about minimal
54 Why Function Space Inference?
The effectiveness of minimal sufficient parameters raises a new direction for approximate
Bayesian inference: when the regression model is non-identifiable in the original parameter
space Θ, can we find its minimal sufficient parameter φ = Λ(θ ), and perform inference on
the space of Φ = {Λθ |θ ∈ Θ} instead (Figure 3.1)? In fact, for regression problems, this is
equivalent to performing inference in function space, which will be discussed in the next
section.
Remark (Posterior correlations on θ ). Apart from posterior consistency, Theorem 3.2 also
provides another perspective on the challenges of performing approximate inference on
θ ∈ Θ. Since the posterior on φ = λ (θ ) asymptotically approaches the point mass, it implies
that the constraint of λ (θ ) ≈ φ ⋆ will approximately hold, which might potentially introduce
stronger pairwise correlations between different components of θ as n increases. This might
cause mean-field VI to under-estimate the posterior variance on θ [344].
Remark (Bayesian or frequentist?). The setting of Definition 3.3 and Theorems 3.1 and 3.2
is different from the usual Bayesian inference problem that we have seen. On one hand, we
have assumed the existence of a ground truth parameter value θ ⋆ , and treat each data sample
xn as i.i.d. random variables, which corresponds to a frequentist setting. On the other hand,
we perform parameter estimation using a Bayesian approach, i.e., via the posterior P(θ |Dn ).
Therefore, the setting of posterior consistency in Definition 3.3 is in fact a mixture of both
frequentist and Bayesian paradigms.
This is unfortunately non-identifiable except for few special cases[258] according to Def-
inition 3.2, which may cause a number of issues. However, we may reparameterize the
likelihood function, using the predictive mean function f (·) := g(·, w):
Now, according to Definition 3.4, the predictive function f (·) is a minimal sufficient
parameter of the BNN (this can be seen by noticing that the likelihood function p(Dn | f ) is
itself a minimum sufficient parameter, and it corresponds to f (·) one-to-one). Therefore, we
can imagine that the resulting function space posterior,
p(Dn | f )p( f )
p( f |Dn ) = , (3.4)
p(Dn )
will converge to the true function asymptotically under certain conditions. Indeed, in the
context of Bayesian neural networks for example, such function-space posterior convergence
has been proved [172] under certain priors. 3 This motivates us to perform inference directly
in function space, which gives us a number of advantages. In particular:
• Function space inference directly performs Bayesian inference in the space of minimal
sufficient parameters, which forms an identifiable reparameterization of the regression
model. Therefore, if done properly, this theory will help us bypass the problems of
parameter space inference mentioned in Section 3.1. In particular, unidentifiability,
over-parameterization, symmetries, and (asymptotic) sensitivity to weigh-space prior
beliefs.
• In the case where the model parameters do not have specific scientific implications
(such as in BNNs), we may directly design flexible and/or meaningful variational
families for q( f ), without needing to derive first variational distributions on weight-
space. This also helps us bypass the challenge of expressiveness of MFVI, which was
described in Section 3.1.
3Technically speaking, these BNN posterior convergence results are proved in the sense of Hellinger
neighbourhood (over P) on the predictive joint density functions p(yn | f (xn ))p(xn ) (with p(xn ) fixed as
p(xn ) ∝ 1), which is also a minimum sufficient parameter (therefore equivalent to the predictive function f ).
56 Why Function Space Inference?
Of course, the idea of introducing function space inference will also introduce new chal-
lenges, which will be discussed in further detail in Chapter 4 and 5. In one word, the research
presented in Part A are all about:
• In Chapter 4, we will first propose a method called variational implicit process, which
approximates p( f |Dn ) via Gaussian process (GP) approximations and learns p( f ) via
a wake-sleep procedure;
2. Inference. Given a collection of training data D, the exact posterior p( f |D) is then
given by closed-form expressions.
To a certain extent, the concept of function space inference considered in this paper can
be seen as the algorithmic abstraction of the above procedure of GP inference. Instead of
constructing a specific infinite-dimensional p( f ), in function space inference we are given a
predictive model fθ (x) parameterized by some parameter, θ (could be finite dimensional).
We will treat them as if they are infinite-dimensional objects, and computes the posterior
in function space, p( f |D). There is no doubt that Gaussian processes themselves do not
immediately lead to such algorithmic abstractions. This is due to the fact that Gaussian
processes assume a specific form of function space prior (which is Gaussian), and its
inference method (based on closed form expression) can not be directly applied to other
priors. Therefore, to find such algorithmic abstractions, we may have two potential options:
I The first option is model-driven. That is, we extend the existing Bayesian nonparametric
priors such as GPs, to some class of more general and more flexible priors. Ideally, this
58 Variational Implicit Processes
should be able to cover many interesting Bayesian models as special cases. Then, we
need to develop a specific method for optimizing variational approximations under this
new prior.
II The other option is algorithm-driven, which is much more difficult. Instead of con-
sidering a specific class of functional priors, we will take an existing approximate
inference method of parameter space, for example variational inference, and extend it
to its function space counterpart. This should give us a general purpose function space
approximate inference method.
In this chapter, we will first explore the first route (and leave the second one to Chapter
5). Our key idea is to draw inspirations from recent advancements of implicit models. As
introduced in Section 2.2.4 of Chapter 2), probabilistic models with implicit distributions
as core components have recently attracted enormous interest in both deep learning and
the approximate Bayesian inference communities. In contrast to prescribed probabilistic
models [61] that assign explicit densities to possible outcomes of the model, implicit models
implicitly assign probability measures by the specification of the data generating process.
One of the most well-known implicit distributions is the generator of generative adversarial
nets (GANs) [88, 11] that transforms isotropic noise into high dimensional data, using neural
networks. In approximate inference context, implicit distributions have also been used as
flexible approximate posterior distributions [271, 191, 340, 182].
This chapter explores the extension of implicit models to Bayesian modeling of random
functions. Similar to the construction of Gaussian processes (GPs), we develop implicit
process (IP), which assigns implicit distributions over any finite collections of random
variables. Therefore, IPs can be much more flexible than GPs when complicated models like
neural networks are used for implicit distributions. With an IP as the prior, we can directly
perform (variational) posterior inference over functions in a non-parametric fashion. This
is beneficial for better-calibrated uncertainty estimates like GPs [36]. It also avoids typical
issues of inference in parameter space, that is, symmetric modes in the posterior distribution
of Bayesian neural network weights. The function-space inference for IPs is achieved by our
proposed variational implicit process (VIP) algorithm, which addresses the intractability
issues of implicit distributions.
Concretely, the contributions of this Chapter are threefold:
• We formalize implicit stochastic process priors over functions, and prove its well-
definedness in both finite and infinite-dimensional cases. By allowing the usage of
IPs with rich structures as priors ( e.g., data simulators and Bayesian LSTMs), our
4.1 Stochastic processes 59
approach provides a unified and powerful Bayesian inference framework for these
important but challenging deep models.
• We derive a novel and efficient variational inference framework that gives a closed-
form approximation to the IP posterior. It does not rely on e.g. density ratio/gradient
estimators in implicit variational inference literature which can be inaccurate in high
dimensions. Our inference method is computationally cheap, and it allows scalable
hyperparameter learning in IPs.
• We conduct extensive comparisons between IPs trained with the proposed inference
method, and GPs/BNNs/Bayesian LSTMs trained with existing variational approaches.
Our method consistently outperforms other methods and achieves state-of-the-art
results on a large-scale Bayesian LSTM inference task.
This chapter is arranged as follows. In Section 4.1, we will first introduce the notion
of stochastic processes (i.e., distributions over functions), and how to construct them via
Kolmogorov extension theorem. Then, in Section 4.2, we will utilize such tools to generalize
Gaussian processes to implicit processes. We will provide concrete examples, and derive
theoretical results regarding its well-definedness. In Section 4.3, we will introduce an
approximate inference method for implicit processes. In Section 4.3.3, we will discuss
computational complexities and scalable methods for predictive inference. Then, we perform
experimental evaluations of the proposed method in Section 4.4, and finally, in Section 4.5,
we review the related works from different areas.
Definition 4.1 (Stochastic processes). Suppose that we are given a probability space
(Ω, F, P), and an non-empty set, T . Then, any collection of random variables {vx : ω ∈
Ω 7→ vx (ω) ∈ Y|x ∈ T } is called a stochastic process, and T is called the index set.
Remark (Random variables and induced measure). Here we take this chance to briefly
review the measure theoretic definition of random variables. A random variable v on a
probability space (Ω, F, P) is a measurable function that maps Ω to another measurable
60 Variational Implicit Processes
space, (Y, B). therefore, it defines a induced probability measure on Y, denote by Pv , given
by:
Pv (B ∈ B) := P(v−1 (B)). (4.1)
Therefore, whenever we talk about a random variable on (Ω, F, P), we can also interpret it
as a (induced) measure on (Y, B).
x ∈ T 7→ vx (ω) ∈ Y (4.2)
3. and finally, we can also interpret a stochastic process as the induced probability
measure (by the Y T -valued random variable) on the measurable space Y T , i.e., a
distribution over random functions.
The second and third interpretations gives us two alternative definitions of stochastic pro-
cesses:
4.1 Stochastic processes 61
Definition 4.3 (Stochastic processes, ii). Suppose we are given a measurable space (Ω, F, P),
and an non-empty set, T . Then, a stochastic process is a measurable function from Ω to
some measurable space, Y T with some suitable σ -algebra denoted by B T a .
Definition 4.4 (Stochastic processes, iii). Suppose we are given a measurable space (Y, B)
(usually we assume Y = R, B is the corresponding Borel algebra), and an non-empty set, T .
Then, a stochastic process is a probability measure over Y T , with some suitable σ -algebra,
denoted by B T .
aTechnically, this is in fact the cylinder σ -algebra of Y, given in Theorem 4.1 [129].
Theorem 4.1 (Kolmogorov extension theorem). Without loss of generality, suppose for any
finite subset {x1 , x2 , ..., xn } ⊂ T , we have a corresponding finite dimensional probability
measure px1 ,x2 ,...,xn (called marginal distributions) on the measurable space Y n := Rn . If
they satisfies the following consistency condition for all finite subset {x1 , x2 , ..., xn } ⊂ T and
integer n:
px1 ,x2 ,...,xn (B1 ×B2 ×...×Bn ) = pxπ(1) ,xπ(2) ,...,xπ(n) (Bπ(1) ×Bπ(2) ...×Bπ(n) ), ∀B1 , ..., Bn ⊂ B(R);
(4.3)
px1 ,x2 ,...,xn (B1 × ... × Bn ) = px1 ,x2 ,...,xn ,xn+1 (B1 × ... × Bn × R). (4.4)
Then, there exists a unique stochastic process p⋆ ( f ) on RT (under the so called cylinder
algebra, denoted by B T ), such that
p⋆ ( fx1 ∈ B1 , fx2 ∈ B2 , ..., fxn ∈ Bn ) = px1 ,x2 ,...,xn (B1 × B2 × ... × Bn ) (4.5)
62 Variational Implicit Processes
for all integer n, all finite subsets x1 , x2 , ..., xn ⊂ T , and all measurable sets B1 , ..., Bn ⊂
B(R).
Being able to define a stochastic processes through marginal distributions px1 ,x2 ,...,xn
allows us to define flexible function space priors p( f ), by combining the ideas of implicit
distributions discussed in Section 2.2.4.
Definition 4.5 (noiseless implicit stochastic processes). An implicit stochastic process (IP) is
a collection of random variables f (·), such that any finite collection f = ( f (x1 ), ..., f (xN ))⊤
has joint distribution implicitly defined by the following generative process:
A function distributed according to the above IP is denoted as f (·) ∼ IP(gθ (·, ·), pz ).
Note that z ∼ p(z) could be infinite dimensional (such as samples from a Gaussian
Process). Definition 4.5 is validated by the following propositions.
Proposition 4.1 (Finite dimension case). Let z be a finite dimensional vector. Then there
exists a unique stochastic process on index set T = X, such that any finite collection of
random variables has distribution implicitly defined by (4.6).
4.2 Implicit Stochastic Processes 63
x x f (·) x w
θ
y z y y
N N N
(d)
Figure 4.1 Examples of IPs: (a) Neural samplers; (b) Warped GPs (c) Bayesian neural
networks; (d) Bayesian RNNs.
Proposition 4.2 (Infinite dimension case). Let z(·) ∼ SP(0, C) be a centered continuous
stochastic process on L2 (Rd ) with covariance function C(·, ·). Then the operator g(x, z) =
OK (z)(x) := h( x ∑M ′ ′ ′
R
l=0 Kl (x, x )z(x )dx ), 0 < M < +∞ defines a stochastic process if Kl ∈
L2 (Rd × Rd ) , h is a Borel measurable, bijective function in R and there exist 0 ≤ A < +∞
such that |h(x)| ≤ A|x| for ∀x ∈ R.
Proposition 4.1 is proved in appendix 4.A.1 using the Kolmogorov extension theorem.
Proposition 4.2 considers random functions as the latent input z(·), and introduces a specific
form of the transformation/operator g, so that the resulting collection of variables f (·) is
still a valid stochastic process (see appendix 4.A.2 for a proof). Note this operator can be
recursively applied to build highly non-linear operators over functions [97, 365, 327, 169, 84].
These two propositions indicate that IPs form a rich class of priors over functions. Indeed,
we visualize some examples of IPs in Figure 4.1 with discussions as follows:
Example 4.1 (Data simulators). Simulators, e.g. physics engines and climate models, are
omnipresent in science and engineering. These models encode laws of physics in gθ (·, ·),
use z ∼ p(z) to explain the remaining randomness, and evaluate the function at input
locations x: f (x) = gθ (x, z). We define the neural sampler as a specific instance of this
class. In this case gθ (·, ·) is a neural network with weights θ , i.e., gθ (·, ·) = NNθ (·, ·), and
p(z) = Uniform([−a, a]d ).
64 Variational Implicit Processes
Example 4.2 (Warped Gaussian Processes). Warped Gaussian Processes [316] is also an
interesting example of IPs. Let z(·) ∼ p(z) be a sample from a GP prior, and gθ (x, z) is
defined as gθ (x, z) = h(z(x)), where h(·) is a one dimensional monotonic function.
Example 4.3 (Bayesian neural network). In a Bayesian neural network, the synaptic weights
W with prior p(W ) play the role of z in (4.6). A function is sampled by W ∼ p(W ) and
then setting f (x) = gθ (x,W ) = NNW (x) for all x ∈ X. In this case θ could be the hyper-
parameters of the prior p(W ) to be tuned.
Example 4.4 (Bayesian RNN). Similar to Example 4.3, a Bayesian recurrent neural network
(RNN) can be defined by considering its weights as random variables, and taking as function
evaluation an output value generated by the RNN after processing the last symbol of an
input sequence.
Equation (4.7) defines an implicit model p(y, f|x), which is intractable in most cases. Note
that it is common to add Gaussian noise ε to an implicit model, e.g. see the noise smoothing
trick used in GANs [318, 290]. Given an observed dataset D = {X, y} and a set of test inputs
X∗ , Bayesian predictive inference computes the predictive distribution p(y∗ |X∗ , X, y, θ ),
which itself requires interpolating over posterior p( f |X, y, θ ). Besides prediction, we
also want to learn the model parameters θ and σ by maximizing the marginal likelihood:
R
log p(y|X, θ ) = log f p(y|f)p(f|X, θ )df , with f = f (X) being the evaluation of f on the
points in X. Unfortunately, both the prior p(f|X, θ ) and the posterior p( f |X, y, θ ) are
intractable as the implicit process does not allow point-wise density evaluation, let alone the
marginalization tasks. Therefore, to address these, we must resort to approximate inference.
We propose a generalization of the wake-sleep algorithm [114] to handle both intractabil-
ities. This method returns (i) an approximate posterior distribution q( f |X, y) which is later
used for predictive inference, and (ii) an approximation to the marginal likelihood p(y|X, θ )
for hyper-parameter optimization. We use the posterior of a GP to approximate the pos-
terior of the IP, i.e. q( f |X, y) = qGP ( f |X, y), since GP is one of the few existing tractable
distributions over functions. A high-level summary of our algorithm is the following:
4.3 Variational Implicit Processes 65
• Sleep phase: sample function values f and noisy outputs y as indicated in (4.7). This
dreamed data is then used as the maximum-likelihood (ML) target to fit a GP. This is
equivalent to minimizing DKL [p(y, f|X, θ )||qGP (y, f|X)] for any possible X.
• Wake phase: The optimal GP posterior approximation qGP (f|X, y) obtained in the
sleep phase is used to construct a variational approximation to log p(y|X, θ ), which is
then optimized with respect to θ .
Our approach has two key advantages. First, the algorithm has no explicit sleep phase
computation, since the sleep phase optimization has an analytic solution that can be directly
plugged into the wake-phase objective. Second, the proposed wake phase update is highly
scalable, as it is equivalent to a Bayesian linear regression task with random features sampled
from the implicit process. With our wake-sleep algorithm, the evaluation of the implicit
prior density is no longer an obstacle for approximate inference. We call this inference
framework the variational implicit process (VIP). In the following sections we give specific
details on both the wake and sleep phases.
Below we also write the optimal solution as q⋆GP (f|X, θ ) = qGP (f|X, M⋆ , K⋆ ) to explicitly
specify the dependency on prior parameters θ 1 . In practice, the mean and covariance
functions are estimated by by Monte Carlo, which leads to maximum likelihood training
(MLE) for the GP with dreamed data from the IP. Assume S functions are drawn from the
IP: fsθ (·) ∼ IP(gθ (·, ·), pz ), s = 1, . . . , S. The optimum of U(M, K) is then estimated by the
MLE solution:
1
M⋆MLE (x) = fsθ (x), (4.10)
S∑s
1
K⋆MLE (x1 , x2 ) = ∑ ∆s (x1 )∆s (x2 ), (4.11)
S s
∆s (x) = fsθ (x) − M⋆MLE (x).
To reduce computational costs, the number of dreamed samples S is often small. Therefore,
we perform maximum a posteriori instead of MLE, by putting an inverse Wishart process
prior [306] IWP(ν, Ψ) over the GP covariance function K (Appendix 4.A.3).
The original sleep phase algorithm in [114] also finds a posterior approximation by
minimizing (4.9). However, the original approach would define the q distribution as
q(y, f|X) = p(y|X, θ )qGP (f|y, X), which builds a recognition model that can be directly
transfered for later inference. By contrast, we define q(y, f|X) = p(y|f)qGP (f|X), which cor-
responds to an approximation of the IP prior. In other words, we approximate an intractable
generative model using another generative model with a GP prior and later, the resulting
GP posterior q⋆GP (f|X, y) is employed as the variational distribution. Importantly, we never
explicitly perform the sleep phase updates, that is, the optimization of U(M, K), as there is
an analytic solution readily available, which can potentially save a significant amount of
computation.
1Thisallows us to compute gradients w.r.t. θ through M⋆ and K⋆ using reparameterization trick (by
definition of IP, f (x) = gθ (x, z)), during the wake phase in Section 4.3.2.
4.3 Variational Implicit Processes 67
Another interesting observation is that the sleep phase’s objective U(M, K) also provides
an upper-bound to the KL divergence between the posterior distributions,
One can show that U is an upper-bound of J according to the non-negativity and chain rule
of the KL divergence:
Therefore, J is also decreased when the mean and covariance functions are optimized
during the sleep phase. This bounding property justifies U(M, K) as a appropriate variational
objective for posterior approximation.
This again demonstrates the key advantage of the proposed sleep phase update via generative
model matching. Also it is a sensible objective for predictive inference as the GP returned
by wake-sleep will be used for making predictions.
Similar to GP regression, optimizing log q⋆GP (y|X, θ ) can be computationally expensive
for large datasets. Therefore sparse GP approximation techniques [315, 335, 109, 38] are
applicable, but we leave them to future work and consider an alternative approach that is
related to random feature approximations of GPs [263, 78, 75, 15, 166].
Note that log q⋆GP (y|X, θ ) can be approximated by the log marginal likelihood of a
Bayesian linear regression model with S randomly sampled dreamed functions, and a
68 Variational Implicit Processes
1
µ(xn , a, θ ) = M⋆ (xn ) + √ ∑ ∆s (xn )as , (4.15)
S s
∆s (xn ) = fsθ (xn ) − M⋆ (xn ), p(a) = N (a; 0, I).
For scalable inference, we follow Li and Gal [177] to approximate (4.14) by the α-energy
(see Section 2.2.3), with qλ (a) = N (a; µ , Σ ) and sample a mini-batch K ⊂ {1, ..., N} of
size K:
log q⋆GP (y|X, θ ) ≈ LαGP (θθ , λ )
N
= ∑ log Eqλ (a) [q⋆(yk |xk , a, θ )α ]
αK k∈K
(4.16)
Recall that the optimal variational GP approximation has mean and covariance functions
defined as (4.10) and (4.11), respectively, which means that Kff has rank S. Therefore
predictive inference requires both function evaluations and matrix inversion, which costs
O(C(L + N)S + NS2 + S3 ) time. This complexity can be further reduced: note that the
computational cost is dominated by (Kff + σ 2 I)−1 . Denote the Cholesky decomposition
of the kernel matrix Kff = BB⊤ . It is straightforward to show that in the Bayesian lin-
ear regression problem (4.15) the exact posterior of a is q(a|X, y) = N (a; µ , Σ), with
µ = σ12 Σ B⊤ (y − m), σ 2 Σ −1 = B⊤ B + σ 2 I. Therefore the parameters of the GP predictive
distribution in (4.17) are reduced to:
m∗ = M⋆ (X∗ ) + φ ⊤ ⊤
∗ µ , Σ∗ = φ ∗ Σφ ∗, (4.18)
√
with the elements in φ ∗ as (φφ ∗ )s = ∆s (x∗ )/ S. This reduces the prediction cost to O(CLS +
S3 ), which is on par with e.g. conventional predictive inference techniques for Bayesian
neural networks that also cost O(CLS). In practice we use the mean and covariance matrix
from q(a) to compute the predictive distribution. Alternatively one can directly sample a ∼
q(a) and compute f∗ = ∑Ss=1 as fsθ (X∗ ), which is also an O(CKS + S3 ) inference approach
but would have higher variance.
70 Variational Implicit Processes
4.4 Experiments
In this section, we test the capability of VIPs with various tasks, including time series
interpolation, Bayesian NN/LSTM inference, and Approximate Bayesian Computation
(ABC) with simulators,etc. When the VIP is applied to Bayesian NN/LSTM (Example
4.3-4.4), the prior parameters over each weight are tuned individually. We use S = 20 for
VIP unless noted otherwise. We focus on comparing VIPs as an inference method to other
Bayesian approaches, with detailed experimental settings presented in Appendix 4.D.
Figure 4.2 First row: Predictions returned from VIP (left), VDO (middle) and exact GP with RBF
+ Periodic kernel (right), respectively. Dark grey dots: noisy observations; dark line: clean ground
truth function; dark gray line: predictive means; Gray shaded area: confidence intervals with 2
standard deviations. Second row: Corresponding predictive uncertainties.
are then centered, and the targets are standardized. We use the same settings as in Section
4.4.1, except that we run Adam with learning rate = 0.001 for 5000 iterations. Note that
GP/SVGP predictions are reproduced directly from [78].
Predictive interpolations are shown in Figure 4.5. We see that VIP and VDO give similar
interpolation behaviors. However, VDO overall under-estimates uncertainty when compared
with VIP, especially in the interval [−100, 200]. VDO also incorrectly estimates the mean
function around x = −150 where the ground truth there is a constant. On the contrary, VIP
is able to recover the correct mean estimation around this interval with high confidence. GP
methods recover the exact mean of the training data with high confidence, but they return
poor estimates of predictive means for interpolation. Quantitatively, the right two plots in
Figure 4.3 show that VIP achieves the best NLL/RMSE performance, again indicating that
its returns high-quality uncertainties and accurate mean predictions.
Figure 4.3 Test performance on synthetic ex- Figure 4.4 Test performance on clean energy
ample (left two) and solar irradiance interpo- dataset
lation (right two)
[335]), exact GP and the functional BNNs (fBNN)2 , and the results for fBNN is quoted
from Sun et al. [330]. All neural networks have two hidden layers of size 10, and are trained
for 1,000 (except for fBNNs where the results cited use 2,000 epochs). The observational
noise variance for VIP and VDO is tuned over a validation set, as detailed in Appendix 4.D.
The α value for both VIP and alpha-variational inference are fixed to 0.5, as suggested in
[113]. The experiments are repeated for 10 times on all datasets except Protein, on which
we report an averaged results across 5 repetitive runs.
Results are shown in Table 4.1 and 4.2 with the best performances boldfaced. Note
that our method is not directly comparable to exact (full) GP and fBNN in the last two
columns. They are only trained on small datasets since they require the computation of
the exact GP likelihood, and fBNNs are trained for longer epochs. Therefore they are not
included for the overall ranking shown in the last row of the tables. VIP methods consistently
outperform other methods, obtaining the best test-NLL in 7 datasets, and the best test RMSE
in 8 out of the 9 datasets. In addition, VIP-BNN obtains the best ranking among 6 methods.
Note also that VIP marginally outperforms exact GPs and fBNNs (4 of 5 in NLLs), despite
the comparison is not even fair. Finally, it is encouraging to see that, despite its general
form, the VIP-NS achieves the second best average ranking in RMSE, outperforming many
specifically designed BNN algorithms.
2 fBNN is a recent inference method designed for BNNs, where functional priors (GPs) are used to regularize
Figure 4.5 Interpolations returned by VIP (top), variational dropout (middle), and exact GP
(bottom), respectively. SVGP visualization is omitted as it looks nearly the same. Here
grey dots: training data, red dots: test data, dark dots: predictive means, light grey and
dark grey areas: Confidence intervals with 2 standard deviations of the training and test set,
respectively. Note that our GP/SVGP predictions reproduces [78].
We use a VIP with a prior defined by a Bayesian LSTM (200 hidden units) and α = 0.5.
We replicate the experimental settings in Bui et al. [36], Hernández-Lobato et al. [113],
except that our method directly takes raw sequential molecule structure data as input. We
compare our approach with a deep GP trained with expectation propagation [DGP, 36],
variational dropout for LSTM [VDO-LSTM, 76], alpha-variational inference LSTM [α-
LSTM, 177], BB-α on BNN [113], VI on BNN [29], and FITC GP [315]. Results for the
latter 4 methods are quoted from Hernández-Lobato et al. [113], Bui et al. [36]. Results in
Figure 4.4 show that VIP significantly outperforms other baselines and hits a state-of-the-art
result in test likelihood and RMSE.
ẏ = θ1 xy − θ2 y, ẋ = θ3 x − θ4 xy,
4.5 Related works 75
where x is the population of the predator, and y is the population of the prey. Therefore the
L-V model is an implicit model, which allows the simulation of data but not the evaluation
of model density. We follow the setup of [253] to select the ground truth parameter of the
L-V model, so that the model exhibit a oscillatory behavior which makes posterior inference
difficult. Then the L-V model is simulated for 25 time units with a step size of 0.05, resulting
in 500 training observations. The prediction task is to extrapolate the simulation to the
[25, 30] time interval.
We consider (approximate) posterior inference using two types of approaches: regression-
based methods (VIP-BNN, VDO-BNN and SVGP), and ABC methods (MCMC-ABC [207]
and SMC-ABC [20, 30]). ABC methods first perform posterior inference in the parameter
space, then use the L-V simulator with posterior parameter samples for prediction. By
contrast, regression-based methods treat this task as an ordinary regression problem, where
VDO-BNN fits an approximate posterior to the NN weights, and VIP-BNN/SVGP perform
predictive inference directly in function space. Results are shown in Table 4.3, where VIP-
BNN outperforms others by a large margin in both test NLL and RMSE. More importantly,
VIP is the only regression-based method that outperforms ABC methods, demonstrating its
flexibility in modeling implicit systems.
applications such as health care, uncertainty quantification of neural networks has become
increasingly important. Although decent progress has been made for Bayesian neural
networks (BNNs) [56, 117, 17, 243, 91, 29, 111, 177], uncertainty in deep learning still
remains an open challenge.
Research in the GP-BNN correspondance has been extensively explored in order to
improve the understandings of both worlds [242, 243, 365, 104, 75, 173, 214]. Notably,
in Neal [242], Gal and Ghahramani [75] a one-layer BNN with non-linearity σ (·) and
mean-field Gaussian prior is approximately equivalent to a GP with kernel function
Later Lee et al. [173] and Matthews et al. [214] showed that a deep BNN is approximately
equivalent to a GP with a compositional kernel [42, 108, 50, 257] that mimic the deep net.
These approaches allow us to construct expressive kernels for GPs [159], or conversely,
exploit the exact Bayesian inference on GPs to perform exact Bayesian prediction for BNNs
[173]. The above kernel is compared with equation (4.11) in Appendix 4.C.
Alternative schemes have also been investigated to exploit deep structures for GP model
design. These include: (1) deep GPs [49, 36], where compositions of GP priors are proposed
to represent prior over compositional functions; (2) the search and design of kernels for
accurate and efficient learning [346, 65, 337, 21, 293], and (3) deep kernel learning that
uses deep neural net features as the inputs to GPs [116, 367, 4, 34, 130]. Frustratingly, the
first two approaches still struggle to model high-dimensional structured data such as texts
and images; and the third approach is only Bayesian w.r.t. the last output layer.
The intention of our work is not to understand BNNs as GPs, nor to use deep learning
to help GP design. Instead we directly treat a BNN as an instance of implicit processes
(IPs), and the GP is used as a variational distribution to assist predictive inference. This
approximation does not require previous assumptions in the GP-BNN correspondence
literature [173, 214] nor the conditions in compositional kernel literature. Therefore the
VIP approach also retains some of the benefits of Bayesian nonparametric approaches, and
avoids issues of weight-space inference such as symmetric posterior modes.
To certain extent, the approach in Flam-Shepherd et al. [69] resembles an inverse of VIP
by encoding properties of GP priors into BNN weight priors, which is then used to regularize
BNN inference. This idea is further investigated by a concurrent work on functional BNNs
[330], where GP priors are directly used to regularize BNN training through gradient
estimators [309].
4.6 Conclusions 77
Concurrent work of neural process [80] resembles the neural sampler, a special case of
IPs. However, it performs inference in z space using the variational auto-encoder approach
[149, 273], which is not applicable to other IPs such as BNNs. By contrast, the proposed VIP
approach applies to any IPs, and performs inference in function space. In the experiments we
also show improved accuracies of the VIP approach on neural samplers over many existing
Bayesian approaches.
4.6 Conclusions
We presented a variational approach for learning and Bayesian inference over function
space based on implicit process priors. It provides a powerful framework that combines
the rich flexibilities of implicit models with the well-calibrated uncertainty estimates from
(parametric/nonparametric) Bayesian models. As an example, with BNNs as the implicit
process prior, our approach outperformed many existing GP/BNN methods and achieved
significantly improved results on molecule regression data. Many directions remain to be
explored. Better posterior approximation methods beyond GP prior matching in function
space will be designed. Classification models with implicit process priors will be developed.
Implicit process latent variable models will also be derived in a similar fashion as Gaussian
process latent variable models. A promising direction of application would be investigating
novel inference methods for models equipped with other implicit process priors, e.g. data
simulators in astrophysics, ecology and climate science.
78 Variational Implicit Processes
For any finite collection of random variables y1:n = {y1 , ..., yn }, ∀n we denote
the induced distribution as p1:n (y1:n ). Note that p1:n (y1:n ) can be represented as
E p(z) [∏ni=1 N (yi ; g(xi ; z), σ 2 )]. Therefore for any m < n, we have
Z
p1:n (y1:n )dym+1:n
Z Z n
= ∏ N (yi; g(xi, z), σ 2)p(z)dzdym+1:n
i=1
Z Z n
= ∏ N (yi; g(xi, z), σ 2)p(z)dym+1:ndz
i=1
Z m
= ∏ N (yi; g(xi, z), σ 2)p(z)dz = p1:m(y1:m).
i=1
Note that the swap of the order of integration relies on that the integral is finite, which is true
when the prior p(z) is proper. Therefore, the marginal consistency condition of Kolmogorov
extension theorem is satisfied. Similarly, the permutation consistency condition of Kol-
mogorov extension theorem can be proved as follows: assume π(1 : n) = {π(1), ..., π(n)}
4.A Derivations 79
pπ(1:n) (yπ(1:n) )
Z n
= ∏ N (yπ(i); g(xπ(i), z), σ 2)p(z)dz
i=1
Z n
= ∏ N (yi; g(xi, z), σ 2)p(z)dz = p1:n(y1:n).
i=1
Therefore, by Kolmogorov extension theorem, there exists a unique stochastic process, with
finite marginals that are distributed exactly according to Definition 4.5.
Proof Since L2 (Rd ) is closed under finite summation, without loss of generality, we
consider the case of M = 1 where O(z)(x) = h( K(x, x′ )z(x′ )dx′ ). According to Karhunen-
R
Loeve expansion (K-L expansion) theorem [194], the stochastic process z can be expanded
as the stochastic infinite series,
∞ ∞
z(x) = ∑ Zi φi (x), ∑ λi < +∞.
i i
Where Zi are zero-mean, uncorrelated random variables with variance λi . Here {φi }∞ i=1 is
2 d
an orthonormal basis of L (R ) that are also eigen functions of the operator OC (z) defined
by OC (z)(x) = C(x, x′ )z(x′ )dx′ . The variance λi of Zi is the corresponding eigen value of
R
φi (x).
Apply the linear operator
Z
OK (z)(x) = K(x, x′ )z(x′ )dx′
80 Variational Implicit Processes
where the exchange of summation and integral is guaranteed by Fubini’s theorem. Therefore,
the functions { x K(x, x′ )φi (x′ )dx′ }∞ 2 d
R
i=1 forms a new basis of L (R ). To show that the
stochastic series 4.A.1 converge:
∞ Z
|| ∑ Zi K(x, x′ )φi (x′ )dx||2L2
i
∞
≤ ||OK ||2 || ∑ Zi φi (x′ )||2L2
i
∞
= ||OK ||2 ∑ ||Zi ||22 ,
i
This is a well defined norm since OK is a bounded operator (K ∈ L2 (Rd × Rd )). The
last equality follows from the orthonormality of {φi }. The condition ∑∞ i λi < ∞ further
∞ 2
guarantees that ∑i ||Zi || converges almost surely. Therefore, the random series (4.A.1)
converges in L2 (Rd ) a.s..
Finally we consider the nonlinear mapping h(·). With h(·) a Borel measurable function
satisfying the condition that there exist 0 ≤ A < +∞ such that |h(x)| ≤ A|x| for ∀x ∈ R, it
follows that h ◦ OK (z) ∈ L2 (Rd ). In summary, g = OK (z) = h ◦ OK (z) defines a well-defined
stochastic process on L2 (Rd ).
Despite of its simple form, the operator g = h ◦ OK (z) is in fact the building blocks for
many flexible transformations over functions [97, 365, 327, 169, 84] . Recently Guss [97]
4.A Derivations 81
proposed the so called Deep Function Machines (DFMs) that possess universal approxima-
tion ability to nonlinear operators:
Definition 4.6 (Deep Function Machines [97]). A deep function machine g = ODFM (z, S) is
a computational skeleton S indexed by I with the following properties:
U(M, K) = DKL [p(f, y|X, θ )||qGP (f, y|X, M(·), K(·, ·))]
Since we use q(y|f) = p(y|f), this reduces U(M, K) to DKL [p(f|X, θ )||qGP (f|X, M, K)]. In
order to obtain optimal solution wrt. U(M, K), it sufficies to draw S fantasy functions (each
sample is a random function fs (·)) from the prior distribution p(f|X, θ ), and perform moment
matching, which gives exactly the MLE solution, i.e., empirical mean and covariance
82 Variational Implicit Processes
functions
1
M⋆MLE (x) = ∑ fs (x), (4.A.2)
s S
1
K⋆MLE (x1 , x2 ) = ∑ ∆s (x1 )∆s (x2 ), (4.A.3)
S s
∆s (x) = fs (x) − M⋆MLE (x). (4.A.4)
K⋆MAP (x1 , x2 )
1
= { ∆s (x1 )∆s (x2 ) + Ψ(x1 , x2 )}. (4.A.5)
ν +S+N +1 ∑ s
Where N is the number of data points in the training set X where m(·) and K(·, ·) are evaluated.
Alternatively, one could also use the posterior mean Estimator (PM) that minimizes posterior
expected squared loss:
K⋆PM (x1 , x2 )
1
= { ∆s (x1 )∆s (x2 ) + Ψ(x1 , x2 )}. (4.A.6)
ν +S−N −1 ∑ s
In the implementation of this paper, we choose KPM estimator with ν = N and Ψ(x1 , x2 ) =
ψδ (x1 , x2 ). The hyper parameter ψ is trained using fast grid search using the same procedure
for the noise variance parameter, as detailed in Appendix 4.D.
4.A Derivations 83
Therefore, by the non-negative property of KL divergence, we have J (M, K) < U(M, K).
Since we select q(y|f) = p(y|f), the optimal solution of U(M, K) also minimizes
. Therefore not only the upper bound U is optimized in sleep phase, the gap
is also decreased when the mean and covariance functions are optimized.
learning becomes
Compared with the approximate MLE method, the only extra term needs to be estimated
is −DKL [q(θθ )||p(θθ )]. Note that, introducing q(θθ ) will double the number of parameters.
In the case of Bayesian NN as an IP, where θ contains means and variances for weight
priors, then a simple Gaussian q(θθ ) will need two sets of means and variances variational
parameters (i.e., posterior means of means, posterior variances of means,posterior means of
variances, posterior variances of variances). Therefore, to make the representation compact,
we choose q(θθ ) to be a Dirac-delta function δ (θθ q ), which results in an empirical Bayesian
solution.
Another possible alternative approach is, instead of explicitly specifying the form and
hyperparameters for p(θθ ),we can notice that from standard variational lower bound
log qGP (y|X) ≈ Eq(θθ ) [log qGP (y|X, θ )] − DKL [q(θθ )||p(θθ )].
Therefore, we can use − log qGP (y|X, θ q ) as the regularization term instead, which penalizes
the parameter configurations that returns a full marginal log likelihood (as opposed to the
diagonal likelihood in the original BB-α energy α1 ∑N n log Eq(z)q(θθ ) qGP (yn |xn , z, θ ) ) that
α
is too high, especially the contribution from non-diagonal covariances. We refer this as
likelihood regularization. In practice, − log qGP (y|X, θ q ) is estimated on each mini-batch.
4.B KL divergence on function space v.s. KL divergence on weight space 85
This is an example of KL divergence in function space (i.e., the output f). Generally
R R
speaking, we may assume that p(f) = W p(f|W)p(W)dW, and q(f) = W p(f|W)q(W),
where q(W) is weight-space variational approximation. That is to say, both stochastic
processes p and q can be generated by finite dimensional weight space representation W.
This can be seen as a one-step Markov chain with preivious state st = W, new state st+1 = f,
and probability transition function r(st+1 |st ) = p(f|W). Then, by applying the second law
of thermodynamics of Markov chains(Cover and Thomas [44]), we have:
This shows that the KL divergence in function space forms a tighter bound than the KL
divergence on weight space, which is one of the merits of function space inference.
Here σ (·) is a non-linear activation function, w is a vector of length D, b is the bias scaler,
and p(w), p(b) the corresponding prior distributions. Gal and Ghahramani [75] considered
approximating this GP with a one-hidden layer BNN ŷ(·) = BNN(·, θ ) with θ collecting
86 Variational Implicit Processes
the weights and bias vectors of the network. Denote the weight matrix of the first layer as
W ∈ RD×K , i.e. the network has K hidden units, and the kth column of W as wk . Similarly
the bias vector is b = (b1 , ..., bK ). We further assume the prior distributions of the first-layer
parameters are p(W W ) = ∏K k=1 p(wk ) and p(b b) = ∏Kk=1 p(bk ), and use mean-field Gaussian
prior for the output layer. Then this BNN constructs an approximation to the GP kernel as:
1
K̃VDO (x1 , x2 ) = σ (w⊤ ⊤
k x1 + bk )σ (wk x2 + bk ),
K∑k
wk ∼ p(w), bk ∼ p(b).
• Variational dropout (VDO) for BNN: similar to Gal and Ghahramani [75], we fix the
length scale parameter 0.5 ∗ l 2 = 10e−6 . Since the network size is relatively small,
dropout probability is set as 0.005 or 0.0005. We use 2000 forward passes to evaluate
posterior likelihood.
• α-dropout inference for BNN: suggested by Li and Gal [177], we fix α = 0.5 which
often gives high quality uncertainty estimations, possibility due to it is able to achieve
a balance between reducing training error and improving predictive likelihood. We
use K = 10 for MC sampling.
• Variational sparse GPs and exact GPs: we implement the GP-related algorithms using
GPflow [215]. variational sparse GPs uses 50 inducing points. Both GP models use
the RBF kernel.
88 Variational Implicit Processes
• About noise variance parameter grid search for VIPs (VIP-BNN and VIP-NS), VDOs
and α-dropout: we start with random noise variance parameter, run optimization on
the model parameters, and then perform a (thick) grid search over noise variance
parameter on validation set. Then, we train the model on the entire training set using
this noise variance parameter value. This coordinate ascent like procedure does not
require training the model for multiple times as in Bayesian optimization, therefore
can speed up the learning process. The same procedure is used to search for optimal
hyperparameter ψ of the inverse-Wishart process of VIPs.
created for each training data, which is too prohibitive for memory storage. Therefore, in our
implementation, we enforce all training data to share K sample paths. This approximation is
accurate since we use a small dropout rate, which is 0.005.
Chapter 4, we have described two different approaches to function space inference: the
I N
model-driven approach and the algorithm-driven approach. The model-driven approach
starts from an existing example of Bayesian nonparametric priors (in our case, the Gaussian
process), extends it to a more flexible class of priors, and then develops the corresponding
approximate inference algorithms. On the other hand, the algorithm-driven approach starts
from an existing inference method for parameter-space, and develops its function-space
counterpart.
So far we have presented our first approach to function space inference (VIP), which
is based on GP posterior approximations using implicit process priors. Roughly speaking,
in the VIP method the true posterior of a IP p( f |D) is approximated by an approximate
posterior of following form (denoted by qVIP ( f |D)):
where the basis functions {φs }Ss=1 are random samples drawn directly from p( f ).
Despite having demonstrated empirical advantage over weight space inference methods,
VIPs still suffer from a number of issues, detailed in the following remarks.
Remark (Limitations of VIPs). VIP clearly has a few limitations that need to be addressed.
To start with, its approximate posterior, qVIP ( f |D) resembles a GP approximation to p( f |D),
whereas the true posterior in function space might be arbitrarily complex. Therefore, GP
approximations might not be able to capture non-GP behaviors.
92 Functional Variational Inference
Also, VIP requires the implicit prior p( f ) to be reparameterizable, that is: 1), p( f ) should
take the form of f (x) = gθ (x, z), z ∼ p(z); 2), the function gθ (·, ·) must be differentiable;
and 3), the functional form of gθ (·, ·) must be fully known in advance. These assumptions
may limit its applicability to more complicated priors such as structured implicit priors
[330]. Lastly, the wake-sleep procedure of VIP does not optimize a coherent function-space
objective function (such as ELBO usually used in parameter-space VI).
This chapter is organized as follows. In Section 5.1, we will formalize the framework of
functional variational inference, and explain basic concepts for functional KL-divergences
between stochastic processes, functional ELBO, as well as functional Bayesian neural
networks. We will also introduce the theoretical pathologies of functional KL divergence
minimization. In Section 5.2, we propose a more well-behaved functional divergence
called grid-functional KL divergence, and use it as the objective function for function space
inference. We derive a number of theoretical results and showcase how the corresponding
5.1 Problem setting, and the functional KL divergence 93
ELBO can be calculated. In Section 5.3, we introduce the definition of Stochastic Process
Generators (SPGs), and derive efficient estimations of functional ELBO based on SPGs
(Section 5.4). Finally, in Section 5.7, we apply FVI to several tasks and showcase their
experimental performances.
Note that DKL [q( f )||p( f |D)] is the functional KL-divergence between stochastic processes,
q( f ) and p( f |D). Unfortunately, both measures q( f ) and p( f |D) does not have a convenient
density form1 . Therefore, DKL [q( f )||p( f |D)] can only be defined by following measure-
theoretic definition.
1 Since there does not exist “useful” infinite dimensional Lebesgue measures [213].
94 Functional Variational Inference
Proposition 2.1 shows that when Q is absolutely continuous w.r.t. P (i.e, Q ≪ P), DKL [Q||P]
can be expressed as
dQ
Z
DKL [Q||P] = log dQ, (5.1.3)
Ω dP
dQ
where dP denotes the Radon-Nikodym derivative of Q w.r.t. P.
In practice, the definition given by Equation (5.1.3) is not very convenient to work with.
Fortunately, as shown by [330], DKL [q( f )||p( f )] can be expressed in the form of finite
dimensional densities:
where Xn denote a set of n measure points {xk }1≤k≤n in the domain/index set of f (·),
which can be treated as an element of the product space T n ; and fXn denotes the vector
of function values evaluated on Xn . Since fXn is a finite dimensional vector, the densities
functions q(fXn ) and p(fXn |D) exist under mild conditions, and DKL [q(fXn )||p(fXn |D)] can
be computed using those densities. Here, we have slightly abused the notations, and use p(·)
to denote both the probability measure p( f ) itself, and the corresponding density functions
q(fXn ) over its finite dimensional function values fXn . In other words, the KL-divergence
between stochastic processes is the supreme of the relative entropies obtained on all possible
+
measure points in T Z (Figure 5.1).
Functional ELBO Similar to the parameter space VI, minimizing the functional KL-
divergence (5.1.4) is equivalent to maximizing the evidence lower bound (ELBO) in function
space:
Lqf unctional := Eq( f ) [log pπ (D| f )] − DKL [q( f )||p( f )] (5.1.5)
where DKL [q( f )||p( f )] is the functional KL divergence between q( f ) and p( f ). We call
f unctional
Lq the functional ELBO. In the context of machine learning, the above formulation
of functional ELBO maximization is first used in the concurrent work of functional Bayesian
neural networks (f-BNNs) [330]. To a certain extend, f-BNNs is the reverse of our VIP
method: instead of using GPs as posterior approximations, they use GPs as priors in function
space, and performs VI using BNNs as approximate posterior. That is, f-BNNs optimizes
where pGP ( f ) is a GP and qBNN ( f ) is a BNN. The gradients of DKL [qBNN ( f )||pGP ( f )] is
estimated via Stein gradient estimator [181].
For implicit processes, both ∇fXn log qλ (fXn ) and ∇fXn log p(fXn )] can be intractable. To solve
this problem, Spectral stein gradient estimators SSGE is able to estimate ∇fXn log qλ (fXn )
(or ∇fXn log p(fXn )]) using only samples from qλ (fXn (or p(fXn )). Given a kernel function
′
K(fXn , f Xn ) satisfying
Z
′ ′
∇fXn [K(fXn , f Xn )p(fXn )]dfXn , ∀f Xn ∈ Rn , (5.1.8)
fXn ∈Rn
96 Functional Variational Inference
While f-BNNs is one of the first methods that highlight the importance of function space
inference, it suffers from the limitations of the SSGE approach to functional ELBOs.
• Second, even we assume that the functional KL divergence is finite, it still brings
up another computational issue. In Equation (5.1.4), computing the functional KL
divergence requires to find the maximum over all possible measure point locations
and sizes. This process is unfortunately very difficult, and has found to be prone to
overfitting [330].
• Another issue is, it has been shown that (spectral) stein gradient estimators are less ef-
ficient for high dimensional distributions [379, 85]. When estimating ∇fXn log qλ (fXn ),
the dimensionality of fXn could be very large (depending on the optimal n and Xn
that maximizes Equation 5.1.4). Technically, the size of optimal Xn should be larger
than the size of D, which makes SSGE not scalable to large data setting (as the
computational cost of SSGE scales linearly to the dimensionality of fXn ).
3 3 3 3
2 2 2 2
1 1 1 1
y(x( ))
y(x( ))
y(x( ))
y(x( ))
0 0 0 0
1 1 1
1
2 2 2
2
3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 3 2 1 0 1 2 3
3 2 1 0 1 2 3
(a) functional BNN (b) Mean field VI BNN (c) HMC BNN (d) Ours (FVI on BNN)
Figure 5.2 A regression task on a synthetic dataset (red crosses) from [70]. We plot predictive
mean and uncertainties for each algorithms. This tasks is used to demonstrate the theoretical
pathologies of weight-space VI for single-layer BNNs: there is no setting of the variational
parameters that can model the in-between uncertainty between two data clusters. The
functional BNNs [330] also have this problem, since mean-field BNNs are used as part of
the model. On the contrary, our FVI method can produce sensible uncertainty estimates. See
Appendix 5.C.2 for more details.
Therefore, those limitations motivates us to propose new objective functions for function-
space VI, as well as more scalable gradient estimation methods, which will be introduced in
the next section.
both n and Xn . We specify a probability distribution {pn }1≤n<∞ over n ≤ 1 assigning low
probability values pn to larger n. This way, we can trim down the contributions from the
KL terms DKL [q(fXn )||p(fXn |D)] that correspond to large n, and hope fully arrive at a finite
expectation value.
Following this idea, we give the formal definition of grid-functional KL divergence as:
Dgrid [q( f )||p( f |D)] := En,Xn ∼c DKL [q(fXn )||p(fXn |D)], (5.2.1)
where Xn is a set of n measurement points {xk }1≤k≤n sampled from T , according to some
sampling distribution c.
+
Here, c is defined on the product space T Z , and the number of sampled measure
points n is also random. One may recognize that [330] proposed a similar objective as
an approximation to 5.1.4, in which the number of measure points n is a fixed constant
instead of a random variable. Note that Dgrid [q( f )||p( f |D)] is not an approximation to
5.2 The grid-functional KL divergence: a new objective for functional VI 99
DKL [q( f )||p( f |D)]: it is a valid functional divergence at its own, and we propose to use it
as a first principle for function space VI. Indeed, we can prove that Dgrid [q( f )||p( f |D)] is a
valid divergence, under mild conditions:
+
Proposition 5.1. Suppose c has full support on T Z . Then, Dgrid [q( f )||p( f |D)] satisfies
the following conditions: i), Dgrid [q( f )||p( f |D)] ≥ 0; ii), Dgrid [q( f )||p( f |D)] = 0 if and
only if q( f ) = p( f |D).
+
In other words, if c has full support on T Z (we will give an example of such c later),
then Dgrid [q( f )||p( f |D)] is a valid divergence in function space. Therefore, we can use
Dgrid as an alternative objective for function space inference. Now, the problem becomes: is
Dgrid [q( f )||p( f |D)] more well behaved than DKL [q( f )||p( f |D)]? Indeed it is. In fact, we
can show that for certain scenarios, Dgrid can avoid some of the issues the original functional
KL divergence has (see Appendix 5.A.2 for details):
Proposition 5.2. Let p( f ) and q( f ) be two distributions for random functions. Assume that
p( f ) is parameterized by the following sampling processes:
This result shows that the grid-functional KL divergence allows us to perform VI between
p and q even if they have different parametric forms. Moreover, in Appendix 5.A.3 Corollary
5.1, we have shown that if one of the distributions is replaced by a Gaussian process (of
certain kernel function), then under some additional assumptions, the grid-functional KL
is still finite. In those cases, the original functional KL is no longer finite. This validates
our choice of grid-functional KL divergence. To use Dgrid [q( f )||p( f |D)] for VI, we further
derive a new ELBO based on Dgrid (Appendix 5.A.1):
100 Functional Variational Inference
Then we have:
Lgrid
q = Eq( f ) [log p(D| f )] − Dgrid [q( f )||p( f )] (5.2.3)
f unctional
and log p(D) ≥ Lgrid
q ≥ Lq .
Proposition 5.3 shows that Lgrid
q is a valid variational objective function: it is a lower
f unctional
bound for log p(D), and also upper-bounds Lq . For the rest of the chapter, we will
grid
discuss how to perform functional VI based on Lq . We will focus on how to propose a
expressive variational family q( f ), and how to efficiently estimate Dgrid [q( f )||p( f )].
Remark (Choice of c). One example of c that satisfies the requirement of Propositions 5.1,
5.2, and 5.3 takes the following form (which will be used throughout the chapter):
[
(n − |D|) ∼ Geom(p), xk ∼ U(T ), ∀1 ≤ k ≤ n − |D|, Xn := XD {xk }1≤k≤n−|D| , (5.2.4)
where we first sample n from a geometric distribution, such that (n − |D|) ∼ Geom(p) with
parameter p (see Appendix 5.A.2 for more discussion). Then, (n − |D|) out of distribution
(OOD) measure points are sampled independently from a uniform distribution on T .
where {ai }1≤i are set of zero-mean random variables (which can be non-Gaussian), {φi }∞ i=1
2 d
forms a set of orthonormal basis functions of L (R ). Furthermore, we can show that {φi }∞
i=1
2Assuming it is (L2 (Rd )-).
5.3 Choosing q( f ): stochastic process generators (SPGs) 101
are exactly the eigen functions of the operator OC ( f )(x) = C(x, x′ ) f (x′ )dx′ , where C(·, ·)
R
sub
(a) Illustration of the Karhunen-Loeve expansion theorem.
where {φs (·, ws )}1≤s≤S are a set of real-valued functions parameterized by some parameters,
{ws }1≤s≤S . ν is a white noise process that models additive aleatoric uncertainty (as well as
the approximation error). For the practical parameterization of φs (·, ws ), since the covariance
function of p( f |D) is unknown, we may assume that each φs : Rd 7→ R1 is defined by an
individual flexible deep neural network with weights, ws . In order to compensate for the finite
truncation approximation, we assume that these weights ws are learnable via optimization
(which will be introduced later). The only missing part is the parameterization of the
non-Gaussian variables, a = (a1 , ..., aS ). This is done by specifying a implicit distribution
102 Functional Variational Inference
(Section 2.2.4) on a: Z
q(a) = pθ (a|h)qη (h), (5.3.3)
h
where h is the latent variable of q(a), the conditional distribution pθ (a|h) is parameterized
by θ , and qη (h) is some distribution over the latent space. One can immediately recognize
that q(a) is nothing more than the variational auto-encoder (VAE) [150], by noticing that
pθ (a|h) is just the decoder, and q(a) is the prior on the latent space. We can finally give the
definition of our SPG variational family qSPG ( f |qη (h)) (Figure 5.4b):
Z
f = ∑ as φs (·, ws ) + ν, a∼ pθ (a|h)qη (h), ν ∼ GP(ν; 0, δ (·, ·)σν2 ). (5.3.4)
s h
Intuitively, qSPG serves as a generator for stochastic processes. qSPG ( f |·) maps any given
qη (h) to a stochastic process qSPG ( f |qη (h)), hence the name. Regarding the expressiveness
of SPGs, we have the following result:
where MMD is the maximum mean discrepancy, F is the MMD function class defined to be
a unit ball in a RKHS with a universal kernel [324] k(·, ·) as its reproducing kernel.
Next we will discuss how {ws }, θ and qη (h) can be estimated, and how can we use
SPG to estimate the function space KL-divergence in Equation 5.1.4.
Remark (Relations to VIP). SPGs can be seen as the non-Gaussian extension of the vari-
ational approximation used in VIP. Recall that in VIP, the variational family qVIP ( f ) is
defined by the following sampling process:
where {φs }Ss=1 are random paths sampled from the prior process p( f ). In SPG, we essentially
removed the Gaussian assumption on a, by specifying the non-Gaussian implicit model,
q(a). Furthermore, SPG removed the constraint that {φs }Ss=1 need to be sampled from p( f ).
where p0 (h) is a fixed standard normal distribution. We denote the above process by
p̃SPG ( f |p0 (h)). We can train p̃SPG ( f ) on f1 , f2 , ..., fm , ..., fM , by optimizing the aggregated
104 Functional Variational Inference
ELBO on p̃SPG ( f ):
p̃SPG (fX
m |h)p0 (h)
O
max EXO ∑ log p̃SPG (fX
m ) ≥ max EXO E
O
X log , (5.4.2)
q̃λ (h|fmO ) q̃λ (h|fX O
{ws },θ ,λ
λ m {ws },θ ,λ
λ m )
where fXm are the function values of f m evaluated on XO , XO are |O| ≤ |D| measure points
independently sampled from the training set XD = {xi }N X
i=1 . fm are the function values of f m
evaluated on XO , and q̃λ (h|fX O
m ) is an encoder network that approximates the true posterior
XO
p̃SPG (h|fm ).
Product of Experts (PoE) encoder. When sampling XO , since its size might vary each
time, we would need to set up 2N inference nets, one for each possible subsets. To overcome
this issue, we adopt the Product of Experts encoder [369], a simple and flexible approach
for such scenario, given by:
|O|
q̃λ (h|fX
m ) ∝ p0 (h) ∏ q̃λ (h| f m (xi ), xi ),
O (5.4.3)
i=1
where q̃λ (h| fm (xi ), xi ) is an inference network representing the expert associated with the
i-th measurement point. Now the we can use a single encoder q̃λ (h|fX O
m ) to handle all the
possible inputs fX O
m . In practice, we let q̃λ (h| f m (xi ), xi ) to be a Gaussian expert that maps
[ fm (xi ), xi ] to a factorized Gaussian in latent space. Since the product of Gaussian experts
is still Gaussian, q̃λ (h|fX O
m ) is a Gaussian distribution whose statistics can be computed
analytically.
Estimating the grid-functional KL-divergence given Xn . In order to estimate the grid-
functional KL divergence between qSPG ( f ) and p( f ), we first discuss how this divergence
can be estimated on measurement points Xn , i.e., DKL [q(fXn )||p(fXn )] where fXn is the vector
of function values evaluated on Xn . We then discuss how this can be used to estimate the
grid-functional divergence in Equation 5.2.1. To begin with, as in Section 5.3, our variational
family is given by
Z
f = ∑ as φs (·, ws ) + ν, a ∼ q(a) = pθ (a|h)qη (h), ν ∼ GP(ν; 0, σν2 ). (5.4.4)
s h
We denote the above variational family by qSPG ( f |qη (h)). The key ingredient of our
estimation method is that, we force qSPG ( f |qη (h)) and p̃SPG ( f |p0 (h)) to share the same
basis functions (or weights {ws }) and decoder parameters θ . That is, once optimal {ws } and
θ are obtained by fitting p̃SPG ( f ) to p( f ), these are frozen and will be reused in qSPG ( f ).
5.5 Functional Variational Inference: the final algorithm 105
This makes sense, since according to our definition in Section 5.1, qSPG ( f ) and p̃SPG ( f )
share the same measurable space, (RT , BR
T ).
Therefore, the only difference between qSPG ( f ) and p̃SPG ( f ) is the choice of the prior
distributions on h, which is qη (h) and p0 (h), respectively. Given this property, we can
compute the KL divergence between qSPG ( f ) and p̃SPG ( f ) given measurement points Xn
(Appendix 5.A.5):
Proposition 5.5 (KL divergence on measurement points between SPGs). Let qSPG ( f ) and
p̃SPG ( f ) be the SPGs defined in Equation 5.4.1 and 5.4.4. Then we have
DKL [qSPG (fXn )|| p̃SPG (fXn )] = E f ∼qSPG( f ) log Z(fXn ), (5.4.5)
Note that Z(fXn ) is intractable to compute due to the intractability of the posterior
p̃SPG (h|fXn ). Fortunately, this is already approximated by the PoE inference net q̃λ (h|fXn )
given by Equation 5.4.3:
qη (h)
Z
Xn Xn
Z(f ) ≈ Z̃(f ) := q̃λ (h|fXn ) dh. (5.4.6)
h p0 (h)
Since qη (h), p0 (h), and q̃λ (h|fXn ) are all Gaussian distributions, Z̃(fXn ) can be computed
using analytic solutions. Note also that thanks to the VAE-like structure in SPGs, all the
calculations are performed in the latent space, whose dimensionality is much lower than
fX . With the additional help of analytic solutions for Z̃(fX ), the estimation of (5.4.6) is very
efficient and scalable.
|D|
log p(D) ≥ ∑ Eq( f ) [log pπ (yi | f (xi ))] − En,Xn ∼c DKL [q(fXn )||p(fXn )]
i
(5.5.1)
|D|
≈ ∑ Eq( f ) [log pπ (yi | f (xi ))] − En,Xn ∼c E f ∼qSPG ( f ) log Z̃(fXn ).
i
To make Equation 5.5.1 scalable to large data, we can apply mini-batch sampling to the
|D|
likelihood term ∑i Eq( f ) [log p(yi | f (xi ))]. Then the only bottleneck of Equation 5.5.1 is
that the input Xn to the inference net q̃λ (used in Z̃(fXn )) can be very high dimensional
due to the condition XD ⊂ Xn required by Proposition 5.3. Fortunately, we can derive
the following mini-batch estimators for En,Xn ∼c E f ∼qSPG ( f ) log Z̃(fXn ) (Appendix 5.A.6 and
5.A.7):
Proposition 5.6. En,Xn ∼c E f ∼qSPG ( f ) log Z̃(fXn ) can be estimated by the mini-batch estimator
1 H h
JK := ∑ f ∼qSPG( f ) log ση−2i + log σ̂λ−2i − log (ση−2i + σ̂λ−2i − 1) − µ̂λ2 i σ̂λ−2i − µ 2η i ση−2i
2 i=1
E
i
+ (σ̂η−2
i
µ̂ η i
+ σ̂ −2
λ
µ̂λi )2
(σ −2
ηi + σ̂λ
−2
− 1)−1
,
i i
(5.5.2)
where H is the dimensionality of h, N (h; µη i , ση2 i ) = qη (hi ), N (h; µλ i , σλ2 ) = q̃λ (hi |fX ).
i
σ̂λ−2 and µ̂λ i are the mini-batch approximators for µλ i and σλ2 , respectively:
i i
|D| −2
σ̂λ−2 := ∑ σhi | f xk + ∑ σh−2 x
i| f l
i
k∈K K x ∈X \X
l n D
µ̂λ i |D| −2
2
:= ∑ σhi | f xk µhi | f xb + ∑ σh−2 x µ
i | f l hi | f
xl ,
σ̂λ k∈K K xl ∈Xn \XD
i
where K is a mini-batch of size K sampled from {1, ..., |D|}, xl ∈ Xn \ XD is a set of OOD
samples sampled from T using c in Eq. 5.2.4, and µhi | f xk and σh2i | f xk are the mean and
variance parameter returned from q̃λ (hi | f (xk )).
The estimation in Eq. 5.5.2 is biased (but consistent). To remove the bias, we propose to
debias Eq. 5.5.2 based on the Russian Roulette estimator [141]:
Proposition 5.7 (Russian Roulette estimator). Let R be a random integer from a distribution
P(N) with support over the integers larger than K. x0 is a random location sampled from
5.5 Functional Variational Inference: the final algorithm 107
h i
T . Then En,Xn ∼c E f ∼qSPG ( f ) log Z̃(fXn ) can be estimated by E JK + ∑Rk=K P(N≥k)
∆k
, where
∆k = Jk+1 − Jk , and the expectation is taken over R, n, Xn , and all mini-batches used by
each Jk terms.
This will enable us to also perform mini-batch sampling on the measurement points
when performing FVI. Our final optimization objective function is
R
|D| ∆k
L̂FVI := ∑ Eq( f ) [log p(yi | f (xi ))] − J K − ∑ , (5.5.3)
I i∈I k=B P(N ≥ k)
where I is a mini-batch of size I for the likelihood terms, R is an integer sampled from
P(N), which is set to be (R − K) ∼ Geom(0.5). Finally, the full algorithm is sketched in
Algorithm 2. We call this proposed method Functional Variational Inference (FVI).
Remark (Scalability). Our method is empirically much faster than f-BNN (Appendix 5.C.3).
When estimating Dgrid [q( f )||p( f )], our method scales as O(Mq ), where Mq is the number
of samples sampled from qSPG ( f ), that are used in JK . In practice, we use Mq = 1. On
the contrary, the SSGE estimator used in f-BNN scales as O(Mq3 + Mq2 |fXn |), where |fXn | is
the dimensionality of fXn . Usually in SSGEs, much larger value of Mq needs to used (e.g.,
Mq = 100). Also, as analyzed in Section 5.1, SSGE scales linearly to the dimensionality
of fXn , and does not allow mini-batch estimation. Those factors will limit the applicability
of SSGEs to large scale function space inference. In Appendix 5.C.3, we provide further
results on the run-time performance of these methods.
108 Functional Variational Inference
Function space priors. Another line of work directly defines distributions over functions
by combining stochastic processes with neural networks. For example, neural processes
(NPs) [80] and its variants [147, 90] focus on meta-learning scenarios and propose to use set
encoders to model all possible posterior distributions of the form {p(f|C)|C ⊂ D}, where
C is the so-called “context points” in neural processes. This could be inefficient for large
datasets, since it needs to feed all data points to the set encoder, which scales linearly w.r.t.
the dataset size. More importantly, NPs still use for learning and inference an ELBO defined
on parameter space instead of function space. On the contrary, our method focuses on the
functional VI for supervised learning scenarios and does not need to model all possible
conditionals. When computing predictive distribution , we only need to evaluate qη (h),
which is a simple Gaussian distribution (no set encoders involved).
Figure 5.6 Implicit function prior and posterior samples from ground truth, FVI, VIP, and
f-BNN, respectively. The first row corresponds to a piecewise constant prior, and the
second row corresponds to a piecewise linear prior. The leftmost column shows 5 prior
samples. From the second column to the rightmost column we show posterior samples
generated by ground truth (returned by SIR), FVI, VIP and f-BNN, respectively. Red dots
denote the training data. We plot 10 posterior samples in black lines and show predictive
uncertainty as grey shaded areas.
models that are reliable under uncertainty. To reduce the prohibitive computation cost of
exact/deep GPs, various VI methods [335, 109, 213, 292] have been studied. These methods
share a similar principle with our work, that is, to minimize the functional divergence
between the posterior and variational processes. Nevertheless, the GP components of the
functional prior play a critical role in this line of work, which makes them less applicable to
general non-GP based priors.
5.7 Experiments
In this section, we evaluate the performance of FVI using a number of tasks, including inter-
polation with structured implicit priors, multivariate regression with BNN priors, contextual
bandits, and image classification. We mainly compare FVI with other weight-space and
function-space Bayesian inference methods using the same priors. For more implementation
details, please refer to Appendix 5.B. Additional experiments can be found in Appendix 5.C.
functions, and 2), piecewise linear random functions. Please refer to Appendix 5.B.2 for
details. For each prior, we first sample a random function from the prior; then, 100 observed
data points are sampled as D, half of which are sampled from [0, 0.2] and the other half are
sampled from [0.8, 1]. Finally, we ask the algorithms to perform inference using the prior,
i.e., producing samples from p( f |D).
We compare the performance of FVI with ground truth, f-BNN and VIPs. The ground
truth posterior samples are generated by sampling importance re-sampling. F-BNNs are
based on the code kindly open-sourced by [330]. As we found that the training time required
by f-BNN is prohibitive, we only trained f-BNN for 100 epochs for fairness. For VIPs
(Gaussian approximations), we use an empirical covariance kernel, which is estimated from
random function samples of the implicit priors. For FVI, implementation details can be
found in Appendix 5.B.2.
Results are displayed in Figure 5.6. FVI can successfully generate samples that mimic
the piecewise constant/linear behaviors. The posterior uncertainty returned by FVI is also
close to the ground truth estimates. On the other hand, f-BNNs severely under-fit the data
and provide very poor in-between uncertainties. Note that, although f-BNNs are only trained
for 100 epochs, their running time is still 100x higher than that of FVI (Appendix 5.C.3).
VIP performs better than f-BNNs, but fails to mimic the behaviour of the priors: the posterior
samples from VIP are very noisy. This is due to the prior function samples violating the
Gaussian assumption, with the correlation level between points being lower than expected.
This results in very noisy VIP posterior samples that are hard to interpret.
Table 5.2 Contextual bandits performance comparison. Results are relative to the cumulative
regret of the worst algorithm on each dataset. Numbers after the algorithm are the network
sizes. The best methods are boldfaced, and the second best methods are highlighted in
brown.
MEAN R ANK M USHROOM S TATLOG C OVERTYPE J ESTER A DULT W HEEL C ENSUS
FVI 2 × 50 2.11 16.46 ± 2.04 7.95 ± 2.92 49.59 ± 1.61 68.59 ± 6.87 90.33 ± 0.86 41.44 ± 9.28 51.77 ± 3.06
U NIFORM 10.45 100.0 ± 0.00 99.85 ± 0.36 99.49 ± 0.62 100.0 ± 0.00 99.60 ± 0.53 94.04 ± 11.9 99.30 ± 0.55
RMS 5.68 17.74 ± 7.65 10.36 ± 2.51 69.72 ± 7.23 75.07 ± 5.50 97.65 ± 1.48 70.39 ± 19.7 94.55 ± 3.60
D ROPOUT 2 × 50 5.54 19.84 ± 6.46 15.53 ± 4.50 67.72 ± 2.32 75.04 ± 4.66 97.44 ± 0.98 59.40 ± 10.8 86.60 ± 0.52
BBB 2 × 50 4.88 23.18 ± 5.90 30.90 ± 3.29 63.91 ± 1.96 72.93 ± 5.69 95.49 ± 2.03 56.38 ± 11.3 70.68 ± 2.32
BBB 1 × 50 8.22 15.52 ± 4.40 80.25 ± 18.6 94.80 ± 4.84 83.30 ± 5.26 99.24 ± 0.66 58.12 ± 18.0 99.46 ± 0.37
N EURAL L INEAR 6.94 19.04 ± 2.96 21.22 ± 1.98 75.34 ± 1.00 86.86 ± 3.61 97.93 ± 1.37 37.41 ± 8.86 83.75 ± 1.44
B OOT RMS 4.51 17.11 ± 5.99 9.47 ± 2.03 63.27 ± 1.35 74.66 ± 3.87 96.11 ± 1.02 63.15 ± 25.9 90.47 ± 3.40
PARAM N OISE 5.94 17.76 ± 4.14 20.95 ± 3.07 78.08 ± 5.66 76.95 ± 5.84 96.23 ± 1.81 41.26 ± 6.48 96.34 ± 4.56
BBα 2 × 50 9.45 68.45 ± 6.05 95.22 ± 4.88 98.60 ± 1.45 94.29 ± 2.69 98.72 ± 1.28 80.50 ± 7.96 97.94 ± 2.01
FBNN 2 × 50 3.17 16.55 ± 2.41 10.01 ± 1.39 50.10 ± 5.70 70.82 ± 3.27 90.72 ± 3.18 77.70 ± 21.2 51.22 ± 2.55
Bayes-by-Backprop [29], variational dropout [75], and variational alpha dropout [177]
(α = 0.5). We also compare with three function-space BNN inference methods: VIP-BNNs,
VIP-Neural processes [198], and f-BNNs. Finally, we include comparisons to function space
particle optimization [357] in Appendix 5.C.7 for reference purpose. All inference methods
are based on the same BNN priors whenever applicable. For experimental settings, we
follow [198]. Each dataset was randomly split into train (90%) and test sets (10%). This
was repeated 10 times and results were averaged.
Results are shown in Table 5.1. Overall, FVI consistently outperforms other VI-based
inference methods for BNNs and achieves the best result in 7 datasets (out of 9). FVI also
outperforms f-BNNs (in 5 datasets out of 6), despite the fact that they are more expensive to
train. Note that exact GPs and f-BNNs are not directly comparable to other methods, since
i), they perform inference over different priors; and ii), they are much more expensive as
they require the evaluation of the exact GP likelihood. Thus, their results are only available
for smaller datasets, and are not included for ranking.
Table 5.3 Image classification and OOD detection performance. Accuracy, negative log-
likelihood (NLL) and area-under-the-curve (AUC) of OOD detection are reported. Our
method outperforms all baselines in terms of classification accuracy and OOD-AUC, and
performs competitively on NLL for CIFAR10. Results for MAP, KFAC and Ritter et al. are
obtained from [127].
FMNIST CIFAR10
M ODEL ACCURACY NLL OOD-AUC ACCURACY NLL OOD-AUC
FVI 91.60±0.14 0.254±0.05 0.956±0.06 77.69 ±0.64 0.675±0.03 0.883±0.04
MFVI 91.20±0.10 0.343±0.01 0.782±0.02 76.40±0.52 1.372±0.02 0.589±0.01
MAP 91.39±0.11 0.258±0.00 0.864±0.00 77.41±0.06 0.690±0.00 0.809±0.01
KFAC-L APLACE 84.42±0.12 0.942±0.01 0.945±0.00 72.49±0.20 1.274±0.01 0.548±0.01
R ITTER ET AL . 91.20±0.07 0.265±0.00 0.947±0.00 77.38±0.06 0.661±0.00 0.796±0.00
5.8 Conclusion
In this chapter, we took a algorithm-driven approach for function space inference, and
proposed Functional Variational Inference (FVI).. It optimizes a grid-based functional
divergence, which can be estimated based on our proposed SPG model. We demonstrated
that FVI works well with implicit priors, scales well to high dimensional data and provides
reliable uncertainty estimates. Possible directions for future work might include developing
grid-function KL estimation method without surrogate models, and improving the theoretical
understanding of functional space VI.
114 Functional Variational Inference
where Xn is the so called measurement points, fXn is the vector of function values evaluated
on Xn , and DKL [q(fXn )||p(fXn )] is the KL-divergence over random vectors typically used in
machine learning community.
arg min DKL [q( f )||p( f |D)] = arg min En,Xn ∼c DKL [q(fXn )||p(fXn |D)],
q( f ) q( f )
Let’s first consider the left handside, arg minq( f ) DKL [q( f )||p( f |D)]. When it reaches the
optimum, we have a unique solution, q⋆L ( f ) = p( f |D). According to Equation 5.A.1, we
5.A Proof of Theoretical results 115
have:
arg min sup DKL [q(fXn )||p(fXn |D)] = arg min DKL [q( f )||p( f |D)] = q⋆L ( f )
q( f ) n,Xn q( f )
En,Xn ∼c DKL [q(fXn )||p(fXn |D)] ≤ sup DKL [q(fXn )||p(fXn |D)]
n,Xn
At q⋆L ( f ), we have
0 ≤ En,Xn ∼c DKL [q⋆L (fXn )||p(fXn |D)] ≤ sup DKL [q⋆L (fXn )||p(fXn |D)] = 0
n,Xn
Therefore, we have
On the other hand, assume that En,Xn ∼c DKL [q⋆L (fXn )||p(fXn |D)] reaches its optimum 0 at
some optimal solution q⋆R ( f ). Since DKL [q⋆R (fXn )||p(fXn |D)] is non-negative and c has
+
full support, we have DKL [q⋆R (fXn )||p(fXn |D)] = 0 for all possible Xn ⊂ supp(c) = T Z .
Therefore, we have
DKL [q⋆R ( f )||p( f |D)] = sup DKL [q⋆R (fXn )||p(fXn |D)] = 0
n,Xn
Therefore, we have
q⋆R ( f ) = p( f |D) = q⋆L ( f )
That is,
arg min DKL [q( f )||p( f |D)] = arg min En,Xn ∼c DKL [q(fXn )||p(fXn |D)] = p( f |D)
q( f ) q( f )
Proposition 5.3. Let n, Xn ∼ c be a set of random measure points such that Xn always
contains XD . Define:
Lgrid
q := log p(D) − Dgrid [q( f )||p( f |D)]. (5.A.2)
Then we have:
Lgrid
q = Eq( f ) [log p(D| f )] − Dgrid [q( f )||p( f )] (5.A.3)
f unctional
and log p(D) ≥ Lgrid
q ≥ Lq .
Lgrid
q
= log p(D) − En,Xn ∼c DKL [q(fXn )||p(fXn |D)]
= En,Xn ∼c {log p(D) − DKL [q(fXn )||p(fXn |D)]}
q(fXn )
= En,Xn ∼c {log p(D) − Eq [log ]}
p(fXn |D)
q(fXn )p(D)
= En,Xn ∼c {log p(D) − Eq [log ]}
p(fXn , D)
= En,Xn ∼c {Eq [− log q(fXn ) + log p(fXn , D)]}
= En,Xn ∼c {Eq(fD ) log p(D|fD ) − DKL [q(fXn )||p(fXn )]}
= Eq( f ) [log pπ (D| f )] − En,Xn ∼c DKL [q(fXn )||p(fXn )]
Proof : Let Xn denote a set of n measure points {xk }1≤k≤n in T n . Also, let the sampling
distribution c to have the following form:
n ∼ p(n), xk ∼ U(T ), ∀1 ≤ k ≤ n
That is, c first samples a positive integer n from the distribution p(n), and then draw n
samples from T independently and uniformly. Let us consider Dgrid [q( f )||p( f )]. According
118 Functional Variational Inference
to the definition
The first inequality is due to information processing inequality. The q̄ and p in the second
inequality is defined as
q̄ = sup q(hXn ) > 0
hXn ∈Bn ⊂Rn
. Note that both q̄ and p are strictly greater than 0, due to the the fact that B n is the support
of q(hXn ), and p(h|x; Θ) > 0 for ∀h ∈ B. Next, notice that
q̄
= sup q(hXn )
hXn ∈Bn
Z
= sup ∏ q(hk |xk ; Γ)q(Γ)dΓ
hXn ∈Bn Γ 1≤k≤n
=(q⋆ )n > 0
Where we have used (q⋆ ) to denote suphk ∈B supΘ∈V supxk ∈T q(hi |xk ; Γ) . The second
equality is given by the definition of q(h; Γ); the first inequality is due to that the expectation
is replaced by supΓ∈V supXn ∈T n , and the fact that q(Γ) has compact support; and in the
last inequality q⋆ > 0 since otherwise, q(hk |xk ; Γ) ≡ 0 which contradicts with that fact that
q(hxk ) = Γ q(hk |xk ; Γ)q(Γ)dΓ has compact support. Similarly, we also have:
R
p
= inf p(hXn )
hXn ∈Bn
Z
= inf ∏ p(hk |xk ; Θ)p(Θ)dΘ
hXn ∈Bn Θ 1≤k≤n
=(p⋆ )n > 0
, where we have used p⋆ to denote infhk ∈B infΘ infxk ∈T p(hk |xk ; Θ) . Note that the second
≤n [log q⋆ − log p⋆ ]
Remark (grid-functional KL using BNN as priors). We here note that the proof applies
to BNN priors. Assume p(hi |xi ; Θ = w) = N (hi ; gw (xi ), ς 2 ), where gw (·) is a Bayesian
neural network parameterized by w, and p(w) is some suitable prior on weights such as
factorized Gaussians. In this case, it is trivial to verify that given any compact set B,
p(h|x; Θ) > 0 for ∀h ∈ B, x ∈ T , and Θ ∈ RI holds, hence the assumptions in Proposition
5.A.2 is satisfied.
Corollary 5.1. Let p( f ) and q( f ) be two distributions for random functions. Assume that
q( f ) is parameterized by the following sampling processes:
, And p( f ) is parameterized by a zero mean Gaussian process with kernel function K(·, ·).
Assume further that: i), q( f ) satisfies the assumptions in Proposition 5.2; ii), K(·, ·) is
a stationary kernel, i.e., K(x1 , x2 ) = Φ(∥x1 − x2 ∥) for some function Φ (e.g., radial basis
function). and iii), the smallest eigen value of KXn ,Xn , denoted by λ n , decays in the order
of O(n−γ ) for some constant γ > 1 (see the literature of eigen value distribution/lower
bounding smallest eigen value of kernel matrices, and/or norm estimation for inverse
matrices. For example, [358, 14, 360, 18, 297, 239] to name a few).
+
Then, there exist a sampling distribution c such that: 1, c has full support on T Z , and
2, Dgrid [q( f )||p( f )] is finite.
Proof We can basically apply most of the proof of Proposition 5.2. In our case, the key
ingredient is to derive a lower bound for
p= inf p(hXn )
hXn ∈An ⊂Rn
T
Xn
hXn K−1
Xn ,Xn h
Xn
n 1
log p(h ) = − − log 2π − log |KXn ,Xn |
2 2 2
Without loss of generality, assume that ∥hXn ∥ ≤ A for some constant A. Then, we have
T 1 Xn A
hXn K−1
Xn ,Xn h
Xn
≤ ∥h ∥ ≤
λn λn
1
, where λ n denotes the smallest eigen value for KXn ,Xn (or equivalently, λn
is the largest
eigen value for K−1
Xn ,Xn ).
Notice also that
1
log |KXn ,Xn | ≤ n log Tr(KXn ,Xn ) = n log Φ(0)
n
.
Therefore, we can write
n A
log p ≥ − (log 2π + log Φ(0)) −
2 2λ
λn
122 Functional Variational Inference
Since λ n decays in the order of O(n−γ ) for some constant γ > 1, by running the same
argument as in the proof of Proposition 5.2, ∑∞ Xn Xn
n=1 p(n)EXn ∼U(T n ) DKL [q(f )||p(f )] is
absolute convergent if limn→∞ p(n + 1)/p(n) < 1.
where MMD is the maximum mean discrepancy between p and q, F is the MMD function
class defined to be a unit ball in a reproducing kernel Hilbert space (RKHS) with a universal
kernel [324] k(·, ·) as its reproducing kernel.
N ∞
f (x) = lim LN , LN := ∑ Zi φi (x), ∑ λ i < +∞.
N→∞ i i
“metrization” means that for any sequence of measures P1 , P2 , ..., Pn , ... ∈ P, we have
w
Pn → P ⇔ lim MMD(Pn , P; F) = 0.
n→∞
The above triangle inequality holds since k is universal [93]. Hence, to prove our theorem,
it sufficies to show that there exits a sequence of SPGs qSPG,1 , ..., qSPG,n′ , ... such that
limn′ →∞ MMD(qSPG,n′ , pLn ; F) = 0, ∀n ∈ Z+ , x ∈ T . To prove this, let us fix n for now,
and consider the random coefficients {Zi }ni=1 of Ln . Based on the results from [48], there
exists a sequence of Gaussian VAEs qVAE,1 ({Zi }ni=1 ), ..., qVAE,n′′ ({Zi }ni=1 ), ... of latent size
n, such that
w
qVAE,n′′ ({Zi }ni=1 ) → p({Zi }ni=1 )
. Based on our definition in Section 5.3, qSPG,n′ is indeed a SPG. Since the linear summation
over φi using linear weights {Zi }ni=1 is a continuous mapping, we also have:
w
qSPG,n′ → pLn , ∀n ∈ Z+ , x ∈ T
due to continuous mapping theorem. Again, from the MMD metrization, we have
. To finally prove our theorem, consider an arbitrary error ε. Then, there exists Ln
such that MMD(pLn , p; F) < ε/2. Next, given this particular Ln , there exits n′ such that
124 Functional Variational Inference
where the first equality directly follows from the chain rule of KL-divergence, and the sec-
ond equality follows from the fact that qSPG (h|fXn )) ∝ qSPG (h) p̃SPG (fXn |h), p̃SPG (h|fXn ) ∝
p0 (h) p̃SPG (fXn |h).
tor:
1 H h
JK := ∑ f ∼qSPG( f ) log ση−2i + log σ̂λ−2i
2 i=1
E
where H is the dimensionality of h, N (h; µ η i , ση2 i ) = qη (hi ), N (h; µ λ i , σλ2 ) = q̃λ (hi |fXn ).
i
σ̂λ−2 and µ̂
µ λ i are the mini-batch approximators for µ λ i and σλ2 , respectively:
i i
|D| −2
σ̂λ−2 := ∑ σhi | f xk + ∑ σh−2 x
i| f l
i
k∈K K x ∈X \X
l n D
µλi
µ̂ |D| −2
:= ∑ σhi | f xk µ hi | f xb + ∑ σh−2 x µ h | f xl
i| f l
σ̂λ2 k∈K K x ∈X \X
l n
i
D
i
where K is a mini-batch of size K sampled from {1, ..., |D|}, xl ∈ Xn \ XD is a set of OOD
samples sampled from T using c in Eq. 5.2.4, and µ hi | f xk and σh2i | f xk are the mean and
variance parameter returned from q̃λ (hi | f (xk )).
Proof To derive the mini-batch estimator, we first compute the expression for Z̃(fXn ).
Since q̃λ (h|fXn ) is a product of Gaussian encoder, its mean and variance can be computed
by:
Σ −1
λ
= ∑ Σ −1
hi | f x
x∈Xn
µ λ = Σλ ∑ Σ −1
hi | f x µ hi | f x
x∈Xn
the covariance and mean of qη (h). By our assumptions, Σ η is also a diagonal matrix with
q (h)
Ση )ii = ση2 i . Since q̃λ (h|fXn ) pη (h) is a product of three Gaussian distributions, its log
(Σ
0
normalization constant log Z̃ can be computed using the results from, for example Appendix
126 Functional Variational Inference
A.2 of [110]:
where ση2 i ,σλ2 ,µ µ λ i are the ith element of diag−1 Σ η , diag−1 Σ λ , µ η , µ λ , respectively. To
µ η i ,µ
i
effectively estimate σλ−2 = ∑x∈Xn σh−2 i| f
2 −2
x and µ λ i = σλ ∑x∈Xn σh | f x µ hi | f x , we can uniformly
i
i i
sample a mini-batch XK of size K from XD , and then compute the following noisy mini-batch
estimation:
µ λ i σλ−2 = ∑ σh−2
i| f
x µ hi | f x = NEx∈XD σh−2
i| f
x µ hi | f x + ∑ σh−2 x µ h | f xl
i| f l i
i
x∈Xn xl ∈Xn \XD
N
≈ ∑ K σh−2i| f xk µ hi| f xk + ∑ σh−2 x µ h | f xl ,
i| f l i
k∈K xl ∈Xn \XD
We denote the estimators for σλ−2 and µ λ i by σ̂λ−2 and µ̂ µ λ i , respectively. Then, applying
i i
X
these noisy estimations to En,Xn ∼c E f ∼qSPG ( f ) log Z̃(f n ), we have
The symbol ≈ in the last line means that it is a consistent estimator, due to multivariate
continuous mapping theorem.
where ∆k = Jk+1 − Jk , and the expectation E is taken over R, n, Xn , and all mini-batches
used by each Jk terms.
Proof By definition, we have limk→∞ EJK = EJN = En,Xn ∼c E f ∼qSPG ( f ) log Z̃(fXn ). Ap-
parently, E ∑Rk=0 P(N≥n)
∆k
constructs an Russian Roulette estimator [141]. Based on lemma
3 from [41], in order prove our result we only have to show that E ∑∞ k=0 |∆k | < ∞. In fact,
since the data distribution is assumed to be an empirical distribution, we have
∞ ∞
∑ |∆k | = ∑ |Jk+1 − Jk|
k=0 k=0
N−1
= ∑ |Jk+1 − Jk| < ∞
k=0
holds for all possible mini-batches used by each ∆k . The second equality is based on the fact
N−1
that Jk+1 = Jk = log Z̃(fXn ) for all k ≥ N. Therefore, we have E ∑∞
k=0 |∆k | = ∑k=0 E|∆k | <
∞.
the train/test set are predefined, we have only run experiments with 5 different random seeds
(for initialization). For interpolation with implicit prior experiment, see 5.B.2 for details.
where we first sample n from a geometric distribution, such that (n − |D|) ∼ Geom(p) with
parameter p. Here, we use the parameter p = 0.5. Then, (n − |D|) out of distribution (OOD)
measure points are sampled independently from a uniform distribution on T .
Choice of prior processes p( f ) Note that since FVI is an inference method instead of a
new model, we will assume that FVI and most of the baselines will be using the same priors,
whenever applicable. For example, in the interpolation structured prior tasks, both FVI and
f-BNN will use the same piecewise implicit prior. In multivariate regression and image
classification tasks, all algorithms will use the same BNN prior with the same structures,
therefore we can isolate the difference caused by inference algorithms.
Structure of SPGs Unless specified otherwise, we use 10 basis functions for our SPGs,
and each basis function is a two-layer neural network that maps from Rd to R1 . The structure
of these networks is input-100-100-output. Note that these neural network parameters
are not part of the variational parameters, since they are frozen forever after we have
finished distilling p( f ) using p̃SPG ( f ). To further reduce the number of free parameters, the
parameters of the first two layers of all basis functions can be shared (this is applied only to
larger scale experiments such as image classification). The encoder q̃λ (h| f ) for p̃SPG ( f ) is
also a two-layer neural network (input-500-200-latent statistics), whose parameters are also
5.B Further details of experiments 129
fixed after distilling p( f ). The decoders also have two hidden layers (latent variables-50-
100-output). The latent dimension is different depending on the tasks so that the comparison
between baselines will be fair. This will be detailed later. For the stationary GP white noise
process used in SPGs, we assume that they have isotropic noise level σν2 = 0.1.
Optimization Unless otherwise specified, we use Adam optimizer with learning rate
lr = 0.001. We use a slightly larger learning rate in the contextual Bandit experiment since
the learning rates used for each baseline is tuned from [0.001, 0.05], as specified in the
experiment section. When training p̃SPG ( f ), we use 5k epochs unless otherwise noticed.
For the inference phase where qSPG ( f ) is optimized to maximize the functional ELBO, the
number of iterations is determined by the other baselines. For example in contextual Bandits,
all baselines are trained for 100 epochs, so is FVI. In terms of batch size, unless specified
otherwise, we choose the batch size to be 100. This batch size is also used to perform MC
estimation of the likelihood term in Equation 5.5.3, and training p̃SPG in Equation 5.4.2.
random function samples and optimize p̃SPG ( f ) for 5k epochs. For inference, the variational
parameters are trained for 1k epochs.
UCI Multivariate regression For this experiment, we follow [198]. The functional prior
is a fully connected ReLU BNN with two hidden layers (input-10-10-output). We train FVI
for 1k epochs. We use 10 basis functions for FVI and a batch size of 100. The latent size is
set to 100.
Gaussian Processes in UCI regression On UCI datasets, variational sparse GPs and exact
GPs are implemented using GPflow. VSGPs uses 50 inducing points. Both variations of GP
models use the RBF kernel.
Contextual Bandits We use similar settings to [330], where we use a batchsize = 32,
training epochs = 100, training frequency = 50, and contexual points = 2000. We use ReLU
BNNs as functional priors. It has two hidden layers, each with 50 hidden units. For FVI, we
use 100 basis functions (with shared weights until the last layer), a learning rate of 0.005. For
details of the algorithms mentioned in Table 5.2, readers may refer to [274] for details. Here
we briefly explain the meaning of the abbreviations. FVI: functional variational inference;
FBNN: functional Bayesian neural networks; Uniform: uniform sampling; RMS: trains a
neural network and acts greedily using RMSprop; Boot RMS: Bootsrapped RMS; Neural
Linear: Bayesian linear regression over deep NN features; ParamNoise: just a regular
DNN, but when making decisions, an isotropic Gaussian perturbation is added to the NN
weights; Dropout: variational dropout BNNs; BBB: Bayes-by-Backprop BNNs; BB α:
Black-box alpha divergence minimization.
Image classification and OOD detection For all models in this experiment, the CNN
structures are the same as in [127]. That is, the 3 convolutional layers plus 3 fully connected
layers in the DeepOBS benchmark [300]. Similar to [127], we apply standard isotropic
Gaussian prior on all weight parameters. We use Adam with learning rate of 0.001 and
5.C Additional Experiments 131
VIP posterior samples (linear) VIP posterior samples (linear) VIP posterior samples (linear) VIP posterior samples (linear)
1.4 1.4 1.4 1.4
1.2 1.2 1.2 1.2
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
y
y
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x x
(a) VIP, 5 basis func- (b) VIP, 10 basis (c) VIP, 20 basis (d) VIP, 50 basis
tionsVIP posterior samples (linear) functions
VIP posterior samples (linear)
functions
VIP posterior samples (linear)
functions
VIP posterior samples (linear)
1.4 1.4 1.4 1.4
y
y
y
0.6 0.6 0.6 0.6
(e) VIP, 100 basis (f) VIP, 150 basis (g) VIP, 200 basis (h) VIP, 500 basis
functions functions functions functions
Figure 5.8 The posterior samples from VIPs with different number of basis functions. As
more basis functions are used, the posterior samples from VIP become more and more noisy,
and finally converges to GP-like behaviour when 500 basis functions are used. Compared
to the ground truth estimate from Figure 5.6, VIP clearly under-estimates the predictive
uncertainties in-between the training samples.
batch size of 100, and run the training procedure for 100 epochs. For FVI, we use 100
basis functions in the SPGs on both datasets. Note that each basis function is a there-layer
convolutional network that maps from Rd to R10 . To significantly reduce the memory usage,
the parameters of the convolutional layers of all basis functions are shared.
y
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x x
(a) FVI, 5 basis (b) FVI, 10 basis (c) FVI, 20 basis (d) FVI, 50 basis
functions
FVI posterior samples (linear)
functions
FVI posterior samples (linear)
functions
FVI posterior samples (linear)
functions
FVI posterior samples (linear)
1.4 1.4 1.4 1.4
1.2 1.2 1.2 1.2
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
y
y
y
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x x
(e) FVI, 100 basis (f) FVI, 150 basis (g) FVI, 200 basis (h) FVI, 500 basis
functions functions functions functions
Figure 5.9 The posterior samples from FVI with different number of basis functions. FVI
is still able to learn the piecewise linear behaviour from the prior as more basis functions
are used. As the number of basis functions is increased to 500, FVI converges to a solution
that is much closer to the ground truth (compared with VIP), and is still able to exhibit
non-Gaussian behaviours from the prior.
is not reparameterizable, once the basis functions for FVI and VIP are sampled, they will
be frozen forever (in contrast, when the prior is reparameterizable, both FVI and VIP can
optimize the basis functions, therefore the number of basis functions required will be much
smaller than this experiment).
From Figure 5.8 and Figure 5.9, we can first observe that as the number of basis function
increases, the predictive uncertainty of both FVI and VIP also increase, until around when
200 basis functions are reached. However, as more basis functions are used, the posterior
samples from VIP become noisier, and finally converges to GP-like behavior when 500
basis functions are used. Compared to the ground truth estimate from Figure 5.6, VIP
under-estimates the predictive uncertainties in-between the training samples. This is due to
that the piecewise linear behavior of the function samples violates the Gaussian assumption
of VIP, such that the correlation level between points will be lower than expected. On the
other hand, FVI is still able to learn the piecewise linear behavior from the prior as more
basis functions are used. As the number of basis functions is increased to 500, FVI converges
to a solution that is much closer to the ground truth (compared with VIP) and is still able
to exhibit non-Gaussian behaviors from the prior. We can conclude that the advantage of
FVI over VIP does not vanish as the number of basis functions increases. In contrast, the
difference between FVI and VIP becomes even more distinct and recognizable.
5.C Additional Experiments 133
3 3 3 3 3
2 2 2 2 2
1 1 1 1 1
y(x( ))
y(x( ))
y(x( ))
y(x( ))
y(x( ))
0 0 0 0 0
1 1 1
1 1
2 2 2
2 2
3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 3 3 2 1 0 1 2 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3
(a) functional BNN (b) MFVI BNN (c) GP (d) HMC (e) Ours
Figure 5.10 A regression task on a synthetic dataset (red crosses) reproduced from [70]. We
plot predictive mean and uncertainties for each algorithms. This tasks is used to demonstrate
the theoretical finding on the pathologies of weight-space VI for single-layer BNNs: there is
no setting of the variational parameters that can model the in-between uncertainty between
two data clusters. The functional BNNs [330] also has this problem, since BNNs are use
as part of the model. On the contrary, our functional VI method can produce sensible
in-between uncertainties for out-of-distribution data. See Appendix 5.C.2 for more details.
Proposition 5.7 (Limitations for single-hidden layer BNNs [70]). Consider any single-
hidden layer fully-connected ReLU NN f : RD → R. Let xd denote the d th element of the
input vector x. Suppose we have a fully factorised Gaussian distribution over the weights
and biases in the network. Consider any points p, q, r ∈ RD such that r ∈ −
→ and either:
pq
1. −
→ contains 0 and r is closer to 0 than both p and q.
pq
2. −
→ is orthogonal to and intersects the plane x = 0, and r is closer to the plane x = 0
pq d d
than both p and q.
That is, the weight space inference of a single-hidden layer variational BNNs (using
mean-field VI) fails to represent the in-between uncertainty, and become over-confident
on out-of-distribution data. In this experiment, the training data is sampled as follows: the
2-D input locations of training data are generated by sampling 100 points, 50 each from
two separate clusters that follow Gaussian distributions. The inputs of the cluster on the
left of Figure 5.2 around (−1, −1), and the other cluster is centered around (1, 1). Both
134 Functional Variational Inference
have isotropic Gaussian noise with zero mean and variance 0.01. The outputs (y) are -1
and 1 for the left and right clusters, respectively. We further add a Gaussian observational
noise of variance 0.1 to the outputs. To test whether the baselines can learn the in-between
uncertainties between clusters, we use a fully connected ReLU BNN of a single hidden layer
(50 units). The FVI also uses this prior as functional prior and has 50 basis functions and 50
latent dimensions. The settings of MFVI are determined according to [70]. The settings of
F-BNN are determined similarily.
In figure 1, the λ axis is the 1-D parameter that parameterizes the 1-dimensional straight
line embedded in the 2-D plane, that connects (−3, −3) and (3, 3). The value of the λ -
coordinate implies that its actual 2-D coordinate in the 2-D plane is (λ λ , λ ). The results in
Figure 5.2 show that both Mean-field variational BNN and functional BNN suffers from
the limitations of single hidden layer BNN. On the contrary, FVI can produce a sensible
in-between uncertainty that is similar to GPs and HMC. For GPs, we use infinite-width BNN
kernel following [70]
6 1e3 CPU time comparison (piecewise const) 6 1e3 CPU time comparison (piecewise linear)
5 5
4 4
3 3
2 2
1 1
0 0
FVI fBNN FVI fBNN
(a) CPU time (s), piecewise constant prior (b) CPU time (s), piecewise linear prior
Figure 5.11 CPU time comparison, FVI vs f-BNN on implicit priors. Although f-BNNs are
only trained for 100 epochs, its running time is still 100x slower than FVI.
dimensional features as input (including dummy binary variables for categorical variables).
The output has 9 different classes (actions). The CPU time consumed by each algorithm on
Census is listed as follows:
Table 5.4 CPU time performance comparison of running contextual bandit on Census dataset
Based on Table 5.4, we can see that FVI is nearly 500 times faster than f-BNN. The
run time of FVI is similar to Bayes-by-Backprop, indicating that FVI is very efficient and
scalable.
1.4 1.4
1.2 1.2
1.0 1.0
0.8 0.8
y
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
(a) F-BNN, piecewise con- (b) F-BNN, pieacewise linear
stant prior prior
Figure 5.12 F-BNN on structured implicit priors, trained with 10k epochs
In experiment 5.7.1, we have only run f-BNNs for 100 epochs due to its computational
costs. Here, we provide improved results of fully-trained f-BNNs after 10k epochs. Note
that this epoch number is much larger than the FVI setting (5k), since we found that after
5k epochs, the f-BNN posteriors do not seem to improve over the results in experiment
5.7.1. As shown in Figure 5.12, after 10k epochs, the posterior uncertainty estimates of
f-BNNs become much loser to the ground truth in Figure compared with its 100 epochs
version. However, this comes with the cost of significantly increased computational time.
Moreover, f-BNNs seem to provide less convincing posterior samples in terms of mimicking
the piece-wise constant/linear behaviours.
136 Functional Variational Inference
Note that f-SVGD is not included in our main experiments in Table 5.1, since it is a
particle optimization-based inference method. On the other hand, GP is not included since
it is not a BNN-based model. For GPs, we used variational sparse GP with 50 inducing
points plus an RBF kernel. The additional results in Table 5.5 shows that FVI performs the
best in 6 out of 9 datasets. Moreover, FVI outperforms f-SVGD in 6 out of 9 datasets and
outperforms GP in 8 out of 9 datasets in terms of NLLs.
5.C Additional Experiments 137
Table 5.6 larger scale regression experiment: Average test negative log likelihood
Dataset N FVI f-BNNs BBB FVI shallow
GPU 241600 2.93±0.03 2.97±0.02 2.99±0.01 3.10±0.04
Protein 45730 2.82±0.01 2.72±0.01 2.72±0.01 2.85±0.00
Naval 11934 -7.42±0.01 -7.24±0.01 -6.96±0.01 -7.38±0.04
From the perspective of uncertainty, the latent variable zn (or more precisely, the posterior
pθ (zn |xn )) quantifies the data uncertainty of each data point xn (e.g., if θ is sufficient
statistics and if the model complexity is correct). This type of uncertainty is categorized as
the aleatoric uncertainty, since zn cannot be reduced by collecting more data. This is because
each data instance xn is assigned with different latent variables zn , and pθ (zn |xn , xm ) =
pθ (zn |xn ) for all m ̸= n.
Using the methods (variational EM/amortized VI/wake-sleep) introduced in Chapter
2, learning and inference of log pθ (D) can be performed efficiently. However, all these
methods are built upon an important assumption: the data set D = {xn }1≤n≤N must be
142 Overview: missing data uncertainty, decision making, and identifiability
fully observed. Unfortunately, in many real-world applications, this assumption does not
hold: when we collect datasets, they often contain missing data. This can be caused by
human errors, physical constraints, non-response, etc. Since we are uncertain regarding the
missing entries in our dataset, the failure to account for this uncertainty may compromise
the performance of machine learning models, as well as downstream tasks based on these
models. Therefore, it is important to be able to perform learning and inference under missing
data and quantify the uncertainties caused by missing data.
2. If p(r|x) = p(r|xo ), the data is missing at random (MAR). That is, the cause of
missingness r is observed;
3. Otherwise, the data is missing not at random (MNAR). That is, the cause of missing-
ness is unobserved.
MCAR and MAR are the most used assumptions in practice due to their technical simplicity:
under MCAR and MAR, we can ignore the missing mechanism and only focus on p(x)
[279]. However, MCAR and MAR do not hold in many real-world applications where, by
contrast, MNAR mechanisms are more common. In MNAR scenarios, we must explicitly
model the joint distribution p(x, r). In Chapter 8, we will discuss these assumptions in more
details.
Methods for handling missing data have been extensively studied in the past few decades.
These methods can be roughly classified into two categories: complete case analysis (CCA)
based methods and imputation based methods. CCA-based methods, such as listwise deletion
[5] and pairwise deletion [209] directly delete data instances that contain missing entries and
6.2 Generative models for missing data uncertainty and decision making 143
only keep those that are complete for subsequent data analysis. Listwise/pairwise deletion
methods are known to be unbiased under MCAR and will be biased under MAR/MNAR. On
the contrary, imputation-based methods try to replace missing values with imputed/predicted
values. One popular imputation technique is called single imputation, where only one
single set of imputed values for each data instance is produced. Standard techniques of
single imputation include mean/zero imputation, regression-based imputation [5] and non-
parametric methods [143, 325].
Unfortunately, single imputation methods only return point estimates of the missing
values. Hence, they cannot quantify the missing data uncertainty. As opposed to single
imputation, multiple imputation (MI) methods [280, 281, 119, 237] such as MICE [362] are
simulation-based methods that return multiple imputation values for subsequent statistical
analysis. Unlike single imputation, the standard errors of estimated parameters produced
with MI are known to be unbiased [282]. Apart from MI, there exist other methods such as
full information maximum likelihood [9, 67] and inverse probability weighting [277, 120],
which can be directly applied to MAR without introducing additional bias. However, these
methods assume a MAR missing data mechanism, and cannot be directly applied to MNAR
without introducing bias.
1. Learning. How can we efficiently estimate the parameters of the complete data model
(n)
log pθ (x) (or log pθ (x, r) under MNAR), given only DO = {xO }1≤n≤N ? Furthermore,
can this be done in the large data/large model regime?
2. Inference. How can we quantify the missing data uncertainty? That is, given a
partially observed xO , how can we compute pθ (z|xO )? Equivalently, how can we
perform missing data imputation, pθ (xU |xO )? There are many possible partitions of
complete data x into xO and xU . For d dimensional observable variables, there exists
2d different combinations of the observed subset, xO . Therefore, there are 2d different
144 Overview: missing data uncertainty, decision making, and identifiability
posterior distributions of the form pθ (xU |xO ); performing Bayesian inference for each
different combination presents a significant computational challenge. How can we
efficiently address this challenge?
3. Decision making. The aleatoric uncertainty (represented by pθ (z|x)) can not be fully
eliminated by collecting more data points. However, the missing data uncertainty
(represented by pθ (z|xO )) is still partially reducible in the sense that, if we are able to
actively observe more features from the same data point xO , the posterior pθ (z|xO )
will finally approach the complete data posterior pθ (z|x). As a matter of fact, in many
applications, it is possible to acquire additional information (sometimes at specific
costs). For example, in assessing the health status of a patient we may decide to take
additional measurements such as diagnostic tests or imaging scans before making a
final assessment.
Therefore, this brings up an interesting decision making problem: suppose that in a
given prediction task we are interested in some task-specific variables xφ ⊂ xU (which
we call the target variables). Then, can we optimally choose which feature xi ∈ xU \ xφ
to observe next, so that xi ∈ xU \ xφ is provides the most informative knowledge about
the target xφ ? Or equivalently, can the missing data uncertainty of xφ , represented as
pθ (xφ |xO ), be maximally reduced? 1 If these questions can be answered, we expect
that the decision quality of xφ can be improved (evaluated by certain metrics such as
likelihood and/or accuracy).
As analyzed in Chapter 1 Section 1.3, all these research questions are important not only
for technical advancements of generative models but also for building practical systems to
replicate human expert’s decision-making behaviors. In Chapter 7, we will first work under
the assumptions of MCAR and MAR, and present our original contributions to these research
questions in the context of unsupervised learning and active information acquisition.
ML learning
Figure 6.1 Model non-identifiability under MNAR will introduce additional biases when
performing missing data imputation
Apparently, at least one of them must be biased. Thus, when dealing with MNAR missing
mechanisms, we must take into account the model identifiability and investigate sufficient
conditions for identifiability under missing data. This research question will be studied in
Chapter 8.
Remark (Non-identifiability under MCAR). Model non-identifiability will not cause the
aforementioned biases under MCAR assumptions. When the data is MCAR, we can recover
the distribution of the complete data pθ (x) by using listwise deletion: we can delete all data
points that contain missing values, and only fit the model to complete data instances. This is
equivalent to estimating the conditional distribution,
pθ (xO |r = 1).
pθ (x, r = 1)
pθ (x) = = pθ (xO |r = 1),
p(r = 1)
which is exactly the same as the listwise deletion estimation, pθ (xO |r = 1). Similarly, any
marginal distribution pθ (xO ) can be estimated by
pθ (xO , rO = 1, rU = 0)
pθ (xO ) = = pθ (xO |rO = 1, rU = 0).
p(rO = 1, rU = 0)
pθ (x)
pθ (xU |xO ) = ,
pθ (xO )
This time, the numerator, pθ (x, r = 1) can still be estimated by listwise deletion
However, note that the denominator p(r = 1|x) can not always be estimated from observa-
tional data. It requires additional assumptions, on the recoverability of p(r = 1|x) [233].
Here, recoverability means there exist a functional g, such that p(r = 1|x) = g(p(xO , r)).
Without this guarantee, we can not recover the complete data distribution pθ (x).
• Chapter 7: built upon deep generative models and probabilistic modeling, Chapter
7 presents a practical framework for learning, inference, and high-value information
acquisition under MAR (missing at random) missing values. This framework is
referred as EDDI (Efficient Dynamic Discovery of high-value Information).
• Chapter 8: this chapter leans towards theoretical aspects more and revisit the assump-
tions of the approach used in Chapter 7. We extends the work of EDDI to more general
missing not at random (MNAR) assumptions, and studies the model identifiability of
deep generative models under MNAR.
Chapter 7
Human experts are not only good at evaluating the level of uncertainty in certain decision-
making problems, but also actively collecting new information that is most useful for
reducing those uncertainties. Imagine a person walking into a hospital with a broken arm.
The first question from healthcare personnel would likely be “How did you break your arm?”
instead of “Do you have a cold?”, because the answer reveals the most relevant information
for this patient’s treatment. However, automating this human expertise of information
acquisition is difficult. In applications such as online questionnaires, for example, most
existing online questionnaire systems either present exhaustive questions [174, 310] or use
extremely time-consuming human labeling work to manually build a decision tree to reduce
the number of questions [375]. This wastes the valuable time of experts or users (patients).
An automated solution for the personalized dynamic acquisition of information has great
potential to save much of this time in many real-life applications.
What are the technical challenges to building an intelligent information acquisition
system? If we carefully analyze the previous healthcare example, we can break down the
doctor’s thinking process into three steps. First, he/she would evaluate the current situation
of the patient, with uncertainty in mind; second, given the current situation, he/she will
investigate what are the possible scenarios and outcomes; and finally, based on those possible
scenarios, he/she will ask questions accordingly, that are the most relevant and impactful.
If we translate this thinking process into machine learning terms, we can first identify that
missing data uncertainty is a key issue: at any point in time we only observe a small subset of
the patient’s symptoms or medical test results, yet have to reason about possible causes for
his symptom. We are thus uncertain regarding the missing parts of the dataset and will need
150 Efficient Dynamic Discovery of High-Value Information with Partial VAE
an accurate probabilistic model that can quantify the missing data uncertainty, and perform
inference given a variable subset of observed answers. Another key problem is deciding what
to ask next: can we optimally choose which variable to observe next, so that we can reduce
the missing data uncertainty, and improve the decision quality of relevant tasks? this requires
assessing the value of each possible question or measurement, the exact computation of
which is intractable. However, compared to traditional active learning methods, here we
need to actively select individual features, not data instances. Therefore, many existing
methods are not applicable. In addition, these traditional methods are often not scalable to
the large volume of data available in many practical cases [305, 174].
In this Chapter, we propose the EDDI (Efficient Dynamic Discovery of high-value
Information) framework as a scalable unsupervised learning and information acquisition
system under missing data. We assume that only a partially observed version of the dataset
is available for analysis, and information acquisition is always associated with some cost.
Given a specific decision task, such as estimating the customers’ experience or assessing
population health status, we can utilize the framework to dynamically decide which piece
of information to acquire next. The EDDI framework is very general, and the information
can be presented in any form such as question-answering tasks, or lab test results in medical
diagnosis. Our contributions are:
1. A new partial amortized inference method for generative modeling under par-
tially observed data (Section 7.2.1). We extend the variational autoencoder
(VAE) [150, 272], to account for partial observations. The resulting method,
which we call the Partial VAE, is inspired by the set formulation of the data
[260, 374]. The Partial VAE, as a probabilistic framework in the presence of
missing data, is highly scalable, and serves as the base for the EDDI framework.
Note that Partial VAE itself is widely applicable and can be used on its own as a
non-linear probabilistic framework for missing-data imputation.
2. An information-theoretic acquisition function with a novel efficient approxima-
tion, yielding a novel variable-wise active learning method (Section 7.2.2).
Based on the partial VAE, we actively select the unobserved variable which
contributes most to the decision task, such as customer surveys and health assess-
ments, evaluated using the mutual information. This acquisition function does
not have an analytical solution, and we derive a novel efficient approximation.
7.1 Problem formulation 151
7.2 Methodology
In this section, we present the Partial VAE to model and perform inference on partial
observations. Finally, we complete the EDDI framework by presenting our new acquisition
function and estimation method.
VAE and amortized inference. A VAE defines a generative model in which the data x is
generated from latent variables z, pθ (x, z) = ∏i pθ (xi |z)p(z). The data generation, pθ (x|z),
is realized by a deep neural network. To approximate the posterior of the latent variable
pθ (z|x), VAEs use amortized variational inference. Specifically, it uses an encoder, which
is another neural network with the data x as input to produce a variational approximation of
the posterior q(z|x; φ ). As traditional variational inference, VAE is trained by maximizing
an evidence lower bound (ELBO), which is equivalent to minimizing the KL divergence
between q(z|x; φ ) and pθ (z|x).
VAEs are not directly applicable when data points have arbitrary subset of data entries
missing. Consider the situation that the variables are divided into observed variables xO
and unobserved variables xU . In this setting, we would like to efficiently and accurately
infer p(z|xO ) and p(xU |xO ). One main challenge is that there are many possible partitions
{U, O}, where the size of observed variables might vary. Therefore, classic approaches
7.2 Methodology 153
to training a VAE with the variational bound and amortized inference networks are not
applicable. We propose to extend amortized inference to handle partial observations.
Partial VAE
This implies that given z, the observed variables xO are conditionally independent of xU .
Therefore,
pθ (xU |xO , z) = pθ (xU |z), (7.2.2)
and inferences about xU can be reduced to inference about z. Hence, the key object of
interest in this setting is p(z|xO ), i.e., the posterior over the latent variables z given the
observed variables xO . Once we obtain z, computing xU is straightforward. To approximate
p(z|xO ), we introduce a variational inference network q(z|xO ) and define a partial variational
lower bound, or the partial ELBO
This bound, L partial , depends only on the observed variables xO , whose dimensionality may
vary among different data points. We thus call the the inference net, q(z|xO ), the partial
inference net. Specifying q(z|xO ) requires distributions for any partition {O,U} of I. Given
(n)
a set of partially observed data, DO = {xO }1≤n≤N , we can further write the partial ELBO
w.r.t. this particular dataset:
(n) (n)
log p(DO ) ≥ ∑ Ez(n) ∼q(z(n) |x(n) ) [log p(xO |z(n) )]+ ∑[log p(z(n) )−log q(z(n) |xO )]. (7.2.4)
n O n
Remark (Assumptions on missing mechanism). Here, we have implicitly assumed that the
missing mechanism is MCAR or MAR. That is, the missing mask r is independent of xU . In
the classic paper by Rubin [279], it is argued that when the missing data is MCAR or MAR,
we can ignore the missing mechanism and just perform maximum likelihood learning by
154 Efficient Dynamic Discovery of High-Value Information with Partial VAE
(|O|, M + 1) (|O|, M)
e1 x1 h e1
ENCODER
ENCODER
e2 x2 × (1, K)
h (1, K) x1 h
... ... shared g c ... shared g c
ei xi h e|O|
× h
e|O| x|O| h x|O|
R
maximizing log p(xO ) = log xU p(xO , xU )dxU . The reasoning behind this is as follows:
(n)
arg max ∑ log pθ (xO )
θ n
(n) (n)
= arg max ∑ log pθ (xO )p(r(n) |xO )
θ n
Z
(n) (n) (n)
= arg max ∑ log (n)
pθ (xO , xU )p(r(n) |x(n) )dxU
θ n xU
(n)
= arg max ∑ log pθ (xO , r(n) ).
θ n
Therefore, we can ignore the missing mask r, as well as the associated missing mechanism,
p(r|x), when performing learning and infernece.
where sd carries the information of the input of the d-th observed variable, and |O| is the
number of observed variables. In particular, sd contains the information about the identity of
the input ed and the corresponding input value xd . There are many ways to define the identity
variable, ed . Naively, it could be the coordinates of observed pixels for images, and one-hot
embedding of the number of questions in a questionnaire. With different problem settings, it
7.2 Methodology 155
can be beneficial to learn e as an embedding of the identity of the variable, either with or
without an naive encoding as input. In this work, we treat e as an unknown embedding, to
be optimized during training.
There are also different ways to construct sd . A common choice is concatenation,
sd = [ed , xd ], which is often used in computer vision applications [260]. Such architecture
is illustrated in Figure 7.1a. We refer to this setting as the Pointnet (PN) specification of
Partial VAE. However, the construction of sd can be more flexible. We propose to construct
sd = ed ∗ xd using element-wise multiplication as an alternative, shown in Figure 7.1b. We
show that this formulation generalizes naive Zero Imputation (ZI) VAE [240] (cf. Appendix
7.C.1). We refer to the multiplication setting as the Pointnet Plus (PNP) specification of
Partial VAE.
We can then use a neural network h(·) to map the input sd to RK , where and K is the
latent space size. The key to the PNP/PN structure is the permutation invariant aggregation
operation g(·), such as max-pooling or summation. In this way, the mapping c(xO ) is
invariant to the permutations of elements of xO , and xO can have arbitrary length. Finally,
the fixed-size code c(xO ) is fed into an ordinary neural network, that transforms the code into
the statistics of a multivariate Gaussian distribution to approximate p(z|xO ). The procedure
is illustrated in Figure 7.2. As discussed before, given p(z|xO ), we can estimate p(xU |z).
In this chapter, we mainly consider the case that a subset of interesting observations repre-
sents the statistics of interest xφ . Sampling xi ∼ p(xi |xo ) is approximated by xi ∼ p̂(xi |xo ),
where p̂(xi |xo ) can be obtained by using the Partial VAE. It is implemented by first sampling
z ∼ q(z|xo ), and then xi ∼ p(xi |z). The same applies for p(xi , xφ |xo ) which appears in
Equation (7.2.8).
is intractable since both p(xφ |xi , xo ) and p(xφ |xo ) are intractable. For high dimensional xφ ,
R
entropy estimation could be difficult. The entropy term xφ p(xφ |xi , xo ) log p(xφ |xi , xo )dxφ
depends on i hence cannot be ignored. In the following, we show how to approximate this
expression.
Note that analytic solutions of KL-divergences are available under specific variational
distribution families of q(z|xO ) (such as the Gaussian distribution commonly used in VAEs).
Instead of calculating the information reward in x space, we have shown that one can
7.2 Methodology 157
Note that Equation (7.2.8) is exact. Additionally, we use the partial VAE approximation
p(z|xφ , xi , xo ) ≈ q(z|xφ , xi , xo ), p(z|xo ) ≈ q(zi |xo ) and p(z|xi , xo ) ≈ q(zi |xi , xo ). This leads
to the final approximation of the information reward:
With this approximation, the divergence between q(z|xi , xo ) and q(z|xo ) can often be com-
puted analytically in the Partial VAE setting, for example, under Gaussian parameterization.
The only Monte Carlo sampling required is the one set of samples xφ , xi ∼ p(xφ , xi |xo ) that
can be shared across different KL terms in Equation (7.2.9).
Remark (Stop criterion based on missing data uncertainty.) When actively acquiring new
variables, it is sometimes useful to apply certain stop criterion that will terminate actions
when certain conditions are met. Throughout this thesis, although we do not explicitly apply
any stop criterion, we would like to point out the fact that the EDDI framework naturally
provide a principled stop criterion. The intuition is that, when there isn’t enough information
to predict xφ , the estimated uncertainty level of missing data uncertainty p(xU |xO ) (or p(xφ ))
should be quite high. This indicates that we need to continue to acquire more variables to
xO , until we feel significantly more certain.
This uncertainty-based stop criterion is further developed and investigated in our recent
preliminary work [106], where we have shown that such stop criterion helps to significantly
improve the efficiency of EDDI algorithm in the context of symptom-based self-diagnosis.
variables start with green bars, and turns purple once selected by the algorithm. The right
plot of each row is the corresponding violin plot of the posterior density estimations of
remaining unobserved variables. The rightmost variable in each violin plot corresponds
to the target variable. At each step, EDDI will estimate the information reward of each
unobserved variable (green) given the observed variables. Then, the variable with highest
information reward will be acquired (becomes purple), and the posterior density of remaining
unobserved variables ( especially the target variable) will shrink. At the end of the fourth
step, partial VAE is already quite confident on its prediction of the target variable.
7.3 Experiments
Here we evaluate the proposed EDDI framework. We first assess the Partial VAE component
of EDDI alone on an image inpainting task both qualitatively and quantitatively (Section
7.3.1). We compare our proposed two PN-based Partial VAE with the zero-imputing (ZI)
VAE [240]. Additionally, we modify the ZI VAE to use the mask matrix indicating which
variables are currently observed as input. We name this method ZI-m VAE. We then
demonstrate the performance of the entire EDDI framework on datasets from the UCI
repository (Section 7.3.2 ), as well as in two real-life application scenarios: Risk assessment
in intensive care (Section 7.3.3) and public health assessment with national health survey
(Section 7.3.4). We compare the performance of EDDI, using four different Partial VAE
settings, with three baseline information acquisition strategies. The first baseline is the
random active feature selection strategy (denoted as RAND) which randomly picks the next
variable to observe. RAND reflects the strategy used in many real-world applications, such
as online surveys. The second baseline method is the single best strategy (denoted as SING)
which finds a single fixed global optimal order of selecting variables. This order is then
applied to all data points. SING uses the objective function as in Equation (7.2.9) to find the
optimal ordering by averaging over all the test data.
Figure 7.3 Image inpainting example with MNIST dataset using Partial VAE with four
settings.
160 Efficient Dynamic Discovery of High-Value Information with Partial VAE
Figure 7.4 Information reward estimated during the first 4 active variable selection steps on
a randomly chosen Boston Housing test data point. Model: PNP, strategy: EDDI. Each row
contains two plots regarding the same time step. Bar plots on the left show the information
reward estimation of each variable on the y-axis. All unobserved variables start with green
bars, and turns purple once selected by the algorithm. Right: violin plot of the posterior
density estimations of remaining unobserved variables.
.
7.3 Experiments 161
Inpainting Random Missing Pixels. We use MNIST dataset [171] and remove pixels
randomly for this task. The same settings are used for all methods (see Appendix 7.B.1 for
details). During training, we remove a random portion (uniformly sampled between 0% and
70%) of pixels. We then impute missing pixels on a partially observed test set (constructed
by removing 70% of the pixels uniform randomly). The performance of pixel imputation is
evaluated by test ELBOs on missing pixels. The first two rows in Table 7.1 show training
and test ELBOs for all algorithms using this partially observed dataset. Additionally, we
show ordinary VAE (VAE-full) trained on the fully observed dataset as an ideal reference.
Among all Partial VAE methods, the PNP approach performs best.
Table 7.1 Comparing models trained on partially observed MNIST. VAE-full is an ideal
reference.
7.B.2) [59]. We report the results of EDDI with all these four different specifications of
Partial VAE (ZI, ZI-m, PN, PNP).
All Partial VAE are first trained on partially observed UCI datasets where a random
portion of variables is removed. We actively select variable for each test point starting with
empty observation xo = 0. / In all UCI datasets, we randomly sample 10% of the data as the
test set. All experiments are repeated for ten times.
Taking PNP based setting as an example, Figure 7.5 shows the test RMSE on xφ for each
variable selection step with three different datasets, where xφ is defined by the UCI task.
We call this curve the information curve (IC). We see that EDDI can obtain information
efficiently. It archives the same test RMSE with less than half of the variables. Single optimal
ordering also improves upon random ordering. However, it is less efficient compared with
EDDI, since EDDI perform active learning for each data instance which is “personalized”.
Figure 7.6 shows an example of the decision processes using EDDI and SING. The first step
of EDDI overlaps largely with SING. From the second step, EDDI makes “personalized”
decisions.
7.3 Experiments 163
We also present the average performance among all datasets with different settings. The
area under the information curve (AUIC), can then be used to compare the performance
across models and strategies. Smaller AUIC value indicates better performance. However,
due to different datasets have different scales of RMSEs and different numbers of variables
(indicated by steps), it is not fair to average the AUIC across datasets to compare overall
performances. We thus define average ranking of AUIC that compares 12 methods (indexed
Nj
by i) averaging these datasets as: ri = ∑ 1N j ∑6j=1 ∑k=1 ri jk , i = 1, .., 12. These 12 methods
j
are cross combinations of four Partial VAE models with three variable selection strategies.
ri is the final ranking of ith combination, ri jk is the ranking of the ith combination (based on
AUIC value) regarding the kth test data point in the jth UCI dataset, and N j is the size of the
jth UCI dataset. This gives us 6 ∑ j N j different rankings. Finally, we compute the mean and
standard error statistics based on these rankings. Table 7.2 summarize the average ranking
results. We provide additional statistical significance test (Wilcoxcon signed-rank test for
paired data) in Appendix 7.B.2. Based on these experimental results, we see that EDDI
outperforms other variable selection order in all different Partial VAE settings. Among
different partial VAE settings, PNP/PN-based settings perform better than ZI-based settings.
Method Time
DRAL 2747.16
EDDI 2.64
our task, we focus on variable selection, which corresponds to medical instrument selection.
We thus further process the time series variables into static variables based on temporal
averaging.
Figure 7.8 shows the information curve (based on Bernoulli likelihoods) of different
strategies, using PNP based Partial VAE as an example (more results in Appendix 7.B.3).
Table 7.4 shows the average ranking of AUIC with different settings. In this application,
EDDI significantly outperforms other variable selection strategies in all different settings of
Partial VAE, and PNP based setting performs best.
Figure 7.8 Information curves of active vari- Figure 7.9 Information curves of active
able selection on risk assessment task on (grouped) variable selection on risk assess-
MIMIC III, produced with PNP setting. ment task on NHANES, produced with PNP
setting.
Method EDDI Random Single best Method EDDI Random Single best
ZI 8.83 (0.01) 7.97 (0.02) 9.83 (0.01) ZI 6.00 (0.10) 8.45 (0.09) 6.51 (0.09)
ZI-m 4.91 (0.01) 7.00 (0.01) 5.91 (0.01) ZI-m 8.06 (0.09) 8.67 (0.09) 8.68 (0.07)
PN 4.96 (0.01) 6.62 (0.01) 5.96 (0.01) PN 5.28 (0.10) 5.57 (0.10) 5.46 (0.09)
PNP 4.39 (0.01) 6.18 (0.01) 5.39 (0.01) PNP 4.80 (0.10) 5.30 (0.10) 5.17 (0.10)
Table 7.4 Average ranking on AUIC of Table 7.5 Average ranking on AUIC of
MIMIC III NHANES
selection on the group level: at each step, the algorithm selects one group to observe. This is
more challenging than the experiments in previous sections since it requires the generative
model to simulate a group of unobserved data in Equation (7.2.9) at the same time. When
evaluating test RMSE on the target variable of interest, we treat variables in each group
equally. For a fair comparison, the calculation of the area under the information curve
(AUIC) is weighted by the size of the group chosen by the algorithms. Specifically, AUIC is
calculated after spline interpolation. The information curve plots in Figure 7.9, together with
Table 7.5 of AUIC statistics show that our EDDI outperforms other baselines. In addition,
this experiment shows that EDDI is capable of performing active selection on a large pool
of grouped variables to estimate a high dimensional target.
Traditional methods without amortization. Prediction based methods have shown ad-
vantages for missing value imputation [298]. Efficient matrix factorization based methods
have been recently applied [144, 133, 289], where the observations are assumed to be able
to decompose as the multiplication of low dimensional matrices. In particular, many prob-
abilistic frameworks with various distribution assumptions [289, 28] have been used for
missing value imputation [373, 99] and also recommender systems where unlabeled items
are predicted [326, 350, 89].
The probabilistic matrix factorization method has been used in the active variable selec-
tion framework called the dimensionality reduction active learning model (DRAL),[174].
These traditional methods suffer from limited model capacity since they are typically linear.
Additionally, they do not scale to large volumes of data and thus are usually not applicable
in real-world applications. For example, Lewenberg et al. [174] tested the performance of
their method with a single user due to the heavy computational cost of traditional inference
methods for probabilistic matrix factorization.
Utilizing Amortized Inference. Amortized inference [150, 272, 376] has significantly
improved the scalability of deep generative latent variable models. In the case of partially
observed data, amortized inference is particularly of interest due to the speed requirement
in many real-life applications. Wu et al. [368] use amortized inference during training,
where the training dataset is assumed to be fully observed. During test time, the traditional
non-scalable inference is used to infer missing data entries from the partially observed
dataset using the pre-trained model. This method is restrictive since it is not scalable in the
test time and the fully observed training set assumption does not hold for many applications.
Nazabal et al. [241] use zero imputation (ZI) for amortized inference for both training
and test sets with missing data entries. ZI is a generic and straightforward method that
first fills the missing data with zeros, and then feeds the imputed data as input for the
inference network. The drawback of ZI is that it introduces bias when the data are not
missing completely at random which leads to a poorly fit model. We also observe artifacts
when using it for the image inpainting task. Independent of our work, Garnelo et al. [79]
7.4 Related Work 167
Active Feartue Acquisition (AFA). Active sequential feature selection is of great need,
especially in cost-sensitive applications. Thus, many methods have also been applied and
resulted in the class of methodologies called Active Feature Acquisition (AFA) [217, 286,
333, 124]. For instance, Melville et al. [217], Saar-Tsechansky et al. [286] have designed
objectives to select any feature from any instance to minimize the cost to achieve high
accuracy. The proposed framework is very general. However, the problem setting of AFA
methods is different from our active variable selection problem.AFA aims to select training
set optimally that would result in the best classifier (model), while assume that the test data
are fully observed. On the contrary, our framework aims to identify and acquire high value
information sequentially for each teat instance.
168 Efficient Dynamic Discovery of High-Value Information with Partial VAE
7.5 Conclusion
In this chapter, we present EDDI, a novel and efficient framework for unsupervised learning
and dynamic active variable selection under missing data. Within the EDDI framework, we
propose Partial VAE which performs amortized inference to handle missing data. Partial
VAE alone can be used as a non-linear computational efficient probabilistic imputation
method. Based on it, we design a variable wise acquisition function for EDDI and derive
corresponding approximation method. EDDI has demonstrated its effectiveness on active
variable selection tasks across multiple real-world applications. In the future, we would
extend the EDDI framework to handle more complicated scenarios, such as data missing not
at random, time-series, and the cold-start situation.
7.A Additional Derivations 169
R(i, xo ) = Exi ∼p(xi |xo ) [DKL (p(xφ |xi , xo )||p(xφ |xo ))]
R R
Where p(xφ |xi , xo ) = z p(xφ |z)q(z|xi , xo ), p(xφ |xo ) = z p(xφ |z)q(z|xo ) and q(z|xo ) are
approximate condition distributions given by partial VAE models. Now we consider the
problem of directly approximating R(i, xo ).
Applying the chain rule of KL-divergence, we have:
Using again the KL-divergence chain rule on DKL (p(xφ , z|xi , xo )||p(xφ , z|xo )), we have:
p(z|xφ , xi , xo ) ≈ q(z|xφ , xi , xo ),
p(z|xi , xo ) ≈ q(z|xi , xo ), p(z|xo ) ≈ q(z|xo )
R(i, xo )
≈ Exi ∼p(xi |xo ) [DKL (q(z|xi , xo )||q(z|xo ))]
− Exi ∼p(xi |xo ) Exφ ∼p(xφ |xi ,xo ) DKL (q(z|xφ , xi , xo )||q(z|xφ , xo ))
= Exi ∼p(xi |xo ) [DKL (q(z|xi , xo )||q(z|xo ))]
− Exφ ,xi ∼p(xφ ,xi |xo ) DKL (q(z|xφ , xi , xo )||q(z|xφ , xo )) = R̂(i, xo ).
This new objective tries to maximize the shift of belief on latent variables z by introduc-
ing xi , while penalizing the information that cannot be absorbed by xφ (by the penalty
term DKL (q(z|xφ , xi , xo )||q(z|xφ , xo ))). Moreover, it is more computationally efficient since
one set of samples xφ , xi ∼ p(xφ , xi |xo ) can be shared across different terms, and the KL-
divergence between common parameterizations of encoder (such as Gaussians and normal-
izing flows) can be computed exactly without the need for approximate integrals. Note also
that under approximation
, sampling xi ∼ p(xi |xo ) is approximated by xi ∼ p̂(xi |xo ), where p̂(xi |xo ) is defined by the
following process in Partial VAE. It is implemented by first sampling z ∼ q(z|xo ), and then
xi ∼ p(xi |z). The same applies for p(xi , xφ |z).
For our MNIST experiment, we randomly draw 10% of the whole data to be our test set.
Partial VAE models (ZI, ZI-m, PNP and PNs) share the same size of architecture with 20
dimensional diagonal Gaussian latent variables: the generator (decoder) is a 20-200-500-500
7.B Additional Experimental Results 171
fully connected neural network with ReLU activations (where D is the data dimension,
D = 784). The inference nets (encoder) share the same structure of D-500-500-200-40 that
maps the observed data into distributional parameters of the latent space. For the PN-based
parameterizations, we use a 500 dimensional feature mapping h parameterized by a single
layer neural network, and 20 dimensional ID vectors ei (see Section 7.2.1) for each variable.
We choose the symmetric operator g to be the basic summation operator.
During training, we apply Adam optimization [148] with default hyperparameter setting,
learning rate of 0.001 and a batch size of 100. We generate partially observed MNIST
dataset by adding artificially missingness at random in the training dataset during training.
We first draw a missing rate parameter from a uniform distribution U(0, 0.7) and randomly
choose variables as unobserved. This step is repeated at each iteration. We train our models
for 3K iterations.
Figure 7.10 Random images generated using (a) naive zero imputing, (b) zero imputing
with mask, (c) PN and (d) PNP, respectively.
All data are normalized and then scaled between 0 and 1. For each of the 10 - in total-
repetitions, we randomly draw 10% of the data to be our test set. Partial VAE models (ZI,
ZI-m, PNP and PNs) share the same size of architecture with 10 dimensional diagonal
172 Efficient Dynamic Discovery of High-Value Information with Partial VAE
Gaussian latent variables: the generator (decoder) is a 10-50-100-D neural network with
ReLU activations (where D is the data dimensions). The inference nets (encoder) share
the same structure D-100-50-20 that maps the observed data into distributional parameters
of the latent space. For the PN-based parameterizations, we further use a 20 dimensional
feature mapping h parameterized by a single layer neural network and 10 dimensional ID
vectors ei (please refer to section 7.2.1) for each variable. We choose the symmetric operator
g to be the basic summation operator.
As in the image inpainting experiment, we apply Adam optimization during training
with default hyperparameter setting, and a batch size of 100 and ingest random missingness
as before. We trained our models for 3K iterations.
During active learning, we draw 50 samples in order to estimate the expectation under
xφ , xi ∼ p(xφ , xi |xo ) in Equation (7.2.8). Other than information curves based on test RMSEs,
we will also provide information curves based on test negative log likelihoods. This will be
provided in Appendix 7.B.2. Note that this test nllh of the target variable is also estimated
using 50 samples of xφ ∼ p(xφ |xo ). Then, we approximately compute the (expected) log
predictive likelihood through log p(xφ |xo ) ≈ log M1 ∑M m=1 p(xφ |zm ), where zm ∼ q(z|xo ).
In this section, we perform Wilcoxcon signed-rank significance test on the AUIC (RMSE-
based) performance of different methods, to support our result in Table 7.2. Since Table 7.2
suggests that EDDI-PNP-Partial VAE is the best algorithm overall, we set EDDI-PNP-Partial
VAE as default and perform Wilcoxcon test between EDDI-PNP-Partial VAE and all other
15 different settings, to see whether the improvement is significant. Table 7.6 displays the
corresponding p-value for each test. It is obvious that in all 15 tests, the EDDI-PNP-Partial
VAE results are significant (compared with the standard α = 0.05 cutoff). This provides
strong evidence that confirms our results in Table 7.2.
Table 7.6 p- values of Wilcoxon signed-rank test of EDDI-PNP vs. 11 other settings, on 6
UCI datasets, using AUIC (RMSE-based) as evaluation metric.
Here we present additional plots of the RMSE information curves during active learning.
Figure 7.11 presents the results for the Boston Housing, the Energy and the Wine datasets
and for the three approaches, i.e. PN, ZI and masked ZI.
0.16 0.12
0.20 0.12
0.14
0.25 0.11 0.25 0.11
0.18 0.10
0.12 0.10
0.16 0.20 0.09 0.20
2.5 5.0 7.5 10.0 12.5 2 4 6 8 2 4 6 8
Figure 7.11 information curves (based on RMSE) of active variable selection for the three
UCI datasets and the three approaches, i.e. (First row) PointNet (PN), (Second row) Zero
Imputing (ZI), and (Third row) Zero Imputing with mask (ZI-m). Green: random strategy;
Black: EDDI; Pink: Single best ordering. This displays RMSE (y axis, the lower the better)
during the course of active selection (x-axis).
174 Efficient Dynamic Discovery of High-Value Information with Partial VAE
Negative test log Likelihood plots of PN, ZI and ZI-m on UCI datasets
Here we present additional plots of the negative test log likelihood curves during active
variable selection. Figure 7.12 presents the results for the Boston Housing, the Energy and
the Wine datasets and for the three approaches, i.e. PN, ZI and masked ZI.
0.0
−0.5
0.0 −0.6
−0.6
−0.2
−0.6
−0.5
−0.7
−0.2 −0.7 −0.7
−0.6
−0.8
−0.4 −0.8 −0.4 −0.8
2 4 6 8 2 4 6 8
−0.6
2.5 5.0 7.5 10.0 12.5
−0.7 −0.6
−0.8
−0.8 −0.8
0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Steps Steps Steps
masked_ZI+Ours ma ked_ZI+RAND masked_ZI+SING
0.0 0.0
avg. neg. test likelihood
−0.1
−0.5
−0.65
−0.4 −0.70
−0.6
−0.2 −0.70 −0.2
−0.5 −0.75 −0.3 −0.75
−0.7
Figure 7.12 Information curves (based on test negative log-likelihood) of active variable
selection for the three UCI datasets and the three approaches, i.e. (First row) PointNet
(PN), (Second row) Zero Imputing (ZI), and (Third row) Zero Imputing with mask (ZI-m).
Green: random strategy; Black: EDDI; Pink: Single best ordering. This displays negative
test log likelihood (y axis, the lower the better) during the course of active selection (x-axis).
Here we present additional results of a new baseline, the LASSO-based feature selection.
This is not presented in the main text since LASSO is designed for a different problem
setting. It requires fully observed data, and only works in regression problems with one
7.B Additional Experimental Results 175
dimensional outputs. Both MIMIC III and NHANES tasks do not fulfill these requirements.
Additionally, LASSO aims to select a global set of features to obtain the best performance
instead of select the most informative feature given partially observed information, thus
cannot be used in a sequential setting. We thus construct the LASSO feature selection
baseline as follows for comparison: we first apply LASSO regression on training dataset
which is fully observed in these UCI datasets, and select the features (denoted by A) that
correspond to non-zero coefficients. Then, during test time, LASSO strategy will observe the
features one by one from A randomly. When all variables selected by LASSO are already
picked, we stop the feature selection progress. Once LASSO has completed feature selection,
we use we use the corresponding partial-VAE (ZI,ZI-m,PNP,PN) to make predictions for
fairness.
Figure 7.13 presents the results for the Boston Housing, the Energy and the Wine datasets
as examples. Full results of all UCI datasets are presented in Table 7.7. Note that in Table
7.7, Wilcoxon signed-rank test is performed between EDDI and LASSO strategies for each
Partial VAE models, respectively. The results indicates that EDDI significantly outperforms
LASSO in all circumstances. This is despite the fact that EDDI is a greedy sequential
variable selection method that built upon partially observed data, while LASSO-baseline
makes use of the information from fully observed data, and selects the set of variables in a
non-greedy, global manner, which is often unrealistic in many pratical application settings.
avg. RMSE
avg. RMSE
0.25 0.25
0.18
0.16 0.20 0.20
0.14 0.15 0.15
0.12 0.10 0.10
0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Steps Steps Steps
Figure 7.13 Information curves of active variable selection for the three UCI datasets and
PNP-Partial VAE. Black: EDDI; Blue: Single best ordering. This displays test RMSE (y
axis, the lower the better) during the course of active selection (x-axis).
176 Efficient Dynamic Discovery of High-Value Information with Partial VAE
Table 7.7 Avg. rankings of AUIC (RMSE-based), and p- values of Wilcoxon signed-rank
test that EDDI outperforms LASSO (on 6 UCI datasets).
The decision process facilitated by the active selection of the variables (for the EDDI
framework) is efficiently illustrated in Figure 7.14 and Figure 7.15 for the Boston Housing
dataset and for the PNP and PNP with single best ordering approaches, respectively.
For completeness, we provide details regarding the abbreviations of the variables used in
the Boston dataset and appear both figures.
PRD - proportion of residential land zoned for lots over 25,000 sq.ft.
7.B.3 MIMIC-III
Here we provide additional results of our approach on the MIMIC-III dataset.
For our active learning experiments on MIMIC III datasets, we chose the variable of interest
xφ to be the binary mortality indicator of the dataset. All data (except the binary mortality
indicator) are normalized and then scaled between 0 and 1. We transformed the categorical
variables into real-valued using the dictionary deduced from [139] that makes use of the
actual medical implications of each possible values. The binary mortality indicator are
treated as Bernoulli variables and Bernoulli likelihood function is applied. For each repetition
(of the 5 in total), we randomly draw 10% of the whole data to be our test set. Partial VAE
models (ZI, ZI-m, PNP and PNs) share the same size of architecture with 10 dimensional
diagonal Gaussian latent variables: the generator (decoder) is a 10-50-100-D neural network
with ReLU activations (where D is the data dimensions). The inference nets (encoder)
share the same structure of D-100-50-20 that maps the observed data into distributional
parameters of the latent space. Additionally, for PN-based parameterizations, we further use
a 20 dimensional feature mapping h parameterized by a single layer neural network, and 10
dimensional ID vectors ei (please refer to section 7.2.1) for each variable. We choose the
symmetric operator g to be the basic summation operator.
Adam optimization and random missingness is applied as in the previous experiments.
We trained our models for 3K iterations. During active learning, we draw 50 samples in order
to estimate the expectation under xφ , xi ∼ p(xφ , xi |xo ) in Equation (7.2.8). Loss functions
(RMSEs and negative log likelihoods) of the target variable is also estimated using samples
of xφ ∼ p(xφ |xo ) through p(xφ |xo ) ≈ M1 ∑Mm=1 p(xφ |zm ), where zm ∼ q(z|xo ).
Figure 7.16 shows the information curves (Bernoulli negative test likelihood-based) of
active variable selection on the risk assessment task for MIMIC-III as produced by the three
approaches, i.e. ZI, PN and masked ZI.
178 Efficient Dynamic Discovery of High-Value Information with Partial VAE
Figure 7.16 Information curves of active variable selection on risk assessment task on MIMIC
III, produced from: (a) Zero Imputing (ZI), (b) PointNet (PN) and (c) Zero Imputing with
mask (ZI-m). Green: random strategy; Black: EDDI; Pink: Single best ordering. This
displays negative test log likelihood (y axis, the lower the better) during the course of active
selection (x-axis)
.
7.B.4 NHANES
Preprocessing and model details
For our active learning experiments on NHANES datasets, we chose the variable of interest
xφ to be the lab test result section of the dataset. All data are normalized and scaled between
0 and 1. For categorical variables, these are transformed into real-valued variables using
the code that comes with the dataset, which makes use of the actual ordering of variables in
questionnaire. Then, for each repetition (of the 5 repetitions in total), we randomly draw 8000
data as training set and 100 data to be test set. All partial VAE models (ZI, ZI-m, PNP and
PNs) uses gaussian likelihoods, with an diagonal Gaussian inference model (encoder). Partial
VAE models share the same size of architecture with 20 dimensional diagonal Gaussian
latent variables: the generator (decoder) is a 20-50-100-D neural network. The inference
nets (encoder) share the same structure of D-100-50-20 that maps the observed data into
distributional parameters of the latent space. Additionally, for PN-based parameterizations,
we further use a 20 dimensional feature mapping h parameterized by a single layer neural
network, and 100 dimensional ID vectors ei (please refer to section 7.2.1) for each variable.
We choose the symmetric operator g to be the basic summation operator.
Adam optimization and random missingness is applied as in the previous experiments.
We trained all models 1K iterations. During active learning, 10 samples were drawn to
estimate the expectation in Equation (7.2.9). Losses (RMSEs) of the target variable is also
estimated using 10 samples.
7.C Additional Theoretical Contributions 179
Zero imputation with inference net In ZI, the natural parameter of λ (e.g., Gaussian
parameters in variational autoencoders) is approximated using the following neural network:
L
(1) (0)
f (x) := ∑ wl σ (wl xT )
l=1
,
where L is the number of hidden units, x is the input image with xi be the value of the ith
pixel. To deal with partially observed data x = xo ∪ xu , ZI simply sets all xu to zero, and use
the full inference model f (x) to perform approximate inference.
where si = [xi , ei ], ei is the I dimensional embedding/ID/location vector of the ith pixel, g(·)
is a symmetric operation such as max-pooling and summation, and h(·) is a nonlinear feature
mapping from RI+1 to RK (we will always refer h as feature maps ). In the current version
of the partial-VAE implementation, where Gaussian approximation is used, we set K = 2H
with H being the dimension of latent variables. We set g to be the element-wise summation
operator, i.e. a mapping from RKO to RK defined by:
I
g(h(s1 ), h(s2 ), ..., h(sO )) := ∑ θk σ ( ∑ hk (si)),
k=1 i∈O
where hk (·) is the kth output feature of h(·). The above PN parameterization is also
(1) (0)
permutation invariant; setting L = I, θl = wl ,(wl )i = (ei )l the resulting PN model is
equivalent to the ZI neural network.
Generalizing ZI from PN perspective In the ZI approach, the missing values are replaced
with zeros. However, this ad-hoc approach does not distinguish missing values from actual
observed zero values. In practice, being able to distinguish between these two is crucial
for improving uncertainty estimation during partial inference. One the other hand, we have
found that PN-based partial VAE experiences difficulties in training. To alleviate both issues,
we proposed a generalization of the ZI approach that follows a PN perspective. One of the
advantages of PN is setting the feature maps of the unobserved variables to zero instead of
the related weights. As discussed before, these two approaches are equivalent to each other
only if the factors are linear. More generally, we can parameterize the PN by:
h(1) (si ) := ei ∗ xi
(1) (1)
h(2) (hi ) := NN1 (hi )
(2) (1)
g(h(s1 ), h(s2 ), ..., h(sO )) := NN2 (σ ( ∑ hk (hi ))),
i∈O
where NN1 is a mapping from RI to RK defined by a neural network, and NN2 is a mapping
from RK to R2H defined by another neural network.
• Using reversed information reward Exi ∼p(xi |xo ) [DKL (p(xφ |xo )||p(xφ |xo , xi ))], and then
apply ELBO (KL-divergence) ⇒ This does not make sense mathematically, since this
7.C Additional Theoretical Contributions 181
will result in upper bound approximation of the (reversed) information objective, this
is in the wrong direction.
• Ranganath’s bound [268] on estimating entropy⇒ gives upper bound of the objective,
wrong direction.
• All the above methods also needs samples from latent space (therefore second level
approximation needed).
Note that based on McKay’s relationship between entropy and KL-divergence reduction, we
have:
Similarly, we have
Exi ∼p(xi |xo ) Exφ ∼p(xφ |xi ,xo ) DKL (p(z|xφ , xi , xo )||p(z|xφ , xo ))
=Exφ ∼p(xφ |xo ) Exi ∼p(xi |xφ ,xo ) DKL (p(z|xφ , xi , xo )||p(z|xφ , xo ))
=Exφ ∼p(xφ |xo ) Exi ∼p(xi |xφ ,xo ) H(p(z|xφ , xi , xo )) − H(p(z|xφ , xo ))
=Exi ∼p(xi |xo ) Exφ ∼p(xφ |xi ,xo ) H(p(z|xφ , xi , xo )) − Exφ ∼p(xφ |xo ) Exi ∼p(xi |xφ ,xo ) H(p(z|xφ , xo ))
=Exi ∼p(xi |xo ) Exφ ∼p(xφ |xi ,xo ) H(p(z|xφ , xi , xo )) − Exφ ∼p(xφ |xo ) H(p(z|xφ , xo )) ,
where MacKay’s result is applied to Exi ∼p(xi |xφ ,xo ) DKL (p(z|xφ , xi , xo )||p(z|xφ , xo )) .
Putting everything together, we have
182 Efficient Dynamic Discovery of High-Value Information with Partial VAE
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 7.14 Information reward estimated during the first 4 active variable selection steps on
a randomly chosen Boston Housing test data point. Model: PNP, strategy: EDDI. Each row
contains two plots regarding the same time step. Bar plots on the left show the information
reward estimation of each variable on the y-axis. All unobserved variables start with green
bars, and turns purple once selected by the algorithm. Right: violin plot of the posterior
density estimations of remaining unobserved variables.
.
184 Efficient Dynamic Discovery of High-Value Information with Partial VAE
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 7.15 Information reward estimated during the first 4 active variable selection steps on
a randomly chosen Boston Housing test data point. Models: PNP, strategy: single ordering.
Each row contains two plots regarding the same time step. Bar plots on the left show the
information reward estimation of each variable on the y-axis. All unobserved variables start
with green bars, and turns purple once selected by the algorithm. Right: violin plot of the
posterior density estimations of remaining unobserved variables.
.
Chapter 8
So far, our discussions have been based on the assumption that the missing data follows a
MAR (missing at random) mechanism. However, real-world datasets often have missing
values associated with complex generative processes, where the cause of the missingness
may not be fully observed. This is known as missing not at random (MNAR) data. Although
there are many relevant methods in the literature that have considered the MNAR scenario,
their model’s identifiability under MNAR is generally not guaranteed. That is, model
parameters can not be uniquely determined even with infinite data samples. Therefore, as
discussed in Chapter 6 Section 6.3, lack of model identifiability might introduce additional
biases in missing data imputation. This issue is especially overlooked by many modern
deep generative models. In this Chapter, we fill in this gap by systematically analyzing the
identifiability of generative models under MNAR. Furthermore, we propose a practical deep
generative model which can provide identifiability guarantees under mild assumptions, for a
wide range of MNAR mechanisms.
8.1 Introduction
Missing data is an obstacle in many data analysis problems, which may seriously compromise
the performance of machine learning models, as well as downstream tasks based on these
models. Being able to successfully recover/impute missing data in an unbiased way is
the key to understanding the structure of real-world data. This requires us to identify the
186 Identifiable Generative Models Under Missing Not at Random Data
Figure 8.1 Exemplar missing data situations. (a): MCAR; (b): MAR; (c)-(i): MNAR.
• Based on our analysis, we propose a practical algorithm model based on VAEs (Section
8.4), named GINA (deep generative imputation model for missing not at random).
This enables us to apply flexible deep generative models in a principled way, even in
the presence of MNAR data.
8.2 Backgrounds
Next, we introduce necessary notations and of missing data, and set up a concrete problem
setting.
Basic Notation In this Chapter, we will use a notation system that is slightly different from
Chapter 7, which will allow us to denote different quantities more accurately. Similar to the
notations introduced by [128, 279], let X be the complete set of variables in the system of
interest. We call it observable variables. Let I = {1, ..., D} be the index set of all observable
variables, i.e., X = {Xi |i ∈ I}. Let XO denote the set of actually observed variables, here O ∈
I is a index set such that XO ⊂ X. We call O the observable pattern. Similarly, XU denotes
S
the set of missing/unobserved variables, and X = XO XU . Additionally, we use R to denote
the missing mask indicator variable, such that Ri = 1 indicates Xi is observed, and Ri = 0
indicates otherwise. We call a probabilistic distribution p(X) on X the reference distribution,
that is, the distribution that we would have observed if no missing mechanism is present;
and we call the conditional distribution p(R|X) the missing mechanism, which decides
the probability of each Xi being missing. Then, we can define the marginal distribution
R
of partially observed variables, which is given by log p(XO , R) = log XU p(XO , XU , R)dXU .
Finally, we will use lowercase vectors to denote the realized values of the corresponding
random variable. For example, (xO , r) ∼ p(XO , R) is the realization/samples of XO and R,
and the dimensionality of xO may vary for each realizations.
Problem setting Suppose that we have a ground truth data generating process, denoted by
pD (XO , R), from which we can obtain (partially observed) samples (xO , r) ∼ pD (XO , R). We
also have a model to be optimized, denoted by p(θθ ,ψψ ) (XO , XU , R),where θ is the parameter
of reference distribution pθ (X), and ψ the parameter of missing mechanism pψ (R|X). Our
goal can then be described as follows:
• To establish the identifiability of the model p(θθ ,ψψ ) (XO , R). That is, we wish to uniquely
and correctly identify θ̂θ , such that pθ̂θ (X) = pD (X), given infinite amount of partially
observed data samples from ground truth, (xO , r) ∼ pD (XO , R).
• Then, given the identified parameter, we will be able to perform missing data imputa-
tion, using pθ̂θ (XU |XO ). If our parameter estimate is unbiased, then our imputation is
also unbiased, that is, pθ̂θ (XU |XO ) = pD (XU |XO ) for all possible configurations of XO .
8.2 Backgrounds 189
arg max E(xO ,r)∼pD (X,R) log pθ (XO = xO ) = arg max E(xO ,r)∼pD (X,R) log pθ (XO = xO , R = r)
θ θ
R
where log p(XO ) = log XU p(XO , XU )dXU .In practice, ML learning on XO can done by EM
algorithm [55, 190]. However, when missing data is MNAR, the above argument does
not hold, and the missing data mechanism cannot be ignored during learning. Consider
the representative graphical model example in Figure 8.1 (d), which has appeared in many
context of machine learning. In this graphical model, X is the cause of R, and the connections
between X and R are fully connected, i.e., each single node in R are caused by the entire set
X. All nodes in R are conditionally independent from each other given X.
Clearly, this is an example of a data generating process with MNAR mechanism. In
this case, Rubin proposed to jointly optimize both the reference distribution pθ (X) and the
missing data mechanism pψ (R|X), by maximizing:
arg max E(xO ,r)∼pD (X,R) log p(θθ ,ψψ ) (XO = xO , R = r) (8.2.1)
θ ,ψ
ψ
This factorization is referred as selection modeling [128, 190]. There are multiple challenges
if we want to Eq. 8.2.1 to obtain a practical model that provide unbiased imputation. First,
we need model assumption to be consistent with the real-world data generation process,
pD (XO , R). Given a wide range of possible MNAR scenarios, it is a challenge to design a
general model. Secondly, the model need to be identifiable to enable the possibility to learn
the underlying process which leads to unbiased imputation.
where Z is some latent variable model with prior p(Z), and pθ (X|Z) is given by pθ (X|Z) =
N ( fθ (Z), σ ), with fθ (·) being a neural network parameterized by θ . Generally, VAEs do
not have identifiability guarantees w.r.t. θ [146]. Nevertheless, inspired by the identifiablity
of nonlinear ICA, [146] shows that the identifiability of VAE can be established up to
equivalence permutation under mild assumptions, if the unconditional prior p(Z) of VAE is
replaced by the following the conditionally factorial exponentially family prior,
K
pT,ζ (Z|U) ∝ ∏ Q(Zi ) exp[ ∑ Ti, j (Zi )ζi, j (U)], (8.2.3)
i=1 j=1
where U is some additional observations (called auxiliary variables), Q(Zi ) is some base
measure, Ti (U) = (Ti,1 , ..., Ti,K ) the sufficient statistics, and ζ i (U) = (ζi,1 , ..., ζi,K ) the cor-
responding natural parameters. Then, the new VAE model given by
Z
log pθ (X|U) = log dZ pθ (X|Z)pT,ζ (Z|U) (8.2.4)
Z
is identifiable (Theorem 1 and 2 of [146], see Appendix 8.G).We call the model (8.2.4)
the identifiable VAE. Unfortunately, this identifiability results for VAE only hold when all
variables are fully observed; thus, it cannot be immediately applied to address the challenges
of dealing with MNAR data stated in Section 8.2.2. Next, we will analyze the identifiablity
of generative models under general MNAR settings (Section 8.3), and propose a practical
method that can be used in MNAR (Section 8.4).
In other words, the “correct” model parameter θ ∗ can be identified via maximum
likelihood learning (under complete data), and the ML solution is unbiased. Similarly, when
8.3 Establishing model identifiability under MNAR 191
Here, (θθ , ψ ) ∈ Ω are learnable parameters that belong to some parameter space Ω =
Ω θ × Ω ψ . Each θ is the parameter that parameterizes the conditional distribution that
connects Xd and Z, pθ d (Xd |Z). Assume that the ground truth parameter of pD belongs to
the model parameter space, (θθ ∗ , ψ ∗ ) ∈ Ω.
Given such a model, our goal is to correctly identify the ground truth parameter settings
given partially observed samples from pD (XO , XU , R). That is, let
(θ̂θ , ψ̂
ψ ) = arg max E(xO ,r)∼pD (X,R) log p(θθ ,ψψ ) (XO = xO , R = r),
(θθ ,ψ
ψ )∈Ω
Ω
we would like to achieve θ̂θ = θ ∗ . In order to achieve this, we must make additional
assumptions.
Assumption A2. Subset identifiability: There exist a partition1 of I, denoted by AI =
{Cs }1≤s≤S , such that: for all Cs ∈ AI , pθ (Xos ) is identifiable on a subset of parameters
{θθ d |d ∈ Cs }.
This assumption basically formalizes the idea of divide and conquer: we partition the
whole index set into several smaller subsets {Cs }1≤s≤S , on which each reference distribution
pθ (XCs ) is only responsible for the identifiability on a subset of parameters.
Assumption A3. There exists a collection of observable patterns, denote by A¯I :=
{Cl′ }1≤l≤L , such that: 1), A¯I is a cover 1 of I; 2), pD (X, RCl′ = 1, RI\Cl′ ) (hence any of
its marginal distributions) is positive for all X and 1 ≤ l ≤ L; and 3), for all index c ∈ Cl′ ,
there exists Cs ∈ AI defined in A2, such that c ∈ Cs ⊂ Cl′ .
This assumption is about the strict positivity of the ground truth data generating process,
pD (XO , XU , R). Instead of assuming that complete case data are available as in [233], here
we assume we should at least have some observations, pD (X, RO = 1, RU = 0) > 0 for
O ∈ ÂI , on which pθ (XO ) is identifiable.
To summarize, A1 ensures that our model has the same graphical representation/para-
R
metric forms as the ground truth; A2 pθ (XO ) = XU pθ (XO , XU )dXU should be at least
identifiable for a collection of observable patterns that forms a partition of I; and Assump-
tion A3 ensures that pD (XO , XU , R) should be positive for certain important patterns (i.e.,
those on which pθ (XO ) is identifiable). Given these assumptions, we have the following
proposition (See Appendix 8.C for proof.):
Proposition 8.1 (Sufficient conditions for identifiability under MNAR). Let pθ ,ψψ (XO , XU , R)
be a model on the observable variables X, and missing pattern R, and pD (XO , XU , R) be the
1 It can be arbitrary partition in the set theory sense.
8.3 Establishing model identifiability under MNAR 193
ground truth distribution. Assume that they satisfies Data setting D1, Assumptions A1, A2
and A3.
Let θ = arg max(θθ ,ψψ )∈Ω ψ ) (XO = xO , R = r) be the set of ML
Ω E(xO ,r)∼pD (X,R) log p(θθ ,ψ
∗
solutions of Equation 8.2.1. Then, we have θ = {θθ } × θ ψ . That is, the ground truth model
parameter θ ∗ can be uniquely identified via (partial) maximum likelihood learning.
Missing value imputation as inference Given a model p(θθ ) (XO , XU ), the missing data im-
putation problem can be then formularized by the Bayesian inference problem pθ (XU |XO ) ∝
pθ (XU , XO ). If the assumptions of Proposition 8.2 are satisfied, it enables us to correctly
identify the ground truth reference model parameter, θ ∗ . Therefore, the imputed values
sampled from the posterior pθ ∗ (XU |XO ) will be unbiased, and can be used for down stream
decision making tasks.
Remark: Note that Proposition 8.1 can be extended to the case where model identifiability
is defined by equivalence classes [146, 321]. See Appendix 8.F for details.
Assuming that the inverse mapping Φ−1 exists. Then trivially, if pθ ,ψψ (XO , R) is identifiable
with respect to θ and ψ , then p̃τ,γ (XO , R) should be also identifiable with respect to τ and γ:
Proposition 8.2. Let Ω ⊂ RI be the parameter domain of the model pθ ,ψψ (XO , XU , R).
Assume that the mapping Φ : (θθ , ψ ) ∈ Ω ⊂ RI 7→ (τ, γ) ∈ Ξ ⊂ RJ is one-to-one on Ω
(equivalently, the inverse mapping Φ−1 : Ξ 7→ Ω is injective, and Ω is its image set). Consider
194 Identifiable Generative Models Under Missing Not at Random Data
the induced distribution with parameter space Ξ, defined as p̃τ,γ (XO , R) := pΦ−1 (τ,γ) (XO , R).
Then, p̃ is identifiable w.r.t. (τ, γ), if pθ ,ψψ (XO , R) is identifiable w.r.t. θ and ψ .
Proposition 8.2 basically shows that if two distributions pθ ,ψψ (XO , R) and p̃τ,γ (XO , R)
are related by a mapping Φ with nice properties, than the identifiability will translate
between them. This already covers many scenarios of the data-model mismatch. For
example, consider the case where ground truth data generation process satisfies the following
assumption:
Data setting D2 Suppose the ground truth pD (XO , XU , R) satisfies: X are all generated by
shared latent confounders Z (as in D1), and R cannot be the cause of any other variables
as in [233, 342]. Typical examples are given by any of the cases in Fig 8.1(excluding (j)
where R1 is the cause of R2 ). Furthermore, the ground truth data generating process is given
by the parametric form pD (XO , XU , R) = p̃τ ∗ ,γ ∗ (XO , XU , R), where Ξ = Ξ τ × Ξ γ denotes its
parameter space.
Then, for such ground truth data generating process, we can show that we can always
find a model in the form of Equation 8.3.2, such that there exists some mapping Φ, that can
model their relationship:
Lemma 8.1. Suppose the ground truth data generating process p̃τ ∗ ,γ ∗ (XO , XU , R) satisfies
setting D2. Then, there exists a model pθ ,ψψ (XO , XU , R), such that: 1), pθ ,ψψ (XO , XU , R)
can be written in the form of Equation 8.3.2 (i.e., Assumption A1; and 2), there exists a
mapping Φ as described in Proposition 8.2, such that p̃τ,γ (XO , R) = pΦ−1 (τ,γ) (XO , R), for
all (τ, γ) ∈ Ξ .
Proposition 8.3 (Sufficient conditions for identifiability under MNAR and data-model mis-
match). Let pθ ,ψψ (XO , XU , R) be a model on the observable variables X and missing pattern R,
8.4 GINA: A Practical Imputation Algorithm for MNAR 195
and pD (XO , XU , R) be the ground truth distribution. Assume that they satisfies Data setting
D2, Assumption A2, A3, and A4. Let θ = arg max(θθ ,ψψ )∈Ω ψ ) (XO =
Ω E(xO ,r)∼pD (X,R) log p(θθ ,ψ
−1 ∗
xO , R = r) be the set of ML solutions of Equation 8.2.1. Then, we have θ = {Φτ (τ )} × θ ψ .
Namely, the ground truth model parameter τ ∗ of pD can be uniquely identified (as Φ(θθ ∗ ))
via ML learning.
Remark: practical implications Proposition 8.3 allows us to deal with the cases where the
parameterization of ground truth data generating process and model distribution are related
through a set of mappings, {ΦO }. In general, the graphical structure of pD (XO , XU , R) can
be any cases in Figure 8.1 excluding (j). Then, in those cases, we are still able to use a
model that corresponds to Equation 8.3.2 (Fig 8.1 (h)) to perform ML learning, provided
that our model is flexible enough (Assumption A4). This greatly improves the applicability
of our identifiability results, and we can build a practical algorithm based on Equation 8.3.2
to handle many practical MNAR cases.
In Appendix 8.G, we show that GINA fulfill the required assumptions of Proposition 8.1
and 8.3. Thus, we can use GINA to identify the ground truth data generating process, and
perform missing value imputation under MNAR.
Learning and imputation In practice, the joint likelihood in Equation 8.2.1 is intractable.
Similar to the approach proposed in [128], we introduce a variational inference network,
qλ (Z|XO ), which enable us to derive a importance weighted lower bound of log pθ ,ψψ (XO , R):
1
log pθ ,ψψ (XO , R) ≥ LK (θθ , ψ , λ , XO , R) := Ez1 ,...,zK ,x1 ,...,xK ∼pθ (XU |Z)qλ (Z|XO ) log wk
U U λ K∑k
Given θ ∗ , ψ ∗ , λ ∗ , we can impute missing data by solving the approximate inference prob-
lem: Z Z
pθ (XU |XO ) = pθ (XU |Z)pθ (Z|XO )dZ ≈ pθ (XU |Z)qλ (Z|XO )dZ.
Z Z
One issue that is often ignored by many MNAR methods is the model identifiability.
Identifiability under MNAR has been discussed for certain cases ( [220, 219, 221, 352,
332, 321]). For example, [352] proposed the instrumental variable approach to help the
identification of MNAR data. [219] investigated the identifiability of normal and normal
mixture models, and showed that identifiability for parametric models is highly non-trivial
under MNAR. [220] studied conditions for nonparametric identification using shadow
variable technique. Despite the resemblance to the auxiliary variable in our approach,
[219, 220] mainly considers the supervised learning (multivariate regression) scenario.
[233, 232, 311] also discussed a similar topic based on a graphical and causal approach in a
non-parametric setting. Although the notion of recoverability has been extensively discussed,
their methods do not directly lead to practical imputation algorithms in a scalable setting.
On the contrary, our work takes a different approach, in which we handle MNAR with a
parametric setting, by dealing with learning and inference in latent variable models. We step
aside from the computational burden with the help of recent advances in deep generative
models for scalable imputation.
There has been a growing interest in applying deep generative models to missing data
imputation. In [200, 197, 241], scalable methods for training VAEs under MAR have been
proposed. Similar methods have also been advocated in the context of importance weighted
VAEs, multiple imputation [212], and heterogeneous tabular data imputation [241, 201, 199].
Generative adversarial networks (GANs) have also been applied to MCAR data [372, 175].
More recently, deep generative models under MNAR have been studied [128, 82, 87], where
different approaches such as selection models [279, 107] and pattern-set mixture models
[189] has been combined with partial variational inference for training VAEs. However,
without additional assumptions, the model identifiability remains unclear in these approaches,
and the posterior distribution of missing data conditioned on observed data might be biased.
8.6 Experiments
We study the empirical performance of the proposed algorithm of Section 8.4 with both
synthetic data (Section 8.6.1) and two real-world datasets with music recommendation
(Section 8.6.2) and personalized education (Section 8.6.3) . The experimental setting details
can be found in Appendix 8.B.
198 Identifiable Generative Models Under Missing Not at Random Data
partially observed data. This experiment showed the clear advantage of our method under
different MNAR situations.
are desired. This further indicates that our proposed GINA predicts the unobserved answer
with the desired behavior.
8.7 Conclusion
In this chapter, we provide a analysis of identifiability for generative models under MNAR,
and studies sufficient conditions of identifiability under different scenarios. We provide
sufficient conditions under which the model parameters can be uniquely identified, via joint
maximum likelihood learning on XO and R. Therefore, the learned model can be used to
perform unbiased missing data imputation. We proposed a practical algorithm based on
VAEs, which enables us to apply flexible generative models that is able to handle missing
data in a principled way. The main limitation of our proposed pracitical algorithm is the
need for auxiliary variables (meta feature) which is inherited from identifiable VAE models
[146]. In practice, they may not be always available. For future work, we will investigate
how to address such limitation, and how to extend to more complicated scenarios.
202 Identifiable Generative Models Under Missing Not at Random Data
observed data and the missing pattern. As opposed to our model, the latent priors p(Z) for
both PVAE and Not-MIWAE are parameterized by a standard normal distribution, hence
no auxiliary variables are used. Also, note that the graphical model of Not-MIWAE is
described by Fig 8.1 (a), and does not handle the scenarios where the ground truth data
distribution follows other graphs like Fig 8.1 (g). Finally, the inference model q(Z|X) for
the underlying VAEs is set to be diagonal Gaussian distributions whose mean and variance
are parameterized by neural nets as in standard VAEs [149] (with missing values replaced
by zeros[241, 128, 212]), or a permutation invariant set function proposed in [200]. See
Appendix 8.B for more implementation details for each tasks.
Network structure and training We use 5 dimensional latent space with fully factorized
standard normal priors. The decoder part pθ (X|Z) uses a 5-10-D structure, where D = 3
in our case. For inference net, we use a zero imputing [200] with structure 2D-10-10-5,
that maps the concatenation of observed data (with missing data filled with zero) and mask
variable R into distributional parameters of the latent space. For the factorized prior p(Z|V )
of the i-VAE component of GINA, we used a linear network with one auxiliary input (which
is set to be fully observed dimension, X1 ). The missing model pψ (R|X) for GINA and
i-NotMIWAE is a single layer neural network with 10 hidden units. All neural networks use
Tanh activations (except for output layer, where no activation function is used). All baselines
uses importance weighted VAE objective with 5 importance samples. The observational
noise for continuous variables are fixed to log σ = −2. All methods are trained with Adam
optimizer with batchsize with 100, and learning rate 0.001 for 20k epochs.
204 Identifiable Generative Models Under Missing Not at Random Data
index c ∈ Cl′ for some l, such that θ 1c ̸= θ 2c , and p(θθ 1 ,ψψ 1 ) (XCl′ , R) = p(θθ 2 ,ψψ 2 ) (XCl′ , R). That is,
p(XCl′ , R) is not identifiable on {θθ d }d∈Cl′ .
According to Assumption A3, there exists Cs ∈ AI , such that c ∈ Cs ⊂ Cl′ . Then,
consider the marginal
Z
pθ (XCs ) =
Z,R,X\Cs
dZ ∏ pθ d (Xd |Z)pψ (R|X, Z)p(Z) = pθ d∈Cs (XCs )
d∈Cs
. Since p(θθ 1 ,ψψ 1 ) (XCl′ , R) = p(θθ 2 ,ψψ 2 ) (XCl′ , R), we have p(θθ 1 ) (XCs ) = p(θθ 2 ) (XCs ) (the joint
Cs Cs
uniquely determines marginals). However, this contradicts with our Assumption A2
that pθ Cs (XCs ) is identifiable: this identifiability assumption implies that we should have
p(θθ 1 ) (XCs ) ̸= p(θθ 2 ) (XCs ). Therefore, by contradiction, we have p(XCl′ , R) is partially identi-
Cs Cs
fiable on {θθ d }d∈Cl′ for ∀Cl′ ⊂ ĀI .
Then, we proceed to prove that the ground truth parameter θ ∗ can be uniquely identified
via ML learning. Based on our Assumption A1, upon optimal ML solution,
holds for all (θθ ML , ψ ML ) ∈ θ ML , and all ∀O ⊂ I that satisfies p(XO , R) > 0.
Note also that:
Z
p(θθ ML ,ψψ ML ) (XO , R) = dZ ∏ pθ ML (Xd |Z)pψ ML (R|X)p(Z)
Z,XI\O d
d
, which depends on both θ O and ψ . Since we have already shown that p(θθ ,ψψ ) (XCl′ , R)
are partially identifiable on {θθ d }d∈Cl′ for ∀Cl′ ⊂ ĀI , according to Assumption A3, upon
optimal solution , we have that
{θθ d = θ ∗d }d∈Cl′
holds for all ∀Cl′ ⊂ ĀI . Since we have assumed that XCl′ = I in Assumption 3
S
Cl′ ∈ĀI
(i.e.,ĀI is a cover of I ), this guarantees that
∗
θ ML
d = θd
206 Identifiable Generative Models Under Missing Not at Random Data
for all d. In other words, we are able to uniquely identify θ ∗ from observed data, therefore
θ = {θθ ∗ } × θ ψ
where the third line is due to the fact that Φ−1 is injective and pθ ,ψψ (XO , R) is identifiable
with respect to θ and ψ .
Proof: 2
Case 1 (connections among X): Suppose the ground truth data generating pro-
cess pD (X, R) = p̃τ ∗ ,γ ∗ (XO , XU , R) is given by Figure 8.1 (i). That is, pD (X, R) =
2We mainly consider the case where all variables are continuous. Discrete variables will complicate the
discussion, but will not change the conclusion.
8.E Relaxing Assumption A1 207
R
p̃γ (X|R) Z ∏i p̃τi (Xi |Z, pa(Xi ) X)p(Z)dZ. Without loss of generality, assume that prob-
T
abilistic distributions p̃τi (Xi |Z, pa(Xi ) X) takes the form as p̃τi (Xi |Z, pa(Xi ) X) =
T T
R ψi
εi δ (Xi − f i (εi , pa(Xi ) X, Z))p(εi )dεi . Therefore, we have
T
p̃τ (X)
Z \
=
Z
∏ p̃τi (Xi|Z, pa(Xi) X)p(Z)dZ
Z
"i Z
#
ψ \
= ∏ dεi δ (Xi − fi i (εi , pa(Xi ) X, Z))p(εi )
z {i|N(X ) T X̸=0}
/ εi
i
∏T p(X j |Z) p(Z)dZ
{ j|N(X j ) X=0}
/
Z
" #
ψ \
= T ∏ δ (Xi − fi i (εi , pa(Xi ) X, Z))p(εi )
z,{i|N(Xi ) X̸=0}
/ {i|N(Xi ) X̸=0}
T
/
∏T p(X j |Z) p(Z)dZ
{ j|N(X j ) X=0}
/
Z
" #
ψ \
T ∏ δ (Xi − fi i (εi , pa(Xi ) X, Z))p(εi )
z,{i|N(Xi ) X̸=0}
/ {i|N(Xi ) X̸=0}
T
/
∏T p(X j |Z) p(Z)dZ
{ j|N(X j ) X=0}
/
Z
" #
= T ∏ δ (Xi − gi (εi , ancε (i), Z))p(εi )
z,{i|N(Xi ) X̸=0}
/ {i|N(Xi ) X̸=0}
T
/
∏T p(X j |Z) p(Z)dZ
{ j|N(X j ) X=0}
/
\
{εk |Xk ∈ ancXi Z, 1 ≤ k ≤ D}
208 Identifiable Generative Models Under Missing Not at Random Data
has a new aggregated latent space, {Z, {εi |1 ≤ i ≤ D}}. That is, for each Xi that has non
empty neighbour in X, a new latent variable will be created. With this new latent space,
the connections among X can be decoupled, and the new graphical structure of p(X, R)
corresponds to Figure 8.1 (h).
The mapping Φ that connects p̃τi (X, R) and p(X, R) can now be defined as identity
mapping, since no new parameters are introduced/removed when reparameterizing p̃τi (X, R)
into p(X, R). Hence, the two requirements of Lemma 8.1 are fulfilled.
Case 2(subgraph): Next, consider the case that the ground truth data generating process
pD (X, R) = p̃τ ∗ ,γ ∗ (XO , XU , R) is given by one of the Figure 8.1 (a)-(g). That is, it is a
subgraph of Figure 8.1 (h). Without loss of generality, assume that p̃γi (Ri = 1|pa(Ri )) =
logit−1 (fγi (pa(Ri ))), and pa(Ri ) ⊊ {X, Z}; in other words, certain connections from {X, Z}
to Ri is missing. Consider the model distribution parameterized by p(Ri = 1|X, Z) =
logit−1 (fγi (pa(Ri )) + gθ i ({X, Z} \ pa(Ri ))), satisfying gθ i =0 (·) ≡ 0. Therefore, the mapping
Φ−1 is given as Φ−1 (γi ) := (γi , θ i = 0). Apparently, Φ−1 is injective, hence satisfying the
requirement of Proposition 8.2.
Proof : First, it s not hard to show that pθ ,ψψ (XCl′ , R) is partially identifiable on {θθ d }d∈Cl′
for ∀Cl′ ∈ ĀI . This has been shown in the proof of Proposition 8.1, and we will not repeat
this proof again.
Next, given data setting D2 and Assumption A4, define
, then we have:
p(θθ ML ,ψψ ML ) (XO , R) = pΦ−1 (τ ∗ ,γ ∗ ) (XO , R)
holds for all (θθ ML , ψ ML ) ∈ θ ML , and all ∀O ⊂ I that satisfies p(XO , XU , RO = 1, RU = 0) >
0.
Since p(θθ ,ψψ ) (XCl′ , R) are partially identifiable on {θθ d }d∈Cl′ for ∀Cl′ ⊂ ĀI and according
to Assumption A3, pD (XO , XU , RCl′ = 1, RI\Cl′ = 0) > 0. Therefore,
{θθ d = Φ−1 ∗ ∗
θ (τ , γ )d }d∈Cl
′
−1 ∗ ∗
θ ML
d = Φθ (τ , γ )d
for all d. In other words, we are able to uniquely identify θ ∗ from observed data, therefore
θ = {Φ−1 ∗ ∗
θ (τ , γ )} × θ ψ
.
Finally, according to Assumption 4 and the proof of Lemma 8.1, Φ is decoupled as
ψ )). Therefore, we can write θ = {Φ−1 (τ ∗ )} × θ ψ . That is, the ground truth
(Φθ (θθ ), Φψ (ψ
model parameter τ ∗ of pD can be uniquely identified (as Φ(θθ ∗ )).
Apparently, definition 8.2.1 is a special case of definition 8.F.1, where ∼ is given by the
equality operator, =. When the discussion is based on the identifiability under equivalence
relation, then it is obvious that all the arguments of Proposition 8.1, 8.2, and 8.3 still holds.
Also, the statement of the results needs to adjusted accordingly. For example, in Proposition
8.1, instead of “the ground truth model parameter θ ∗ can be uniquely identified", we now
have “the ground truth model parameter θ ∗ can be uniquely identified up to a equivalence
relation, ∼".
Theorem 8.1. Assume we sample data from the model given by p(X, Z|V ) = pε (X −
f (Z))pT,ζ (Z|V ), where f is a multivariate function f : RH 7→ RD . pT,ζ (Z|V ) is parameter-
ized by exponential family of the form pT,ζ (Z|V ) ∝ ∏i=1M Q(Zi ) exp[∑ j=1K Ti, j (Zi )ζi, j (V )],
where Q(Zi ) is some base measure, M is the dimensionality of the latent variable Z,
Ti (V ) = (Ti,1 , ..., Ti,K ) are the sufficient statistics, and ζ i (V ) = (ζi,1 , ..., ζi,K ) are the corre-
sponding parameters, depending on V . Assume the following holds:
1. The set {X ∈ X |φε (x) = 0} has zero measure, where φ is the characteristic function
of pε ;
8.G Subset identifiability (A2) for identifiable VAEs 211
3. Ti, j are differentiable a.e., and (Ti, j )1≤ j≤k are linearly independent on any subset of
X of measure greater than zero;
4. There exists nk + 1 distinct points V 0 , ...,V nk , such that the matrix L = (ζζ (V 1 −
U 0 ), ..., ζ (V nk −V 0 )) of size nk by nk is invertible.
Then, the parameters ( f , T, ζ ) are ∼A -identifiable, where ∼A is the equivalence class defined
as (see also Appendix 8.F):
Note that under additional mild assumptions, the A in the ∼A equivalence relation can
be further reduced to a permutation matrix. That is, the model parameters can be identified,
such that the latent variables differs up to a permutation. This is not inconsequential in many
applications. We refer to [146] for more discussions on permutation equivalence.
So far, Theorem 8.1 only discussed the identifiability of p(X) on the full variables,
S
X = XO XU . However, in Assumption A2, we need the reference model to be (partially)
identifiable on a partition Cs ∈ AI , pθ (Xos ). Naturally, we need additional assumptions on
the the injective function f , as stated below:
Assumption A5 There exists an integer DO , such that fO : RH 7→ R|O| is injective for all
O that |O| ≥ D0 . Here, fO is the entries from the output of f , that corresponds to the index
set O.
Remark Note that, under assumption A5, the Assumption A3 in Section 8.3 becomes
more intuitive: it means that in order to uniquely recover the ground truth parameters, our
training data must contain training examples that have more than D0 observed features. This
is different from some previous works ([233] for example), where complete case data must
be available.
Finally, given these new assumptions, it is easy to show that:
Corollary 8.1 (Local identifiability). Assume that p(X, Z|V ) = pε (X − f (Z))pT,ζ (Z|V )
is the model parameterized according to Theorem 8.1. Assume that the assumptions in
Theorem 8.1 holds for p(X|V ). Additionally, assume that f satisfies assumption A5.
212 Identifiable Generative Models Under Missing Not at Random Data
Proof : it is trivial to see that the assumptions 1, 3, and 4 in Theorem 8.1 automatically
holds regarding p(XO |V ). fO is injective according to Assumption A5. Hence, p(XO |V )
satisfies all the assumptions in Theorem 8.1, and p(XO |V ) is ∼A -identifiable on ( fO , T, ζ )
for all O that satisfies |O| ≥ D0 .
Remark In practice, Assumption A5 is often satisfied. For example, consider the f that
is parameterized by the following MLP composite function:
i∗ = arg max R(i | XO ) := EXi ∼p(Xi |XO ) KL p(Xφ |Xi , XO ) ∥ p(Xφ |XO ) .
i∈U
8.I Additional results 213
In the Eedi dataset, as we do not have a specific target variable of interest, it is defined
as Xφ = XU . In this case, Xφ could be ver high-dimensional, and direct estimation of
KL p(Xφ |Xi , XO ) ∥ p(Xφ |XO ) . could be inefficient. In [200], a fast approximation has been
proposed:
In this approximation, all calculation happens in the latent space of the model, hence we can
make use of the learned inference net to efficeintly estimate R(i | XO ).
Dataset A
Dataset B
Dataset C
Figure 8.4 Visualization of imputed X2 and X3 from synthetic experiment. Row-wise (A-C)
plots for dataset A, B, and C, respectively; Column-wise: PVAE imputed samples, Not-
MIWAE imputed samples, and GINA imputed samples, respectively. Contour plot: kernel
density estimate of ground truth density of complete data;
Chapter 9
Theme A Theme B
Recall that in Chapter 1, both themes are motivated from the perspective of Bayesian
approaches to uncertainties. Here, we provide another hidden driving force that connects all
216 Conclusion and Future Work
the chapters in this thesis, from the perspective of (Bayesian) deep learning. We all know
that machine learning, especially deep learning, has been the driving force behind modern AI
research, due to its unmatched flexibility and scalability. Despite its empirical success, it has
been argued that deep learning methods often fail to produce reliable uncertainty estimates
for their predictions, which might jeopardize their performance in critical real-life decision-
making tasks. Following the recent development of (approximate) Bayesian inference
techniques, there has been a resurgence of interest in combining Bayesian techniques with
deep learning methods. This results in Bayesian deep learning (BDL) tools that can tell their
users when the algorithms are “making a random guess”.
Both research themes (Table 9.1) presented in this thesis were largely inspired by many
works in Bayesian deep learning, especially by the work of two of my lab alumnus, Yarin
Gal [74] and Yingzhen Li [176]. Both of their works have been widely recognized to be quite
essential to the field of modern Bayesian deep learning, as well as approximate inference.
In some sense, the works presented in this thesis can also be treated as the “reverse" of
typical Bayesian deep learning paradigms. In Bayesian deep learning, scalable approximate
inference algorithms are usually developed and applied for specific deep learning models,
such as deep neural networks for regression and classification. On the contrary, we utilized
the ideas from (Bayesian) deep learning to help propose new directions and develop new
approaches of Bayesian approximate inference. Examples of this paradigm can be found in
many places of the thesis, for instance:
• In Part A, the research question of performing inference in the function space is largely
motivated by analyzing the pathologies of model non-identifiability in neural networks.
The fact that the deep learning literature cares more about prediction functions than
the specific neural network weights motivates us to perform inference in the space of
minimal sufficient parameters, i.e., the function space.
• In the work of variational implicit processes (Chapter 4), the concept of implicit
distributions developed in deep learning (GANs) was applied to create new stochastic
process priors. The flexibility of neural networks helps us extend the existing Bayesian
non-parametric priors (GPs) to some class of more general and flexible priors, namely
the implicit processes. Furthermore, we developed a wake-sleep approximate inference
procedure for implicit processes, which is a method that is generalized from the
Helmholtz machines in deep unsupervised learning literature.
• In the work of functional variational inference (Chapter 5), we further generalize the
idea proposed in Chapter 4, to a more general method that performs non-Gaussian
9.2 Future research questions for function space inference 217
• Part B is solely based on the idea of using deep generative models to quantify missing
data uncertainty, perform multiple imputations, and acquire new information. The
empirical success in the work of EDDI (Chapter 7) also benefits a lot from many
innovations from the deep learning literature. First, the neural network decoder used
by the partial VAE allows us to represent expressive distributions for accurate density
estimation. Second, the proposed partial amortization method is based on the point-net
structures in point clouds modeling, which enables us to handle inference queries
of 2D different possible combinations of missing patterns. Last but not the least,
the efficient information reward estimation method is only possible due to the latent
representations provided by the encoder-decoder structures of VAEs.
• Finally, the introduction of deep neural networks to missing data imputation does
introduce certain pathologies due to model non-identification. Therefore, in Chapter 8,
this problem is further studied by analyzing the sufficient conditions of identifiability
for generative models under MNAR assumption.
In this thesis, we have proposed a number of new directions and new approaches in
Bayesian inference, by introducing ideas from the deep learning literature to approximate
inference and Bayesian machine learning. All these new advances would not be possible
without the recent developments of deep learning methods in the past decades. Therefore,
the new directions discussed in this thesis also open leads to future works, as detailed in the
next section.
5 is constructive: we take the KL divergence on finite measure points, and marginalize out
the measure points w.r.t. n and Xn . This gives us a valid divergence in function space. The
same approach can be used to define function space variants of other divergences, such as α
divergences [113], f -divergences [351], χ-divergences [60], etc. It would be interesting to
see how different divergence measures will impact the behavior of inference algorithms.
sampling of the measure points, MC sampling from the posterior process, as well as
debiasing techniques based on Russian roulette estimators. It is, therefore, important to carry
further theoretical analysis regarding the variance and asymptotic convergence properties
of such estimators, as well as related variance reduction techniques for function space
inference. This is a topic that is ignored by many function space inference literature and is
worth further investigation. Finally, it is also helpful to investigate the posterior consistency
and contraction rates of function space inference methods (assuming all approximations are
accurate), and compare with the corresponding contraction rates of weight-space VI [27].
neural networks. Also, it has been argued [86] that such epistemic uncertainty in generative
models is crucial to handling the ice-start problem, where there is little or no training data
from the beginning. Therefore, it is desirable to also assume a prior on the generative
model parameters θ , and perform approximate inference over both model parameters θ
and the latent variables, z. Unfortunately, this would add heavy computational burdens
to VAE-like amortized inference methods: imagine up to millions of distributions over
deep neural net weights, each having complex interactions with latent variables {z}. When
applying amortized inference to this case, such correlations between θ and z can be difficult
to model using inference networks. In the preliminary work of [86], such interactions
are ignored by applying a mean-field approximation. However, such approximations will
be quite limited, especially when applied to more complex models (such as hierarchical
generative models). Therefore, one possible future direction is to investigate how to perform
learning and inference for such hierarchical Bayesian generative models under missing
data. A potentially promising direction is to combine (Hamiltonian) MCMC methods and
variational inference for specific generative models, to demystify the correlations between θ
and z.
9.3.3 Deep generative models for partially observed, mixed type tabu-
lar data.
In the deep learning literature, deep generative models such as VAEs are typically applied to
standard homogeneous datasets in which each data dimension has a similar type and similar
statistical properties (e.g., consider spatial or temporal correlations found in images and
videos). However, many real-world datasets are tabular datasets, which are heterogeneous
and contain variables with different types. For instance, in healthcare applications, a patient
record may contain demographic information such as nationality (which is of categorical
type), age (which is ordinal), and height (which is continuous). In our work in Part B,
we have demonstrated that deep generative models can also be successfully applied to
partially observed tabular datasets, which opens up new possibilities that broaden the range
of applications where deep generative models can be deployed. However, our treatment
of mixed-type data is quite ad-hoc: we simply convert all variables into continuous type
variables and apply Partial VAEs with gaussian likelihoods to the processed data. A more
principled way of learning from tabular data is to treat each different variable type correctly
and apply the different likelihood functions for each type (e.g. Gaussian likelihoods for
real-valued variables and Bernoulli likelihoods for binary variables). In our preliminary
9.4 Interplay between the techniques developed in Theme A and Theme B 221
work in [201], we argue that naively applying VAEs to such mixed-type heterogeneous
data can lead to unsatisfying results. The reason for this is that the contribution that each
likelihood makes to the training objective can be very different, leading to challenging
optimization problems in which some data dimensions may be poorly-modeled in favor of
others. In the same paper, we proposed a preliminary two-stage solution to this problem,
by first learning a homogeneous representation of each heterogeneous variable, and then
learning a partial VAE over those heterogeneous representations. Nevertheless, it is still
an open question how to develop extensions of deep generative models to properly handle
partially observed, mixed typed tabular datasets.
to investigate how function space inference methods would impact the model identifiability
and asymptotic convergence properties of the deep generative models under MNAR from
Theme B. Finally, models such as implicit processes, or Bayesian regression models trained
by FVI can also be trivially combined with Partial VAE to perform active feature acquisition.
This can be done via the factorization p(xφ , x \ xφ , z) = p(x \ xφ , z)p(xφ |z, x \ xφ ) , where
the term p(xφ |z, x \ xφ ) can be modelled by a Bayesian regression model. By monitoring
the uncertainty level on both z and xφ , we can also establish an optimal stop criterion for
active information acquisition.
Bibliography
[30] Bonassi, F. V., West, M., et al. (2015). Sequential Monte Carlo with adaptive weights
for approximate Bayesian computation. Bayesian Analysis, 10(1):171–187.
[31] Bornschein, J. and Bengio, Y. (2014). Reweighted wake-sleep. arXiv:1406.2751.
[32] Botev, A., Ritter, H., and Barber, D. (2017). Practical Gauss-Newton optimisation
for deep learning. In International Conference on Machine Learning, pages 557–565.
PMLR.
[33] Bottou, L. et al. (1998). Online learning and stochastic approximations. On-line
learning in neural networks, 17(9):142.
[34] Bradshaw, J., Matthews, A. G. d. G., and Ghahramani, Z. (2017). Adversarial examples,
uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks.
arXiv:1707.02476.
[35] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan,
A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.
Advances in neural information processing systems, 33:1877–1901.
[36] Bui, T., Hernández-Lobato, D., Hernandez-Lobato, J., Li, Y., and Turner, R. (2016a).
Deep Gaussian processes for regression using approximate expectation propagation. In
International Conference on Machine Learning, pages 1472–1481.
[37] Bui, T. D. and Turner, R. E. (2014). Tree-structured Gaussian process approximations.
In Advances in Neural Information Processing Systems, pages 2213–2221.
[38] Bui, T. D., Yan, J., and Turner, R. E. (2016b). A unifying framework for sparse
Gaussian process approximation using power expectation propagation. arXiv:1605.07066.
[39] Burt, D. R., Ober, S. W., Garriga-Alonso, A., and van der Wilk, M. (2020). Under-
standing variational inference in function-space. arXiv preprint arXiv:2011.09421.
[40] Casert, C., Mills, K., Vieijra, T., Ryckebusch, J., and Tamblyn, I. (2020). Optical
lattice experiments at unobserved conditions and scales through generative adversarial
deep learning. arXiv preprint arXiv:2002.07055.
[41] Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. (2019). Residual
flows for invertible generative modeling. In Advances in Neural Information Processing
Systems, pages 9916–9926.
[42] Cho, Y. and Saul, L. K. (2009). Kernel methods for deep learning. In Advances in
Neural Information Processing Systems, pages 342–350.
[43] Collobert, R. and Weston, J. (2008). A unified architecture for natural language
processing: Deep neural networks with multitask learning. In Proceedings of the 25th
International Conference on Machine Learning, pages 160–167. ACM.
[44] Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley &
Sons.
226 Bibliography
[71] Foong, A. Y., Li, Y., Hernández-Lobato, J. M., and Turner, R. E. (2019). ’in-
between’uncertainty in Bayesian neural networks. arXiv preprint arXiv:1906.11537.
[72] Fortuin, V., Garriga-Alonso, A., Wenzel, F., Rätsch, G., Turner, R., van der Wilk,
M., and Aitchison, L. (2021). Bayesian neural network priors revisited. arXiv preprint
arXiv:2102.06571.
[73] Fu, M. C. (2006). Gradient estimation. Handbooks in operations research and
management science, 13:575–616.
[74] Gal, Y. (2016). Uncertainty in deep learning. PhD thesis, PhD thesis, University of
Cambridge.
[75] Gal, Y. and Ghahramani, Z. (2016a). Dropout as a Bayesian approximation: Repre-
senting model uncertainty in deep learning. In International Conference on Machine
Learning, pages 1050–1059.
[76] Gal, Y. and Ghahramani, Z. (2016b). A theoretically grounded application of dropout
in recurrent neural networks. In Advances in Neural Information Processing Systems,
pages 1019–1027.
[77] Gal, Y., Hron, J., and Kendall, A. (2017). Concrete dropout. Advances in neural
information processing systems, 30.
[78] Gal, Y. and Turner, R. (2015). Improving the Gaussian process sparse spectrum ap-
proximation by representing uncertainty in frequency inputs. In International Conference
on Machine Learning, pages 655–664.
[79] Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M.,
Teh, Y. W., Rezende, D., and Eslami, S. A. (2018a). Conditional neural processes. In
International Conference on Machine Learning, pages 1704–1713. PMLR.
[80] Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and
Teh, Y. W. (2018b). Neural processes. arXiv preprint arXiv:1807.01622.
[81] Gershman, S. J., Hoffman, M. D., and Blei, D. M. (2012). Nonparametric variational
inference. In Proceedings of the 29th International Coference on Machine Learning,
pages 235–242.
[82] Ghalebikesabi, S., Cornish, R., Holmes, C., and Kelly, L. (2021). Deep generative miss-
ingness pattern-set mixture models. In International Conference on Artificial Intelligence
and Statistics, pages 3727–3735. PMLR.
[83] Gilboa, I. (2009). Theory of decision under uncertainty, volume 45. Cambridge
university press.
[84] Globerson, A. and Livni, R. (2016). Learning infinite-layer networks: beyond the
kernel trick. arxiv preprint. arXiv preprint arXiv:1606.05316.
[85] Gong, W., Li, Y., and Hernández-Lobato, J. M. (2021a). Sliced kernelized Stein
discrepancy. In International Conference on Learning Representations.
Bibliography 229
[86] Gong, W., Tschiatschek, S., Nowozin, S., Turner, R. E., Hernández-Lobato, J. M., and
Zhang, C. (2019). Icebreaker: Element-wise efficient information acquisition with a
Bayesian deep latent Gaussian model.
[87] Gong, Y., Hajimirsadeghi, H., He, J., Durand, T., and Mori, G. (2021b). Variational
selective autoencoder: Learning from partially-observed heterogeneous data. In Interna-
tional Conference on Artificial Intelligence and Statistics, pages 2377–2385. PMLR.
[88] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural
Information Processing Systems, pages 2672–2680.
[89] Gopalan, P. K., Charlin, L., and Blei, D. (2014). Content-based recommendations with
poisson factorization. In Advances in Neural Information Processing Systems, pages
3176–3184.
[90] Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner,
R. E. (2020). Convolutional conditional neural processes. In International Conference
on Learning Representations.
[91] Graves, A. (2011). Practical variational inference for neural networks. In Advances in
Neural Information Processing Systems, pages 2348–2356.
[92] Gray, R. M. (2011). Entropy and information theory. Springer Science & Business
Media.
[93] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A
kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773.
[94] Grünwald, P. and Van Ommen, T. (2017). Inconsistency of Bayesian inference for
misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4):1069–
1103.
[95] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern
neural networks. In International Conference on Machine Learning, pages 1321–1330.
PMLR.
[96] Guo, F., Wang, X., Fan, K., Broderick, T., and Dunson, D. B. (2016). Boosting
variational inference. arXiv preprint arXiv:1611.05559.
[97] Guss, W. H. (2016). Deep function machines: Generalized neural networks for
topological layer expression. arXiv preprint arXiv:1612.04799.
[98] Hachmann, J., Olivares-Amaya, R., Jinich, A., Appleton, A. L., Blood-Forsythe, M. A.,
Seress, L. R., Roman-Salgado, C., Trepte, K., Atahan-Evrenk, S., Er, S., et al. (2014).
Lead candidates for high-performance organic photovoltaics from high-throughput quan-
tum chemistry–the harvard clean energy project. Energy & Environmental Science,
7(2):698–704.
230 Bibliography
[99] Hamesse, C., Ackermann, P., Kjellström, H., and Zhang, C. (2018). Simultaneous
measurement imputation and outcome prediction for achilles tendon rupture rehabilitation.
In ICML/IJCAI Joint Workshop on Artificial Intelligence in Health.
[100] Han, S., Liao, X., Dunson, D., and Carin, L. (2016). Variational Gaussian copula
inference. In Artificial Intelligence and Statistics, pages 829–838. PMLR.
[101] Harsanyi, J. C. (1978). Bayesian decision theory and utilitarian ethics. The American
Economic Review, 68(2):223–228.
[102] Harsanyi, J. C. (1979). Bayesian decision theory, rule utilitarianism, and arrow’s
impossibility theorem. Theory and Decision, 11(3):289–317.
[103] Harutyunyan, H., Khachatrian, H., Kale, D. C., Ver Steeg, G., and Galstyan, A.
(2019). Multitask learning and benchmarking with clinical time series data. Scientific
data, 6(1):1–18.
[104] Hazan, T. and Jaakkola, T. (2015). Steps toward deep kernel methods from infinite
neural networks. arXiv:1508.05133.
[105] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778.
[106] He, W., Mao, X., Ma, C., Huang, Y., Hernàndez-Lobato, J. M., and Chen, T. (2022).
BSODA: A bipartite scalable framework for online disease diagnosis. In Proceedings of
the ACM Web Conference 2022, pages 2511–2521.
[107] Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica:
Journal of the econometric society, pages 153–161.
[108] Heinemann, U., Livni, R., Eban, E., Elidan, G., and Globerson, A. (2016). Improper
deep kernels. In Artificial Intelligence and Statistics, pages 1159–1167.
[109] Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data.
arXiv:1309.6835.
[110] Hernández-Lobato, J. M. (2010). Balancing flexibility and robustness in machine
learning: semi-parametric methods and sparse linear models.
[111] Hernández-Lobato, J. M. and Adams, R. (2015). Probabilistic backpropagation for
scalable learning of Bayesian neural networks. In International Conference on Machine
Learning, pages 1861–1869.
[112] Hernández-Lobato, J. M., Houlsby, N., and Ghahramani, Z. (2014). Probabilistic
matrix factorization with non-random missing data. In International Conference on
Machine Learning, pages 1512–1520. PMLR.
[113] Hernández-Lobato, J. M., Li, Y., Rowland, M., Hernández-Lobato, D., Bui, T., and
Turner, R. E. (2016). Black-box α-divergence minimization.
Bibliography 231
[114] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The" wake-sleep"
algorithm for unsupervised neural networks. Science, 268(5214):1158.
[115] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for
deep belief nets. Neural computation, 18(7):1527–1554.
[116] Hinton, G. E. and Salakhutdinov, R. R. (2008). Using deep belief nets to learn
covariance kernels for Gaussian processes. In Advances in Neural Information Processing
Systems, pages 1249–1256.
[117] Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by
minimizing the description length of the weights. In Proceedings of the Sixth Annual
Conference on Computational Learning Theory, pages 5–13. ACM.
[118] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational
inference. The Journal of Machine Learning Research, 14(1):1303–1347.
[119] Horton, N. J. and Lipsitz, S. R. (2001). Multiple imputation in practice: comparison
of software packages for regression models with missing variables. The American
Statistician, 55(3):244–254.
[120] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without
replacement from a finite universe. Journal of the American statistical Association,
47(260):663–685.
[121] Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. (2011). Bayesian active
learning for classification and preference learning. arXiv preprint arXiv:1112.5745.
[122] Hron, J., Matthews, A., and Ghahramani, Z. (2018). Variational Bayesian dropout:
pitfalls and fixes. In International Conference on Machine Learning, pages 2019–2028.
PMLR.
[123] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely
connected convolutional networks. In 017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 2261–2269. IEEE.
[124] Huang, S.-J., Xu, M., Xie, M.-K., Sugiyama, M., Niu, G., and Chen, S. (2018). Active
feature acquisition with supervised matrix completion. arXiv preprint arXiv:1802.05380.
[125] Huszár, F. (2017). Variational inference using implicit distributions. arXiv preprint
arXiv:1702.08235.
[126] Ibrahim, J. G., Lipsitz, S. R., and Chen, M.-H. (1999). Missing covariates in general-
ized linear models when the missing data mechanism is non-ignorable. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 61(1):173–190.
[127] Immer, A., Korzepa, M., and Bauer, M. (2021). Improving predictions of Bayesian
neural nets via local linearization. In International Conference on Artificial Intelligence
and Statistics, pages 703–711. PMLR.
232 Bibliography
[128] Ipsen, N. B., Mattei, P.-A., and Frellsen, J. (2021). not-{miwae}: Deep generative
modelling with missing not at random data. In International Conference on Learning
Representations.
[129] Itô, K. (1984). An Introduction to Probability Theory. Cambridge University Press.
[130] Iwata, T. and Ghahramani, Z. (2017). Improving output uncertainty estimation and
generalization in deep learning via neural network Gaussian processes. arXiv:1707.05922.
[131] Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson,
A. G. (2020). Subspace inference for Bayesian deep learning. In Uncertainty in Artificial
Intelligence, pages 1169–1179. PMLR.
[132] Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. G. (2021). What are
Bayesian neural network posteriors really like? In International conference on machine
learning, pages 4629–4640. PMLR.
[133] Jain, P., Meka, R., and Dhillon, I. S. (2010). Guaranteed rank minimization via
singular value projection. In Advances in Neural Information Processing Systems.
[134] Jakobsen, J. C., Gluud, C., Wetterslev, J., and Winkel, P. (2017). When and how
should multiple imputation be used for handling missing data in randomised clinical
trials–a practical guide with flowcharts. BMC medical research methodology, 17(1):1–10.
[135] Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with Gumbel-
Softmax. In International Conference on Learning Representations.
[136] Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender
systems: an introduction. Cambridge University Press.
[137] Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge university
press.
[138] Jeffreys, H. (1998). The theory of probability. OUP Oxford.
[139] Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody,
B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). Mimic-iii, a freely accessible
critical care database. Scientific Data, 3:160035.
[140] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction
to variational methods for graphical models. Machine learning, 37(2):183–233.
[141] Kahn, H. (1955). Use of different Monte Carlo sampling techniques.
[142] Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models.
In EMNLP, volume 3, page 413.
[143] Keerin, P., Kurutach, W., and Boongoen, T. (2012). Cluster-based KNN missing
value imputation for dna microarray data. In 2012 IEEE International Conference on
Systems, Man, and Cybernetics (SMC), pages 445–450. IEEE.
Bibliography 233
[144] Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from noisy
entries. Journal of Machine Learning Research.
[145] Khan, M. E. E., Immer, A., Abedi, E., and Korzepa, M. (2019). Approximate
inference turns deep networks into Gaussian processes. Advances in neural information
processing systems, 32.
[146] Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. (2020). Variational
autoencoders and nonlinear ICA: A unifying framework. In International Conference on
Artificial Intelligence and Statistics, pages 2207–2217. PMLR.
[147] Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals,
O., and Teh, Y. W. (2019). Attentive neural processes. In International Conference on
Learning Representations.
[148] Kingma, D. P. and Ba, J. L. (2015). Adam: a method for stochastic optimization. In
International Conference on Learning Representations, pages 1–13.
[149] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes.
arXiv:1312.6114.
[150] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Interna-
tional Conference on Learning Representation.
[151] Kleijn, B. J. and van der Vaart, A. W. (2012). The Bernstein-von-Mises theorem
under misspecification. Electronic Journal of Statistics, 6:354–381.
[152] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-
dimensional Bayesian statistics. The Annals of Statistics, 34(2).
[153] Knapik, B. (2013). Bayesian Asymptotics: Inverse Problems and Irregular Mod-
els. PhD thesis, Vrije Universiteit Amsterdam. Naam instelling promotie: VU Vrije
Universiteit Naam instelling onderzoek: VU Vrije Universiteit.
[154] Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and
techniques. MIT press.
[155] Kolmogorov, A. (1950). Foundations of the theory of probability. Chelsea Publishing
Company.
[156] Koopmans, T. C. and Reiersol, O. (1950). The identification of structural characteris-
tics. The Annals of Mathematical Statistics, 21(2):165–181.
[157] Körding, K. P. and Wolpert, D. M. (2004). Bayesian integration in sensorimotor
learning. Nature, 427(6971):244–247.
[158] Körding, K. P. and Wolpert, D. M. (2006). Bayesian decision theory in sensorimotor
control. Trends in cognitive sciences, 10(7):319–326.
[159] Krauth, K., Bonilla, E. V., Cutajar, K., and Filippone, M. (2016). AutoGP: Exploring
the capabilities and limitations of Gaussian process models. arXiv:1610.05392.
234 Bibliography
[160] Kristiadi, A., Hein, M., and Hennig, P. (2020). Being Bayesian, even just a bit, fixes
overconfidence in Relu networks. In International Conference on Machine Learning,
pages 5436–5446. PMLR.
[161] Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from
tiny images.
[162] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. In Advances in Neural Information Processing
Systems, pages 1097–1105.
[163] Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The annals
of mathematical statistics, 22(1):79–86.
[164] Laplace, P. S. (1820). Théorie analytique des probabilités. Courcier.
[165] Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of
high dimensional data. In Advances in Neural Information Processing Systems, pages
329–336.
[166] Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal,
A. R. (2010). Sparse spectrum Gaussian process regression. Journal of Machine Learning
Research, 11(Jun):1865–1881.
[167] Le, Q. V. (2013). Building high-level features using large scale unsupervised learning.
In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
pages 8595–8598. IEEE.
[168] Le Cam, L. (2012). Asymptotic methods in statistical decision theory. Springer
Science & Business Media.
[169] Le Roux, N. and Bengio, Y. (2007). Continuous neural networks. In Artificial
Intelligence and Statistics, pages 404–411.
[170] Lean, J., Beer, J., and Bradley, R. (1995). Reconstruction of solar irradiance since
1610: Implications for climate change. Geophysical Research Letters, 22(23):3195–3198.
[171] LeCun, Y. (1998). The MNIST database of handwritten digits. http://yann. lecun.
com/exdb/mnist/.
[172] Lee, H. K. (2000). Consistency of posterior distributions for neural networks. Neural
Networks, 13(6):629–642.
[173] Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y.
(2018). Deep neural networks as Gaussian processes. In International Conference on
Learning Representations.
[174] Lewenberg, Y., Bachrach, Y., Paquet, U., and Rosenschein, J. S. (2017). Knowing
what to ask: A Bayesian active learning approach to the surveying problem. In AAAI,
pages 1396–1402.
Bibliography 235
[175] Li, S. C.-X., Jiang, B., and Marlin, B. (2019). MisGAN: Learning from incomplete
data with generative adversarial networks. In International Conference on Learning
Representations.
[176] Li, Y. (2018). Approximate inference: New visions. PhD thesis, University of
Cambridge.
[177] Li, Y. and Gal, Y. (2017). Dropout inference in Bayesian neural networks with
alpha-divergences. In International conference on machine learning, pages 2052–2061.
PMLR.
[178] Li, Y., Hernández-Lobato, J. M., and Turner, R. E. (2015). Stochastic expectation
propagation. In Advances in Neural Information Processing Systems, pages 2323–2331.
[179] Li, Y. and Liu, Q. (2016). Wild variational approximations.
[180] Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. In Advances
in Neural Information Processing Systems, pages 1073–1081.
[181] Li, Y. and Turner, R. E. (2018). Gradient estimators for implicit models. In Interna-
tional Conference on Learning Representations.
[182] Li, Y., Turner, R. E., and Liu, Q. (2017). Approximate inference with amortised
MCMC. arXiv:1702.08343.
[Liang et al.] Liang, D., Charlin, L., and Blei, D. M. Causal inference for recommendation.
[184] Liang, D., Charlin, L., McInerney, J., and Blei, D. M. (2016). Modeling user exposure
in recommendation. In Proceedings of the 25th international conference on World Wide
Web, pages 951–961.
[185] Lichman, M. et al. (2013). UCI machine learning repository.
[186] Lindley, D. V. (1956). On a measure of the information provided by an experiment.
The Annals of Mathematical Statistics, pages 986–1005.
[187] Ling, G., Yang, H., Lyu, M. R., and King, I. (2012). Response aware model-based
collaborative filtering. In Proceedings of the Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence, pages 501–510.
[188] Little, R. and Rubin, D. (1987). Statistical analysis with missing data. Technical
report.
[189] Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal
of the American Statistical Association, 88(421):125–134.
[190] Little, R. J. and Rubin, D. B. (2019). Statistical analysis with missing data, volume
793. John Wiley & Sons.
[191] Liu, Q. and Feng, Y. (2016). Two methods for wild variational inference.
arXiv:1612.00081.
236 Bibliography
[192] Liu, Q., Lee, J. D., and Jordan, M. I. (2016). A kernelized Stein discrepancy for
goodness-of-fit tests. In Proceedings of the International Conference on Machine Learn-
ing (ICML).
[193] Liu, Q. and Wang, D. (2016). Stein variational gradient descent: A general purpose
Bayesian inference algorithm. In Advances in Neural Information Processing Systems,
pages 2370–2378.
[194] Loeve, M. (1977). In Probability Theory I-II. Springer.
[195] Louizos, C. and Welling, M. (2016). Structured and efficient variational deep learning
with matrix Gaussian posteriors. In International Conference on Machine Learning,
pages 1708–1716. PMLR.
[196] Louizos, C. and Welling, M. (2017). Multiplicative normalizing flows for variational
Bayesian neural networks. In International Conference on Machine Learning, pages
2218–2227. PMLR.
[197] Ma, C., Gong, W., Hernández-Lobato, J. M., Koenigstein, N., Nowozin, S., and
Zhang, C. (2018). Partial VAE for hybrid recommender system.
[198] Ma, C., Li, Y., and Hernández-Lobato, J. M. (2019a). Variational implicit processes.
In International Conference on Machine Learning, pages 4222–4233. PMLR.
[199] Ma, C., Tschiatschek, S., Li, Y., Turner, R., Hernandez-Lobato, J. M., and Zhang,
C. (2020a). Hm-vaes: a deep generative model for real-valued data with heterogeneous
marginals. In Symposium on Advances in Approximate Bayesian Inference, pages 1–8.
PMLR.
[200] Ma, C., Tschiatschek, S., Palla, K., Hernandez-Lobato, J. M., Nowozin, S., and Zhang,
C. (2019b). EDDI: Efficient dynamic discovery of high-value information with partial
VAE. In International Conference on Machine Learning, pages 4234–4243. PMLR.
[201] Ma, C., Tschiatschek, S., Turner, R., Hernández-Lobato, J. M., and Zhang, C. (2020b).
VAEM: a deep generative model for heterogeneous mixed type data. Advances in Neural
Information Processing Systems, 33:11237–11247.
[202] Ma, W. and Chen, G. H. (2019). Missing not at random in matrix completion:
The effectiveness of estimating missingness probabilities under a low nuclear norm
assumption. Advances in Neural Information Processing Systems, 32.
[203] MacKay, D. J. (1992a). Information-based objective functions for active data selection.
Neural computation, 4(4):590–604.
[204] MacKay, D. J. (1992b). A practical Bayesian framework for backpropagation net-
works. Neural computation, 4(3):448–472.
[205] Mackay, D. J. C. (1992). Bayesian methods for adaptive models. PhD thesis,
California Institute of Technology.
Bibliography 237
[206] Maddox, W. J., Benton, G., and Wilson, A. G. (2020). Rethinking parameter counting
in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139.
[207] Marjoram, P., Molitor, J., Plagnol, V., and Tavaré, S. (2003). Markov Chain
Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences,
100(26):15324–15328.
[208] Marlin, B. M. and Zemel, R. S. (2009). Collaborative prediction and ranking with
non-random missing data. In Proceedings of the third ACM conference on Recommender
systems, pages 5–12.
[209] Marsh, H. W. (1998). Pairwise deletion for missing data in structural equation models:
Nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample
sizes. Structural Equation Modeling: A Multidisciplinary Journal, 5(1):22–36.
[210] Martens, J. (2020). New insights and perspectives on the natural gradient method.
The Journal of Machine Learning Research, 21(1):5776–5851.
[211] Martens, J. and Grosse, R. (2015). Optimizing neural networks with kronecker-
factored approximate curvature. In International conference on machine learning, pages
2408–2417. PMLR.
[212] Mattei, P.-A. and Frellsen, J. (2019). MIWAE: Deep generative modelling and
imputation of incomplete data sets. In International Conference on Machine Learning,
pages 4413–4423. PMLR.
[213] Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On sparse
variational methods and the Kullback-Leibler divergence between stochastic processes.
Journal of Machine Learning Research, 51:231–239.
[214] Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z.
(2018). Gaussian process behaviour in wide deep neural networks. arXiv:1804.11271.
[215] Matthews, A. G. d. G., Van Der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A.,
León-Villagrá, P., Ghahramani, Z., and Hensman, J. (2017). GPflow: A Gaussian process
library using tensorflow. The Journal of Machine Learning Research, 18(1):1299–1304.
[216] McCallumzy, A. K. and Nigamy, K. (1998). Employing EM and pool-based active
learning for text classification. In International Conference on Machine Learning, pages
359–367. Citeseer.
[217] Melville, P., Saar-Tsechansky, M., Provost, F., and Mooney, R. (2004). Active feature-
value acquisition for classifier induction. In International Conference on Data Mining,
pages 483–486. IEEE.
[218] Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes:
Unifying variational autoencoders and generative adversarial networks. In International
Conference on Machine Learning, pages 2391–2400. PMLR.
238 Bibliography
[219] Miao, W., Ding, P., and Geng, Z. (2016). Identifiability of normal and normal mixture
models with nonignorable missing data. Journal of the American Statistical Association,
111(516):1673–1683.
[220] Miao, W., Liu, L., Tchetgen, E. T., and Geng, Z. (2015). Identification, doubly robust
estimation, and semiparametric efficiency theory of nonignorable missing data with a
shadow variable. arXiv preprint arXiv:1509.02556.
[221] Miao, W. and Tchetgen, E. T. (2018). Identification and inference with nonignorable
missing covariate data. Statistica Sinica, 28(4):2049.
[222] Miller, A. C., Foti, N. J., and Adams, R. P. (2017). Variational boosting: Iteratively
refining posterior approximations. In International Conference on Machine Learning,
pages 2420–2429. PMLR.
[223] Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference.
In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence,
pages 362–369. Morgan Kaufmann Publishers Inc.
[224] Minka, T. P. (2004). Power EP. Technical report.
[225] Minka, T. P. (2005). Divergence measures and message passing. Technical report,
Technical report, Microsoft Research, Cambridge.
[226] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief
networks. In International Conference on Machine Learning, pages 1791–1799. PMLR.
[227] Mnih, A. and Rezende, D. (2016). Variational inference for Monte Carlo objectives.
In International Conference on Machine Learning, pages 2188–2196. PMLR.
[228] Mnih, A. and Salakhutdinov, R. R. (2007). Probabilistic matrix factorization. Ad-
vances in neural information processing systems, 20:1257–1264.
[229] Moens, V., Ren, H., Maraval, A., Tutunov, R., Wang, J., and Ammar, H. (2021).
Efficient semi-implicit variational inference. arXiv preprint arXiv:2101.06070.
[230] Mohamed, A.-r., Dahl, G. E., and Hinton, G. (2012). Acoustic modeling using
deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing,
20(1):14–22.
[231] Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020). Monte Carlo gradient
estimation in machine learning. J. Mach. Learn. Res., 21(132):1–62.
[232] Mohan, K. and Pearl, J. (2014). Graphical models for recovering probabilistic and
causal queries from missing data. Technical report.
[233] Mohan, K., Pearl, J., and Tian, J. (2013). Graphical models for inference with missing
data. Advances in neural information processing systems, 26:1277–1285.
[234] Molchanov, D., Kharitonov, V., Sobolev, A., and Vetrov, D. (2019). Doubly semi-
implicit variational inference. In The 22nd International Conference on Artificial Intelli-
gence and Statistics, pages 2593–2602. PMLR.
Bibliography 239
[251] Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshmi-
narayanan, B., and Snoek, J. (2019). Can you trust your model’s uncertainty? evaluating
predictive uncertainty under dataset shift. Advances in neural information processing
systems, 32.
[252] Paisley, J., Blei, D. M., and Jordan, M. I. (2012). Variational Bayesian inference
with stochastic search. In Proceedings of the 29th International Coference on Machine
Learning, pages 1363–1370.
[253] Papamakarios, G. and Murray, I. (2016). Fast ε-free inference of simulation mod-
els with Bayesian conditional density estimation. In Advances in Neural Information
Processing Systems, pages 1028–1036.
[254] Parmigiani, G. and Inoue, L. (2009). Decision theory: principles and approaches,
volume 812. John Wiley & Sons.
[255] Paulino, C. D. M. and de Bragança Pereira, C. A. (1994). On identifiability of
parametric statistical models. Journal of the Italian Statistical Society, 3(1):125–151.
[256] Pawlowski, J. M. and Urban, J. M. (2020). Reducing autocorrelation times in lattice
simulations with generative adversarial networks. Machine Learning: Science and
Technology, 1(4):045011.
[257] Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponen-
tial expressivity in deep neural networks through transient chaos. In Advances in Neural
Information Processing Systems, pages 3360–3368.
[258] Pourzanjani, A. A., Jiang, R. M., and Petzold, L. R. (2017). Improving the identifia-
bility of neural networks for Bayesian inference. In NIPS Workshop on Bayesian Deep
Learning, volume 4, page 29.
[259] Pyzer-Knapp, E. O., Li, K., and Aspuru-Guzik, A. (2015). Learning from the harvard
clean energy project: The use of neural networks to accelerate materials discovery.
Advanced Functional Materials, 25(41):6495–6502.
[260] Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017). PointNet: Deep learning on point
sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 652–660.
[261] Quiñonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse
approximate Gaussian process regression. Journal of Machine Learning Research,
6(Dec):1939–1959.
[262] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning
with deep convolutional generative adversarial networks. arXiv:1511.06434.
[263] Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines.
In Advances in neural information processing systems, pages 1177–1184.
[264] Ramesh, A. and LeCun, Y. (2018). Backpropagation for implicit spectral densities.
arXiv preprint arXiv:1806.00499.
Bibliography 241
[280] Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespon-
dents in sample surveys. Journal of the American Statistical Association, 72(359):538–
543.
[281] Rubin, D. B. (1988). An overview of multiple imputation. In Proceedings of the
survey research methods section of the American statistical association, pages 79–84.
Citeseer.
[282] Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys, volume 81.
John Wiley & Sons.
[283] Rudin, W. (1987). Real and Complex Analysis, 3rd Ed. McGraw-Hill, Inc., USA.
[284] Rudner, T. G., Chen, Z., and Gal, Y. (2020). Rethinking function-space variational
inference in Bayesian neural networks. In Third Symposium on Advances in Approximate
Bayesian Inference.
[285] Rudner, T. G. J., Chen, Z., Teh, Y. W., and Gal, Y. (2021). Rethinking Function-Space
Variational Inference in Bayesian Neural Networks. In Third Symposium on Advances in
Approximate Bayesian Inference.
[286] Saar-Tsechansky, M., Melville, P., and Provost, F. (2009). Active feature-value
acquisition. Management Science, 55(4):664–684.
[287] Saatçi, Y. (2012). Scalable inference for structured Gaussian process models. PhD
thesis, University of Cambridge.
[288] Salakhutdinov, R. and Hinton, G. E. (2009). Deep boltzmann machines. In AISTATS,
volume 1, page 3.
[289] Salakhutdinov, R. and Mnih, A. (2008). Bayesian probabilistic matrix factorization
using Markov Chain Monte Carlo. In International conference on Machine learning,
pages 880–887. ACM.
[290] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X.
(2016). Improved techniques for training gans. In Advances in Neural Information
Processing Systems, pages 2234–2242.
[291] Salimans, T., Kingma, D., and Welling, M. (2015). Markov Chain Monte Carlo
and variational inference: Bridging the gap. In International Conference on Machine
Learning, pages 1218–1226. PMLR.
[292] Salimbeni, H. and Deisenroth, M. P. (2017). Doubly stochastic variational inference
for deep Gaussian processes. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, pages 4591–4602.
[293] Samo, Y.-L. K. and Roberts, S. (2015). String Gaussian process kernels. arXiv
preprint arXiv:1506.02239.
[294] Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean field theory for sigmoid
belief networks. Journal of artificial intelligence research, 4:61–76.
Bibliography 243
[311] Shpitser, I., Mohan, K., and Pearl, J. (2015). Missing data as a causal and probabilistic
problem. Technical report.
[312] Shrive, F. M., Stuart, H., Quan, H., and Ghali, W. A. (2006). Dealing with missing
data in a multi-question depression scale: a comparison of imputation methods. BMC
medical research methodology, 6(1):1–10.
[313] Silvestro, D. and Andermann, T. (2020). Prior choice affects ability of Bayesian
neural networks to identify unknowns. arXiv preprint arXiv:2005.04987.
[314] Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. arXiv:1409.1556.
[315] Snelson, E. and Ghahramani, Z. (2006). Sparse Gaussian processes using pseudo-
inputs. In Advances in Neural Information Processing Systems, pages 1257–1264.
[316] Snelson, E., Ghahramani, Z., and Rasmussen, C. E. (2004). Warped Gaussian
processes. In Advances in neural information processing systems, pages 337–344.
[317] Sobolev, A. and Vetrov, D. P. (2019). Importance weighted hierarchical variational
inference. Advances in Neural Information Processing Systems, 32.
[318] Sønderby, C. K., Caballero, J., Theis, L., Shi, W., and Huszár, F. (2017). Amortised
MAP inference for image super-resolution. In International Conference on Learning
Representations.
[319] Song, J., Zhao, S., and Ermon, S. (2017). A-nice-mc: Adversarial training for MCMC.
Advances in Neural Information Processing Systems, 30.
[320] Sportisse, A., Boyer, C., and Josse, J. (2020a). Imputation and low-rank estimation
with missing not at random data. Statistics and Computing, 30(6):1629–1643.
[321] Sportisse, A., Boyer, C., and Josses, J. (2020b). Estimation and imputation in
probabilistic principal component analysis with missing not at random data. Advances in
Neural Information Processing Systems, 33.
[322] Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R.
(2010). Hilbert space embeddings and metrics on probability measures. The Journal of
Machine Learning Research, 11:1517–1561.
[323] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
(2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of
Machine Learning Research, 15(1):1929–1958.
[324] Steinwart, I. (2001). On the influence of the kernel on the consistency of support
vector machines. Journal of machine learning research, 2(Nov):67–93.
[325] Stekhoven, D. J. and Bühlmann, P. (2012). Missforest—non-parametric missing value
imputation for mixed-type data. Bioinformatics, 28(1):112–118.
Bibliography 245
[326] Stern, D., Herbrich, R., and Graepel, T. (2009). Matchbox: Large scale Bayesian
recommendations. In International World Wide Web Conference.
[327] Stinchcombe, M. B. (1999). Neural network approximation of continuous functionals
and continuous functions on compactifications. Neural Networks, 12(3):467–477.
[328] Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density ratio estimation in
machine learning. Cambridge University Press.
[329] Sun, S., Chen, C., and Carin, L. (2017). Learning structured weight uncertainty in
Bayesian neural networks. In Artificial Intelligence and Statistics, pages 1283–1292.
PMLR.
[330] Sun, S., Zhang, G., Shi, J., and Grosse, R. (2018). Functional variational Bayesian
neural networks.
[331] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning
with neural networks. In Advances in Neural Information Processing Systems, pages
3104–3112.
[332] Tang, G., Little, R. J., and Raghunathan, T. E. (2003). Analysis of multivariate
missing data with nonignorable nonresponse. Biometrika, 90(4):747–764.
[333] Thahir, M., Sharma, T., and Ganapathiraju, M. K. (2012). An efficient heuristic
method for active feature acquisition and its application to protein-protein interaction
prediction. In BMC proceedings, volume 6, page S2. BioMed Central.
[334] Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds
another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.
[335] Titsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian
processes. In International Conference on Artificial Intelligence and Statistics, pages
567–574.
[336] Titsias, M. K. and Ruiz, F. (2019). Unbiased implicit variational inference. In The
22nd International Conference on Artificial Intelligence and Statistics, pages 167–176.
PMLR.
[337] Tobar, F., Bui, T. D., and Turner, R. E. (2015). Learning stationary time series using
Gaussian processes with nonparametric kernels. In Advances in Neural Information
Processing Systems, pages 3501–3509.
[338] Tran, D., Blei, D., and Airoldi, E. M. (2015). Copula variational inference. In
Advances in Neural Information Processing Systems, pages 3564–3572.
[339] Tran, D., Ranganath, R., and Blei, D. (2017a). Hierarchical implicit models and
likelihood-free variational inference. Advances in Neural Information Processing Systems,
30.
[340] Tran, D., Ranganath, R., and Blei, D. M. (2017b). Deep and hierarchical implicit
models. arXiv:1702.08896.
246 Bibliography
[341] Trippe, B. and Turner, R. (2018). Overpruning in variational Bayesian neural networks.
arXiv preprint arXiv:1801.06230.
[342] Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., and Zhang, K. (2019).
Causal discovery in the presence of missing data. In The 22nd International Conference
on Artificial Intelligence and Statistics, pages 1762–1770. PMLR.
[343] Turner, R. E. and Sahani, M. (2010). Statistical inference for single-and multi-
band probabilistic amplitude demodulation. In Acoustics Speech and Signal Processing
(ICASSP), 2010 IEEE International Conference on, pages 5466–5469. IEEE.
[344] Turner, R. E. and Sahani, M. (2011). Two problems with variational expectation
maximisation for time-series models.
[345] Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university
press.
[346] van der Wilk, M., Rasmussen, C. E., and Hensman, J. (2017). Convolutional Gaussian
processes. In Advances in Neural Information Processing Systems, pages 2845–2854.
[347] Von Neumann, J. and Morgenstern, O. (1944). Theory of games and economic
behavior. Princeton university press.
[348] Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families,
and variational inference. Now Publishers Inc.
[349] Wald, A. (1950). Statistical decision functions.
[350] Wang, C. and Blei, D. M. (2011). Collaborative topic modeling for recommending
scientific articles. In International Conference on Knowledge Discovery and Data Mining,
pages 448–456. ACM.
[351] Wang, D., Liu, H., and Liu, Q. (2018a). Variational inference with tail-adaptive
f-divergence. Advances in Neural Information Processing Systems, 31.
[352] Wang, S., Shao, J., and Kim, J. K. (2014). An instrumental variable approach for
identification and estimation with nonignorable nonresponse. Statistica Sinica, pages
1097–1116.
[353] Wang, X., Zhang, R., Sun, Y., and Qi, J. (2019a). Doubly robust joint learning for
recommendation on data missing not at random. In International Conference on Machine
Learning, pages 6638–6647. PMLR.
[354] Wang, Y. and Blei, D. M. (2019). The blessings of multiple causes.
[355] Wang, Y., Liang, D., Charlin, L., and Blei, D. M. (2018b). The deconfounded
recommender: A causal inference approach to recommendation. arXiv preprint
arXiv:1808.06581.
Bibliography 247
[356] Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y., Hernández-Lobato, J. M.,
Turner, R. E., Baraniuk, R. G., Barton, C., Jones, S. P., Woodhead, S., and Zhang, C.
(2020). Diagnostic questions: The neurips 2020 education challenge. arXiv preprint
arXiv:2007.12061.
[357] Wang, Z., Ren, T., Zhu, J., and Zhang, B. (2019b). Function space particle op-
timization for Bayesian neural networks. In International Conference on Learning
Representations.
[358] Wathen, A. J. and Zhu, S. (2015). On spectral distribution of kernel matrices related
to radial basis functions. Numerical Algorithms, 70(4):709–726.
[359] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin
dynamics. In Proceedings of the 28th International Conference on Machine Learning
(ICML-11), pages 681–688.
[360] Wendland, H. (2004). Scattered data approximation, volume 17. Cambridge univer-
sity press.
[361] Wenzel, F., Roth, K., Veeling, B., Swiatkowski, J., Tran, L., Mandt, S., Snoek, J.,
Salimans, T., Jenatton, R., and Nowozin, S. (2020). How good is the Bayes posterior in
deep neural networks really? In International Conference on Machine Learning, pages
10248–10259. PMLR.
[362] White, I. R., Royston, P., and Wood, A. M. (2011). Multiple imputation using chained
equations: issues and guidance for practice. Statistics in medicine, 30(4):377–399.
[363] Wiegerinck, D. B. W. (1999). Tractable variational structures for approximating graph-
ical models. In Advances in Neural Information Processing Systems 11: Proceedings of
the 1998 Conference, volume 11, page 183. MIT Press.
[364] Wiegerinck, W. (2013). Variational approximations between mean field theory and
the junction tree algorithm. arXiv preprint arXiv:1301.3901.
[365] Williams, C. K. (1997). Computing with infinite networks. In Advances in neural
information processing systems, pages 295–301.
[366] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connec-
tionist reinforcement learning. Machine learning, 8(3):229–256.
[367] Wilson, A. G., Hu, Z., Salakhutdinov, R. R., and Xing, E. P. (2016). Stochastic
variational deep kernel learning. In Advances in Neural Information Processing Systems,
pages 2586–2594.
[368] Wu, G., Domke, J., and Sanner, S. (2018). Conditional inference in pre-trained
variational autoencoders via cross-coding. arXiv preprint arXiv:1805.07785.
[369] Wu, M. and Goodman, N. (2018). Multimodal generative models for scalable weakly-
supervised learning. In Advances in Neural Information Processing Systems, pages
5575–5585.
248 Bibliography
[370] Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset
for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
[371] Yin, M. and Zhou, M. (2018). Semi-implicit variational inference. In International
Conference on Machine Learning, pages 5660–5669. PMLR.
[372] Yoon, J., Jordon, J., and Schaar, M. (2018). GAIN: Missing data imputation using
generative adversarial nets. In International Conference on Machine Learning, pages
5689–5698. PMLR.
[373] Yu, H.-F., Rao, N., and Dhillon, I. S. (2016). Temporal regularized matrix factor-
ization for high-dimensional time series prediction. In Advances in Neural Information
Processing Systems, pages 847–855.
[374] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola,
A. J. (2017). Deep sets. In Advances in Neural Information Processing Systems, pages
3394–3404.
[375] Zakim, D., Braun, N., Fritz, P., and Alscher, M. D. (2008). Underutilization of
information and knowledge in everyday medical practice: Evaluation of a computer-
based solution. BMC Medical Informatics and Decision Making, 8(1):50.
[376] Zhang, C., Bütepage, J., Kjellström, H., and Mandt, S. (2018). Advances in variational
inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008–
2026.
[377] Zhang, R., Li, Y., De Sa, C., Devlin, S., and Zhang, C. (2021). Meta-learning diver-
gences for variational inference. In International Conference on Artificial Intelligence
and Statistics, pages 4024–4032. PMLR.
[378] Zheng, Z. and Padmanabhan, B. (2002). On active learning for data acquisition. In
International Conference on Data Mining, pages 562–569. IEEE.
[379] Zhou, Y., Shi, J., and Zhu, J. (2020). Nonparametric score estimators. In International
Conference on Machine Learning, pages 11513–11522. PMLR.
[380] Zhu, H. and Rohwer, R. (1995). Information geometric measurements of generalisa-
tion.