Notes
Notes
Yingzhen Li
(Lecture notes for the ProbAI 2022 BNN introduction lecture)
Assuming we have p(θ|D) at hand, then in prediction time, given a new test input x∗ , we use the
following Bayesian predictive distribution for predicting the corresponding output:
Z
p(y |x , D) = p(y ∗ |x∗ , θ)p(θ|D)dθ.
∗ ∗
(5)
Unfortunately we don’t know how to directly compute p(θ|D) nor p(y ∗ |x∗ , D). This is where
approximate Bayesian inference comes in, where many of the existing approaches solve the problem
in the following 3 steps:
1. Design an approximate posterior: design a distribution family Q such that for each q(θ) ∈ Q,
we can compute its density given any θ, and q(θ) is easy to sample from;
2. Fit the approximate posterior: find the best q distribution in Q so that q(θ) ≈ p(θ|D) well
according to some criteria;
2. Approximate predictive inference with Monte Carlo: approximate p(y ∗ |x∗ , D) by replacing the
exact posterior p(θ|D) with q(θ) and estimating the integral with Monte Carlo:
Z K
1 X
p(y ∗ |x∗ , D) ≈ p(y ∗ |x∗ , θ)q(θ)dθ ≈ p(y ∗ |x∗ , θk ), θk ∼ q(θ). (6)
K
k=1
It remains to discuss how to find an approximation q(θ) ≈ p(θ|D), for which we will mainly
discuss the variational inference (VI) approach below.
1
2 Bayes-by-backprop: Mean-field VI for BNNs
2.1 Foundations of variational inference (VI)
As discussed above, we need to find an approximation q(θ) ≈ p(θ|D). A natural way to do so is to
minimise some divergence measure that tells the difference between q(θ) and p(θ|D). In particular,
variational inference (VI) uses the KL divergence and finds the best approximate posterior within a
distribution family Q (e.g., all Gaussians):
q ∗ (θ) = arg min KL[q(θ)||p(θ|D)], KL[q(θ)||p(θ|D)] = Eq [log q(θ) − log p(θ|D)]. (7)
q∈Q
However, we cannot directly compute this KL as we don’t know how to compute p(θ|D) in the first
place. Fortunately we can re-write the KL to an equivalent objective: notice that by taking the
logarithm on the both sides of the Bayes’ rule:
p(D|θ)p(θ)
log p(θ|D) = log = log p(D|θ) + log p(θ) − log p(D), (8)
p(D)
which means (notice that log p(D) is a constant w.r.t. q and θ):
which strives to balance between the data fitting quality and the complexity of q.
L
Y Y Y
q(θ) = q(W l )q(bl ), q(W l ) = q(Wijl ), q(bl ) = q(bli ). (12)
l=1 ij i
2
l
Therefore the variational parameters to be optimised are the mean parameters {Mij , mli } and vari-
ance parameters of the q distribution. Note that as variance is positive, one cannot directly op-
timize w.r.t. {Vijl , vil } without constraints. Instead in practice we often define e.g., log Vijl as a
free-parameter to be optimized. Collecting all the parameters into vector and matrix forms, we have
where KL[q(W l )||p(W l )] and KL[q(bl )||p(bl )] are KL divergences between two factorised Gaussians,
which have analytic solutions.
For the data-fit term Eq [log p(D|θ)], the expectation remains intractable since p(D|θ) is defined
using neural networks (i.e., non-linear transform of θ). Also for optimization we need to be able
to compute the gradient of this data fit term w.r.t. the mean and variance parameters of q(θ).
This problem is solved using Monte-Carlo sampling and the reparameterisation trick [Kingma and
Welling, 2013]. In detail, for a Gaussian distribution q(θ) = N (θ; µ, diag(σ 2 )), getting a sample
θk ∼ q is computed as follows:
with ⊙ denoting the element-wise product. This means we can estimate the data fit term with
Monte-Carlo as follows, where µ = {M l , ml }L 2 l l L
l=1 and σ = {V , v }l=1 :
K
1 X
Eq [log p(D|θ)] ≈ log p(D|µ + σ ⊙ ϵk ), ϵk ∼ N (0, I). (17)
K
k=1
Lastly, when the dataset D contains many datapoints, one might want to run mini-batch training,
i.e., stochastic gradient descent (SGD). This is possible for VI: notice that under the i.i.d. data
setting,
XN
log p(D|θ) = log p(yn |xn , θ) = N E(x,y)∼D [log p(y|x, θ)]. (18)
n=1
This means we can estimate E(x,y)∼D [log p(y|x, θ)] using a mini-batch of the data: assume that
(x1 , y1 ), ..., (xm , ym ) ∼ D:
M
1 X
E(x,y)∼D [log p(y|x, θ)] ≈ log p(ym |xm , θ). (19)
M m=1
Combining all the sampling estimate together, we can compute the ELBO objective as follows, using
K = 1 Monte-Carlo sample for θ:
M
N X
ELBOβ (q, D) ≈ log p(ym |xm , µ + σ ⊙ ϵ) − βKL[q(θ)||p(θ)], ϵ ∼ N (0, I). (20)
M m=1
3
But at the same time, it requires much more variational parameters if we were to use full covariance
Gaussians for every layer. For example, a hidden layer with both input and output
P50×50dimensions as
50 would need 50 × 50 = 2500 parameters for parameterising the mean, but i=1 i = 3126250
parameters for parameterising the (symmetric) full covariance matrix! Therefore, when selecting
the q distribution family, one needs to also consider the computational & memory costs for such
approximation.
In below we will discuss 2 more “economic” solutions:
1. The so-called ”last-layer BNN” approach: only apply full-covariance Gaussian posterior ap-
proximation to the last layer of the network, and use MLE/MAP solutions for the previous
layers.
2. Monte Carlo dropout (MC-dropout): adding dropout layers to the network, and in test time,
run multiple forward passes with dropout.
In other words, we use deterministic network layers for all but the last layer, and for the last layer
we use a Gaussian approximation with a full-rank covariance matrix. The corresponding ELBO
objective then becomes
which in practice means we only sample the last layer’s weight and compute the KL regulariser for
the last layer.
In regression tasks with Gaussian likelihood, this approach can be viewed as performing Bayesian
linear regression but with non-linear input feature computed by previous neural network layers. So
this means, given fixed parameters θ1:L−1 = {M l , ml }L−1
l=1 for all the previous layers, the variational
parameters {µL , ΣL } can be analytically calculated, if we use full batch for training. In practice we
may still prefer optimising the ELBO to find the optimal Gaussian posterior approximation for the
last layer, which can leverage SGD-based optimisation methods, and this approach can be applied
to other cases such as classification (thus more general).
with variational parameter for the layer as {M l , ml }, M l ∈ Rdout ×din , ml ∈ Rdout . This means there
are two equivalent way to compute Eq [log p(y|x, θ)] with Monte Carlo: for the forward pass of a
layer, below computations are equivalent:
4
The KL[q||p] regularizer for MC-dropout is ill-defined for Gaussian prior p(θ). In practice this
is replaced by an ℓ2 regularizer, i.e., 2σ1−π
2 ||M l ||22 for the weight variational parameter M l (and
prior
similarly for ml ). The intuition is the following: the q distribution used in MC-dropout is the
limiting distribution of the following mixture of Gaussian distribution:
q(θil ) = lim πN (θil ; 0, ηI) + (1 − π)N (θil ; µli , ηI), θil = {Wil , bli }, µli = {Mil , mli }. (24)
η→0
This permits a valid KL divergence KL[q(θil )||p(θil )] if using Gaussian prior and η > 0, which includes
a term that approximately equals to 2σ1−π
2 ||M l ||22 for the weight variational parameter M l (and
prior
similarly for ml ). The other terms in that KL regulariser depends on η which will diverge to infinity
when η → 0, and in MC-dropout those terms are dropped. In other words, the q distribution in
MC-dropout is improper in the sense that approximating a continuous variable posterior distribution
with “mixture of delta measures” results in infinite KL which is poor. But in practice MC-dropout
can still provide quick posterior approximates (especially in function space) and it has shown to
provide useful uncertainty information in downstream tasks.
but you don’t actually know the analytic form of f (x). Rather, you only have access of it as a
“black-box”, where you provide an input x to this “black-box”, and it will return you a (noisy
version of) output y = f0 (x) + ϵ.
Bayesian optimisation (BO) [Snoek et al., 2012] is a class of methods that tackle this challenge.
To motivate BO, let’s imagine you already have a set of datapoints D = {(xn , yn )}N n=1 collected
by sending the queries x1 , ..., xN to the above mentioned “black-box”. Then we can fit a surrogate
function fθ (x) to data. In large data limit (N → +∞), with flexible enough model, we expect fθ (x)
to be very close, if not identical, to the ground-truth function f0 (x). In such case we can solve the
optimisation problem by finding x∗ = arg maxx∈X fθ (x) instead, which is tractable.
However, in practice we don’t have such a big dataset to train the surrogate model. This
is especially the case if the “black-box” corresponds to an expensive experiment (e.g., training a
Transformer network where x represents the hyper-parameter settings). In such scenario fθ (x) will
be quite different from f (x) in most of the unseen input locations. But at the end of the day, we
are only interested in the maximum of f0 (x) rather than the value of f0 (x) at all possible input
locations. This means training the surrogate model with very large datasets is unnecessary, and it
is possible to use some smart algorithms that can find the optimum of f0 without excessive queries
to the “black-box”.
The key idea of BO is to optimise f0 using “helps” from the surrogate fθ by taking the uncertainty
of model fitting into account. It aims to find the optimum of f0 with least amount of queries to the
expensive ”black-box”. Specifically, there are 3 ingredients of an BO method:
1. Acquisition function: use the current estimated surrogate function fθ (x) and its uncertainty
estimate to compute an acquisition function a(x);
2. Query the “black-box”: find the next input x∗ to query by maximising the acquisition function:
x∗ = arg maxx a(x), and query the corresponding output value y∗ = f0 (x∗ ) + ϵ∗ ;
3. Surrogate model update: given new queried result (x∗ , y∗ ) and historical data D, update
D ← D ∪ {(x∗ , y∗ )}, and use this new dataset D to update the surrogate model and its
uncertainty estimate.
5
At the beginning since the model is uncertain, a good acquisition function will take uncertainty
into account and encourage ”exploration”, i.e., querying inputs at different locations. As we collect
more data, with proper Bayesian posterior updates and assuming the family of fθ is flexible enough,
the surrogate model fθ will become closer and closer to the ground-truth function f0 and the un-
certainty will be reduced. So at some point, the model will become certain about its output, and
a good acquisition function will also enable “exploitation” at this stage, to seek for the optimum of
the surrogate function fθ as to solve the original optimisation problem.
and the query procedure will pick the next query input as
Initially, σ(x) can be quite large for many regions, meaning that UCB will mainly explore. As we
collect more data, σ(x) will decrease around the regions that have been searched, and this allows
the algorithm to
1. ignore some searched regions where the model confidently thinks their function value is small;
2. exploit some other searched regions where the model confidently believes the optimum might
be there;
3. explore some other promising regions that has not been searched before.
Here β a hyper-parameter specified by the user at a time, to achieve a desired balance between
exploration (σ(x)) and exploitation (m(x)). When β = 0, it means we trust the surrogate model
and exploit on that. When β is large, we allow the query process to focus on regions that have large
σ(x) value (where the model is most uncertainty about). For optimal BO, this β coefficient will
decrease during time; for simplicity, in the demo we only consider a fixed value of β.
6
5.1 Uncertainty measures
We consider two types of uncertainty, using coin flipping as a running example:
• Epistemic uncertainty: also named model uncertainty, this is the uncertainty due to lack of
knowledge, and thus can be reduced by collecting more data. For example, by flipping a coin
multiple times, we become more and more certain about whether the coin is fair or bent;
• Aleatoric uncertainty: also named data uncertainty, this is the uncertainty regarding the
stochasticity of individual experimental outcome, which is non-reducible. For example, even if
we are 100% sure about that the coin is fair, we are still unsure about whether the next coin
flip result will be head or tail.
These two type of uncertainty, when summed, returns the total uncertainty, i.e.,
We can show that H[p] is maximised when pc = 1/C, i.e., each category has equal probability, and
in this case we are very uncertain about the sampling outcome. On the other hand, H[p] = 0 when
pc = 1 for a particular c ∈ {1, ..., C}, meaning that the sampling outcome will be c for 100% of the
time (thus certain). In other words, higher entropy means we are more uncertain, and vice versa.
For BNNs applied to classification tasks, we can also compute uncertainty based on entropy-
related measures. Recall that the Bayesian predictive distribution is
Z
p(y ∗ |x∗ , D) = p(y ∗ |x∗ , θ)p(θ|D)dθ. (29)
In classifcation case p(y ∗ |x∗ , D) is a categorical distribution, which means we can compute its entropy
as to measure total uncertainty:
C
X
H[y ∗ |x∗ , D] = H[p(y ∗ |x∗ , D)] = − p(y ∗ = c|x∗ , D) log p(y ∗ = c|x∗ , D). (30)
c=1
But also notice that for each sample θ ∼ p(θ|D), p(y ∗ |x∗ , θ) is also a categorical distribution for
which we can also compute its entropy. This enables us to compute the conditional entropy as to
measure aleatoric uncertainty:
C
X
Ep(θ|D) [H[y ∗ |x∗ , θ]] = Ep(θ|D) [H[p(y ∗ |x∗ , θ)]] = Ep(θ|D) [− p(y ∗ = c|x∗ , θ) log p(y ∗ = c|x∗ , θ)].
c=1
(31)
The intuition for conditional entropy being a measure for aleatoric uncertainty is as follows. When
the posterior p(θ|D) becomes a delta mass, there is no epistemic uncertainty as we are now certain
about the weight parameters. Even so, p(y ∗ |x∗ , θ) for θ evaluated at the posterior mode can still
have non-zero probability for multiple categories, which is especially the case when label noise exists
(thus having non-zero aleatoric uncertainty). In this case the conditional entropy is greater than
zero and can be used as a measure for aleatoric uncertainty.
As for epistemic uncertainty, from the relationship between total, epistemic and aleatoric uncer-
tainty, we can compute it as “epistemic uncertainty = total uncertainty - aleatoric uncertainty”. It
terms out that this is related to the mutual information between y ∗ and θ:
I[y ∗ ; θ|x∗ , D] = H[y ∗ |x∗ , D] = H[p(y ∗ |x∗ , D)] − Ep(θ|D) [H[y ∗ |x∗ , θ]]. (32)
7
One can show that this mutual information can be re-written as
which tells us in (the model’s) expectation, how much the posterior will be updated given a new
observation (x∗ , y ∗ ). Notice that to reduce the epistemic uncertainty of a BNN, one can supply new
datapoints at input locations that the network are uncertain about their outputs. As the reduction
of epistemic uncertainty is conducted by making the posterior more concentrated, having higher
mutual information typically means that the network expects to have higher epistemic uncertainty
reduction if we supply (x∗ , y ∗ ) as a new observation, which also indicate that currently the network
has high epistemic uncertainty about the output value at x∗ .
The above discussed uncertainty measures have been used not only for OOD detection but also
for sequential decision making, including acquisition function design in active learning and Bayesian
optimisation. Interested readers are refer to e.g., [Houlsby et al., 2011; Hernández-Lobato et al.,
2014; Gal et al., 2017] for further readings.
One can show that this approach also performs variational inference and the training objectives of
these networks can also be combined to provide a valid lower-bound to the marginal log-likelihood
log p(D). To see this, consider the following “augmented prior”:
One can show that this “augmented model” p(D|θ)p(θ)p(s) preserves the marginal likelihood:
S Z
X Z
p(D|θ)p(θ)p(s)dθ = p(D|θ)p(θ)dθ = p(D). (36)
s=1
This also allows us to define an ELBO based on this “augmented model” p(D|θ)p(θ)p(s) and an
approximate posterior q(θ, s). In particular, if we define
Since the variational parameters (e.g., mean & log variance of the mean-field Gaussian) for different
qs (θ) are independent to each other, this means instead of training all S BNNs together with the
above variational lower-bound L, one can simply train each of them independently using the ELBO
objective ELBO(qs , D).
1 https://izmailovpavel.github.io/neurips_bdl_competition/
8
References
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural
network. In International conference on machine learning. PMLR.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model
uncertainty in deep learning. In international conference on machine learning. PMLR.
Gal, Y., Islam, R., and Ghahramani, Z. (2017). Deep bayesian active learning with image data. In
International Conference on Machine Learning, pages 1183–1192. PMLR.
Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search
for efficient global optimization of black-box functions. Advances in neural information processing
systems, 27.
Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. (2011). Bayesian active learning for
classification and preference learning. arXiv preprint arXiv:1112.5745.
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114.
Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive un-
certainty estimation using deep ensembles. Advances in neural information processing systems,
30.
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine
learning algorithms. In Advances in neural information processing systems.
Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the
bandit setting: No regret and experimental design. In Proceedings of the International Conference
on Machine Learning, 2010.