0% found this document useful (0 votes)

38 views

Notes

This document provides an introduction to Bayesian neural networks (BNNs). It discusses that BNNs treat neural network parameters as random variables and perform approximate Bayesian inference over them. This allows BNNs to estimate uncertainty in predictions. The document outlines variational inference (VI) as a common approach for approximate Bayesian inference in BNNs. Specifically, it describes mean-field VI using fully factorized distributions, known as Bayes-by-backprop, where the variational parameters (means and variances) of the approximate posterior are optimized to maximize the evidence lower bound (ELBO).

Uploaded by

hu jack

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Notes

Uploaded by

hu jack

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Introduction to Bayesian Neural Networks

Yingzhen Li
(Lecture notes for the ProbAI 2022 BNN introduction lecture)

1 What is a Bayesian neural network (BNN)?

In short, a Bayesian neural network (BNN) is a neural network that uses (approximate) Bayesian
inference for uncertainty estimation. For example, we can treat the NN parameters as random
variables and infer them using (approximate) Bayesian posterior inference.
For a supervised learning task, a Deep learning solution is to fit a neural network y = fθ (x) with
parameters θ to a dataset D = {(xn , yn )}N
n=1 . In particular, DL training corresponds to a Maximum
Likelihood Estimation (MLE) of the parameters:
θ∗ = arg max E(x,y)∼D [log p(y|x, θ)], (1)
θ

with the likelihood term defined as

for regression: p(y|x, θ) = N (y; fθ (x), σ 2 I), (2)
for classification: p(y|x, θ) = Categorical(logit = fθ (x)). (3)
Often regularizers such as ℓ2 regularizers are used, in such case it returns a Maximum A Posteriori
(MAP) estimate of the parameters.
In Bayesian neural networks (BNN), however, the network parameters θ are treated as random
variables, and we perform (approximate) Bayesian inference on it. In detail, we define a prior
distribution p(θ), which leads to a posterior with Bayes’ rule (under the i.i.d. data setting):
N
p(D|θ)p(θ) Y
p(θ|D) = , p(D|θ) = p(yn |xn , θ). (4)
p(D) n=1

Assuming we have p(θ|D) at hand, then in prediction time, given a new test input x∗ , we use the
following Bayesian predictive distribution for predicting the corresponding output:
Z
p(y |x , D) = p(y ∗ |x∗ , θ)p(θ|D)dθ.
∗ ∗
(5)

Unfortunately we don’t know how to directly compute p(θ|D) nor p(y ∗ |x∗ , D). This is where
approximate Bayesian inference comes in, where many of the existing approaches solve the problem
in the following 3 steps:
1. Design an approximate posterior: design a distribution family Q such that for each q(θ) ∈ Q,
we can compute its density given any θ, and q(θ) is easy to sample from;
2. Fit the approximate posterior: find the best q distribution in Q so that q(θ) ≈ p(θ|D) well
according to some criteria;
2. Approximate predictive inference with Monte Carlo: approximate p(y ∗ |x∗ , D) by replacing the
exact posterior p(θ|D) with q(θ) and estimating the integral with Monte Carlo:
Z K
1 X
p(y ∗ |x∗ , D) ≈ p(y ∗ |x∗ , θ)q(θ)dθ ≈ p(y ∗ |x∗ , θk ), θk ∼ q(θ). (6)
K
k=1
It remains to discuss how to find an approximation q(θ) ≈ p(θ|D), for which we will mainly
discuss the variational inference (VI) approach below.

1
2 Bayes-by-backprop: Mean-field VI for BNNs
2.1 Foundations of variational inference (VI)
As discussed above, we need to find an approximation q(θ) ≈ p(θ|D). A natural way to do so is to
minimise some divergence measure that tells the difference between q(θ) and p(θ|D). In particular,
variational inference (VI) uses the KL divergence and finds the best approximate posterior within a
distribution family Q (e.g., all Gaussians):

q ∗ (θ) = arg min KL[q(θ)||p(θ|D)], KL[q(θ)||p(θ|D)] = Eq [log q(θ) − log p(θ|D)]. (7)
q∈Q

However, we cannot directly compute this KL as we don’t know how to compute p(θ|D) in the first
place. Fortunately we can re-write the KL to an equivalent objective: notice that by taking the
logarithm on the both sides of the Bayes’ rule:
p(D|θ)p(θ)
log p(θ|D) = log = log p(D|θ) + log p(θ) − log p(D), (8)
p(D)

which means (notice that log p(D) is a constant w.r.t. q and θ):

KL[q(θ)||p(θ|D)] = Eq [log q(θ) − log p(D|θ) − log p(θ)] + log p(D)

= log p(D) − (Eq [log p(D|θ)] − KL[q(θ)||p(θ)]) (9)
:= log p(D) − ELBO(q, D).

In other words, the below optimisation problems are equivalent:

min KL[q(θ)||p(θ|D)] ⇔ max ELBO(q, D),

q∈Q q∈Q
(10)
ELBO(q, D) = Eq [log p(D|θ)] − KL[q(θ)||p(θ)].

One can think about the ELBO as a combination of two terms:

• Data fitting: Eq [log p(D|θ)] measures on average how good neural networks with parameters
sampled from q fit the training data.
• Complexity regularization: KL[q(θ)||p(θ)] describes the amount of changes of q from the prior
p. In BNN literature the prior p on weights are often set to be less informative (e.g., Gaussian
with zero mean and large variance), in such case the KL term can also be viewed as regularizing
the complexity of q.
These interpretations also leads to a “tempered ELBO” that is often used in practice:

ELBOβ (q, D) = Eq [log p(D|θ)] − βKL[q(θ)||p(θ)], (11)

which strives to balance between the data fitting quality and the complexity of q.

2.2 Bayes-by-backprop: Mean-field VI for BNNs

Now let’s discuss MFVI-BNN (also named Bayes-by-backprop) [Blundell et al., 2015] as a specific
example. In this case we define the Q family as all possible fully factorised distributions. In detail,
assume θ = {(W l , bl )}L
l=1 for an L-layer BNN. Then a fully factorised approximation looks like

L
Y Y Y
q(θ) = q(W l )q(bl ), q(W l ) = q(Wijl ), q(bl ) = q(bli ). (12)
l=1 ij i

Assuming Gaussian distrbutions further:

q(Wijl ) = N (Wijl ; Mij

l
, Vijl ), q(bli ) = N (bli ; mli , vil ). (13)

2
l
Therefore the variational parameters to be optimised are the mean parameters {Mij , mli } and vari-
ance parameters of the q distribution. Note that as variance is positive, one cannot directly op-
timize w.r.t. {Vijl , vil } without constraints. Instead in practice we often define e.g., log Vijl as a
free-parameter to be optimized. Collecting all the parameters into vector and matrix forms, we have

Variational parameters of mean-field Gaussian: {M l , ml , log V l , log v l }L

l=1 , (14)

where the logarithm in above equation is applied element-wise.

We now turn to the optimization of the ELBO. If factorised and isotropic Gaussian prior is used
for θ, then we can compute the KL regularizer as
L
X
KL[q(θ)||p(θ)] = KL[q(W l )||p(W l )] + KL[q(bl )||p(bl )], (15)
l=1

where KL[q(W l )||p(W l )] and KL[q(bl )||p(bl )] are KL divergences between two factorised Gaussians,
which have analytic solutions.
For the data-fit term Eq [log p(D|θ)], the expectation remains intractable since p(D|θ) is defined
using neural networks (i.e., non-linear transform of θ). Also for optimization we need to be able
to compute the gradient of this data fit term w.r.t. the mean and variance parameters of q(θ).
This problem is solved using Monte-Carlo sampling and the reparameterisation trick [Kingma and
Welling, 2013]. In detail, for a Gaussian distribution q(θ) = N (θ; µ, diag(σ 2 )), getting a sample
θk ∼ q is computed as follows:

θ k = µ + σ ⊙ ϵk , ϵk ∼ N (0, I), (16)

with ⊙ denoting the element-wise product. This means we can estimate the data fit term with
Monte-Carlo as follows, where µ = {M l , ml }L 2 l l L
l=1 and σ = {V , v }l=1 :

K
1 X
Eq [log p(D|θ)] ≈ log p(D|µ + σ ⊙ ϵk ), ϵk ∼ N (0, I). (17)
K
k=1

Lastly, when the dataset D contains many datapoints, one might want to run mini-batch training,
i.e., stochastic gradient descent (SGD). This is possible for VI: notice that under the i.i.d. data
setting,
XN
log p(D|θ) = log p(yn |xn , θ) = N E(x,y)∼D [log p(y|x, θ)]. (18)
n=1

This means we can estimate E(x,y)∼D [log p(y|x, θ)] using a mini-batch of the data: assume that
(x1 , y1 ), ..., (xm , ym ) ∼ D:
M
1 X
E(x,y)∼D [log p(y|x, θ)] ≈ log p(ym |xm , θ). (19)
M m=1

Combining all the sampling estimate together, we can compute the ELBO objective as follows, using
K = 1 Monte-Carlo sample for θ:
M
N X
ELBOβ (q, D) ≈ log p(ym |xm , µ + σ ⊙ ϵ) − βKL[q(θ)||p(θ)], ϵ ∼ N (0, I). (20)
M m=1

3 Other choices of approximate posterior

Given the practical issues of MFVI in calibrated uncertainty estimation, a natural question is to
ask whether other design of the q distribution would return better results. Within the Gaussian
family, Gaussians with full-rank covariance matrices are more expressive than factorised Gaussians.

3
But at the same time, it requires much more variational parameters if we were to use full covariance
Gaussians for every layer. For example, a hidden layer with both input and output
P50×50dimensions as
50 would need 50 × 50 = 2500 parameters for parameterising the mean, but i=1 i = 3126250
parameters for parameterising the (symmetric) full covariance matrix! Therefore, when selecting
the q distribution family, one needs to also consider the computational & memory costs for such
approximation.
In below we will discuss 2 more “economic” solutions:
1. The so-called ”last-layer BNN” approach: only apply full-covariance Gaussian posterior ap-
proximation to the last layer of the network, and use MLE/MAP solutions for the previous
layers.

2. Monte Carlo dropout (MC-dropout): adding dropout layers to the network, and in test time,
run multiple forward passes with dropout.

3.1 Last-layer BNN with full covariance Gaussian approximation

For this “last-layer BNN” approach, the q distribution is defined differently for different layers:

q(W l , bl ) = δ(W l = M l , bl = ml ), l = 1, ..., L,

L L L L
(21)
q(θ ) = N (θ ; µ , Σ ), θ = {W L , bL }, µL = {M L , mL }.
L

In other words, we use deterministic network layers for all but the last layer, and for the last layer
we use a Gaussian approximation with a full-rank covariance matrix. The corresponding ELBO
objective then becomes

ELBOβ (q, D) = Eq [log p(D|θ)] − βKL[q(θL )||p(θL )], (22)

which in practice means we only sample the last layer’s weight and compute the KL regulariser for
the last layer.
In regression tasks with Gaussian likelihood, this approach can be viewed as performing Bayesian
linear regression but with non-linear input feature computed by previous neural network layers. So
this means, given fixed parameters θ1:L−1 = {M l , ml }L−1
l=1 for all the previous layers, the variational
parameters {µL , ΣL } can be analytically calculated, if we use full batch for training. In practice we
may still prefer optimising the ELBO to find the optimal Gaussian posterior approximation for the
last layer, which can leverage SGD-based optimisation methods, and this approach can be applied
to other cases such as classification (thus more general).

3.2 Monte Carlo dropout

For the Monte Carlo dropout (MC-dropout) method [Gal and Ghahramani, 2016] with dropout prob-
ability π, the q distribution is a mixture of delta measures. Specifically, for a layer with parameters
{W l , bl }, W l ∈ Rdout ×din , bl ∈ Rdout , the q distribution is
dY
out

q(W l , bl ) = q(Wil , bli )

i=1
(23)
q(Wil , bli ) = πδ(Wil = 0, bli = 0) + (1 − π)δ(Wil = Mil , bli = mli ),

with variational parameter for the layer as {M l , ml }, M l ∈ Rdout ×din , ml ∈ Rdout . This means there
are two equivalent way to compute Eq [log p(y|x, θ)] with Monte Carlo: for the forward pass of a
layer, below computations are equivalent:

1. drop weights: sample W l , b ∼ q(W l , bl ), then compute y = x(W l )T + bl ;

2. drop units: compute ŷ = x(M l )T + ml , then apply dropout y = dropout(ŷ; π).

4
The KL[q||p] regularizer for MC-dropout is ill-defined for Gaussian prior p(θ). In practice this
is replaced by an ℓ2 regularizer, i.e., 2σ1−π
2 ||M l ||22 for the weight variational parameter M l (and
prior

similarly for ml ). The intuition is the following: the q distribution used in MC-dropout is the
limiting distribution of the following mixture of Gaussian distribution:

q(θil ) = lim πN (θil ; 0, ηI) + (1 − π)N (θil ; µli , ηI), θil = {Wil , bli }, µli = {Mil , mli }. (24)
η→0

This permits a valid KL divergence KL[q(θil )||p(θil )] if using Gaussian prior and η > 0, which includes
a term that approximately equals to 2σ1−π
2 ||M l ||22 for the weight variational parameter M l (and
prior

similarly for ml ). The other terms in that KL regulariser depends on η which will diverge to infinity
when η → 0, and in MC-dropout those terms are dropped. In other words, the q distribution in
MC-dropout is improper in the sense that approximating a continuous variable posterior distribution
with “mixture of delta measures” results in infinite KL which is poor. But in practice MC-dropout
can still provide quick posterior approximates (especially in function space) and it has shown to
provide useful uncertainty information in downstream tasks.

4 Case study 1: Bayesian optimisation with VI-BNNs

4.1 What is Bayesian optimisation (BO)?
Assume you are interested in optimising a function

x∗ = arg max f0 (x), (25)

x∈X

but you don’t actually know the analytic form of f (x). Rather, you only have access of it as a
“black-box”, where you provide an input x to this “black-box”, and it will return you a (noisy
version of) output y = f0 (x) + ϵ.
Bayesian optimisation (BO) [Snoek et al., 2012] is a class of methods that tackle this challenge.
To motivate BO, let’s imagine you already have a set of datapoints D = {(xn , yn )}N n=1 collected
by sending the queries x1 , ..., xN to the above mentioned “black-box”. Then we can fit a surrogate
function fθ (x) to data. In large data limit (N → +∞), with flexible enough model, we expect fθ (x)
to be very close, if not identical, to the ground-truth function f0 (x). In such case we can solve the
optimisation problem by finding x∗ = arg maxx∈X fθ (x) instead, which is tractable.
However, in practice we don’t have such a big dataset to train the surrogate model. This
is especially the case if the “black-box” corresponds to an expensive experiment (e.g., training a
Transformer network where x represents the hyper-parameter settings). In such scenario fθ (x) will
be quite different from f (x) in most of the unseen input locations. But at the end of the day, we
are only interested in the maximum of f0 (x) rather than the value of f0 (x) at all possible input
locations. This means training the surrogate model with very large datasets is unnecessary, and it
is possible to use some smart algorithms that can find the optimum of f0 without excessive queries
to the “black-box”.
The key idea of BO is to optimise f0 using “helps” from the surrogate fθ by taking the uncertainty
of model fitting into account. It aims to find the optimum of f0 with least amount of queries to the
expensive ”black-box”. Specifically, there are 3 ingredients of an BO method:
1. Acquisition function: use the current estimated surrogate function fθ (x) and its uncertainty
estimate to compute an acquisition function a(x);
2. Query the “black-box”: find the next input x∗ to query by maximising the acquisition function:
x∗ = arg maxx a(x), and query the corresponding output value y∗ = f0 (x∗ ) + ϵ∗ ;
3. Surrogate model update: given new queried result (x∗ , y∗ ) and historical data D, update
D ← D ∪ {(x∗ , y∗ )}, and use this new dataset D to update the surrogate model and its
uncertainty estimate.

5
At the beginning since the model is uncertain, a good acquisition function will take uncertainty
into account and encourage ”exploration”, i.e., querying inputs at different locations. As we collect
more data, with proper Bayesian posterior updates and assuming the family of fθ is flexible enough,
the surrogate model fθ will become closer and closer to the ground-truth function f0 and the un-
certainty will be reduced. So at some point, the model will become certain about its output, and
a good acquisition function will also enable “exploitation” at this stage, to seek for the optimum of
the surrogate function fθ as to solve the original optimisation problem.

4.2 The Upper Confidence Bound (UCB) method

The Upper Confidence Bound (UCB) [Srinivas et al., 2010] is an acquisition function that uses both
mean prediction and uncertainty. Specifically, assume the surrogate model provides both mean m(x)
and standard deviation σ(x) for a given input x, then the UCB acquisition function is the following:

U CB(x) = m(x) + βσ(x), (26)

and the query procedure will pick the next query input as

x = arg max U CB(x). (27)

Initially, σ(x) can be quite large for many regions, meaning that UCB will mainly explore. As we
collect more data, σ(x) will decrease around the regions that have been searched, and this allows
the algorithm to

1. ignore some searched regions where the model confidently thinks their function value is small;
2. exploit some other searched regions where the model confidently believes the optimum might
be there;
3. explore some other promising regions that has not been searched before.

Here β a hyper-parameter specified by the user at a time, to achieve a desired balance between
exploration (σ(x)) and exploitation (m(x)). When β = 0, it means we trust the surrogate model
and exploit on that. When β is large, we allow the query process to focus on regions that have large
σ(x) value (where the model is most uncertainty about). For optimal BO, this β coefficient will
decrease during time; for simplicity, in the demo we only consider a fixed value of β.

5 Case study 2: Detecting adversarial examples

Neural networks has been shown to be vulnerable to adversarial attacks. In this case study, we
will see whether BNNs can be more robust or not, as well as how to use uncertainty measures from
BNNs to detect adversarial examples. The hypothesis that BNNs can be helpful in this setting is
based on the following intuitions and conjectures:
• If BNN samples are diverse, there might exist members in the “Monte Carlo ensemble” that
make correct predictions on adversarial examples;
• Adversarial examples are regarded as OOD data;

• BNNs become uncertain about their predictions on OOD data.

If the above intuitions and conjectures are correct, then it means we can use uncertainty measures
computed using the BNN for adversarial example detection, even if the predictions are incorrect.
Apparently in practice the validity of these conjectures rely on the quality of uncertainty estimates.
Below We first discuss the uncertainty measures we will use in the case study. Then we discuss
ensemble BNNs as a practical approach to further improve the uncertainty quality.

6
5.1 Uncertainty measures
We consider two types of uncertainty, using coin flipping as a running example:
• Epistemic uncertainty: also named model uncertainty, this is the uncertainty due to lack of
knowledge, and thus can be reduced by collecting more data. For example, by flipping a coin
multiple times, we become more and more certain about whether the coin is fair or bent;
• Aleatoric uncertainty: also named data uncertainty, this is the uncertainty regarding the
stochasticity of individual experimental outcome, which is non-reducible. For example, even if
we are 100% sure about that the coin is fair, we are still unsure about whether the next coin
flip result will be head or tail.
These two type of uncertainty, when summed, returns the total uncertainty, i.e.,

total uncertainty = epistemic uncertainty + aleatoric uncertainty.

Quantitatively speaking, Shannon entropy is a P commonly used measure to express uncertainty.

K
For a categorical distribution p = (p1 , ..., pK ) with k=1 pk = 1, the entropy is defined as
C
X
H[p] = − pc log pc . (28)
c=1

We can show that H[p] is maximised when pc = 1/C, i.e., each category has equal probability, and
in this case we are very uncertain about the sampling outcome. On the other hand, H[p] = 0 when
pc = 1 for a particular c ∈ {1, ..., C}, meaning that the sampling outcome will be c for 100% of the
time (thus certain). In other words, higher entropy means we are more uncertain, and vice versa.
For BNNs applied to classification tasks, we can also compute uncertainty based on entropy-
related measures. Recall that the Bayesian predictive distribution is
Z
p(y ∗ |x∗ , D) = p(y ∗ |x∗ , θ)p(θ|D)dθ. (29)

But also notice that for each sample θ ∼ p(θ|D), p(y ∗ |x∗ , θ) is also a categorical distribution for
which we can also compute its entropy. This enables us to compute the conditional entropy as to
measure aleatoric uncertainty:
C
X
Ep(θ|D) [H[y ∗ |x∗ , θ]] = Ep(θ|D) [H[p(y ∗ |x∗ , θ)]] = Ep(θ|D) [− p(y ∗ = c|x∗ , θ) log p(y ∗ = c|x∗ , θ)].
c=1
(31)
The intuition for conditional entropy being a measure for aleatoric uncertainty is as follows. When
the posterior p(θ|D) becomes a delta mass, there is no epistemic uncertainty as we are now certain
about the weight parameters. Even so, p(y ∗ |x∗ , θ) for θ evaluated at the posterior mode can still
have non-zero probability for multiple categories, which is especially the case when label noise exists
(thus having non-zero aleatoric uncertainty). In this case the conditional entropy is greater than
zero and can be used as a measure for aleatoric uncertainty.
As for epistemic uncertainty, from the relationship between total, epistemic and aleatoric uncer-
tainty, we can compute it as “epistemic uncertainty = total uncertainty - aleatoric uncertainty”. It
terms out that this is related to the mutual information between y ∗ and θ:

I[y ∗ ; θ|x∗ , D] = H[y ∗ |x∗ , D] = H[p(y ∗ |x∗ , D)] − Ep(θ|D) [H[y ∗ |x∗ , θ]]. (32)

7
One can show that this mutual information can be re-written as

I[y ∗ ; θ|x∗ , D] = Ep(y∗ |x∗ ,D) [KL[p(θ|y ∗ , x∗ , D)||p(θ|D)]]. (33)

which tells us in (the model’s) expectation, how much the posterior will be updated given a new
observation (x∗ , y ∗ ). Notice that to reduce the epistemic uncertainty of a BNN, one can supply new
datapoints at input locations that the network are uncertain about their outputs. As the reduction
of epistemic uncertainty is conducted by making the posterior more concentrated, having higher
mutual information typically means that the network expects to have higher epistemic uncertainty
reduction if we supply (x∗ , y ∗ ) as a new observation, which also indicate that currently the network
has high epistemic uncertainty about the output value at x∗ .
The above discussed uncertainty measures have been used not only for OOD detection but also
for sequential decision making, including acquisition function design in active learning and Bayesian
optimisation. Interested readers are refer to e.g., [Houlsby et al., 2011; Hernández-Lobato et al.,
2014; Gal et al., 2017] for further readings.

5.2 Ensemble BNNs

BNNs can be combined with ensemble methods [Lakshminarayanan et al., 2017] to get a better
performance; indeed the winner solutions of the NeurIPS 2021 Approximate Inference in BDL
competition1 are based on ensembling BNNs. In short, one just train multiple BNNs (e.g., with
MFVI) independently with different random initialisations, and then ensemble them together to
produce the predictive distribution. In math, this means we fit a number of approximate posteriors
qs (θ), s = 1, ..., S, and in prediction time, compute the predictive distribution by
S K
1 XX
p(y ∗ |x∗ , D) ≈ p(y ∗ |x∗ , θsk ), θsk ∼ qs (θ). (34)
SK s=1
k=1

One can show that this approach also performs variational inference and the training objectives of
these networks can also be combined to provide a valid lower-bound to the marginal log-likelihood
log p(D). To see this, consider the following “augmented prior”:

p(θ, s) = p(θ)p(s), p(s) = Uniform({1, ..., S}). (35)

One can show that this “augmented model” p(D|θ)p(θ)p(s) preserves the marginal likelihood:
S Z
X Z
p(D|θ)p(θ)p(s)dθ = p(D|θ)p(θ)dθ = p(D). (36)
s=1

This also allows us to define an ELBO based on this “augmented model” p(D|θ)p(θ)p(s) and an
approximate posterior q(θ, s). In particular, if we define

q(θ, s) = p(s)q(θ|s), q(θ|s) = qs (θ), (37)

Then the ELBO can be written as

log p(D) ≥ L = Eq(θ,s) [log p(D|θ)] − KL[q(θ, s)||p(θ)p(s)]
S S
1 X 1X (38)
= Eqs (θ) [log p(D|θ)] − KL[qs (θ)||p(θ)] = ELBO(qs , D)
S s=1 S s=1

Since the variational parameters (e.g., mean & log variance of the mean-field Gaussian) for different
qs (θ) are independent to each other, this means instead of training all S BNNs together with the
above variational lower-bound L, one can simply train each of them independently using the ELBO
objective ELBO(qs , D).
1 https://izmailovpavel.github.io/neurips_bdl_competition/

8
References
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural
network. In International conference on machine learning. PMLR.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model
uncertainty in deep learning. In international conference on machine learning. PMLR.
Gal, Y., Islam, R., and Ghahramani, Z. (2017). Deep bayesian active learning with image data. In
International Conference on Machine Learning, pages 1183–1192. PMLR.
Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search
for efficient global optimization of black-box functions. Advances in neural information processing
systems, 27.

Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. (2011). Bayesian active learning for
classification and preference learning. arXiv preprint arXiv:1112.5745.
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114.
Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive un-
certainty estimation using deep ensembles. Advances in neural information processing systems,
30.
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine
learning algorithms. In Advances in neural information processing systems.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the
bandit setting: No regret and experimental design. In Proceedings of the International Conference
on Machine Learning, 2010.

Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
Variation Al
No ratings yet
Variation Al
25 pages
Meshless Cubature Over The Disk by Thin-Plate Splines ?: Alessandro Punzi, Alvise Sommariva, Marco Vianello
No ratings yet
Meshless Cubature Over The Disk by Thin-Plate Splines ?: Alessandro Punzi, Alvise Sommariva, Marco Vianello
11 pages
Notes On Gans, Energy-Based Models, and Saddle Points
No ratings yet
Notes On Gans, Energy-Based Models, and Saddle Points
10 pages
A Brief Primer On Variational Inference - Fabian Dablander
No ratings yet
A Brief Primer On Variational Inference - Fabian Dablander
14 pages
24 Variational Inference
No ratings yet
24 Variational Inference
24 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
xyx009
No ratings yet
xyx009
27 pages
Some Stuff
No ratings yet
Some Stuff
3 pages
CMSC720 Practice Exam
No ratings yet
CMSC720 Practice Exam
2 pages
SAA For JCC
No ratings yet
SAA For JCC
18 pages
CS19M016 PGM Assignment1
No ratings yet
CS19M016 PGM Assignment1
9 pages
Tung Kieu - Probabilistic - Graphical - Model - Report
No ratings yet
Tung Kieu - Probabilistic - Graphical - Model - Report
9 pages
SinhaDu16 PDF
No ratings yet
SinhaDu16 PDF
20 pages
Abstract
No ratings yet
Abstract
19 pages
Latent 2
No ratings yet
Latent 2
4 pages
Frecuency Domain Granger Causality Test
No ratings yet
Frecuency Domain Granger Causality Test
6 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
HW3
No ratings yet
HW3
4 pages
9. Bayesian_Lec_4
No ratings yet
9. Bayesian_Lec_4
25 pages
Cse 150 HW1
No ratings yet
Cse 150 HW1
2 pages
Non RationalL
No ratings yet
Non RationalL
20 pages
Statistics and Probability Letters: Kenneth S. Berenhaut, John V. Baxley, Robert G. Lyday
No ratings yet
Statistics and Probability Letters: Kenneth S. Berenhaut, John V. Baxley, Robert G. Lyday
5 pages
Capacity Convex Opt
No ratings yet
Capacity Convex Opt
15 pages
Micro-CH5-cem
No ratings yet
Micro-CH5-cem
13 pages
15-506
No ratings yet
15-506
25 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
s10998-024-00589-y
No ratings yet
s10998-024-00589-y
17 pages
Adrl App
No ratings yet
Adrl App
139 pages
JDT CSP
No ratings yet
JDT CSP
55 pages
Entropy 5
No ratings yet
Entropy 5
9 pages
ProblemSheet1-23
No ratings yet
ProblemSheet1-23
5 pages
On The Overestimation of Widely Applicable Bayesian Information Criterion
No ratings yet
On The Overestimation of Widely Applicable Bayesian Information Criterion
12 pages
Bachman Pre Cup 2015
No ratings yet
Bachman Pre Cup 2015
10 pages
Minka Dirichlet PDF
No ratings yet
Minka Dirichlet PDF
14 pages
CS 229 Autumn 2016 Problem Set#3:Theory & Unsupervised Learning
No ratings yet
CS 229 Autumn 2016 Problem Set#3:Theory & Unsupervised Learning
5 pages
A Two Parameter Distribution Obtained by
No ratings yet
A Two Parameter Distribution Obtained by
15 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
19 Leiden Sup
No ratings yet
19 Leiden Sup
14 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
IdeCozman ENIA05
No ratings yet
IdeCozman ENIA05
10 pages
Aniruddha - Maths A Submission
No ratings yet
Aniruddha - Maths A Submission
13 pages
2011 Nemenman - Coincidences and Estimation of Entropies of Random Variables
No ratings yet
2011 Nemenman - Coincidences and Estimation of Entropies of Random Variables
11 pages
Lecture 10
No ratings yet
Lecture 10
4 pages
Lda-The Gritty Details
100% (1)
Lda-The Gritty Details
12 pages
Hamiltonian
No ratings yet
Hamiltonian
8 pages
1 Diagonal coinvariants.: 1 1 n n i i n S + n n S + n n λ λ n i,j i j n i,j n n,c c n,c n n 2 n n
No ratings yet
1 Diagonal coinvariants.: 1 1 n n i i n S + n n S + n n λ λ n i,j i j n i,j n n,c c n,c n n 2 n n
3 pages
Perturbación Del Radio Numérico Q de Un Operador de Desplazamiento Ponderado
No ratings yet
Perturbación Del Radio Numérico Q de Un Operador de Desplazamiento Ponderado
10 pages
Directed Graphical Models
No ratings yet
Directed Graphical Models
54 pages
3 Bayesian Deep Learning
No ratings yet
3 Bayesian Deep Learning
33 pages
10 1 1 84 8490 PDF
No ratings yet
10 1 1 84 8490 PDF
7 pages
IT_w1
No ratings yet
IT_w1
20 pages
Quantum Conditional Logical Entropy
No ratings yet
Quantum Conditional Logical Entropy
8 pages
Dyn Kin Pi Lambda
No ratings yet
Dyn Kin Pi Lambda
6 pages
MRRW Bound and Isoperimetric Problems: 6.1 Preliminaries
No ratings yet
MRRW Bound and Isoperimetric Problems: 6.1 Preliminaries
8 pages
Smoothness Methods in Descriptive Representation Theory: D. Noether, K. Riemann, Q. Pascal and Y. Wiles
No ratings yet
Smoothness Methods in Descriptive Representation Theory: D. Noether, K. Riemann, Q. Pascal and Y. Wiles
9 pages
9 Absolutely Continuous Spectrum
No ratings yet
9 Absolutely Continuous Spectrum
7 pages
Lecture 8: Bayesian Estimation of Parameters in State Space Models
No ratings yet
Lecture 8: Bayesian Estimation of Parameters in State Space Models
33 pages
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Ek 2020
No ratings yet
Ek 2020
203 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Seminar em
No ratings yet
Seminar em
51 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages