SlideShare a Scribd company logo
Variational Inference with Rényi
Divergence
D1
¤ arXiv 2 6
¤ Yingzhen Li Richard E. Turner
¤ University of Cambridge
¤ Li D3 “Stochastic Expectation Propagation” NIPS
¤ Rényi
¤ VAE importance weighted AE[Burda et al., 2015]
¤ Appendix
¤
¤
PRML
¤
¤
¤
¤ KL
ln #(%) = ℒ % + *+(,||%)
¤
¤ SVI [Hoffmann et al, 2013]
¤ SEP [Li et al., 2015]
¤ black-box
¤ [Ranganath et al., 2014]
¤ black-box alpha BB-α [Hernandez-Labato et al., 2015]
¤
¤ Importance weighted AE (IWAE)[Burda et al., 2015]
¤ VAE ICLR2016
¤
¤ #(.|/)
¤ ,(.)
¤ KL # /
¤
principle literature [Gr¨unwald, 2007].
2.2 Variational Inference
Next we review the variational inference algorithm [Jordan et al.
perspective, using posterior approximation as a running examp
i.i.d. samples D = {xn}N
n=1 from a probabilistic model p(x|✓) pa
is drawn from a prior p0(✓). Bayesian inference involves comp
parameters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p
p(D)
3
ciple literature [Gr¨unwald, 2007].
Variational Inference
we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat
pective, using posterior approximation as a running example. Consider observing a dataset of
samples D = {xn}N
n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th
awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t
meters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p(xn|✓)
p(D)
,
3
(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For
l models, including Bayesian neural networks, the true posterior is typically intractable.
nference introduces an approximation q(✓) to the true posterior, which is obtained by minim
divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia
e sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I (q; D),
he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is defined by
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]

re p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m
werful models, including Bayesian neural networks, the true posterior is typically intractable. Va
al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi
KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati
rence sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I(q; D),
re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is defined by
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]

p(✓, D)
n called marginal likelihood or model evidence. For many
etworks, the true posterior is typically intractable. Varia-
(✓) to the true posterior, which is obtained by minimising
on family Q:
min
2Q
KL[q(✓)||p(✓|D)]. (8)
able, mainly because of the di cult term p(D). Variational
g an equivalent optimisation problem:
arg max
q2Q
LV I(q; D), (9)
lower-bound (ELBO) LV I(q; D) is defined by
og p(D) KL[q(✓)||p(✓|D)]

p(✓, D) (10)
where p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood
powerful models, including Bayesian neural networks, the true posterior is
tional inference introduces an approximation q(✓) to the true posterior, wh
the KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
However the KL divergence in (8) is also intractable, mainly because of the d
inference sidesteps this di culty by considering an equivalent optimisation
q(✓) = arg max
q2Q
LV I(q; D),
where the variational lower-bound or evidence lower-bound (ELBO) LV I(q
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq

log
p(✓, D)
q(✓)
.
VAE
¤ [Kingma et al,. 2014]
¤
¤ ℎ
¤
¤
1 Variational Auto-encoder with R´enyi Divergence
he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re
oposed (deep) generative model that parametrizes the variational approximation with a recog
twork. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
re we drop the parameters ✓ but keep in mind that they will be learned using approximate max
elihood. However for these models the exact computation of log p(x) requires marginalisation
dden variables and is thus often intractable. Variational expectation-maximisation (EM) me
mes to the rescue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)

log
p(x, h)
q(h|x)
,
here h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defi
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
variational EM, optimisation for q and p are alternated to guarantee convergence. However th
ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the
ional Auto-encoder with R´enyi Divergence
auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently
) generative model that parametrizes the variational approximation with a recognition
enerative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
). (14)
he parameters ✓ but keep in mind that they will be learned using approximate maximum
wever for these models the exact computation of log p(x) requires marginalisation of all
s and is thus often intractable. Variational expectation-maximisation (EM) methods
scue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)

log
p(x, h)
q(h|x)
, (15)
s all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defined as
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
). (16)
EM, optimisation for q and p are alternated to guarantee convergence. However the core
to jointly optimising p and q, which instead has no guarantee of increasing the MLE
on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This
the possibility that alternative surrogate functions might return estimates that are tighter
So the VR bound is considered in this context:
L (q; x) =
1
log E
"✓
p(x, h)
◆1 ↵
#
. (17)
variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is
sed (deep) generative model that parametrizes the variational approximation with a r
rk. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
we drop the parameters ✓ but keep in mind that they will be learned using approximate
ood. However for these models the exact computation of log p(x) requires marginalisa
n variables and is thus often intractable. Variational expectation-maximisation (EM
to the rescue by approximating
log p(x) ⇡ LV I (q; x) = Eq(h|x)

log
p(x, h)
q(h|x)
,
h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve
of VAE is to jointly optimising p and q, which instead has no guarantee of increasing
ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2
explores the possibility that alternative surrogate functions might return estimates that
bounds. So the VR bound is considered in this context:
"✓ ◆1 ↵
#
p(x|✓) =
h1,...,hL
p(hL
|✓)p(hL 1
|hL
, ✓) · · · p(x|h1
, ✓). (
Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1
, . . . , hL
} denotes t
stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F
convenience, we define h0
= x. Each of the terms p(h`
|h`+1
) may denote a complicated nonline
relationship, for instance one computed by a multilayer neural network. However, it is assum
that sampling and probability evaluation are tractable for each p(h`
|h`+1
). Note that L denot
the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W
assume the recognition model q(h|x) is defined in terms of an analogous factorization:
q(h|x) = q(h1
|x)q(h2
|h1
) · · · q(hL
|hL 1
), (
where sampling and probability evaluation are tractable for each of the terms in the product.
In this work, we assume the same families of conditional probability distributions as Kingma
Welling (2014). In particular, the prior p(hL
) is fixed to be a zero-mean, unit-variance Gaussia
In general, each of the conditional distributions p(h`
| h`+1
) and q(h`
|h` 1
) is a Gaussian wi
diagonal covariance, where the mean and covariance parameters are computed by a determinis
feed-forward neural network. For real-valued observations, p(x|h1
) is also defined to be such
Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean paramete
are computed by a neural network.
The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro
Jensen’s Inequality:
log p(x) = log Eq(h|x)

p(x, h)
q(h|x)
Eq(h|x)

log
p(x, h)
q(h|x)
= L(x). (
Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
VAE
¤
reparameterization trick
¤
¤
ed a reparameterization of the recognition distribution in terms
tributions, such that the samples from the recognition model are
s and auxiliary variables. While they presented the reparameter-
tions, for convenience we discuss the special case of Gaussians,
ork. (The general reparameterization trick can be used with our
tribution q(h`
|h` 1
, ✓) always takes the form of a Gaussian
hose mean and covariance are computed from the the states of
2
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be
by first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the d
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
The joint recognition distribution q(h|x, ✓) over all latent variables can be
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying E
sequence. Since the distribution of ✏ does not depend on ✓, we can reformu
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expecta
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)

log
p(x, h
q(h(✏
= E✏1,...,✏L⇠N (0,I)

r✓ log
p(x, h
q(h(✏
Assuming the mapping h is represented as a deterministic feed-forward neu
✏, the gradient inside the expectation can be computed using standard backp
one approximates the expectation in Eqn. 6 by generating k samples of ✏ a
Carlo estimator
1
k
kX
r✓ log w (x, h(✏i, x, ✓), ✓)
der review as a conference paper at ICLR 2016
hidden units at the previous layer and the model parameters. This can be alternatively ex
first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the deterministic m
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in t
eterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each
uence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradien
nd L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)

log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
= E✏1,...,✏L⇠N (0,I)

r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
.
uming the mapping h is represented as a deterministic feed-forward neural network, fo
he gradient inside the expectation can be computed using standard backpropagation. In p
approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the
lo estimator
k
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be alternatively expressed
by first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
The joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
sequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)

log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N (0,I)

r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,
one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
Carlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
e hidden units at the previous layer and the model parameters. This can be alternatively expressed
y first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
he joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
quence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
ound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N(0,I)

log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N(0,I)

r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
ssuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
the gradient inside the expectation can be computed using standard backpropagation. In practice,
ne approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
arlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
ith w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
e VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
adient, but the VAE update tends to have lower variance in practice because it makes use of the
the hidden units at the previous layer and the
by first sampling an auxiliary variable ✏`
⇠
h`
(✏`
, h` 1
, ✓) = ⌃
The joint recognition distribution q(h|x, ✓)
a deterministic mapping h(✏, x, ✓), with ✏
sequence. Since the distribution of ✏ does n
bound L(x) from Eqn. 3 by pushing the gra
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
=
=
Assuming the mapping h is represented as a
✏, the gradient inside the expectation can be
one approximates the expectation in Eqn. 6
Carlo estimator
1
k
kX
i=1
r✓ lo
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). T
the VAE update and the basic REINFORCE
gradient, but the VAE update tends to have
VAE
VAE
¤ VAE
¤
¤ VAE KL
In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the
parameters. We assume an approximate posterior in the form q (z|x), but please note that the
technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully
variational Bayesian method for inferring a posterior over the parameters is given in the appendix.
Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we
can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x)
of an (auxiliary) noise variable ✏:
ez = g (✏, x) with ✏ ⇠ p(✏) (4)
See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function
g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t.
q (z|x) as follows:
Eq (z|x(i)) [f(z)] = Ep(✏)
h
f(g (✏, x(i)
))
i
'
1
L
LX
l=1
f(g (✏(l)
, x(i)
)) where ✏(l)
⇠ p(✏) (5)
We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic
Gradient Variational Bayes (SGVB) estimator eLA
(✓, ; x(i)
) ' L(✓, ; x(i)
):
eLA
(✓, ; x(i)
) =
1
L
LX
l=1
log p✓(x(i)
, z(i,l)
) log q (z(i,l)
|x(i)
)
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (6)
3
g r✓,
eLM
(✓, ; XM
, ✏) (Gradients of minibatch estimator (8))
✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10])
until convergence of parameters (✓, )
return ✓,
Often, the KL-divergence DKL(q (z|x(i)
)||p✓(z)) of eq. (3) can be integrated analytically (see
appendix B), such that only the expected reconstruction error Eq (z|x(i))
⇥
log p✓(x(i)
|z)
⇤
requires
estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour-
aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the
SGVB estimator eLB
(✓, ; x(i)
) ' L(✓, ; x(i)
), corresponding to eq. (3), which typically has less
variance than the generic estimator:
eLB
(✓, ; x(i)
) = DKL(q (z|x(i)
)||p✓(z)) +
1
L
LX
l=1
(log p✓(x(i)
|z(i,l)
))
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (7)
Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the
marginal likelihood lower bound of the full dataset, based on minibatches:
L(✓, ; X) ' eLM
(✓, ; XM
) =
N
M
MX
i=1
eL(✓, ; x(i)
) (8)
where the minibatch XM
= {x(i)
}M
i=1 is a randomly drawn sample of M datapoints from the
full dataset X with N datapoints. In our experiments we found that the number of samples L
per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100.
Derivatives r✓,
eL(✓; XM
) can be taken, and the resulting gradients can be used in conjunction
with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a
basic approach to compute the stochastic gradients.
A connection with auto-encoders becomes clear when looking at the objective function given at
eq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as a
regularizer, while the second term is a an expected negative reconstruction error. The function g (.)
is chosen such that it maps a datapoint x(i)
and a random noise vector ✏(l)
to a sample from the
Importance weighted AE IWAE
¤ VAE
¤
¤
¤
¤
¤ k=1 VAE
¤ k
ution must be approximately factorial and predictable with a feed-forward neural n
VAE criterion may be too strict; a recognition network which places only a small
0%) of its samples in the region of high posterior probability region may still be suffi
ming accurate inference. If we lower our standards in this way, this may give us ad
lity to train a generative network whose posterior distributions do not fit the VAE
This is the motivation behind our proposed algorithm, the Importance Weighted Auto
E).
WAE uses the same architecture as the VAE, with both a generative network and a rec
rk. The difference is that it is trained to maximize a different lower bound on log p
ular, we use the following lower bound, corresponding to the k-sample importance w
te of the log-likelihood:
Lk(x) = Eh1,...,hk⇠q(h|x)
"
log
1
k
kX
i=1
p(x, hi)
q(hi|x)
#
.
h1, . . . , hk are sampled independently from the recognition model. The term inside
ponds to the unnormalized importance weights for the joint distribution, which we wil
= p(x, hi)/q(hi|x).
s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality
at the average importance weights are an unbiased estimator of p(x):
Lk = E
"
log
1
k
kX
wi
#
 log E
"
1
k
kX
wi
#
= log p(x),
iew as a conference paper at ICLR 2016
1. For all k, the lower bounds satisfy
log p(x) Lk+1 Lk.
if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity.
e Appendix A.
Rényi α
¤ . # ,
¤ 1 1 > 0, 1 ≠ 1
¤ 1 → 1 KL
¤ 1 =
8
9
tributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
> 1 the definition is valid when it is finite, and for discrete random variables the integr
d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that
role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
two distributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integratio
replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla
crucial role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
Another special case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of the squ
2
R p p
the definition is valid when it is finite, and for discrete random variables the int
y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th
in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
pecial case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of
istance Hel2
[p||q] = 1
2
R
(
p
p(✓)
p
q(✓))2
d✓:
D1
2
[p||q] = 2 log(1 Hel2
[p||q]).
ven and Harremo¨es, 2014] the definition (1) is also extended to negative ↵ values,
t is non-positive and is thus no longer a valid divergence measure. The proposed m
Rényi
¤ # . / ,(.)	 KL
¤ Rényi
¤ Rényi α
¤
¤ 1 ≠ 1
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq

log
p(✓, D)
q(✓)
.
ational R´enyi Bound
Section 2.1 that the family of R´enyi divergences includes the KL divergence.
ee-energy approaches be generalised to the R´enyi case? Consider approxima
|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
y the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
"✓ ◆1 ↵
#
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq

log
p(✓, D)
q(✓)
.
ariational R´enyi Bound
m Section 2.1 that the family of R´enyi divergences includes the KL divergence.
al free-energy approaches be generalised to the R´enyi case? Consider approxima
p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
erify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
log E
"✓
p(✓, D)
◆1 ↵
#
q(✓)
Variational R´enyi Bound
rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap
nal free-energy approaches be generalised to the R´enyi case? Consider approximating th
r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
verify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
6= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
↵ 1
log Eq
"✓
p(✓, D)
q(✓)p(D)
◆1 ↵
#
=
1
1 ↵
log Eq
"✓
p(✓, D)
q(✓)
◆1 ↵
#
:= L↵(q; D).
me this new objective the variational R´enyi bound (VR). Importantly the following theore
Rényi VR
VR
¤ VR
¤
¤
cope if Monte Carlo methods is not resorted to. This section develops a scalable opt
or the VR bound by extending the recent advances of traditional VI. Black-box met
ssed to enable it applications to arbitrary finite ↵ settings.
Monte Carlo Estimation of the VR Bound
se a simple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to app
K:
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓)
thm. However we can bound the bias by the following theorems proved in the supple
m 2. E{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fix
limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre
] [ {|L↵| < +1}.
5
R Bound Optimisation Framework
energy methods sidestep intractabilities in a class of intractable models. Recent wor
proximations based on Monte Carlo to expend the set of models that can be handled.
be deployed on the same model class as Monte Carlo variational methods, but which
Monte Carlo methods is not resorted to. This section develops a scalable optimis
VR bound by extending the recent advances of traditional VI. Black-box method
o enable it applications to arbitrary finite ↵ settings.
Carlo Estimation of the VR Bound
mple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i
However we can bound the bias by the following theorems proved in the supplemen
{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵
ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin
↵| < +1}.
(a) Sampling approximated VR bounds. (b) Simulated values of divergences.
Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds.
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where
the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour.
Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K
k=1
[ ˆL↵,K(q; D)] 
log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0.
To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding
properties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood
log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵  1 the sampling approximation
ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation
quality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negative
alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow
between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively).
We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
VR
exact approx.
(a) Sampling approximated VR bounds. (b) Simula
Figure 2: (a) An illustration for the bounding properties of sampling ap
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl
VR
1 ≤ 1
1
k
VR
¤ IWAE
¤ 1
ated VR bounds. (b) Simulated values of divergences.
n for the bounding properties of sampling approximations to the VR bounds.
1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
e p, q are 2-D Gaussian distributions with identity covariance matrix, where
[0, 0] and µq = [1, 1]. Best viewed in colour.
ˆ
1 ≤ 1
1 = 0
IWAE
VR-max
¤ Reparameterization trick
¤
¤
¤ 1 = 1 VAE
¤ 1 → −∞
¤ importance weight
¤ VR-max
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏

w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1

ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏

w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1

ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bou
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
.
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏

w↵(✏; , D)r log
p(g , D)
q(g )
,
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite sam
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1

ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
.
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite sample
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I (q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
,
which means the resulting algorithm unifies the computation for all finite ↵ settings.
To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏j wit
Algorithm 1 one gradient step for VR-↵/VR-max
1: sample ✏1, ..., ✏K ⇠ p(✏)
2: for all k = 1, ..., K, and n 2 S the current minibatch, compute the u
log ˆw(✏k; xn) = log p(g (✏k), xn) log q(g (
3: choose the sample ✏jn to back-propagate:
if |↵| < 1: jn ⇠ pk where pk / ˆw(✏k; xn)1 ↵
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
r log ˆw(✏jn
; xn)
¤ [Li et al., 2015] EP
¤
¤
¤ M
VR
¤ M
¤ Black-box alpha BB-α VR
or VAEs. Note that VR-max does not compute
ciple (MDL), since MDL approximates the true
upper-bounds the exact log-likelihood function.
scale Learning
hole dataset D. However for large datasets full
[Li et al., 2015] the authors discussed stochastic
tion for large-scale learning. Here we propose
batch training, which directly applies to the VR
“average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
✓) ¯fD(✓)N
. Now we sample M datapoints S =
posterior by minimising the exact VR bound L 1
4.3 Stochastic Approximation for La
So far we discussed the VR bounds computed on t
batch learning will be very ine cient. In the append
EP as a way to approximating the VR bound opt
another stochastic approximation method to enable
bound.
Using the notation fn(✓) = p(xn|✓) and definin
the joint distribution can be rewritten as p(✓, D) =
the minimum description length principle (MDL), since MDL approximates the true
sing the exact VR bound L 1 that upper-bounds the exact log-likelihood function.
c Approximation for Large-scale Learning
the VR bounds computed on the whole dataset D. However for large datasets full
e very ine cient. In the appendix of [Li et al., 2015] the authors discussed stochastic
proximating the VR bound optimisation for large-scale learning. Here we propose
pproximation method to enable minibatch training, which directly applies to the VR
on fn(✓) = p(xn|✓) and defining the “average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
n can be rewritten as p(✓, D) = p0(✓) ¯fD(✓)N
. Now we sample M datapoints S =
7set average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
xn}. Then we approximate the VR bound (13) by
)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
wer-bound when ↵ ! 1. For other ↵ 6= 1 settings,
the bias of approximation. This is guaranteed by
{xn1
, ..., xnM
} ⇠ D and define the corresponding “subset average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
When M = 1 we also write ¯fS(✓) = fn(✓) for S = {xn}. Then we approximate the VR bound (13) by
replacing ¯fD(✓) with ¯fS(✓):
˜L↵(q; S) =
1
1 ↵
log
Z
q(✓)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
=
1
1 ↵
log Eq[
✓
p0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings,
increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed by
the following theorem proved in the supplementary.
Theorem 3. If the approximate distribution q(✓) is Gaussian N(µ, ⌃), and the likelihood functions has
an exponential family form p(x|✓) = exp[h✓, (x)i A(✓)], then for ↵  1 the stochastic approximation
is bounded by
VR
¤
¤
¤
1
¤ 3
¤ VAE 1 = 1
¤ IWAE 1 = 0
¤ VR-max 1 = −∞
¤ 1 = 0 * = 5000
¤ VR-max IWAE
¤ VR-max
¤ VR-max 25hr29min IWAE 61hr16min
e code1
. Note that the original implementation back-
hile VR-max only back-propagates the sample with
h 101 Silhouettes and MNIST. The experiments were
y small Frey Face dataset, while the other two were
onsists of L = 1 or 2 stochastic layers with determin-
rk architecture is detailed in the supplementary. We
n. For MNIST we used settings from [Burda et al.,
and number of epochs. For other two datasets the
the VI setting. We reproduced the experiments for
s included in [Burda et al., 2015] mismatches those
e 1 by computing log p(x) ⇡ ˆL↵,K(q; x) with ↵ = 0.0,
sent some samples from the VR-max trained models
d almost indistinguishable to IWAEs on all the three
ime to run compared to IWAE with a full backward
a Tesla C2075 GPU, and when trained on MNIST
R-max and IWAE took 25hr29min and 61hr16min,
also implemented the single backward pass version
od result for IWAE is -85.02, which is slightly worse
he arguments in Section 4.1 that negative ↵ can be
mputation resources are limited.
alue corresponding to the tightest VR bound becomes
q and the true posterior increases. This is the case
n q is fitted to approximate the typically multimodal
(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST
Figure 3: Sampled images from the VR-max trained auto-encoders.
Dataset L K VAE IWAE VR-max
Frey Face 1 5 1322.96 1380.30 1377.40
(± std. err.) ±10.03 ±4.60 ±4.59
Caltech 101 1 5 -119.69 -117.89 -118.01
Silhouettes 50 -119.61 -117.21 -117.10
MNIST 1 5 -86.47 -85.41 -85.42
50 -86.35 -84.80 -84.81
2 5 -85.01 -83.92 -84.04
50 -84.78 -83.12 -83.44
Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015].
IWAE results are reproduced using the publicly available implementation.
method was implemented upon the publicly available code1
. Note that the original implementation back-
propagates all the samples to compute gradients, while VR-max only back-propagates the sample with
the largest importance weight.
Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments were
1
¤
(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST
Figure 3: Sampled images from the VR-max trained auto-encoders.
Dataset L K VAE IWAE VR-max
Frey Face 1 5 1322.96 1380.30 1377.40
(± std. err.) ±10.03 ±4.60 ±4.59
Caltech 101 1 5 -119.69 -117.89 -118.01
Silhouettes 50 -119.61 -117.21 -117.10
MNIST 1 5 -86.47 -85.41 -85.42
50 -86.35 -84.80 -84.81
2 5 -85.01 -83.92 -84.04
50 -84.78 -83.12 -83.44
¤ VR-max
¤ Frey Face
¤ 1
2
¤ UCI
¤
¤ VI[Graves,2011] PBP[Hernandez-Lobato et al., 2015]
¤ BB-α=BO
Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max
Boston -2.903±0.071 -2.574±0.089 -2.549±0.019 -2.457±0.066 -2.468±0.071 -2.469±0.072
Concrete -3.391±0.017 -3.161±0.019 -3.104±0.015 -3.094±0.016 -3.076±0.018 -3.092±0.018
Energy -2.391±0.029 -2.042±0.019 -0.945±0.012 -1.401±0.029 -1.418±0.020 -1.389±0.018
Wine -0.980±0.013 -0.968±0.014 -0.949±0.009 -0.948±0.011 -0.952±0.012 -0.949±0.012
Yacht -3.439±0.163 -1.634±0.016 -1.102±0.039 -1.816±0.011 -1.829±0.014 -1.817±0.013
Protein -2.992±0.006 -2.973±0.003 NA±NA -2.923±0.006 -2.911±0.005 -2.938±0.005
Year -3.622±NA -3.603±NA NA±NA -3.545±NA -3.550±NA -3.542±NA
Table 2: Average test log-likelihood. BB-↵=BO results are not directly comparable and are available
only for small datasets.
Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max
Boston 4.320±0.291 3.104±0.180 3.160±0.109 2.853±0.154 2.852±0.169 2.837±0.181
Concrete 7.128±0.123 5.667±0.093 5.374±0.074 5.343±0.102 5.237±0.114 5.280±0.104
Energy 2.646±0.081 1.804±0.048 0.600±0.018 0.807±0.059 0.883±0.050 0.791±0.041
Wine 0.646±0.008 0.635±0.007 0.632±0.005 0.640±0.009 0.638±0.008 0.639±0.009
Yacht 6.887±0.674 1.015±0.054 0.902±0.051 1.111±0.082 1.239±0.109 1.117±0.085
Protein 4.842±0.003 4.732±0.013 NA±NA 4.505±0.033 4.436±0.030 4.574±0.023
Year 9.034±NA 8.879±NA NA±NA 8.942±NA 9.133±NA 8.949±NA
Table 3: Average Test Error. BB-↵=BO results are not directly comparable and are available only for
small datasets.
¤ Rényi
¤ VI/VB EP BB-α VAE IWAE
VR-max
¤
¤
¤
¤
¤ 1

More Related Content

What's hot (20)

PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
MLHEP 2015: Introductory Lecture #1
arogozhnikov
 
PPT
Chapter 24 aoa
Hanif Durad
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Delayed acceptance for Metropolis-Hastings algorithms
Christian Robert
 
PDF
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
Parameswaran Raman
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Unbiased Bayes for Big Data
Christian Robert
 
PDF
06 recurrent neural_networks
Andres Mendez-Vazquez
 
PDF
Intro to Classification: Logistic Regression & SVM
NYC Predictive Analytics
 
PDF
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
PDF
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
26 Machine Learning Unsupervised Fuzzy C-Means
Andres Mendez-Vazquez
 
PDF
18.1 combining models
Andres Mendez-Vazquez
 
PPT
Chapter 26 aoa
Hanif Durad
 
PPT
Chapter 25 aoa
Hanif Durad
 
PPT
Chapter 23 aoa
Hanif Durad
 
PDF
[DL輪読会]Generative Models of Visually Grounded Imagination
Deep Learning JP
 
PPT
Q-Metrics in Theory And Practice
guest3550292
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
MLHEP 2015: Introductory Lecture #1
arogozhnikov
 
Chapter 24 aoa
Hanif Durad
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Delayed acceptance for Metropolis-Hastings algorithms
Christian Robert
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
Parameswaran Raman
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Unbiased Bayes for Big Data
Christian Robert
 
06 recurrent neural_networks
Andres Mendez-Vazquez
 
Intro to Classification: Logistic Regression & SVM
NYC Predictive Analytics
 
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
The Statistical and Applied Mathematical Sciences Institute
 
26 Machine Learning Unsupervised Fuzzy C-Means
Andres Mendez-Vazquez
 
18.1 combining models
Andres Mendez-Vazquez
 
Chapter 26 aoa
Hanif Durad
 
Chapter 25 aoa
Hanif Durad
 
Chapter 23 aoa
Hanif Durad
 
[DL輪読会]Generative Models of Visually Grounded Imagination
Deep Learning JP
 
Q-Metrics in Theory And Practice
guest3550292
 

Viewers also liked (20)

PDF
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
 
PDF
深層生成モデルを用いたマルチモーダル学習
Masahiro Suzuki
 
PDF
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
PDF
(DL hacks輪読) Difference Target Propagation
Masahiro Suzuki
 
PDF
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
PDF
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
Masahiro Suzuki
 
PDF
(DL Hacks輪読) How transferable are features in deep neural networks?
Masahiro Suzuki
 
PDF
(DL hacks輪読) Deep Kalman Filters
Masahiro Suzuki
 
PDF
[Dl輪読会]dl hacks輪読
Deep Learning JP
 
PDF
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Ohsawa Goodfellow
 
PDF
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
Kaoru Nasuno
 
PDF
Deep learning勉強会20121214ochi
Ohsawa Goodfellow
 
PPTX
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
kurotaki_weblab
 
ODP
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Yukiyoshi Sasao
 
PPTX
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
PDF
Deep Learning を実装する
Shuhei Iitsuka
 
PPTX
JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7
Kotaro Nakayama
 
PDF
Deep Learning 勉強会 (Chapter 7-12)
Ohsawa Goodfellow
 
PDF
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Ohsawa Goodfellow
 
PDF
生成モデルの Deep Learning
Seiya Tokui
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダル学習
Masahiro Suzuki
 
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
Masahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
Masahiro Suzuki
 
(DL hacks輪読) Deep Kalman Filters
Masahiro Suzuki
 
[Dl輪読会]dl hacks輪読
Deep Learning JP
 
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Ohsawa Goodfellow
 
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
Kaoru Nasuno
 
Deep learning勉強会20121214ochi
Ohsawa Goodfellow
 
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
kurotaki_weblab
 
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Yukiyoshi Sasao
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
Deep Learning を実装する
Shuhei Iitsuka
 
JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7
Kotaro Nakayama
 
Deep Learning 勉強会 (Chapter 7-12)
Ohsawa Goodfellow
 
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Ohsawa Goodfellow
 
生成モデルの Deep Learning
Seiya Tokui
 

Similar to (DL hacks輪読) Variational Inference with Rényi Divergence (20)

PDF
Introduction to modern Variational Inference.
Tomasz Kusmierczyk
 
PDF
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Olga Zinkevych
 
PDF
is anyone_interest_in_auto-encoding_variational-bayes
NAVER Engineering
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Deep VI with_beta_likelihood
Natan Katz
 
PPTX
PRML Chapter 10
Sunwoo Kim
 
PDF
Explicit Density Models
Sangwoo Mo
 
PDF
"Automatic Variational Inference in Stan" NIPS2015_yomi2016-01-20
Yuta Kashino
 
PPT
Variational Inference
Tushar Tank
 
PDF
Jonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen
 
PDF
Monte carlo dropout and variational bound
天乐 杨
 
PPTX
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
changedaeoh
 
PDF
Bayesian Deep Learning
RayKim51
 
PDF
Introduction to Variational Auto Encoder
vaidehimadaan041
 
PDF
Variational Autoencoder from scratch.pdf
namnguynhi30
 
PDF
010_20160216_Variational Gaussian Process
Ha Phuong
 
PDF
Zahedi
formater1
 
PPTX
Variational Auto Encoder and the Math Behind
Varun Reddy
 
PDF
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
Deep Learning JP
 
PDF
An introduction to deep learning
Van Thanh
 
Introduction to modern Variational Inference.
Tomasz Kusmierczyk
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Olga Zinkevych
 
is anyone_interest_in_auto-encoding_variational-bayes
NAVER Engineering
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Deep VI with_beta_likelihood
Natan Katz
 
PRML Chapter 10
Sunwoo Kim
 
Explicit Density Models
Sangwoo Mo
 
"Automatic Variational Inference in Stan" NIPS2015_yomi2016-01-20
Yuta Kashino
 
Variational Inference
Tushar Tank
 
Jonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen
 
Monte carlo dropout and variational bound
天乐 杨
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
changedaeoh
 
Bayesian Deep Learning
RayKim51
 
Introduction to Variational Auto Encoder
vaidehimadaan041
 
Variational Autoencoder from scratch.pdf
namnguynhi30
 
010_20160216_Variational Gaussian Process
Ha Phuong
 
Zahedi
formater1
 
Variational Auto Encoder and the Math Behind
Varun Reddy
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
Deep Learning JP
 
An introduction to deep learning
Van Thanh
 

More from Masahiro Suzuki (7)

PDF
深層生成モデルと世界モデル(2020/11/20版)
Masahiro Suzuki
 
PDF
確率的推論と行動選択
Masahiro Suzuki
 
PDF
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
Masahiro Suzuki
 
PDF
深層生成モデルと世界モデル
Masahiro Suzuki
 
PDF
「世界モデル」と関連研究について
Masahiro Suzuki
 
PDF
GAN(と強化学習との関係)
Masahiro Suzuki
 
PDF
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
Masahiro Suzuki
 
深層生成モデルと世界モデル(2020/11/20版)
Masahiro Suzuki
 
確率的推論と行動選択
Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
Masahiro Suzuki
 
深層生成モデルと世界モデル
Masahiro Suzuki
 
「世界モデル」と関連研究について
Masahiro Suzuki
 
GAN(と強化学習との関係)
Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
Masahiro Suzuki
 

Recently uploaded (20)

PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Digital Circuits, important subject in CS
contactparinay1
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 

(DL hacks輪読) Variational Inference with Rényi Divergence

  • 1. Variational Inference with Rényi Divergence D1
  • 2. ¤ arXiv 2 6 ¤ Yingzhen Li Richard E. Turner ¤ University of Cambridge ¤ Li D3 “Stochastic Expectation Propagation” NIPS ¤ Rényi ¤ VAE importance weighted AE[Burda et al., 2015] ¤ Appendix ¤
  • 3. ¤ PRML ¤ ¤ ¤ ¤ KL ln #(%) = ℒ % + *+(,||%)
  • 4. ¤ ¤ SVI [Hoffmann et al, 2013] ¤ SEP [Li et al., 2015] ¤ black-box ¤ [Ranganath et al., 2014] ¤ black-box alpha BB-α [Hernandez-Labato et al., 2015] ¤ ¤ Importance weighted AE (IWAE)[Burda et al., 2015] ¤ VAE ICLR2016
  • 5. ¤ ¤ #(.|/) ¤ ,(.) ¤ KL # / ¤ principle literature [Gr¨unwald, 2007]. 2.2 Variational Inference Next we review the variational inference algorithm [Jordan et al. perspective, using posterior approximation as a running examp i.i.d. samples D = {xn}N n=1 from a probabilistic model p(x|✓) pa is drawn from a prior p0(✓). Bayesian inference involves comp parameters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p p(D) 3 ciple literature [Gr¨unwald, 2007]. Variational Inference we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat pective, using posterior approximation as a running example. Consider observing a dataset of samples D = {xn}N n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t meters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p(xn|✓) p(D) , 3 (D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For l models, including Bayesian neural networks, the true posterior is typically intractable. nference introduces an approximation q(✓) to the true posterior, which is obtained by minim divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia e sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I (q; D), he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is defined by LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]  re p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m werful models, including Bayesian neural networks, the true posterior is typically intractable. Va al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati rence sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I(q; D), re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is defined by LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]  p(✓, D) n called marginal likelihood or model evidence. For many etworks, the true posterior is typically intractable. Varia- (✓) to the true posterior, which is obtained by minimising on family Q: min 2Q KL[q(✓)||p(✓|D)]. (8) able, mainly because of the di cult term p(D). Variational g an equivalent optimisation problem: arg max q2Q LV I(q; D), (9) lower-bound (ELBO) LV I(q; D) is defined by og p(D) KL[q(✓)||p(✓|D)]  p(✓, D) (10) where p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood powerful models, including Bayesian neural networks, the true posterior is tional inference introduces an approximation q(✓) to the true posterior, wh the KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. However the KL divergence in (8) is also intractable, mainly because of the d inference sidesteps this di culty by considering an equivalent optimisation q(✓) = arg max q2Q LV I(q; D), where the variational lower-bound or evidence lower-bound (ELBO) LV I(q LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) .
  • 6. VAE ¤ [Kingma et al,. 2014] ¤ ¤ ℎ ¤ ¤ 1 Variational Auto-encoder with R´enyi Divergence he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re oposed (deep) generative model that parametrizes the variational approximation with a recog twork. The generative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). re we drop the parameters ✓ but keep in mind that they will be learned using approximate max elihood. However for these models the exact computation of log p(x) requires marginalisation dden variables and is thus often intractable. Variational expectation-maximisation (EM) me mes to the rescue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , here h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is defi q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). variational EM, optimisation for q and p are alternated to guarantee convergence. However th ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the ional Auto-encoder with R´enyi Divergence auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently ) generative model that parametrizes the variational approximation with a recognition enerative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). (14) he parameters ✓ but keep in mind that they will be learned using approximate maximum wever for these models the exact computation of log p(x) requires marginalisation of all s and is thus often intractable. Variational expectation-maximisation (EM) methods scue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , (15) s all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is defined as q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). (16) EM, optimisation for q and p are alternated to guarantee convergence. However the core to jointly optimising p and q, which instead has no guarantee of increasing the MLE on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This the possibility that alternative surrogate functions might return estimates that are tighter So the VR bound is considered in this context: L (q; x) = 1 log E "✓ p(x, h) ◆1 ↵ # . (17) variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is sed (deep) generative model that parametrizes the variational approximation with a r rk. The generative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). we drop the parameters ✓ but keep in mind that they will be learned using approximate ood. However for these models the exact computation of log p(x) requires marginalisa n variables and is thus often intractable. Variational expectation-maximisation (EM to the rescue by approximating log p(x) ⇡ LV I (q; x) = Eq(h|x)  log p(x, h) q(h|x) , h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve of VAE is to jointly optimising p and q, which instead has no guarantee of increasing ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2 explores the possibility that alternative surrogate functions might return estimates that bounds. So the VR bound is considered in this context: "✓ ◆1 ↵ # p(x|✓) = h1,...,hL p(hL |✓)p(hL 1 |hL , ✓) · · · p(x|h1 , ✓). ( Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1 , . . . , hL } denotes t stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F convenience, we define h0 = x. Each of the terms p(h` |h`+1 ) may denote a complicated nonline relationship, for instance one computed by a multilayer neural network. However, it is assum that sampling and probability evaluation are tractable for each p(h` |h`+1 ). Note that L denot the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W assume the recognition model q(h|x) is defined in terms of an analogous factorization: q(h|x) = q(h1 |x)q(h2 |h1 ) · · · q(hL |hL 1 ), ( where sampling and probability evaluation are tractable for each of the terms in the product. In this work, we assume the same families of conditional probability distributions as Kingma Welling (2014). In particular, the prior p(hL ) is fixed to be a zero-mean, unit-variance Gaussia In general, each of the conditional distributions p(h` | h`+1 ) and q(h` |h` 1 ) is a Gaussian wi diagonal covariance, where the mean and covariance parameters are computed by a determinis feed-forward neural network. For real-valued observations, p(x|h1 ) is also defined to be such Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean paramete are computed by a neural network. The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro Jensen’s Inequality: log p(x) = log Eq(h|x)  p(x, h) q(h|x) Eq(h|x)  log p(x, h) q(h|x) = L(x). ( Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
  • 7. VAE ¤ reparameterization trick ¤ ¤ ed a reparameterization of the recognition distribution in terms tributions, such that the samples from the recognition model are s and auxiliary variables. While they presented the reparameter- tions, for convenience we discuss the special case of Gaussians, ork. (The general reparameterization trick can be used with our tribution q(h` |h` 1 , ✓) always takes the form of a Gaussian hose mean and covariance are computed from the the states of 2 Under review as a conference paper at ICLR 2016 the hidden units at the previous layer and the model parameters. This can be by first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the d h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). The joint recognition distribution q(h|x, ✓) over all latent variables can be a deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying E sequence. Since the distribution of ✏ does not depend on ✓, we can reformu bound L(x) from Eqn. 3 by pushing the gradient operator inside the expecta r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h q(h(✏ = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h q(h(✏ Assuming the mapping h is represented as a deterministic feed-forward neu ✏, the gradient inside the expectation can be computed using standard backp one approximates the expectation in Eqn. 6 by generating k samples of ✏ a Carlo estimator 1 k kX r✓ log w (x, h(✏i, x, ✓), ✓) der review as a conference paper at ICLR 2016 hidden units at the previous layer and the model parameters. This can be alternatively ex first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic m h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in t eterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each uence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradien nd L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . uming the mapping h is represented as a deterministic feed-forward neural network, fo he gradient inside the expectation can be computed using standard backpropagation. In p approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the lo estimator k Under review as a conference paper at ICLR 2016 the hidden units at the previous layer and the model parameters. This can be alternatively expressed by first sampling an auxiliary variable ✏` ⇠ N(0, I), and then applying the deterministic mapping h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). (4) The joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of a deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each layer in sequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the bound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) (5) = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . (6) Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed ✏, the gradient inside the expectation can be computed using standard backpropagation. In practice, one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte Carlo estimator 1 k kX i=1 r✓ log w (x, h(✏i, x, ✓), ✓) (7) with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same e hidden units at the previous layer and the model parameters. This can be alternatively expressed y first sampling an auxiliary variable ✏` ⇠ N(0, I), and then applying the deterministic mapping h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). (4) he joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each layer in quence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the ound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N(0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) (5) = E✏1,...,✏L⇠N(0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . (6) ssuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed the gradient inside the expectation can be computed using standard backpropagation. In practice, ne approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte arlo estimator 1 k kX i=1 r✓ log w (x, h(✏i, x, ✓), ✓) (7) ith w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that e VAE update and the basic REINFORCE-like update are both unbiased estimators of the same adient, but the VAE update tends to have lower variance in practice because it makes use of the the hidden units at the previous layer and the by first sampling an auxiliary variable ✏` ⇠ h` (✏` , h` 1 , ✓) = ⌃ The joint recognition distribution q(h|x, ✓) a deterministic mapping h(✏, x, ✓), with ✏ sequence. Since the distribution of ✏ does n bound L(x) from Eqn. 3 by pushing the gra r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = = Assuming the mapping h is represented as a ✏, the gradient inside the expectation can be one approximates the expectation in Eqn. 6 Carlo estimator 1 k kX i=1 r✓ lo with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). T the VAE update and the basic REINFORCE gradient, but the VAE update tends to have
  • 8. VAE VAE ¤ VAE ¤ ¤ VAE KL In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the parameters. We assume an approximate posterior in the form q (z|x), but please note that the technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully variational Bayesian method for inferring a posterior over the parameters is given in the appendix. Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x) of an (auxiliary) noise variable ✏: ez = g (✏, x) with ✏ ⇠ p(✏) (4) See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t. q (z|x) as follows: Eq (z|x(i)) [f(z)] = Ep(✏) h f(g (✏, x(i) )) i ' 1 L LX l=1 f(g (✏(l) , x(i) )) where ✏(l) ⇠ p(✏) (5) We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic Gradient Variational Bayes (SGVB) estimator eLA (✓, ; x(i) ) ' L(✓, ; x(i) ): eLA (✓, ; x(i) ) = 1 L LX l=1 log p✓(x(i) , z(i,l) ) log q (z(i,l) |x(i) ) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (6) 3 g r✓, eLM (✓, ; XM , ✏) (Gradients of minibatch estimator (8)) ✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10]) until convergence of parameters (✓, ) return ✓, Often, the KL-divergence DKL(q (z|x(i) )||p✓(z)) of eq. (3) can be integrated analytically (see appendix B), such that only the expected reconstruction error Eq (z|x(i)) ⇥ log p✓(x(i) |z) ⇤ requires estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour- aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the SGVB estimator eLB (✓, ; x(i) ) ' L(✓, ; x(i) ), corresponding to eq. (3), which typically has less variance than the generic estimator: eLB (✓, ; x(i) ) = DKL(q (z|x(i) )||p✓(z)) + 1 L LX l=1 (log p✓(x(i) |z(i,l) )) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (7) Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likelihood lower bound of the full dataset, based on minibatches: L(✓, ; X) ' eLM (✓, ; XM ) = N M MX i=1 eL(✓, ; x(i) ) (8) where the minibatch XM = {x(i) }M i=1 is a randomly drawn sample of M datapoints from the full dataset X with N datapoints. In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100. Derivatives r✓, eL(✓; XM ) can be taken, and the resulting gradients can be used in conjunction with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a basic approach to compute the stochastic gradients. A connection with auto-encoders becomes clear when looking at the objective function given at eq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as a regularizer, while the second term is a an expected negative reconstruction error. The function g (.) is chosen such that it maps a datapoint x(i) and a random noise vector ✏(l) to a sample from the
  • 9. Importance weighted AE IWAE ¤ VAE ¤ ¤ ¤ ¤ ¤ k=1 VAE ¤ k ution must be approximately factorial and predictable with a feed-forward neural n VAE criterion may be too strict; a recognition network which places only a small 0%) of its samples in the region of high posterior probability region may still be suffi ming accurate inference. If we lower our standards in this way, this may give us ad lity to train a generative network whose posterior distributions do not fit the VAE This is the motivation behind our proposed algorithm, the Importance Weighted Auto E). WAE uses the same architecture as the VAE, with both a generative network and a rec rk. The difference is that it is trained to maximize a different lower bound on log p ular, we use the following lower bound, corresponding to the k-sample importance w te of the log-likelihood: Lk(x) = Eh1,...,hk⇠q(h|x) " log 1 k kX i=1 p(x, hi) q(hi|x) # . h1, . . . , hk are sampled independently from the recognition model. The term inside ponds to the unnormalized importance weights for the joint distribution, which we wil = p(x, hi)/q(hi|x). s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality at the average importance weights are an unbiased estimator of p(x): Lk = E " log 1 k kX wi #  log E " 1 k kX wi # = log p(x), iew as a conference paper at ICLR 2016 1. For all k, the lower bounds satisfy log p(x) Lk+1 Lk. if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity. e Appendix A.
  • 10. Rényi α ¤ . # , ¤ 1 1 > 0, 1 ≠ 1 ¤ 1 → 1 KL ¤ 1 = 8 9 tributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. > 1 the definition is valid when it is finite, and for discrete random variables the integr d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . two distributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integratio replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla crucial role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . Another special case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of the squ 2 R p p the definition is valid when it is finite, and for discrete random variables the int y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . pecial case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of istance Hel2 [p||q] = 1 2 R ( p p(✓) p q(✓))2 d✓: D1 2 [p||q] = 2 log(1 Hel2 [p||q]). ven and Harremo¨es, 2014] the definition (1) is also extended to negative ↵ values, t is non-positive and is thus no longer a valid divergence measure. The proposed m
  • 11. Rényi ¤ # . / ,(.) KL ¤ Rényi ¤ Rényi α ¤ ¤ 1 ≠ 1 LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ational R´enyi Bound Section 2.1 that the family of R´enyi divergences includes the KL divergence. ee-energy approaches be generalised to the R´enyi case? Consider approxima |D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. y the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ "✓ ◆1 ↵ # LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ariational R´enyi Bound m Section 2.1 that the family of R´enyi divergences includes the KL divergence. al free-energy approaches be generalised to the R´enyi case? Consider approxima p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. erify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. = 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 log E "✓ p(✓, D) ◆1 ↵ # q(✓) Variational R´enyi Bound rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap nal free-energy approaches be generalised to the R´enyi case? Consider approximating th r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. verify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. 6= 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 ↵ 1 log Eq "✓ p(✓, D) q(✓)p(D) ◆1 ↵ # = 1 1 ↵ log Eq "✓ p(✓, D) q(✓) ◆1 ↵ # := L↵(q; D). me this new objective the variational R´enyi bound (VR). Importantly the following theore Rényi VR
  • 12. VR ¤ VR ¤ ¤ cope if Monte Carlo methods is not resorted to. This section develops a scalable opt or the VR bound by extending the recent advances of traditional VI. Black-box met ssed to enable it applications to arbitrary finite ↵ settings. Monte Carlo Estimation of the VR Bound se a simple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to app K: ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) thm. However we can bound the bias by the following theorems proved in the supple m 2. E{✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fix limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre ] [ {|L↵| < +1}. 5 R Bound Optimisation Framework energy methods sidestep intractabilities in a class of intractable models. Recent wor proximations based on Monte Carlo to expend the set of models that can be handled. be deployed on the same model class as Monte Carlo variational methods, but which Monte Carlo methods is not resorted to. This section develops a scalable optimis VR bound by extending the recent advances of traditional VI. Black-box method o enable it applications to arbitrary finite ↵ settings. Carlo Estimation of the VR Bound mple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i However we can bound the bias by the following theorems proved in the supplemen {✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin ↵| < +1}. (a) Sampling approximated VR bounds. (b) Simulated values of divergences. Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds. Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour. Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K k=1 [ ˆL↵,K(q; D)]  log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0. To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding properties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵  1 the sampling approximation ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation quality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negative alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively). We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
  • 13. VR exact approx. (a) Sampling approximated VR bounds. (b) Simula Figure 2: (a) An illustration for the bounding properties of sampling ap Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl VR 1 ≤ 1 1 k
  • 14. VR ¤ IWAE ¤ 1 ated VR bounds. (b) Simulated values of divergences. n for the bounding properties of sampling approximations to the VR bounds. 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha e p, q are 2-D Gaussian distributions with identity covariance matrix, where [0, 0] and µq = [1, 1]. Best viewed in colour. ˆ 1 ≤ 1 1 = 0 IWAE
  • 15. VR-max ¤ Reparameterization trick ¤ ¤ ¤ 1 = 1 VAE ¤ 1 → −∞ ¤ importance weight ¤ VR-max if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . (19) Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , (20) where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite samples ✏k ⇠ p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . (21) with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I(q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , (22) which means the resulting algorithm unifies the computation for all finite ↵ settings. if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . (19) Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , (20) where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite samples ✏k ⇠ p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . (21) with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I(q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , (22) which means the resulting algorithm unifies the computation for all finite ↵ settings. if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bou L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite sam p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite sample show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I (q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , which means the resulting algorithm unifies the computation for all finite ↵ settings. To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏j wit Algorithm 1 one gradient step for VR-↵/VR-max 1: sample ✏1, ..., ✏K ⇠ p(✏) 2: for all k = 1, ..., K, and n 2 S the current minibatch, compute the u log ˆw(✏k; xn) = log p(g (✏k), xn) log q(g ( 3: choose the sample ✏jn to back-propagate: if |↵| < 1: jn ⇠ pk where pk / ˆw(✏k; xn)1 ↵ if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X r log ˆw(✏jn ; xn)
  • 16. ¤ [Li et al., 2015] EP ¤ ¤ ¤ M VR ¤ M ¤ Black-box alpha BB-α VR or VAEs. Note that VR-max does not compute ciple (MDL), since MDL approximates the true upper-bounds the exact log-likelihood function. scale Learning hole dataset D. However for large datasets full [Li et al., 2015] the authors discussed stochastic tion for large-scale learning. Here we propose batch training, which directly applies to the VR “average likelihood” ¯fD(✓) = [ QN n=1 fn(✓)] 1 N , ✓) ¯fD(✓)N . Now we sample M datapoints S = posterior by minimising the exact VR bound L 1 4.3 Stochastic Approximation for La So far we discussed the VR bounds computed on t batch learning will be very ine cient. In the append EP as a way to approximating the VR bound opt another stochastic approximation method to enable bound. Using the notation fn(✓) = p(xn|✓) and definin the joint distribution can be rewritten as p(✓, D) = the minimum description length principle (MDL), since MDL approximates the true sing the exact VR bound L 1 that upper-bounds the exact log-likelihood function. c Approximation for Large-scale Learning the VR bounds computed on the whole dataset D. However for large datasets full e very ine cient. In the appendix of [Li et al., 2015] the authors discussed stochastic proximating the VR bound optimisation for large-scale learning. Here we propose pproximation method to enable minibatch training, which directly applies to the VR on fn(✓) = p(xn|✓) and defining the “average likelihood” ¯fD(✓) = [ QN n=1 fn(✓)] 1 N , n can be rewritten as p(✓, D) = p0(✓) ¯fD(✓)N . Now we sample M datapoints S = 7set average likelihood” ¯fS(✓) = [ QM m=1 fnm (✓)] 1 M . xn}. Then we approximate the VR bound (13) by )↵ p0(✓) ¯fS(✓)N 1 ↵ d✓ 0(✓) ¯fS(✓)N q(✓) ◆1 ↵ ]. (23) wer-bound when ↵ ! 1. For other ↵ 6= 1 settings, the bias of approximation. This is guaranteed by {xn1 , ..., xnM } ⇠ D and define the corresponding “subset average likelihood” ¯fS(✓) = [ QM m=1 fnm (✓)] 1 M . When M = 1 we also write ¯fS(✓) = fn(✓) for S = {xn}. Then we approximate the VR bound (13) by replacing ¯fD(✓) with ¯fS(✓): ˜L↵(q; S) = 1 1 ↵ log Z q(✓)↵ p0(✓) ¯fS(✓)N 1 ↵ d✓ = 1 1 ↵ log Eq[ ✓ p0(✓) ¯fS(✓)N q(✓) ◆1 ↵ ]. (23) This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings, increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed by the following theorem proved in the supplementary. Theorem 3. If the approximate distribution q(✓) is Gaussian N(µ, ⌃), and the likelihood functions has an exponential family form p(x|✓) = exp[h✓, (x)i A(✓)], then for ↵  1 the stochastic approximation is bounded by
  • 17. VR
  • 19. 1 ¤ 3 ¤ VAE 1 = 1 ¤ IWAE 1 = 0 ¤ VR-max 1 = −∞ ¤ 1 = 0 * = 5000 ¤ VR-max IWAE ¤ VR-max ¤ VR-max 25hr29min IWAE 61hr16min e code1 . Note that the original implementation back- hile VR-max only back-propagates the sample with h 101 Silhouettes and MNIST. The experiments were y small Frey Face dataset, while the other two were onsists of L = 1 or 2 stochastic layers with determin- rk architecture is detailed in the supplementary. We n. For MNIST we used settings from [Burda et al., and number of epochs. For other two datasets the the VI setting. We reproduced the experiments for s included in [Burda et al., 2015] mismatches those e 1 by computing log p(x) ⇡ ˆL↵,K(q; x) with ↵ = 0.0, sent some samples from the VR-max trained models d almost indistinguishable to IWAEs on all the three ime to run compared to IWAE with a full backward a Tesla C2075 GPU, and when trained on MNIST R-max and IWAE took 25hr29min and 61hr16min, also implemented the single backward pass version od result for IWAE is -85.02, which is slightly worse he arguments in Section 4.1 that negative ↵ can be mputation resources are limited. alue corresponding to the tightest VR bound becomes q and the true posterior increases. This is the case n q is fitted to approximate the typically multimodal (a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST Figure 3: Sampled images from the VR-max trained auto-encoders. Dataset L K VAE IWAE VR-max Frey Face 1 5 1322.96 1380.30 1377.40 (± std. err.) ±10.03 ±4.60 ±4.59 Caltech 101 1 5 -119.69 -117.89 -118.01 Silhouettes 50 -119.61 -117.21 -117.10 MNIST 1 5 -86.47 -85.41 -85.42 50 -86.35 -84.80 -84.81 2 5 -85.01 -83.92 -84.04 50 -84.78 -83.12 -83.44 Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015]. IWAE results are reproduced using the publicly available implementation. method was implemented upon the publicly available code1 . Note that the original implementation back- propagates all the samples to compute gradients, while VR-max only back-propagates the sample with the largest importance weight. Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments were
  • 20. 1 ¤ (a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST Figure 3: Sampled images from the VR-max trained auto-encoders. Dataset L K VAE IWAE VR-max Frey Face 1 5 1322.96 1380.30 1377.40 (± std. err.) ±10.03 ±4.60 ±4.59 Caltech 101 1 5 -119.69 -117.89 -118.01 Silhouettes 50 -119.61 -117.21 -117.10 MNIST 1 5 -86.47 -85.41 -85.42 50 -86.35 -84.80 -84.81 2 5 -85.01 -83.92 -84.04 50 -84.78 -83.12 -83.44
  • 21. ¤ VR-max ¤ Frey Face ¤ 1
  • 22. 2 ¤ UCI ¤ ¤ VI[Graves,2011] PBP[Hernandez-Lobato et al., 2015] ¤ BB-α=BO Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max Boston -2.903±0.071 -2.574±0.089 -2.549±0.019 -2.457±0.066 -2.468±0.071 -2.469±0.072 Concrete -3.391±0.017 -3.161±0.019 -3.104±0.015 -3.094±0.016 -3.076±0.018 -3.092±0.018 Energy -2.391±0.029 -2.042±0.019 -0.945±0.012 -1.401±0.029 -1.418±0.020 -1.389±0.018 Wine -0.980±0.013 -0.968±0.014 -0.949±0.009 -0.948±0.011 -0.952±0.012 -0.949±0.012 Yacht -3.439±0.163 -1.634±0.016 -1.102±0.039 -1.816±0.011 -1.829±0.014 -1.817±0.013 Protein -2.992±0.006 -2.973±0.003 NA±NA -2.923±0.006 -2.911±0.005 -2.938±0.005 Year -3.622±NA -3.603±NA NA±NA -3.545±NA -3.550±NA -3.542±NA Table 2: Average test log-likelihood. BB-↵=BO results are not directly comparable and are available only for small datasets. Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max Boston 4.320±0.291 3.104±0.180 3.160±0.109 2.853±0.154 2.852±0.169 2.837±0.181 Concrete 7.128±0.123 5.667±0.093 5.374±0.074 5.343±0.102 5.237±0.114 5.280±0.104 Energy 2.646±0.081 1.804±0.048 0.600±0.018 0.807±0.059 0.883±0.050 0.791±0.041 Wine 0.646±0.008 0.635±0.007 0.632±0.005 0.640±0.009 0.638±0.008 0.639±0.009 Yacht 6.887±0.674 1.015±0.054 0.902±0.051 1.111±0.082 1.239±0.109 1.117±0.085 Protein 4.842±0.003 4.732±0.013 NA±NA 4.505±0.033 4.436±0.030 4.574±0.023 Year 9.034±NA 8.879±NA NA±NA 8.942±NA 9.133±NA 8.949±NA Table 3: Average Test Error. BB-↵=BO results are not directly comparable and are available only for small datasets.
  • 23. ¤ Rényi ¤ VI/VB EP BB-α VAE IWAE VR-max ¤ ¤ ¤ ¤ ¤ 1