A Hybrid Convolutional Variational Autoencoder for Text Generation
A Hybrid Convolutional Variational Autoencoder for Text Generation
627
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 627–637
Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
more control over the KL term, which is crucial coder so that the decoder’s receptive field is lim-
for training a VAE model. Given the difficulty of ited. They demonstrate that this allows for a better
generating long sequences in a fully feed-forward control of KL and reconstruction terms. Hu et al.
manner, we augment our network with an RNN (2017) build a VAE for text generation and de-
language model layer. To the best of our knowl- sign a cost function that encourages interpretabil-
edge, this paper is the first work that successfully ity of the latent variables. Zhang et al. (2016),
applies deconvolutions in the decoder of a latent Serban et al. (2016) and Zhao et al. (2017) ap-
variable generative model of natural text. We em- ply VAE to sequence-to-sequence problems, im-
pirically verify that our model is easier to train proving over deterministic alternatives. Chen et al.
than its fully recurrent alternative, which, in our (2016) propose a hybrid model combining autore-
experiments, fails to converge on longer texts. To gressive convolutional layers with the VAE. The
better understand why training VAEs for texts is authors make an argument based on the Bit-Back
difficult we carry out detailed experiments, dis- coding (Hinton and van Camp, 1993) that when
cuss optimization difficulties, and propose effec- the decoder is powerful enough the best thing for
tive ways to address them. Finally, we demon- the encoder to do is to make the posterior distri-
strate that sampling from our model yields realistic bution equivalent to the prior. While they exper-
texts. iment on images, this argument is very relevant
to the textual data. A recent work by Bousquet
2 Related Work et al. (2017) approaches VAEs and GANs from
the optimal transport point of view. The authors
So far, the majority of neural generative mod- show that commonly known blurriness of sam-
els of text are built on the autoregressive as- ples from VAEs trained on image data are a nec-
sumption (Larochelle and Murray, 2011; van den essary property of the model. While the implica-
Oord et al., 2016). Such models assume that the tions of their argument to models combining la-
current data element can be accurately predicted tent variables and autoregressive layers trained on
given sufficient history of elements generated thus non-image data are still unclear, the argument sup-
far. The conventional RNN based language mod- ports the hypothesis of Chen et al. (2016) that dif-
els fall into this category and currently dominate ficulty of training a hybrid model is not caused by
the language modeling and generation problem a simple optimization difficulty but rather may be
in NLP. Neural architectures based on recurrent a more principled issue.
(Józefowicz et al., 2016; Zoph and Le, 2016; Ha Various techniques to improve training of VAE
et al., 2016) or convolutional decoders (Kalch- models where the total cost represents a trade-off
brenner et al., 2016; Dauphin et al., 2016) provide between the reconstruction cost and KL term have
an effective solution to this problem. been used so far: KL-term annealing and input
A recent work by Bowman et al. (2016) tack- dropout (Bowman et al., 2016; Sønderby et al.,
les language generation problem within the VAE 2016), imposing structured sparsity on latent vari-
framework (Kingma and Welling, 2013; Rezende ables (Yeung et al., 2016) and more expressive for-
et al., 2014). The authors demonstrate that with mulations of the posterior distribution (Rezende
some care it is possible to successfully learn a la- and Mohamed, 2015; Kingma et al., 2016). A
tent variable generative model of text. Although work by (Mescheder et al., 2017) follows the same
their model is slightly outperformed by a tradi- motivation and combines GANs and VAEs allow-
tional LSTM (Hochreiter and Schmidhuber, 1997) ing a model to use arbitrary complex formulations
language model, their model achieves a similar ef- of both prior and posterior distributions. In Sec-
fect as in computer vision where one can (i) sam- tion 3.4 we propose another efficient technique to
ple realistic sentences by feeding randomly gen- control the trade-off between KL and reconstruc-
erated novel latent vectors through the decoder tion terms.
and (ii) linearly interpolate between two points in
the latent space. Miao et al. (2015) apply VAE 3 Model
to bag-of-words representations of documents and
the answer selection problem achieving good re- In this section we first briefly explain the VAE
sults on both tasks. Yang et al. (2017) discuss framework of Kingma and Welling (2013), then
a VAE consisting of RNN encoder and CNN de- describe our hybrid architecture where the feed-
628
map an input to a region of the space rather than
to a single point. The most straight-forward way to
achieve a good reconstruction error in this case is
to predict a very sharp probability distribution ef-
fectively corresponding to a single point in the la-
Figure 1: LSTM VAE model of (Bowman et al., tent space (Raiko et al., 2014). The additional KL
2016) term in Eq (1) prevents this behavior and forces the
model to find a solution with, on one hand, low re-
forward part is composed of a fully convolutional
construction error and, on the other, predicted pos-
encoder and a decoder that combines deconvolu-
terior distributions close to the prior. Thus, the de-
tional layers and a conventional RNN. Finally, we
coder part of the VAE is capable of reconstructing
discuss optimization recipes that help VAE to re-
a sensible data sample from every point in the la-
spect latent variables, which is critical training a
tent space that has non-zero probability under the
model with a meaningful latent space and being
prior. This allows for straightforward generation
able to sample realistic sentences.
of novel samples and linear operations on the la-
3.1 Variational Autoencoder tent codes. Bowman et al. (2016) demonstrate
that this does not work in the fully deterministic
The VAE is a recently introduced latent vari-
Autoencoder framework . In addition to regulariz-
able generative model, which combines varia-
ing the latent space, KL term indicates how much
tional inference with deep learning. It modifies the
information the VAE stores in the latent vector.
conventional autoencoder framework in two key
ways. Firstly, a deterministic internal representa- Bowman et al. (2016) propose a VAE model for
tion z (provided by the encoder) of an input x is re- text generation where both encoder and decoder
placed with a posterior distribution q(z|x). Inputs are LSTM networks (Figure 1). We will refer to
are then reconstructed by sampling z from this this model as LSTM VAE in the remainder of the
posterior and passing them through a decoder. To paper. The authors show that adapting VAEs to
make sampling easy, the posterior distribution is text generation is more challenging as the decoder
usually parametrized by a Gaussian with its mean tends to ignore the latent vector (KL term is close
and variance predicted by the encoder. Secondly, to zero) and falls back to a language model. Two
to ensure that we can sample from any point of training tricks are required to mitigate this issue:
the latent space and still generate valid and diverse (i) KL-term annealing where its weight in Eq (1)
outputs, the posterior q(z|x) is regularized with gradually increases from 0 to 1 during the training;
its KL divergence from a prior distribution p(z). and (ii) applying dropout to the inputs of the de-
The prior is typically chosen to be also a Gaussian coder to limit its expressiveness and thereby forc-
with zero mean and unit variance, such that the KL ing the model to rely more on the latent variables.
term between posterior and prior can be computed We will discuss these tricks in more detail in Sec-
in closed form (Kingma and Welling, 2013). The tion 3.4. Next we describe a deconvolutional layer,
total VAE cost is composed of the reconstruction which is the core element of the decoder in our
term, i.e., negative log-likelihood of the data, and VAE model.
the KL regularizer:
3.2 Deconvolutional Networks
Jvae = KL(q(z|x)||p(z))
(1) A deconvolutional layer (also referred to as trans-
−Eq(z|x) [log p(x|z)]
posed convolutions (Gulrajani et al., 2016) and
Kingma and Welling (2013) show that the loss fractionally strided convolutions (Radford et al.,
function from Eq (1) can be derived from the 2015)) performs spatial up-sampling of its inputs
probabilistic model perspective and it is an upper and is an integral part of latent variable genera-
bound on the true negative likelihood of the data. tive models of images (Radford et al., 2015; Gulra-
One can view a VAE as a traditional Autoen- jani et al., 2016) and semantic segmentation algo-
coder with some restrictions imposed on the in- rithms (Noh et al., 2015). Its goal is to perform an
ternal representation space. Specifically, using a “inverse” convolution operation and increase spa-
sample from the q(z|x) to reconstruct the input tial size of the input while decreasing the number
instead of a deterministic z, forces the model to of feature maps. This operation can be viewed as
629
(b) Hybrid model with LSTM
decoder
a backward pass of a convolutional layer and can erated text increases. Next, we describe our VAE
be implemented by simply switching the forward architecture that blends deconvolutional and RNN
and backward passes of the convolution operation. layers in the decoder to allow for better control
In the context of generative modeling based on over the KL-term.
global representations, the deconvolutions are typ-
ically used as follows: the global representation 3.3 Hybrid Convolutional-Recurrent VAE
is first linearly mapped to another representation Our model is composed of two relatively inde-
with small spatial resolution and large number of pendent modules. The first component is a stan-
feature maps. A stack of deconvolutional layers dard VAE where the encoder and decoder modules
is then applied to this representation, each layer are parametrized by convolutional and deconvolu-
progressively increasing spatial resolution and de- tional layers respectively (see Figure 2(a)). This
creasing the amount of feature channels. The out- architecture is attractive for its computational effi-
put of the last layer is an image or, in our case, ciency and simplicity of training.
a text fragment. A notable example of such a The other component is a recurrent language
model is the deep network of (Radford et al., 2015) model consuming activations from the deconvo-
trained with adversarial objective. Our model uses lutional decoder concatenated with the previous
a similar approach but is instead trained with the output characters. We consider two flavors of re-
VAE objective. current functions: a conventional LSTM network
There are two primary motivations for choos- (Figure 2(b)) and a stack of masked convolutions
ing deconvolutional layers instead of the dom- also known as the ByteNet decoder from Kalch-
inantly used recurrent ones: firstly, such lay- brenner et al. (2016) (Figure 2(c)). The primary
ers have extremely efficient GPU implementations reason for having a recurrent component in the
due to their fully parallel structure. Secondly, decoder is to capture dependencies between ele-
feed-forward architectures are typically easier to ments of the text sequences – a hard task for a
optimize than their recurrent counterparts, as the fully feed-forward architecture. Indeed, the condi-
number of back-propagation steps is constant and tional distribution P (x|z) = P (x1 , . . . , xn |z) of
potentially much smaller than in RNNs. Both generated sentences cannot be richly represented
points become significant as the length of the gen- with a feed-forward network. Instead, it factor-
630
Q
izes as: P (x1 , . . . , xn |z) = i P (xi |z) where achieve solutions with non-zero KL term.
components are independent of each other and are KL term annealing can be viewed as a grad-
conditioned only on z. To minimize the recon- ual transition from conventional deterministic Au-
struction cost the model is forced to encode ev- toencoder to a full VAE. In this work we use linear
ery detail of a text fragment. A recurrent lan- annealing from 0 to 1. We have experimented with
guage model instead models the full joint distribu- other schedules but did not find them to have a sig-
tion of output sequences without having to make nificant impact on the final result. As long as the
independence assumptions P (x1 , . . . , xn |z) =
Q KL term weight starts to grow sufficiently slowly,
i P (xi |xi−1 , . . . , x1 , z). Thus, adding a re- the exact shape and speed of its growth does not
current layer on top of our fully feed-forward seem to affect the overall result. We have found
encoder-decoder architecture relieves it from en- the following heuristic to work well: we first run a
coding every aspect of a text fragment into the la- model with KL weight fixed to 0 to find the num-
tent vector and allows it to instead focus on more ber of iterations it needs to converge. We then con-
high-level semantic and stylistic features. figure the annealing schedule to start after the un-
Note that the feed-forward part of our model regularized model has converged and last for no
is different from the existing fully convolutional less than 20% of that amount.
approaches of Dauphin et al. (2016) and Kalch- While helping to regularize the latent vector, in-
brenner et al. (2016) in two respects: firstly, while put dropout tends to slow down convergence. We
being fully parallelizable during training, these propose an alternative technique to encourage the
models still require predictions from previous time model to compress information into the latent vec-
steps during inference and thus behave as a vari- tor: in addition to the reconstruction cost com-
ant of recurrent networks. In contrast, expansion puted on the outputs of the recurrent language
of the z vector is fully parallel in our model (ex- model, we also add an auxiliary reconstruction
cept for the recurrent component). Secondly, our term computed from the activations of the last de-
model down- and up-samples a text fragment dur- convolutional layer:
ing processing while the existing fully convolu-
tional decoders do not. Preserving spatial reso-
Jaux = −Eq(z|x) [log p(x|z)]
lution can be beneficial to the overall result, but X
= −Eq(z|x) [ log p(xt |z)]. (2)
comes at a higher computational cost. Lastly, we
note that our model imposes an upper bound on the t
size of text samples it is able to generate. While it Since at this layer the model does not have ac-
is possible to model short texts by adding special cess to previous output elements it needs to rely on
padding characters at the end of a sample, generat- the z vector to produce a meaningful reconstruc-
ing texts longer than certain thresholds is not pos- tion. The final cost minimized by our model is:
sible by design. This is not an unavoidable restric-
tion, since the model can be extended to generate Jhybrid = Jvae + αJaux (3)
variable sized text fragments by, for example, vari-
able sized latent codes. These extensions however where α is a hyperparameter, Jaux is the interme-
are out of scope of this work. diate reconstruction term and Jvae is the bound
from Eq (1). Expanding the two terms from Eq (3)
3.4 Optimization Difficulties gives:
631
improves the main reconstruction term. We are history and has to rely fully on the latent vec-
thus able to encode information in the latent vector tor. By conditioning the decoder only on the la-
without hurting expressiveness of the decoder. tent vector z we can directly compare the expres-
One can view the objective function in Eq 4 as siveness of the compared models. For the LSTM
a joint objective for two VAEs: one only feed- VAE model historyless decoding is achieved by
forward, as in Figure 2(a), and the other combin- using the dropout on the input elements with the
ing feed-forward and recurrent parts, as in Fig- dropout rate equal to 1 so that its decoder is only
ures 2(b) and 2(c), that partially share parameters. conditioned on the z vector and, implicitly, on the
Since the feed-forward VAE is incapable of pro- number tokens generated so far. We compare it to
ducing reasonable reconstructions without making our fully-feedforward model without the recurrent
use of the latent vector, the full architecture also layer in the decoder (Figure 2(a)). Both networks
gains access to the latent vector through shared are parametrized to have comparable number of
parameters. We note that this trick comes at a parameters.
cost of worse result on the density estimation task,
To test how well both models can cope with the
since part of the parameters of the full model are
stochasticity of the latent vectors, we minimize
trained to optimize an objective that does not cap-
only the reconstruction term from Eq. (1). This
ture all the dependencies that exist in the textual
is equivalent to a pure autoencoder setting with
data. However, the gap between purely determin-
stochastic internal representation and no regular-
istic LM and our model is small and easily control-
ization of the latent space. This experiment cor-
lable by the α hyperparameter. We refer the reader
responds to an initial stage of training with KL
to Figure 4 for quantitative results regarding the
term annealing when its weight is set to 0. We
effect of α on the performance of our model on
pursue two goals with this experiment: firstly, we
the LM task.
investigate how do the two alternative encoders
4 Experiments behave in the beginning of training and establish
a lower bound on the quality of the reconstruc-
We use KL term annealing and input dropout when tions. Secondly, we attempt to put the Bit Back
training the LSTM VAE models from Bowman coding argument from Chen et al. (2016) in con-
et al. (2016) and KL term annealing and regular- text. The authors assume the encoder to be power-
ized objective function from Eq (3) when train- ful enough to produce a good representation of the
ing our models. All models were trained with the data. One interpretation of this argument applied
Adam optimization algorithm (Kingma and Ba, to textual data isQthat factorizing the joint proba-
2014) with decaying learning rate. We use Layer bility as p(x) = t p(xt |x<t ) provides the model
Normalization (Ba et al., 2016) in LSTM lay- with a sufficiently powerful decoder that does not
ers and Batch Normalization (Ioffe and Szegedy, need the latent variables. However, our experi-
2015) in convolutional and deconvolutional layers. mental results suggest that LSTM encoder may not
To make our results easy to reproduce we have re- be a sufficiently expressive encoder for VAEs for
leased the source code of all our experiments1 . textual data, potentially making the argument in-
Data. Our first task is character-level language applicable.
generation performed on the standard Penn Tree- The results are presented in Figure 3. Note that
bank dataset (Marcus et al., 1993). One of the when the length of input samples reaches 30 char-
goals is to test the ability of the models to success- acters, the historyless LSTM autoencoder fails to
fully learn the representations of long sequences. fit the data well, while the convolutional architec-
For training, fixed-size data samples are selected ture converges almost instantaneously. The results
from random positions in the standard training and appear even worse for LSTMs on sequences of 50
validation sets. characters. To make sure that this effect is not
caused by optimization difficulties, i.e. exploding
4.1 Comparison with LSTM VAE
gradients (Pascanu et al., 2013), we have searched
Historyless decoding. We start with an exper- over learning rates, gradient clipping thresholds
iment where the decoder is forced to ignore the and sizes of LSTM layers but were only able to
1
https://github.com/stas-semeniuta/ get results comparable to those shown in Figure 3.
textvae Note that LSTM networks make use of Layer Nor-
632
6 6 6 6
Bits-per-character
Bits-per-character
Bits-per-character
Bits-per-character
5 convvae 5 convvae 5 convvae 5 convvae
4 lstmvae 4 lstmvae 4 lstmvae 4 lstmvae
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Iteration Iteration Iteration Iteration
Figure 3: Training curves of LSTM autoencoder and our model on samples of different length. Solid
and dashed lines show training and validation curves respectively. Note that the model exhibits little to
no overfitting since the validation curve follows the training one almost perfectly.
2.0
two models, focusing on how effective is the en-
1.5
coder at producing meaningful latent vector. How-
1.0 ever, we note that our model performs fairly well
on the LM task and is only slightly worse than
0.5
purely deterministic Language Model, trained in
0.0 the same environment, and is comparable to the
12 24 36 48 60
Text fragment size, characters one of Bowman et al. (2016) in this regard.
We fix input dropout rates at 0.2 and 0.5 for
Figure 4: The full cost (solid lines) and its LSTM VAE and use auxiliary reconstruction loss
KL component (dashed lines) in bits-per-character (Section 3.4) with 0.2 weight in our Hybrid model.
of our Hybrid model trained with 0.2 α hyper- The bits-per-character scores on differently sized
parameter vs. LSTM based VAE trained with 0.2 text samples are presented in Figure 4. As dis-
and 0.5 input dropout, measured on validation par- cussed in Section 3.1, the KL term value indi-
tition. cates how much information the network stores
in the latent vector. We observe that the amount
malization (Ba et al., 2016) which has been shown of information stored in the latent vector by our
to make training of such networks easier. These model and the LSTM VAE is comparable when
results suggest that our model is easier to train than we train on short samples and largely depends on
the LSTM-based model, especially for modeling hyper-parameters α and p. When the length of a
longer pieces of text. Additionally, our model is text fragment increases, LSTM VAE is able to put
computationally faster by a factor of roughly two, less information into the latent vector (i.e., the KL
since we run only one recurrent network per sam- component is small) and for texts longer than 48
ple and time complexity of the convolutional part characters, the KL term drops to almost zero while
is negligible in comparison. for our model the ratio between KL and recon-
struction terms stays roughly constant. This sug-
Decoding with history. We now move to a case gests that our model is better at encoding latent
where the decoder is conditioned on both the la- representations of long texts since the amount of
tent vector and previous output elements. In these information in the latent vector does not decrease
experiments we pursue two goals: firstly, we ver- as the length of a text fragment grows. In con-
ify whether the results obtained on the historyless trast, there is a steady decline of the KL term of the
decoding task also generalize to a less restricted LSTM VAE model. This result is consistent with
case. Secondly, we study how well the models our findings from the historyless decoding exper-
cope with stochasticity introduced by the latent iment. Note that in both of these experiments the
variables. Note that we do not attempt to improve LSTM VAE model fails to produce meaningful la-
the state-of-the-art result on the Language Mod- tent vectors with inputs over 50 characters long.
eling task but instead focus on providing an ap- This further suggests that our Hybrid model en-
633
3.0 2.5
with annealing α =0.0
2.5 without annealing α =0.2
2.0 α =0.5
Bitspercharacter
Bitspercharacter
2.0
1.5
1.5
1.0
1.0
0.5 0.5
0.0 0.0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 3 4 5 6 7 8 9 10
α Number of convolutional layers
Figure 5: The full cost (solid line) and the KL Figure 6: The full cost (solid line) and the KL
component (dashed line) of our Hybrid model with component (dashed line) of our Hybrid model with
LSTM decoder trained with various α, with and ByteNet decoder trained with various number of
without KL term weight annealing, measured on convolutional layers, measured on the validation
the validation partition. partition
4.2 Controlling the KL term tially falls back to an RNN language model. To
have a full control over the receptive field size of
We study the effect of various training techniques
the recurrent component in our decoder, we ex-
that help control the KL term which is crucial for
periment with masked convolutions (Figure 2(c)),
training a generative VAE model.
which is similar to the decoder in ByteNet model
Aux cost weight. First, we provide a detailed from Kalchbrenner et al. (2016). We fix the size
view of how optimization tricks discussed in Sec- of the convolutional kernels to 2 and do not use di-
tion 3.4 affect the performance of our Hybrid lated convolutions and skip connections as in the
model. Figure 5 presents results of our model original ByteNet.
trained with different values of α from Eq. (3).
Note that the inclusion of the auxiliary reconstruc- The resulting receptive field size of the recurrent
tion loss slightly harms the bound on the likeli- layer in our decoder is equal to N + 1 characters,
hood of the data but helps the model to rely more where N is the number of convolutional layers.
on the latent vector as α grows. A similar effect We vary the number of layers to find the amount of
on model’s bound was observed by Bowman et al. preceding characters that our model can consume
(2016): increased input dropout rates force their without collapsing the KL term to zero.
model to put more information into the z vector
but at the cost of increased final loss values. This Results of these experiments are presented in
is a trade-off that allows for sampling outputs in Figure 6. Interestingly, with the receptive field
the VAE framework. Note that our model can size larger than 3 and without the auxiliary re-
find a solution with non-trivial latent vectors when construction term from Eq. (3) (α = 0) the KL
trained with the full VAE loss provided that the term collapses to zero and the model falls back
α hyper-parameter is large enough. Combining it to a pure language model. This suggests that the
with KL term annealing helps to find non-zero KL training signal received from the previous charac-
term solutions at smaller α values. ters is much stronger than that from the input to be
reconstructed. Using the auxiliary reconstruction
Receptive field. The goal of this experiment is term, however, helps to find solutions with non-
to study the relationship between the KL term val- zero KL term component irrespective of receptive
ues and the expressiveness of the decoder. Without field size. Note that increasing the value of α re-
KL term annealing and input dropout, the RNN sults in stronger values of KL component. This
decoder in LSTM VAE tends to completely ignore is consistent with the results obtained with LSTM
information stored in the latent vector and essen- decoder in Figure 5.
634
@userid @userid @userid @userid @userid ... Rec KL
I want to see you so much @userid #FollowMeCam ... LSTM, p = 0.2 67.4 1.0
@userid @userid @userid @userid @userid ...
Why do I start the day today? LSTM, p = 0.5 77.1 2.1
@userid thanks for the follow back
LSTM, p = 0.8 93.7 3.8
no matter what I’m doing with my friends they are so cute Hybrid, α = 0.2 58.5 12.5
@userid Hello How are you doing
I wanna go to the UK tomorrow!! #feelinggood #selfie #instago
@userid @userid I’ll come to the same time and it was a good day too xx Table 2: Breakdown into KL and
reconstruction terms for char-level
Table 1: Random sample tweets generated by LSTM VAE tweet generation. p refers to input
(top) and our Hybrid model (bottom). dropout rate.
635
References Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam
Shazeer, and Yonghui Wu. 2016. Exploring the lim-
Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. its of language modeling. CoRR, abs/1602.02410.
2016. Layer normalization. CoRR, abs/1607.06450.
Philip Bachman. 2016. An architecture for deep, Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
hierarchical generative models. In D. D. Lee, Aäron van den Oord, Alex Graves, and Koray
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar- Kavukcuoglu. 2016. Neural machine translation in
nett, editors, NIPS, pages 4826–4834. linear time. CoRR, abs/1610.10099.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Diederik P. Kingma and Jimmy Ba. 2014. Adam:
Bengio. 2014. Neural machine translation by A method for stochastic optimization. CoRR,
jointly learning to align and translate. CoRR, abs/1412.6980.
abs/1409.0473.
Diederik P. Kingma, Tim Salimans, and Max Welling.
Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl- 2016. Improving variational inference with inverse
Johann Simon-Gabriel, and Bernhard Schoelkopf. autoregressive flow. CoRR, abs/1606.04934.
2017. From optimal transport to generative mod-
eling: the vegan cookbook. CoRR, abs/1705.07642. Diederik P. Kingma and Max Welling. 2013. Auto-
encoding variational bayes. CoRR, abs/1312.6114.
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-
drew M. Dai, Rafal Józefowicz, and Samy Ben- Hugo Larochelle and Iain Murray. 2011. The neural
gio. 2016. Generating sentences from a continuous autoregressive distribution estimator. In AISTATS,
space. In CONLL, pages 10–21. pages 29–37.
Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Anders Boesen Lindbo Larsen, Søren Kaae Sønderby,
Duan, Prafulla Dhariwal, John Schulman, Ilya and Ole Winther. 2015. Autoencoding beyond
Sutskever, and Pieter Abbeel. 2016. Variational pixels using a learned similarity metric. CoRR,
lossy autoencoder. CoRR, abs/1611.02731. abs/1512.09300.
Yann N. Dauphin, Angela Fan, Michael Auli,
and David Grangier. 2016. Language model- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
ing with gated convolutional networks. CoRR, Beatrice Santorini. 1993. Building a large annotated
abs/1612.08083. corpus of english: The penn treebank. Computa-
tional Linguistics, 19(2):313–330.
Marco Fraccaro, Søren Kaae Sø nderby, Ulrich Paquet,
and Ole Winther. 2016. Sequential neural models Lars M. Mescheder, Sebastian Nowozin, and Andreas
with stochastic layers. In NIPS, pages 2199–2207. Geiger. 2017. Adversarial variational bayes: Unify-
ing variational autoencoders and generative adver-
Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, sarial networks. CoRR, abs/1701.04722.
Adrien Ali Taiga, Francesco Visin, David Vázquez,
and Aaron C. Courville. 2016. Pixelvae: A la- Yishu Miao, Lei Yu, and Phil Blunsom. 2015. Neu-
tent variable model for natural images. CoRR, ral variational inference for text processing. CoRR,
abs/1611.05013. abs/1511.06038.
David Ha, Andrew M. Dai, and Quoc V. Le. 2016. Hy- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.
pernetworks. CoRR, abs/1609.09106. 2015. Learning deconvolution network for semantic
segmentation. CoRR, abs/1505.04366.
Geoffrey E. Hinton and Drew van Camp. 1993. Keep-
ing the neural networks simple by minimizing the Aäron van den Oord, Nal Kalchbrenner, and Koray
description length of the weights. In Proceedings Kavukcuoglu. 2016. Pixel recurrent neural net-
of the Sixth Annual ACM Conference on Computa- works. CoRR, abs/1601.06759.
tional Learning Theory, COLT 1993, Santa Cruz,
CA, USA, July 26-28, 1993., pages 5–13. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long 2013. On the difficulty of training recurrent neural
short-term memory. Neural Computation, pages networks. In ICML, pages 1310–1318.
1735–1780.
Alec Radford, Luke Metz, and Soumith Chintala.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan 2015. Unsupervised representation learning with
Salakhutdinov, and Eric P. Xing. 2017. Controllable deep convolutional generative adversarial networks.
text generation. CoRR, abs/1703.00955. CoRR, abs/1511.06434.
Sergey Ioffe and Christian Szegedy. 2015. Batch nor- Tapani Raiko, Mathias Berglund, Guillaume Alain, and
malization: Accelerating deep network training by Laurent Dinh. 2014. Techniques for learning bi-
reducing internal covariate shift. In ICML, pages nary stochastic feedforward neural networks. CoRR,
448–456. abs/1406.2989.
636
Danilo Jimenez Rezende and Shakir Mohamed. 2015. Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.
Variational inference with normalizing flows. In 2017. Learning discourse-level diversity for neural
Proceedings of the 32nd International Conference dialog models using conditional variational autoen-
on Machine Learning, ICML 2015, Lille, France, 6- coders. CoRR, abs/1703.10960.
11 July 2015, pages 1530–1538.
Barret Zoph and Quoc V. Le. 2016. Neural archi-
Danilo Jimenez Rezende, Shakir Mohamed, and Daan tecture search with reinforcement learning. CoRR,
Wierstra. 2014. Stochastic backpropagation and ap- abs/1611.01578.
proximate inference in deep generative models. In
ICML, pages 1278–1286.
637