0% found this document useful (0 votes)
3 views

A Hybrid Convolutional Variational Autoencoder for Text Generation

This paper presents a novel hybrid architecture for variational autoencoders (VAEs) aimed at improving text generation by combining convolutional and recurrent components. The proposed model addresses challenges faced by traditional RNN-based VAEs, such as the collapse of latent variables, by utilizing deconvolutional layers in the decoder and enhancing control over generated text attributes. Empirical results demonstrate that this hybrid approach yields faster training, better handling of long sequences, and realistic text generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

A Hybrid Convolutional Variational Autoencoder for Text Generation

This paper presents a novel hybrid architecture for variational autoencoders (VAEs) aimed at improving text generation by combining convolutional and recurrent components. The proposed model addresses challenges faced by traditional RNN-based VAEs, such as the collapse of latent variables, by utilizing deconvolutional layers in the decoder and enhancing control over generated text attributes. Empirical results demonstrate that this hybrid approach yields faster training, better handling of long sequences, and realistic text generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Hybrid Convolutional Variational Autoencoder for Text Generation

Stanislau Semeniuta1 Aliaksei Severyn2 Erhardt Barth1


1
Universität zu Lübeck, Institut für Neuro- und Bioinformatik
{stas,barth}@inb.uni-luebeck.de
2
Google Research Europe
[email protected]

Abstract generative models is to be able to generate realis-


tic examples as if they were drawn from the input
In this paper we explore the effect of archi- data distribution by simply feeding noise vectors
tectural choices on learning a variational through the decoder. Additionally, the latent rep-
autoencoder (VAE) for text generation. In resentations obtained by applying the encoder to
contrast to the previously introduced VAE input examples give a fine-grained control over the
model for text where both the encoder generation process that is harder to achieve with
and decoder are RNNs, we propose a more conventional autoregressive models. Similar
novel hybrid architecture that blends fully to compelling examples from image generation,
feed-forward convolutional and deconvo- where it is possible to condition generated human
lutional components with a recurrent lan- faces on various attributes such as hair, skin color
guage model. Our architecture exhibits and style (Yan et al., 2015; Larsen et al., 2015), in
several attractive properties such as faster text generation it should be possible to also control
run time and convergence, ability to bet- various attributes of the generated sentences, such
ter handle long sequences and, more im- as, for example, sentiment or writing style.
portantly, it helps to avoid the issue of the While training VAE-based models seems to
VAE collapsing to a deterministic model. pose little difficulty when applied to the tasks of
generating natural images (Bachman, 2016; Gul-
1 Introduction
rajani et al., 2016) and speech (Fraccaro et al.,
Generative models of texts are currently at the 2016), their application to natural text generation
cornerstone of natural language understanding en- requires additional care (Bowman et al., 2016;
abling recent breakthroughs in machine transla- Miao et al., 2015). As discussed by Bowman et al.
tion (Bahdanau et al., 2014; Wu et al., 2016), dia- (2016), the core difficulty of training VAE models
logue modelling (Serban et al., 2016), abstractive is the collapse of the latent loss (represented by the
summarization (Rush et al., 2015), etc. KL divergence term) to zero. In this case the gen-
Currently, RNN-based generative models hold erator tends to completely ignore latent represen-
state-of-the-art results in both unconditional tations and reduces to a standard language model.
(Józefowicz et al., 2016; Ha et al., 2016) and con- This is largely due to the high modeling power of
ditional (Vinyals et al., 2014) text generation. At the RNN-based decoders which with sufficiently
a high level, these models represent a class of au- small history can achieve low reconstruction er-
toregressive models that work by generating out- rors while not relying on the latent vector provided
puts sequentially one step at a time where the next by the encoder.
predicted element is conditioned on the history of In this paper, we propose a novel VAE model for
elements generated thus far. texts that is more effective at forcing the decoder
Variational autoencoders (VAE), recently intro- to make use of latent vectors. Contrary to existing
duced by (Kingma and Welling, 2013; Rezende work, where both encoder and decoder layers are
et al., 2014), offer a different approach to genera- LSTMs, the core of our model is a feed-forward
tive modeling by integrating stochastic latent vari- architecture composed of one-dimensional convo-
ables into the conventional autoencoder architec- lutional and deconvolutional (Zeiler et al., 2010)
ture. The primary purpose of learning VAE-based layers. This choice of architecture helps to gain

627
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 627–637
Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
more control over the KL term, which is crucial coder so that the decoder’s receptive field is lim-
for training a VAE model. Given the difficulty of ited. They demonstrate that this allows for a better
generating long sequences in a fully feed-forward control of KL and reconstruction terms. Hu et al.
manner, we augment our network with an RNN (2017) build a VAE for text generation and de-
language model layer. To the best of our knowl- sign a cost function that encourages interpretabil-
edge, this paper is the first work that successfully ity of the latent variables. Zhang et al. (2016),
applies deconvolutions in the decoder of a latent Serban et al. (2016) and Zhao et al. (2017) ap-
variable generative model of natural text. We em- ply VAE to sequence-to-sequence problems, im-
pirically verify that our model is easier to train proving over deterministic alternatives. Chen et al.
than its fully recurrent alternative, which, in our (2016) propose a hybrid model combining autore-
experiments, fails to converge on longer texts. To gressive convolutional layers with the VAE. The
better understand why training VAEs for texts is authors make an argument based on the Bit-Back
difficult we carry out detailed experiments, dis- coding (Hinton and van Camp, 1993) that when
cuss optimization difficulties, and propose effec- the decoder is powerful enough the best thing for
tive ways to address them. Finally, we demon- the encoder to do is to make the posterior distri-
strate that sampling from our model yields realistic bution equivalent to the prior. While they exper-
texts. iment on images, this argument is very relevant
to the textual data. A recent work by Bousquet
2 Related Work et al. (2017) approaches VAEs and GANs from
the optimal transport point of view. The authors
So far, the majority of neural generative mod- show that commonly known blurriness of sam-
els of text are built on the autoregressive as- ples from VAEs trained on image data are a nec-
sumption (Larochelle and Murray, 2011; van den essary property of the model. While the implica-
Oord et al., 2016). Such models assume that the tions of their argument to models combining la-
current data element can be accurately predicted tent variables and autoregressive layers trained on
given sufficient history of elements generated thus non-image data are still unclear, the argument sup-
far. The conventional RNN based language mod- ports the hypothesis of Chen et al. (2016) that dif-
els fall into this category and currently dominate ficulty of training a hybrid model is not caused by
the language modeling and generation problem a simple optimization difficulty but rather may be
in NLP. Neural architectures based on recurrent a more principled issue.
(Józefowicz et al., 2016; Zoph and Le, 2016; Ha Various techniques to improve training of VAE
et al., 2016) or convolutional decoders (Kalch- models where the total cost represents a trade-off
brenner et al., 2016; Dauphin et al., 2016) provide between the reconstruction cost and KL term have
an effective solution to this problem. been used so far: KL-term annealing and input
A recent work by Bowman et al. (2016) tack- dropout (Bowman et al., 2016; Sønderby et al.,
les language generation problem within the VAE 2016), imposing structured sparsity on latent vari-
framework (Kingma and Welling, 2013; Rezende ables (Yeung et al., 2016) and more expressive for-
et al., 2014). The authors demonstrate that with mulations of the posterior distribution (Rezende
some care it is possible to successfully learn a la- and Mohamed, 2015; Kingma et al., 2016). A
tent variable generative model of text. Although work by (Mescheder et al., 2017) follows the same
their model is slightly outperformed by a tradi- motivation and combines GANs and VAEs allow-
tional LSTM (Hochreiter and Schmidhuber, 1997) ing a model to use arbitrary complex formulations
language model, their model achieves a similar ef- of both prior and posterior distributions. In Sec-
fect as in computer vision where one can (i) sam- tion 3.4 we propose another efficient technique to
ple realistic sentences by feeding randomly gen- control the trade-off between KL and reconstruc-
erated novel latent vectors through the decoder tion terms.
and (ii) linearly interpolate between two points in
the latent space. Miao et al. (2015) apply VAE 3 Model
to bag-of-words representations of documents and
the answer selection problem achieving good re- In this section we first briefly explain the VAE
sults on both tasks. Yang et al. (2017) discuss framework of Kingma and Welling (2013), then
a VAE consisting of RNN encoder and CNN de- describe our hybrid architecture where the feed-

628
map an input to a region of the space rather than
to a single point. The most straight-forward way to
achieve a good reconstruction error in this case is
to predict a very sharp probability distribution ef-
fectively corresponding to a single point in the la-
Figure 1: LSTM VAE model of (Bowman et al., tent space (Raiko et al., 2014). The additional KL
2016) term in Eq (1) prevents this behavior and forces the
model to find a solution with, on one hand, low re-
forward part is composed of a fully convolutional
construction error and, on the other, predicted pos-
encoder and a decoder that combines deconvolu-
terior distributions close to the prior. Thus, the de-
tional layers and a conventional RNN. Finally, we
coder part of the VAE is capable of reconstructing
discuss optimization recipes that help VAE to re-
a sensible data sample from every point in the la-
spect latent variables, which is critical training a
tent space that has non-zero probability under the
model with a meaningful latent space and being
prior. This allows for straightforward generation
able to sample realistic sentences.
of novel samples and linear operations on the la-
3.1 Variational Autoencoder tent codes. Bowman et al. (2016) demonstrate
that this does not work in the fully deterministic
The VAE is a recently introduced latent vari-
Autoencoder framework . In addition to regulariz-
able generative model, which combines varia-
ing the latent space, KL term indicates how much
tional inference with deep learning. It modifies the
information the VAE stores in the latent vector.
conventional autoencoder framework in two key
ways. Firstly, a deterministic internal representa- Bowman et al. (2016) propose a VAE model for
tion z (provided by the encoder) of an input x is re- text generation where both encoder and decoder
placed with a posterior distribution q(z|x). Inputs are LSTM networks (Figure 1). We will refer to
are then reconstructed by sampling z from this this model as LSTM VAE in the remainder of the
posterior and passing them through a decoder. To paper. The authors show that adapting VAEs to
make sampling easy, the posterior distribution is text generation is more challenging as the decoder
usually parametrized by a Gaussian with its mean tends to ignore the latent vector (KL term is close
and variance predicted by the encoder. Secondly, to zero) and falls back to a language model. Two
to ensure that we can sample from any point of training tricks are required to mitigate this issue:
the latent space and still generate valid and diverse (i) KL-term annealing where its weight in Eq (1)
outputs, the posterior q(z|x) is regularized with gradually increases from 0 to 1 during the training;
its KL divergence from a prior distribution p(z). and (ii) applying dropout to the inputs of the de-
The prior is typically chosen to be also a Gaussian coder to limit its expressiveness and thereby forc-
with zero mean and unit variance, such that the KL ing the model to rely more on the latent variables.
term between posterior and prior can be computed We will discuss these tricks in more detail in Sec-
in closed form (Kingma and Welling, 2013). The tion 3.4. Next we describe a deconvolutional layer,
total VAE cost is composed of the reconstruction which is the core element of the decoder in our
term, i.e., negative log-likelihood of the data, and VAE model.
the KL regularizer:
3.2 Deconvolutional Networks
Jvae = KL(q(z|x)||p(z))
(1) A deconvolutional layer (also referred to as trans-
−Eq(z|x) [log p(x|z)]
posed convolutions (Gulrajani et al., 2016) and
Kingma and Welling (2013) show that the loss fractionally strided convolutions (Radford et al.,
function from Eq (1) can be derived from the 2015)) performs spatial up-sampling of its inputs
probabilistic model perspective and it is an upper and is an integral part of latent variable genera-
bound on the true negative likelihood of the data. tive models of images (Radford et al., 2015; Gulra-
One can view a VAE as a traditional Autoen- jani et al., 2016) and semantic segmentation algo-
coder with some restrictions imposed on the in- rithms (Noh et al., 2015). Its goal is to perform an
ternal representation space. Specifically, using a “inverse” convolution operation and increase spa-
sample from the q(z|x) to reconstruct the input tial size of the input while decreasing the number
instead of a deterministic z, forces the model to of feature maps. This operation can be viewed as

629
(b) Hybrid model with LSTM
decoder

(c) Hybrid model with ByteNet


(a) Fully feed-forward component of our VAE model decoder

Figure 2: Illustrations of our proposed models.

a backward pass of a convolutional layer and can erated text increases. Next, we describe our VAE
be implemented by simply switching the forward architecture that blends deconvolutional and RNN
and backward passes of the convolution operation. layers in the decoder to allow for better control
In the context of generative modeling based on over the KL-term.
global representations, the deconvolutions are typ-
ically used as follows: the global representation 3.3 Hybrid Convolutional-Recurrent VAE
is first linearly mapped to another representation Our model is composed of two relatively inde-
with small spatial resolution and large number of pendent modules. The first component is a stan-
feature maps. A stack of deconvolutional layers dard VAE where the encoder and decoder modules
is then applied to this representation, each layer are parametrized by convolutional and deconvolu-
progressively increasing spatial resolution and de- tional layers respectively (see Figure 2(a)). This
creasing the amount of feature channels. The out- architecture is attractive for its computational effi-
put of the last layer is an image or, in our case, ciency and simplicity of training.
a text fragment. A notable example of such a The other component is a recurrent language
model is the deep network of (Radford et al., 2015) model consuming activations from the deconvo-
trained with adversarial objective. Our model uses lutional decoder concatenated with the previous
a similar approach but is instead trained with the output characters. We consider two flavors of re-
VAE objective. current functions: a conventional LSTM network
There are two primary motivations for choos- (Figure 2(b)) and a stack of masked convolutions
ing deconvolutional layers instead of the dom- also known as the ByteNet decoder from Kalch-
inantly used recurrent ones: firstly, such lay- brenner et al. (2016) (Figure 2(c)). The primary
ers have extremely efficient GPU implementations reason for having a recurrent component in the
due to their fully parallel structure. Secondly, decoder is to capture dependencies between ele-
feed-forward architectures are typically easier to ments of the text sequences – a hard task for a
optimize than their recurrent counterparts, as the fully feed-forward architecture. Indeed, the condi-
number of back-propagation steps is constant and tional distribution P (x|z) = P (x1 , . . . , xn |z) of
potentially much smaller than in RNNs. Both generated sentences cannot be richly represented
points become significant as the length of the gen- with a feed-forward network. Instead, it factor-

630
Q
izes as: P (x1 , . . . , xn |z) = i P (xi |z) where achieve solutions with non-zero KL term.
components are independent of each other and are KL term annealing can be viewed as a grad-
conditioned only on z. To minimize the recon- ual transition from conventional deterministic Au-
struction cost the model is forced to encode ev- toencoder to a full VAE. In this work we use linear
ery detail of a text fragment. A recurrent lan- annealing from 0 to 1. We have experimented with
guage model instead models the full joint distribu- other schedules but did not find them to have a sig-
tion of output sequences without having to make nificant impact on the final result. As long as the
independence assumptions P (x1 , . . . , xn |z) =
Q KL term weight starts to grow sufficiently slowly,
i P (xi |xi−1 , . . . , x1 , z). Thus, adding a re- the exact shape and speed of its growth does not
current layer on top of our fully feed-forward seem to affect the overall result. We have found
encoder-decoder architecture relieves it from en- the following heuristic to work well: we first run a
coding every aspect of a text fragment into the la- model with KL weight fixed to 0 to find the num-
tent vector and allows it to instead focus on more ber of iterations it needs to converge. We then con-
high-level semantic and stylistic features. figure the annealing schedule to start after the un-
Note that the feed-forward part of our model regularized model has converged and last for no
is different from the existing fully convolutional less than 20% of that amount.
approaches of Dauphin et al. (2016) and Kalch- While helping to regularize the latent vector, in-
brenner et al. (2016) in two respects: firstly, while put dropout tends to slow down convergence. We
being fully parallelizable during training, these propose an alternative technique to encourage the
models still require predictions from previous time model to compress information into the latent vec-
steps during inference and thus behave as a vari- tor: in addition to the reconstruction cost com-
ant of recurrent networks. In contrast, expansion puted on the outputs of the recurrent language
of the z vector is fully parallel in our model (ex- model, we also add an auxiliary reconstruction
cept for the recurrent component). Secondly, our term computed from the activations of the last de-
model down- and up-samples a text fragment dur- convolutional layer:
ing processing while the existing fully convolu-
tional decoders do not. Preserving spatial reso-
Jaux = −Eq(z|x) [log p(x|z)]
lution can be beneficial to the overall result, but X
= −Eq(z|x) [ log p(xt |z)]. (2)
comes at a higher computational cost. Lastly, we
note that our model imposes an upper bound on the t
size of text samples it is able to generate. While it Since at this layer the model does not have ac-
is possible to model short texts by adding special cess to previous output elements it needs to rely on
padding characters at the end of a sample, generat- the z vector to produce a meaningful reconstruc-
ing texts longer than certain thresholds is not pos- tion. The final cost minimized by our model is:
sible by design. This is not an unavoidable restric-
tion, since the model can be extended to generate Jhybrid = Jvae + αJaux (3)
variable sized text fragments by, for example, vari-
able sized latent codes. These extensions however where α is a hyperparameter, Jaux is the interme-
are out of scope of this work. diate reconstruction term and Jvae is the bound
from Eq (1). Expanding the two terms from Eq (3)
3.4 Optimization Difficulties gives:

The addition of the recurrent component results in Jhybrid = KL(q(z|x)||p(z))


optimization difficulties that are similar to those X
−Eq(z|x) [ log p(xt |z, x<t )]
described by Bowman et al. (2016). In most cases t (4)
the model converges to a solution with a vanish- X
−αEq(z|x) [ log p(xt |z)].
ingly small KL term, thus effectively falling back
t
to a conventional language model. Bowman et al.
(2016) have proposed to use input dropout and KL The objective function from Eq (4) puts a mild
term annealing to encourage their model to encode constraint on the latent vector to produce features
meaningful representations into the z vector. We useful for historyless reconstruction. Since the
found that these techniques also help our model to autoregressive part reuses these features, it also

631
improves the main reconstruction term. We are history and has to rely fully on the latent vec-
thus able to encode information in the latent vector tor. By conditioning the decoder only on the la-
without hurting expressiveness of the decoder. tent vector z we can directly compare the expres-
One can view the objective function in Eq 4 as siveness of the compared models. For the LSTM
a joint objective for two VAEs: one only feed- VAE model historyless decoding is achieved by
forward, as in Figure 2(a), and the other combin- using the dropout on the input elements with the
ing feed-forward and recurrent parts, as in Fig- dropout rate equal to 1 so that its decoder is only
ures 2(b) and 2(c), that partially share parameters. conditioned on the z vector and, implicitly, on the
Since the feed-forward VAE is incapable of pro- number tokens generated so far. We compare it to
ducing reasonable reconstructions without making our fully-feedforward model without the recurrent
use of the latent vector, the full architecture also layer in the decoder (Figure 2(a)). Both networks
gains access to the latent vector through shared are parametrized to have comparable number of
parameters. We note that this trick comes at a parameters.
cost of worse result on the density estimation task,
To test how well both models can cope with the
since part of the parameters of the full model are
stochasticity of the latent vectors, we minimize
trained to optimize an objective that does not cap-
only the reconstruction term from Eq. (1). This
ture all the dependencies that exist in the textual
is equivalent to a pure autoencoder setting with
data. However, the gap between purely determin-
stochastic internal representation and no regular-
istic LM and our model is small and easily control-
ization of the latent space. This experiment cor-
lable by the α hyperparameter. We refer the reader
responds to an initial stage of training with KL
to Figure 4 for quantitative results regarding the
term annealing when its weight is set to 0. We
effect of α on the performance of our model on
pursue two goals with this experiment: firstly, we
the LM task.
investigate how do the two alternative encoders
4 Experiments behave in the beginning of training and establish
a lower bound on the quality of the reconstruc-
We use KL term annealing and input dropout when tions. Secondly, we attempt to put the Bit Back
training the LSTM VAE models from Bowman coding argument from Chen et al. (2016) in con-
et al. (2016) and KL term annealing and regular- text. The authors assume the encoder to be power-
ized objective function from Eq (3) when train- ful enough to produce a good representation of the
ing our models. All models were trained with the data. One interpretation of this argument applied
Adam optimization algorithm (Kingma and Ba, to textual data isQthat factorizing the joint proba-
2014) with decaying learning rate. We use Layer bility as p(x) = t p(xt |x<t ) provides the model
Normalization (Ba et al., 2016) in LSTM lay- with a sufficiently powerful decoder that does not
ers and Batch Normalization (Ioffe and Szegedy, need the latent variables. However, our experi-
2015) in convolutional and deconvolutional layers. mental results suggest that LSTM encoder may not
To make our results easy to reproduce we have re- be a sufficiently expressive encoder for VAEs for
leased the source code of all our experiments1 . textual data, potentially making the argument in-
Data. Our first task is character-level language applicable.
generation performed on the standard Penn Tree- The results are presented in Figure 3. Note that
bank dataset (Marcus et al., 1993). One of the when the length of input samples reaches 30 char-
goals is to test the ability of the models to success- acters, the historyless LSTM autoencoder fails to
fully learn the representations of long sequences. fit the data well, while the convolutional architec-
For training, fixed-size data samples are selected ture converges almost instantaneously. The results
from random positions in the standard training and appear even worse for LSTMs on sequences of 50
validation sets. characters. To make sure that this effect is not
caused by optimization difficulties, i.e. exploding
4.1 Comparison with LSTM VAE
gradients (Pascanu et al., 2013), we have searched
Historyless decoding. We start with an exper- over learning rates, gradient clipping thresholds
iment where the decoder is forced to ignore the and sizes of LSTM layers but were only able to
1
https://github.com/stas-semeniuta/ get results comparable to those shown in Figure 3.
textvae Note that LSTM networks make use of Layer Nor-

632
6 6 6 6
Bits-per-character

Bits-per-character

Bits-per-character

Bits-per-character
5 convvae 5 convvae 5 convvae 5 convvae
4 lstmvae 4 lstmvae 4 lstmvae 4 lstmvae
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Iteration Iteration Iteration Iteration

(a) 10 characters (b) 20 characters (c) 30 characters (d) 50 characters

Figure 3: Training curves of LSTM autoencoder and our model on samples of different length. Solid
and dashed lines show training and validation curves respectively. Note that the model exhibits little to
no overfitting since the validation curve follows the training one almost perfectly.

3.0 proach capable of generating long and diverse se-


hybrid, α =0.2 quences. We experiment on the task to obtain a
2.5 lstmvae, p =0.2
detailed picture of how are our model and LSTM
lstmvae, p =0.5
VAE affected by various choices and compare the
Bits­per­character

2.0
two models, focusing on how effective is the en-
1.5
coder at producing meaningful latent vector. How-
1.0 ever, we note that our model performs fairly well
on the LM task and is only slightly worse than
0.5
purely deterministic Language Model, trained in
0.0 the same environment, and is comparable to the
12 24 36 48 60
Text fragment size, characters one of Bowman et al. (2016) in this regard.
We fix input dropout rates at 0.2 and 0.5 for
Figure 4: The full cost (solid lines) and its LSTM VAE and use auxiliary reconstruction loss
KL component (dashed lines) in bits-per-character (Section 3.4) with 0.2 weight in our Hybrid model.
of our Hybrid model trained with 0.2 α hyper- The bits-per-character scores on differently sized
parameter vs. LSTM based VAE trained with 0.2 text samples are presented in Figure 4. As dis-
and 0.5 input dropout, measured on validation par- cussed in Section 3.1, the KL term value indi-
tition. cates how much information the network stores
in the latent vector. We observe that the amount
malization (Ba et al., 2016) which has been shown of information stored in the latent vector by our
to make training of such networks easier. These model and the LSTM VAE is comparable when
results suggest that our model is easier to train than we train on short samples and largely depends on
the LSTM-based model, especially for modeling hyper-parameters α and p. When the length of a
longer pieces of text. Additionally, our model is text fragment increases, LSTM VAE is able to put
computationally faster by a factor of roughly two, less information into the latent vector (i.e., the KL
since we run only one recurrent network per sam- component is small) and for texts longer than 48
ple and time complexity of the convolutional part characters, the KL term drops to almost zero while
is negligible in comparison. for our model the ratio between KL and recon-
struction terms stays roughly constant. This sug-
Decoding with history. We now move to a case gests that our model is better at encoding latent
where the decoder is conditioned on both the la- representations of long texts since the amount of
tent vector and previous output elements. In these information in the latent vector does not decrease
experiments we pursue two goals: firstly, we ver- as the length of a text fragment grows. In con-
ify whether the results obtained on the historyless trast, there is a steady decline of the KL term of the
decoding task also generalize to a less restricted LSTM VAE model. This result is consistent with
case. Secondly, we study how well the models our findings from the historyless decoding exper-
cope with stochasticity introduced by the latent iment. Note that in both of these experiments the
variables. Note that we do not attempt to improve LSTM VAE model fails to produce meaningful la-
the state-of-the-art result on the Language Mod- tent vectors with inputs over 50 characters long.
eling task but instead focus on providing an ap- This further suggests that our Hybrid model en-

633
3.0 2.5
with annealing α =0.0
2.5 without annealing α =0.2
2.0 α =0.5
Bits­per­character

Bits­per­character
2.0
1.5
1.5
1.0
1.0

0.5 0.5

0.0 0.0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 3 4 5 6 7 8 9 10
α Number of convolutional layers

Figure 5: The full cost (solid line) and the KL Figure 6: The full cost (solid line) and the KL
component (dashed line) of our Hybrid model with component (dashed line) of our Hybrid model with
LSTM decoder trained with various α, with and ByteNet decoder trained with various number of
without KL term weight annealing, measured on convolutional layers, measured on the validation
the validation partition. partition

codes long texts better than the LSTM VAE.

4.2 Controlling the KL term tially falls back to an RNN language model. To
have a full control over the receptive field size of
We study the effect of various training techniques
the recurrent component in our decoder, we ex-
that help control the KL term which is crucial for
periment with masked convolutions (Figure 2(c)),
training a generative VAE model.
which is similar to the decoder in ByteNet model
Aux cost weight. First, we provide a detailed from Kalchbrenner et al. (2016). We fix the size
view of how optimization tricks discussed in Sec- of the convolutional kernels to 2 and do not use di-
tion 3.4 affect the performance of our Hybrid lated convolutions and skip connections as in the
model. Figure 5 presents results of our model original ByteNet.
trained with different values of α from Eq. (3).
Note that the inclusion of the auxiliary reconstruc- The resulting receptive field size of the recurrent
tion loss slightly harms the bound on the likeli- layer in our decoder is equal to N + 1 characters,
hood of the data but helps the model to rely more where N is the number of convolutional layers.
on the latent vector as α grows. A similar effect We vary the number of layers to find the amount of
on model’s bound was observed by Bowman et al. preceding characters that our model can consume
(2016): increased input dropout rates force their without collapsing the KL term to zero.
model to put more information into the z vector
but at the cost of increased final loss values. This Results of these experiments are presented in
is a trade-off that allows for sampling outputs in Figure 6. Interestingly, with the receptive field
the VAE framework. Note that our model can size larger than 3 and without the auxiliary re-
find a solution with non-trivial latent vectors when construction term from Eq. (3) (α = 0) the KL
trained with the full VAE loss provided that the term collapses to zero and the model falls back
α hyper-parameter is large enough. Combining it to a pure language model. This suggests that the
with KL term annealing helps to find non-zero KL training signal received from the previous charac-
term solutions at smaller α values. ters is much stronger than that from the input to be
reconstructed. Using the auxiliary reconstruction
Receptive field. The goal of this experiment is term, however, helps to find solutions with non-
to study the relationship between the KL term val- zero KL term component irrespective of receptive
ues and the expressiveness of the decoder. Without field size. Note that increasing the value of α re-
KL term annealing and input dropout, the RNN sults in stronger values of KL component. This
decoder in LSTM VAE tends to completely ignore is consistent with the results obtained with LSTM
information stored in the latent vector and essen- decoder in Figure 5.

634
@userid @userid @userid @userid @userid ... Rec KL
I want to see you so much @userid #FollowMeCam ... LSTM, p = 0.2 67.4 1.0
@userid @userid @userid @userid @userid ...
Why do I start the day today? LSTM, p = 0.5 77.1 2.1
@userid thanks for the follow back
LSTM, p = 0.8 93.7 3.8
no matter what I’m doing with my friends they are so cute Hybrid, α = 0.2 58.5 12.5
@userid Hello How are you doing
I wanna go to the UK tomorrow!! #feelinggood #selfie #instago
@userid @userid I’ll come to the same time and it was a good day too xx Table 2: Breakdown into KL and
reconstruction terms for char-level
Table 1: Random sample tweets generated by LSTM VAE tweet generation. p refers to input
(top) and our Hybrid model (bottom). dropout rate.

4.3 Generating Tweets 5 Conclusions


In this section we present qualitative results on the We have introduced a novel generative model of
task of generating tweets. natural texts based on the VAE framework. Its
core components are a convolutional encoder and
a deconvolutional decoder combined with a recur-
Data. We use 1M tweets2 to train our model
rent layer. We have shown that the feed-forward
and test it on a held out dataset of 10k samples.
part of our model architecture makes it easier to
We minimally preprocess tweets by only replac-
train a VAE and avoid the problem of KL-term col-
ing user ids and urls with “@userid” and “url”.
lapsing to zero, where the decoder falls back to a
standard language model thus inhibiting the sam-
Setup. We use 5 convolutional layers with the pling ability of VAE. Additionally, we propose an
ReLU non-linearity, kernel size 3 and stride 2 in efficient way to encourage the model to rely on
the encoder. The number of feature maps is [128, the latent vector by introducing an additional cost
256, 512, 512, 512] for each layer respectively. term in the training objective. We observe that it
The decoder is configured equivalently but with works well on long sequences which is hard to
the amount of feature maps decreasing in each achieve with purely RNN-based VAEs using the
consecutive layer. The top layer is an LSTM with previously proposed tricks such as KL-term an-
1000 units. We have not observed significant over- nealing and input dropout. Finally, we have ex-
fitting. The baseline LSTM VAE model contained tensively evaluated the trade-off between the KL-
two distinct LSTMs both with 1000 cells. The term and the reconstruction loss. In particular, we
models have comparable number of parameters: investigated the effect of the receptive field size on
10.5M for the LSTM VAE model and 10.8M for the ability of the model to respect the latent vector
our hybrid model. which is crucial for being able to generate realistic
and diverse samples. In future work we plan to ap-
Results. Both VAE models are trained on the ply our VAE model to semi-supervised NLP tasks
character-level generation. The breakdown of to- and experiment with conditioning generation on
tal cost into KL and reconstruction terms is given text attributes such as sentiment and writing style.
in Table 2. Note that while the total cost values
are comparable, our model puts more information Acknowledgments
into the latent vector, further supporting our obser- We thank Enrique Alfonseca, Katja Filippova,
vations from Section 4.1. This is reflected in the Sylvain Gelly and Jason Lee for their useful feed-
random samples from both models, presented in back while preparing this paper. This project
Table 1. We perform greedy decoding during gen- has received funding from the European Union’s
eration so any variation in samples is only due to Framework Programme for Research and Inno-
the latent vector. LSTM VAE produces very lim- vation HORIZON 2020 (2014-2020) under the
ited range of tweets and tends to repeat ”@userid” Marie Skodowska-Curie Agreement No. 641805.
sequence, while our model produces much more Stanislau Semeniuta thanks the support from Pat-
diverse samples. tern Recognition Company GmbH. We thank the
support of NVIDIA Corporation with the donation
2
a random sample collected using the Twitter API of the Titan X GPU used for this research.

635
References Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam
Shazeer, and Yonghui Wu. 2016. Exploring the lim-
Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. its of language modeling. CoRR, abs/1602.02410.
2016. Layer normalization. CoRR, abs/1607.06450.
Philip Bachman. 2016. An architecture for deep, Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
hierarchical generative models. In D. D. Lee, Aäron van den Oord, Alex Graves, and Koray
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar- Kavukcuoglu. 2016. Neural machine translation in
nett, editors, NIPS, pages 4826–4834. linear time. CoRR, abs/1610.10099.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Diederik P. Kingma and Jimmy Ba. 2014. Adam:
Bengio. 2014. Neural machine translation by A method for stochastic optimization. CoRR,
jointly learning to align and translate. CoRR, abs/1412.6980.
abs/1409.0473.
Diederik P. Kingma, Tim Salimans, and Max Welling.
Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl- 2016. Improving variational inference with inverse
Johann Simon-Gabriel, and Bernhard Schoelkopf. autoregressive flow. CoRR, abs/1606.04934.
2017. From optimal transport to generative mod-
eling: the vegan cookbook. CoRR, abs/1705.07642. Diederik P. Kingma and Max Welling. 2013. Auto-
encoding variational bayes. CoRR, abs/1312.6114.
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-
drew M. Dai, Rafal Józefowicz, and Samy Ben- Hugo Larochelle and Iain Murray. 2011. The neural
gio. 2016. Generating sentences from a continuous autoregressive distribution estimator. In AISTATS,
space. In CONLL, pages 10–21. pages 29–37.
Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Anders Boesen Lindbo Larsen, Søren Kaae Sønderby,
Duan, Prafulla Dhariwal, John Schulman, Ilya and Ole Winther. 2015. Autoencoding beyond
Sutskever, and Pieter Abbeel. 2016. Variational pixels using a learned similarity metric. CoRR,
lossy autoencoder. CoRR, abs/1611.02731. abs/1512.09300.
Yann N. Dauphin, Angela Fan, Michael Auli,
and David Grangier. 2016. Language model- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
ing with gated convolutional networks. CoRR, Beatrice Santorini. 1993. Building a large annotated
abs/1612.08083. corpus of english: The penn treebank. Computa-
tional Linguistics, 19(2):313–330.
Marco Fraccaro, Søren Kaae Sø nderby, Ulrich Paquet,
and Ole Winther. 2016. Sequential neural models Lars M. Mescheder, Sebastian Nowozin, and Andreas
with stochastic layers. In NIPS, pages 2199–2207. Geiger. 2017. Adversarial variational bayes: Unify-
ing variational autoencoders and generative adver-
Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, sarial networks. CoRR, abs/1701.04722.
Adrien Ali Taiga, Francesco Visin, David Vázquez,
and Aaron C. Courville. 2016. Pixelvae: A la- Yishu Miao, Lei Yu, and Phil Blunsom. 2015. Neu-
tent variable model for natural images. CoRR, ral variational inference for text processing. CoRR,
abs/1611.05013. abs/1511.06038.
David Ha, Andrew M. Dai, and Quoc V. Le. 2016. Hy- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.
pernetworks. CoRR, abs/1609.09106. 2015. Learning deconvolution network for semantic
segmentation. CoRR, abs/1505.04366.
Geoffrey E. Hinton and Drew van Camp. 1993. Keep-
ing the neural networks simple by minimizing the Aäron van den Oord, Nal Kalchbrenner, and Koray
description length of the weights. In Proceedings Kavukcuoglu. 2016. Pixel recurrent neural net-
of the Sixth Annual ACM Conference on Computa- works. CoRR, abs/1601.06759.
tional Learning Theory, COLT 1993, Santa Cruz,
CA, USA, July 26-28, 1993., pages 5–13. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long 2013. On the difficulty of training recurrent neural
short-term memory. Neural Computation, pages networks. In ICML, pages 1310–1318.
1735–1780.
Alec Radford, Luke Metz, and Soumith Chintala.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan 2015. Unsupervised representation learning with
Salakhutdinov, and Eric P. Xing. 2017. Controllable deep convolutional generative adversarial networks.
text generation. CoRR, abs/1703.00955. CoRR, abs/1511.06434.

Sergey Ioffe and Christian Szegedy. 2015. Batch nor- Tapani Raiko, Mathias Berglund, Guillaume Alain, and
malization: Accelerating deep network training by Laurent Dinh. 2014. Techniques for learning bi-
reducing internal covariate shift. In ICML, pages nary stochastic feedforward neural networks. CoRR,
448–456. abs/1406.2989.

636
Danilo Jimenez Rezende and Shakir Mohamed. 2015. Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.
Variational inference with normalizing flows. In 2017. Learning discourse-level diversity for neural
Proceedings of the 32nd International Conference dialog models using conditional variational autoen-
on Machine Learning, ICML 2015, Lille, France, 6- coders. CoRR, abs/1703.10960.
11 July 2015, pages 1530–1538.
Barret Zoph and Quoc V. Le. 2016. Neural archi-
Danilo Jimenez Rezende, Shakir Mohamed, and Daan tecture search with reinforcement learning. CoRR,
Wierstra. 2014. Stochastic backpropagation and ap- abs/1611.01578.
proximate inference in deep generative models. In
ICML, pages 1278–1286.

Alexander M. Rush, Sumit Chopra, and Jason Weston.


2015. A neural attention model for abstractive sen-
tence summarization. CoRR, abs/1509.00685.

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,


Laurent Charlin, Joelle Pineau, Aaron C. Courville,
and Yoshua Bengio. 2016. A hierarchical latent
variable encoder-decoder model for generating di-
alogues. CoRR, abs/1605.06069.

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe,


Søren Kaae Sønderby, and Ole Winther. 2016.
Ladder variational autoencoders. CoRR,
abs/1602.02282.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and


Dumitru Erhan. 2014. Show and tell: A neural im-
age caption generator. CoRR, abs/1411.4555.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.


Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the gap
between human and machine translation. CoRR,
abs/1609.08144.

Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak


Lee. 2015. Attribute2image: Conditional im-
age generation from visual attributes. CoRR,
abs/1512.00570.

Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and


Taylor Berg-Kirkpatrick. 2017. Improved varia-
tional autoencoders for text modeling using dilated
convolutions. CoRR, abs/1702.08139.

Serena Yeung, Anitha Kannan, Yann Dauphin, and


Li Fei-Fei. 2016. Epitomic variational autoencoder.
In submission to ICLR 2017.

Matthew D. Zeiler, Dilip Krishnan, Graham W. Tay-


lor, and Robert Fergus. 2010. Deconvolutional net-
works. In CVPR, pages 2528–2535.

Biao Zhang, Deyi Xiong, and Jinsong Su. 2016.


Variational neural machine translation. CoRR,
abs/1605.07869.

637

You might also like