Final4 W18-2706
Final4 W18-2706
Facebook AI Research
Menlo Park, California, USA
45
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 45–54
Melbourne, Australia, July 20, 2018. c 2018 Association for Computational Linguistics
are deep convolutional networks (LeCun et al., text by prepending a particular length marker token.
1990). Both start with a word embedding layer Our experiments (§5.2) provide quantitative and
followed by alternating convolutions with Gated qualitative evidence that the model effectively uses
Linear Units (GLU) (Dauphin et al., 2017). The this variable: output length is easily controlled by
decoder is connected to the encoder through attention changing the length marker and supplying ground
modules (Bahdanau et al., 2015) that performs a truth markers drastically improves summary quality.
weighted sum of the encoder outputs. The weights are We compare our method to Kikuchi et al. (2016) and
predicted from the current decoder states, allowing the demonstrate that our straightforward length control
decoder to emphasize the parts of the input document strategy is more effective.
which are the most relevant for generating the next
token. We use multi-hop attention, i.e. attention is 2.3 Entity-Centric Summarization
applied at each layer of the decoder. The reader might be interested in a document to learn
In addition to attending over encoder states (Bah- about specific entities, such as people or locations.
danau et al., 2015), we also use intra-attention in the For example, a sports fan reading about a recent game
decoder to enable the model to refer back to previously might want to focus the summary on the performance
generated words. This allows the decoder to keep of their favorite player. To enable entity-centric
track of its progress and reduces the generation of re- summaries, we first anonymize entities by replacing
peated information (Vaswani et al., 2017; Paulus et al., all occurrences of a given entity in a document by
2017). To combine encoder and decoder attention, we the same token. For training, we also anonymize the
alternate between each type of attention at every layer. corresponding reference summary. For a (document,
Much prior work on the CNN-Dailymail bench- summary) pair, each entity is replaced with a token
mark employed pointer networks to copy rare from the set (@entity0, . . . , @entityN). This abstracts
entities from the input (Nallapati et al., 2016), which away the surface form, allowing our approach to scale
introduces additional complexity to the model. to many entities and generalize to unseen ones.
Instead, we rely on sub-word tokenization and weight We then express that an entity should be present
sharing. We show this simple approach is very in the generated summary by prepending the entity
effective. Specifically, we use byte-pair-encoding token to the input — prepending @entity3 expresses
(BPE) for tokenization, a proven strategy that has that the model should generate a summary where
been shown to improve the generation of proper @entity3 is present. In effect, this instructs the model
nouns in translation (Sennrich et al., 2016b). We to focus on sentences that mention the marked entities.
share the representation of the tokens in the encoder At training time, we prepend each document with
and decoder embeddings and in the last decoder layer. markers referring to an entity from the ground-truth
summary. To ensure the entity request is informative,
2.2 Length-Constrained Summarization we provide an entity that is present in the ground-truth
Summarization allows a reader with limited time but not present in the summary generated by the base-
to quickly comprehend the essence of a document. line model. At test time, we may specify any entity
Controlling summary length enables reading with marker that we wish the summary to contain. Our
different time budgets: a document might be experiments (§5.2) evaluate the effect of prepending
summarized as a five-word headline, a single sentence different markers to the input. We show that higher
or a paragraph, each providing more and more detail. accuracy is achieved when we specify entities from
To enable the user to control length, we first quan- the first few sentences of a document or if we supply
tize summary length into discrete bins, each represent- markers taken from the reference summary to illustrate
ing a size range. Length bins are chosen so that they specific user preferences. We extend this approach to
each contain roughly an equal number of training doc- multiple entity markers and experiment with append-
uments. We then expand the input vocabulary with ing all ground-truth entities for training and provide
special word types to indicate the length bin of the all entities from Lead-3 at test time. We show that pro-
desired summary, which allows generation to be condi- viding more entities improves summarization quality.
tioned upon this discrete length variable. For training,
we prepend the input of our summarizer with a marker 2.4 Source-Specific Summarization
that indicates the length of the ground-truth summary. Text sources such as newspapers and magazines often
At test time, we control the length of generated have specific style guidelines to provide a consistent
46
experience. Readers are accustomed to the styles of (4) read and remainder: the model receives both
their favorite sources. Therefore, we enable users read portion of the article and the remainder separated
to specify a preferred source style for a summary. by a special token. It is trained to predict the
Similar to length and entities, we introduce special remainder summary. We distinguish the read and
marker tokens (@genSource0, . . . , @genSourceN) to remainder part of the article by using distinct sets of
express source desiderata. For training, we preprend position embeddings.
the input with the marker corresponding to the We compare these methods in Section 4 and show
ground-truth source. At inference, we control the the advantage of the model that receives both the
style of generated summary by prepending different user-read portion and the remainder of the document.
markers. Our experiments (§4) evaluate whether
providing the true source-style produces summaries 3 Related Work
that are closer to the reference summary. We
additionally provide examples of distinct summaries 3.1 Sequence-to-Sequence for Summarization
resulting from changing source-style conditioning. Automatic summarization has been an active research
field for 60 years (Luhn, 1958). Extractive and abstrac-
2.5 Remainder Summarization tive methods have benefited from advances in natural
Beyond reading summaries of full documents, readers language processing, pattern recognition, and machine
may want the flexibility of only summarizing certain learning (Nenkova et al., 2011). Recently, sequence-
portions of a document. For example, a reader who to-sequence neural networks (Sutskever et al., 2014)
has read the first few paragraphs would want a sum- have been applied to abstractive summarization (Nal-
mary of the remaining text to cover what they missed. lapati et al., 2016; See et al., 2017; Paulus et al., 2017)
Training and evaluating remainder summarization following their success in translation (Bahdanau
requires specific data, namely a dataset of full et al., 2015; Luong et al., 2015b), parsing (Luong
documents with position markers separating the et al., 2015a) and image captioning (Vinyals et al.,
already read portion from the remainder part along 2015b). Neural abstractive summarization has built
with the corresponding summaries. Such a dataset upon advances from machine translation and related
is not readily available and would be challenging to fields: attention (Bahdanau et al., 2015) enables
collect. To enable remainder summarization without generation to focus on parts of the source document
such data, we align summaries to full documents. Our while pointers (Vinyals et al., 2015a) help abstractive
procedure matches each reference summary sentence summarization to copy entities from the input (See
to its best matching document sentence based on et al., 2017; Paulus et al., 2017; Nallapati et al., 2016).
ROUGE-L. For any position in the document, we However, summarization also has distinct chal-
remove sentences aligned before this point from the lenges. The generation of multi-sentence summaries
full summary and consider this shorter summary as differs from single sentence translation: left-to-right
the summary of the remainder. In our experiment, we decoders need to be aware of their previous gener-
consider as read portions all article positions located ation at a larger time scale, otherwise models tend
at the middle of two alignment points, except for to produce repeated text. To address this impedi-
alignment points separated by less than 2 sentences. ment, (See et al., 2017) introduce coverage modeling,
We consider the following methods: (Paulus et al., 2017) propose intra-decoder attention,
(1) full summary baseline: the baseline model and (Suzuki and Nagata, 2017) equip the decoder with
predicts a full summary, disregarding the separation an estimator of unigram frequency. Previous work has
of the read portion from the remainder. also explored learning objectives: (Paulus et al., 2017)
(2) post-inference alignment: a full summary is investigates replacing maximum likelihood train-
generated from the baseline model and the summary ing with Reinforcement Learning (RL) to optimize
is shortened with our alignment procedure. The de- ROUGE, the most common automatic metric to assess
coded summary sentences that align to the remainder summarization. Combining both strategies is found
portion compose the summary of the remainder. to perform best in human evaluations, as training with
(3) remainder only: the model is trained to map the RL alone often produces non-grammatical text.
document remainders to the remainder summaries on Our work builds upon prior research: like (Gehring
pre-aligned training data. This model is not given the et al., 2017), we rely on convolutional networks,
read portion of the article. which enable faster training. This contrasts with prior
47
work using recurrent networks (Nallapati et al., 2016; adversarial training, latent variable models such as
See et al., 2017; Paulus et al., 2017). We borrow intra- variational auto-encoders, or pointer networks. While
attention from (Paulus et al., 2017) and expand it to latent variable models are popular for the generation
multi-hop intra-attention inspired by multi-hop source of continuous outputs such as images, (conditional)
attention from (Gehring et al., 2017). To facilitate language models are flexible enough to capture
copying input entities, we share the word represen- the multimodal nature of the data. We leave the
tations between encoder and decoder (Paulus et al., assessment of how additional latent variables might
2017), and also rely on BPE tokenization (Sennrich improve upon our results to future work.
et al., 2016b). This combination allows us to forgo
an additional pointer mechanism unlike (Paulus et al., 4 Experimental Setup
2017; See et al., 2017; Nallapati et al., 2016). Un-
Dataset: We use the CNN-Dailymail dataset
like (Paulus et al., 2017), we did not explore training
(Hermann et al., 2015; Nallapati et al., 2016). It
objectives and maximized the likelihood of the train-
consists of news articles along with multi-sentence
ing summaries given the source document. Our model
summaries, with a total of 287k train, 13k valid and
is amenable to RL, but this aspect is largely orthogonal
11k test articles. On average, the articles are 758
to our main goal, i.e. controllable summarization.
token long, and the summaries are 55 token long.
3.2 Controllable Text Generation Most of our experiments are performed with articles
truncated at 400 tokens, as suggested by (See et al.,
Text generation is an established research area (McK- 2017). We evaluate on two versions of the data: the
eown, 1992). The field follows recent advances entity anonymized version (Hermann et al., 2015;
in generative models, such as the introduction of Nallapati et al., 2016; Paulus et al., 2017) and the full
variational auto-encoders (Kingma and Welling, 2013) text version (See et al., 2017). We use BPE with 30K
and adversarial networks (Goodfellow et al., 2014). types (Sennrich et al., 2016b) for most experiments.
This is exemplified by work focusing on natural For non-BPE models, input and output vocabularies
language generation such as (Bowman et al., 2016; Yu have resp. 47k and 21k word types, corresponding
et al., 2017; Zhao et al., 2017; Rajeswar et al., 2017). to types with more than 20 train occurrences.
Building upon unconditioned generation, con- Further, we compare length control with (Kikuchi
trollable generation is an emerging research field. et al., 2016) on DUC-2004 single-sentence summa-
Research in computer vision includes style trans- rization task. We train on English Gigaword following
fer (Gatys et al., 2015) or controllable image the protocol of Rush et al. (2015). The data consist
generation (Lample et al., 2017). Text generation of 3.6 million pairs (first sentence, headline of news
work focuses on controlling tense or sentiment with articles). Following (Kikuchi et al., 2016), we evaluate
variational auto-encoders (Hu et al., 2017). Shen et al. on the 500 documents in the DUC2004 task-1. We
(2017) relies on adversarial training for manipulating use a source and target vocabulary of 30k words.
sentence sentiment and Sennrich et al. (2016a) Architecture, Training, and Generation: We
propose using side constraints for polite neural implement models with the fairseq library1. For
machine translation models. Takeno et al. (2017) CNN-Dailymail, our model has 8 layers in the
extend the side constraints to control further aspects encoder and decoder, each with kernel width 3. We
of translation output, such as length. Others have use 512 hidden units for each layer, embeddings of
worked on style, for example Ficler and Goldberg size 340, and dropout 0.2. For DUC, we have 6 layers
(2017) propose using a conditional language model in the encoder and decoder with 256 hidden units.
to generate text with stylistic requirements and Kobus
Similar to Gehring et al. (2017), we train using
et al. (2017) propose using tokens and additional fea-
Nesterov accelerated gradient method (Sutskever
tures to translate text in different domains. Filippova
et al., 2013) with gradient clipping 0.1 (Pascanu et al.,
(2017) proposes controlling length for generating
2013), momentum 0.99, and learning rate 0.2. We
answers in a question answering task. Kikuchi
reduce the learning rate by an order of magnitude
et al. (2016) explores length control for sentence
when the validation perplexity ceases to improve,
compression using decoding-time restrictions and
and end training when the learning rate drops below
training-time length token embeddings.
10−5. Summaries are generated using beam search
Motivated by simplicity, our work relies on
1
conditional language modeling and does not require github.com/facebookresearch/fairseq
48
with beam size 5. To avoid repetition, we prevent the Model ROUGE
decoder from generating the same trigram more than 1 2 L
once, following Paulus et al. (2017). fairseq 33.32 12.64 30.57
Evaluation: On the CNN-Dailymail benchmark, + trigram decoding 36.18 14.10 33.18
our automatic evaluation reports F1-ROUGE scores + intra-attention 36.69 14.28 33.47
for ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004). + BPE 37.48 15.12 34.16
We compare to existing abstractive baselines (Nal- + tuning min/max len 37.73 15.03 34.49
lapati et al., 2016; See et al., 2017; Paulus et al.,
Table 1: Baseline without control variables. Each row
2017). We also compare with Lead-3 which selects
add a feature on top of the previous row features.
the first three sentences of the article as the summary.
Note that, although simple, this baseline is not Model ROUGE
outperformed by all models. 1 2 L
For human evaluation, we conduct a human baseline, no control 37.73 15.03 34.49
evaluation study using Amazon Mechanical Turk and Length constraint 39.16 15.54 35.94
the test set generation output of See et al. (2017). 500 Entity centric 38.17 15.16 34.92
articles from the test set were randomly selected and Source specific 37.68 15.16 34.40
evaluated by 5 raters. The raters were presented with Length+Entity+Source 39.61 15.83 36.48
the first 400 words of each news article and asked to
select the summarization output they preferred. Table 2: Summarization with oracle control to
For the DUC-2004, we report recall ROUGE for simulate user preference.
ROUGE-1, ROUGE-2, and ROUGE-L at 30, 50, and
75 byte lengths following Kikuchi et al. (2016). Model ROUGE
1 2 L
Lead-3
5 Results Nallapati et al. (2017) 39.2 15.7 35.5
Maximum Likelihood
We evaluate the design choices of our model and the Nallapati et al. (2016) 35.46 13.30 32.65
impact of manipulating the control variables. We ana- Paulus et al. (2017) 37.86 14.69 34.99
lyze the performance of the remainder summarization Paulus et al. + intra-attn 38.30 14.81 35.49
task and demonstrate the advantage of modeling both fairseq no control (ours) 37.48 15.12 34.16
the read and remainder portions of the document. + fixed control 38.68 15.40 35.47
+ Lead-3 ent 39.06 15.38 35.77
5.1 Convolutional Summarization Reinforcement Learning
Paulus et al. (2017) 39.87 15.82 36.90
Table 1 details the effect of our design choices for
our baseline. Adding a constraint to avoid repeated Table 3: Fixed control variables on entity-anonymized
trigrams at generation time improves F1-ROUGE1 text. Even with fixed variables, the controllable model
by +2.86. Adding intra-attention to enable the model improves ROUGE compared to ML alternatives.
to examine past generations over long distances
improves the accuracy obtained with the trigram
constraint by a further 0.51 F1-ROUGE1. The modest 5.2 Controllable Summarization
improvement is likely because the two features Our summarizer lets users control the length of the
address a similar problem of avoiding repeated genera- generated summary, entities on which it focuses on,
tions. Switching tokenization from word to BPE gives and source style it imitates (see§2). We first evaluate
another +0.79 F1-ROUGE1. BPE improves the ability the effect of providing the oracle reference variables
to copy proper nouns and rare inflections, both of at decoding time. This simulates a user setting their
which are difficult to model in word-based vocabular- preferences to specific values. We then assess the
ies. This agrees with translation results (Sennrich et al., effect of providing non-reference control variables.
2016b). Lastly, we find tuning the min/max length Table 2 reports our results for each variable and
on the validation set and applying the constraints to their combined effect. All control variables improve
the test set improves F1-ROUGE1 by 0.25. the summary quality, but length control has the most
49
Model ROUGE the summary as shown in Table 8. Generally, we
1 2 L observe that generated summaries in the Dailymail-
Lead-3 40.34 17.70 36.57 style are more repetitive and slightly longer than the
Maximum Likelihood CNN-style summaries. This matches the differences
See et al. (2017) 39.53 17.28 36.38 between the two sources in the reference text. The
fairseq no control (ours) 38.23 16.68 34.77 impact of style requests might be greater with a richer
+ fixed control 39.75 17.29 36.54 set of styles — in future work, we plan to evaluate on
+ Lead-3 ent 40.38 17.44 37.15 datasets where varied styles are available.
Table 4: Summarization with fixed control variables 5.3 Summarization with Automatic Control
on original text. Even with a fixed setting, the
Our primary objective is to allow readers to control the
controlled summarization model improves ROUGE.
attributes of generated summaries. However, we can
also set the control variables automatically in absence
impact, followed by entity control and source style. of reader desiderata. For length and source-style, we
Further, the advantages of each control variable cumu- set the variable to a constant value that maximizes
latively produce an even stronger summary: we obtain ROUGE on the validation set. For entity control, we
+2.2 F1-ROUGE1 when combining control variables. randomly sample an entity that appears in lead-3 and
Length control improves accuracy by 1.68 F1- provide it as the entity of interest.
ROUGE1 (Table 2). This improvement is due to Table 3 reports results on the entity-anonymized ver-
two effects: length mismatch is heavily penalized sion of the dataset like (Nallapati et al., 2016; Paulus
by F1-ROUGE. Moreover, the baseline struggles at et al., 2017) and Table 4 reports results on the full text
predicting correct lengths. The latter is due to large data like (See et al., 2017). In both cases, our method
uncertainty in summary length, i.e. even humans have is advantageous over alternatives. Further, providing
difficulty predicting the correct length. all of the entities at training time and only lead-3
Figure 1 reports the average summary length when entities at test time improves quality. On the original
decoding all examples in the test set using each of text, we report 40.38 F1-ROUGE1 as opposed to 39.53
the 10 possible length markers. The model is shown for (See et al., 2017). On the entity-anonymized text,
to respect length markers. Table 8 demonstrates the we report 39.06 F1-ROUGE1 as opposed to 38.30
effect of the length marker on a specific example. for the best maximum likelihood setting of (Paulus
Entity control has less impact on ROUGE com- et al., 2017). We hypothesize that providing all lead-3
pared to length control at +0.69 vs. +1.68 F1-ROUGE1 entities encourages copying from lead-3. Our model
(Table 2). This is mainly because our summaries often does not outperform the reinforcement learning
already contain most entities from the ground-truth model of (Paulus et al., 2017) which optimizes
without the need for additional instruction. Table 6 ROUGE. However, training objectives are orthogonal
further analyzes entity control for 100 test documents. to our work on control variables and we expect
We decode repeatedly requiring each entity from lead- reinforcement learning to equally benefit our model.
3. We then repeat the experiment with each entity Table 5 compares results on DUC-2004 to the best
from the full article. We report how often the entity- method presented by (Kikuchi et al., 2016). We find
centric model generates a summary that actually con- that adding length embedding improves the ROUGE-1
tains the requested entity. For Lead-3 entities, the and ROUGE-L scores for 30, 50, and 75 byte evalua-
model mentions the requested entity 61% of the time, tion. Notably, ROUGE improves more for shorter text
while for all entities from the input, the model men- evaluation, likely because requesting a shorter docu-
tions required entities 34% of the time. In both set- ment allows the model to plan its generation. Compar-
tings, these rates are much higher than the baseline. ing to Kikuchi et al. (2016), our results are stronger
The model has difficulty generating summaries with while our method is very simple – (Kikuchi et al.,
entities which are unlikely to appear in the human 2016) explore embedding the remaining length at
references, e.g. unimportant entities far from the be- each timestep during decoding and creating a separate
ginning of the article. memory cell to control length. In contrast, we simply
Source-style control is the least impactful control provide the desired length as a special token and show
in terms of ROUGE, we report +0.2 F1-ROUGE1 in this simple approach is effective. Lastly, we note that
Table 2. Changing the source style variable changes length-control has less effect on DUC-2004 compared
50
Model 30 byte 50 byte 75 byte
1 2 L 1 2 L 1 2 L
LenInit(0,L) (Kikuchi et al., 2016) 14.31 3.27 13.19 20.87 6.16 19.00 25.87 8.27 23.24
Baseline without control 21.47 7.63 20.71 25.07 8.49 22.97 29.88 10.37 26.29
+ fixed length (ours) 21.81 7.51 21.05 25.39 8.38 23.37 30.00 10.27 26.43
51
a. Summary with Length Control
Requesting Length 2: @entity0 [Easter] is over for the wild rabbits of greater @entity2 [Sydney] as councils and parks prepare
another attempt to kill them off with a deadly virus. It comes after over 30 government bodies scattered carrots laced with calicivirus.
Requesting Length 6: @entity0 [Easter] is over for the wild rabbits of greater @entity2 [Sydney] as councils and parks prepare
another attempt to kill them off with a deadly virus. This year, because of really high summer rainfall - which led to great food
availability - there has been a big surge in the rabbit population in @entity2 [Sydney].
Requesting Length 10: @entity0 [Easter] is over for the wild rabbits of greater @entity2 [Sydney] as councils and parks prepare
another attempt to kill them off with strategically placed carrots that have been laced with a deadly virus. This year,because of really
high summer rainfall - which led to great food availability - there has been a big surge in the rabbit population in @entity2 [Sydney].
It comes after over 30 government bodies scattered carrots laced with calicivirus around public areas in March.
d. Remainder Summary
Full Article: @entity4 [Harry Potter] star says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.
@entity3 [Daniel Radcliffe]’s earnings from the first five @entity4 [Harry Potter] films have been held in a trust fund which he has
not been able to touch.
After 8 sentences: He’ll be able to gamble in a casino, buy a drink in a pub or see the horror film. @entity3 [Daniel Radcliffe]’s
earnings from first five @entity4 [Harry Potter] films have been held in trust fund .
After 12 sentences: @entity3 [Daniel Radcliffe]’s earnings from first five @entity4 [Harry Potter] films have been held in trust
fund .
Table 8: Summaries with various settings for user control variables and remainder summarization.
52
References Catherine Kobus, Josep Crego, and Jean Senellart. 2017.
Domain control for neural machine translation. In
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Proceedings of the International Conference Recent
2015. Neural machine translation by jointly learning Advances in Natural Language Processing, RANLP
to align and translate. International Conference on 2017. INCOMA Ltd., Varna, Bulgaria, pages 372–378.
Learning Representation (ICLR) .
Guillaume Lample, Neil Zeghidour, Nicolas Usunier,
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- Antoine Bordes, Ludovic Denoyer, and Marc’Aurelio
drew M. Dai, Rafal Józefowicz, and Samy Bengio. Ranzato. 2017. Fader networks: Manipulating images
2016. Generating sentences from a continuous space. by sliding attributes. Arxiv abs/1706.00409.
In CoNLL.
Yann LeCun, Bernhard E Boser, John S Denker, Donnie
Sumit Chopra, Michael Auli, and Alexander M Rush. Henderson, Richard E Howard, Wayne E Hubbard,
2016. Abstractive sentence summarization with and Lawrence D Jackel. 1990. Handwritten digit
attentive recurrent neural networks. Conference of recognition with a back-propagation network. In
the North American Chapter of the Association for Advances in neural information processing systems
Computational Linguistics (NAACL) . (NIPS). pages 396–404.
Dipanjan Das and André FT Martins. 2007. A survey on Chin-Yew Lin. 2004. Rouge: A package for automatic
automatic text summarization. Literature Survey for the evaluation of summaries. In Workshop on Text
Language and Statistics II course at CMU 4:192–195. Summarization Branches Out.
Yann N. Dauphin, Angela Fan, Michael Auli, and David H. P. Luhn. 1958. The automatic creation of literature
Grangier. 2017. Language modeling with gated abstracts. IBM Journal of Research and Development
convolutional networks . 2(2).
53
Alexander M Rush, SEAS Harvard, Sumit Chopra, and
Jason Weston. 2015. A neural attention model for
sentence summarization. Conference on Empirical
Methods in Natural Language Processing (EMNLP) .
Abigail See, Peter J Liu, and Christopher D Manning.
2017. Get to the point: Summarization with pointer-
generator networks. Annual Meeting of the Association
for Computational Linguistics (ACL) .
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016a. Controlling politeness in neural machine
translation via side constraints. In Proceedings of the
2016 Conference of the North American Chapter of
the Association for Computational Linguistics: Human
Language Technologies. Association for Computa-
tional Linguistics, San Diego, California, pages 35–40.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare words with
subword units. Annual Meeting of the Association for
Computational Linguistics (ACL) .
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi S.
Jaakkola. 2017. Style transfer from non-parallel text
by cross-alignment. Arxiv abs/1705.09655.
Ilya Sutskever, James Martens, George E. Dahl, and
Geoffrey E. Hinton. 2013. On the importance of ini-
tialization and momentum in deep learning. In ICML.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
In Neural Information Processing Systems (NIPS).
Jun Suzuki and Masaaki Nagata. 2017. Cutting-off
redundant repeating generations for neural abstractive
summarization. In European Conference of the
Association of Computer Linguists (EACL).
Shunsuke Takeno, Masaaki Nagata, and Kazuhide
Yamamoto. 2017. Controlling target features in
neural machine translation via prefix constraints. In
Proceedings of the 4th Workshop on Asian Translation
(WAT2017). Asian Federation of Natural Language
Processing, Taipei, Taiwan, pages 55–63.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all you
need. arXiv preprint arXiv:1706.03762 .
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
2015a. Pointer networks. In Conference on Advances
in Neural Information Processing Systems (NIPS).
Oriol Vinyals, Alexander Toshev, Samy Bengio, and
Dumitru Erhan. 2015b. Show and tell: A neural image
caption generator. Conference on Computer Vision
and Pattern Recognition (CVPR) .
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017.
Seqgan: Sequence generative adversarial nets with
policy gradient. In AAAI.
Junbo Zhao, Y. Kim, K. Zhang, A. M. Rush, and Y. Le-
Cun. 2017. Adversarially Regularized Autoencoders
for Generating Discrete Structures. ArXiv 1706.04223.
54