Language Model Evaluation in Open-Ended Text Gener
Language Model Evaluation in Open-Ended Text Gener
by
An Nguyen
Student ID: 1098402
in the
School of Computing and Information Systems
Melbourne School of Engineering
THE UNIVERSITY OF MELBOURNE
August 2021
Declaration of Authorship
I confirm that:
This thesis does not incorporate without acknowledgement any material previously
submitted for a degree or diploma in any university; and that to the best of my
knowledge and belief it does not contain any material previously published or
written by another person where due reference is not made in the text;
Where necessary I have received clearance for this research from the University’s
Ethics Committee and have submitted all required data to the School;
The thesis is 12,714 words in length (excluding text in images, table, bibliographies
and appendices).
Signed: An Nguyen
Date: 06/01/2021
i
Abstract
Although current state-of-the-art language models have achieved impressive results in
numerous natural language processing tasks, still they could not solve the problem of
producing repetitive, dull and sometimes inconsistent text in open-ended text genera-
tion. Studies often attribute this problem to the maximum likelihood training objective,
and propose alternative approaches by using stochastic decoding methods or altering
the training objective. However, there is still a lack of consistent evaluation metrics to
directly compare the efficacy of these solutions. In this work, we study different evalu-
ation metrics that have been proposed to evaluate quality, diversity and consistency of
machine-generated text. From there, we propose an practical pipeline to evaluate lan-
guage models in open-ended generation task, and research on how to improve the model’s
performance in all dimensions by leveraging different auxiliary training objectives.
ii
Acknowledgements
First and foremost, I would like to express my sincere appreciation to my supervisor, Dr.
Jey Han Lau, who has constantly given me thoughtful guidance and valuable feedback
throughout my thesis. I’m incredibly fortunate to have you as my supervisor - thank
you so much for all of the encouragement and support for me during the middle of this
global pandemic.
I’m eternally grateful for my girlfriend, Ngan, who has always been beside me and
supported me during the most difficult time.
Finally, I want to express my profound gratitude to my parents and sister, who have
always loved and supported me unconditionally.
iii
Contents
Declaration of Authorship i
Abstract ii
Acknowledgements iii
1 Introduction 1
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Rethinking language models evaluation . . . . . . . . . . . . . . . 2
1.1.2 Stochastic Decoding or New Training Objectives? . . . . . . . . . . 3
1.1.3 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 n-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Feed-forward neural language model . . . . . . . . . . . . . . . . . 8
2.1.2.1 Feed-forward neural network . . . . . . . . . . . . . . . . 8
2.1.2.2 Feed-forward neural language model . . . . . . . . . . . . 9
2.1.2.3 Word embeddings . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Recurrent neural language model . . . . . . . . . . . . . . . . . . . 11
2.1.3.1 Recurrent neural network . . . . . . . . . . . . . . . . . . 11
2.1.3.2 Recurrent neural language model . . . . . . . . . . . . . . 12
2.1.3.3 Long short-term memory . . . . . . . . . . . . . . . . . . 12
2.1.3.4 Sequence-to-sequence . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Attention and Transformer . . . . . . . . . . . . . . . . . . . . . . 14
2.1.4.1 Attention mechanism . . . . . . . . . . . . . . . . . . . . 14
2.1.4.2 Transformer and self-attention . . . . . . . . . . . . . . . 15
iv
Contents v
4 Multi-task Learning 42
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Contents vi
4.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.3 Auxiliary Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.3.1 Sentence-level objective . . . . . . . . . . . . . . . . . . . 44
Next Sentence Prediction (NSP) . . . . . . . . . . . . . 45
Sentence Order Prediction (SOP) . . . . . . . . . . . . 45
4.1.3.2 Token-level objective . . . . . . . . . . . . . . . . . . . . 45
TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Part-of-speech Tags (POS) . . . . . . . . . . . . . . . . 45
Dependency Parsing (DP) . . . . . . . . . . . . . . . . 46
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Quality vs. Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Conclusion 50
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A practical pipeline to evaluate language models on open-
ended text generation task . . . . . . . . . . . . 50
A direct comparison between unlikelihood training and stochas-
tic decoding methods . . . . . . . . . . . . . . . 51
An insight on how multi-task learning can lead to better
machine generation . . . . . . . . . . . . . . . . 51
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Incorporate human evaluation to verify the correctness of
the evaluation metrics . . . . . . . . . . . . . . . 51
Further experiments on different ways to tweak language
models training objective . . . . . . . . . . . . . 52
Experiment with more auxiliary training objectives . . . . . 52
Experiment with larger language models . . . . . . . . . . . 52
Bibliography 53
List of Figures
vii
List of Tables
3.1 Sequence repetition scores between different models and decoding meth-
ods. *Human text sequence repetition is computed using the training
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Selection accuracy between different models with MultiNLI and StoryCloze
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Selection accuracy when being trained with different domains . . . . . . . 41
viii
Chapter 1
Introduction
Language modeling is the task of calculating the probability distribution over word
sequences, with a goal to generate more fluent text in a certain language, where having
higher probability is considered fluent. In recent years, neural language modeling has
become the powerhouse for an array of natural language processing tasks, from machine
translation (Bahdanau et al., 2014; Luong et al., 2015), summarization (Zhang et al.,
2020) to story generation (Fan et al., 2018). The introduction of the Transformers
architecture (Vaswani et al., 2017) has allowed language model to be trained on even
more massive text datasets in significantly less time thanks to its ability to parallelize
computation. Today, large pre-trained language model like GPT-2 (Radford et al., 2019),
or the latest GPT-3 (Brown et al., 2020) with 175 billion parameters have achieved state-
of-the-art results in numerous tasks in zero-shot and few-shot setting.
Despite of their superiority in multiple NLP tasks, these language models are still falling
short in open-ended text generation task such as story generation and dialogue mod-
elling, where the model is required to produce long continuation of text. Based on
empirical observation with standard deterministic decoding method, the text is often
found to be dull, repetitive (Holtzman et al., 2019; Welleck et al., 2019a; Shao et al.,
2017; Fan et al., 2018), and sometimes logically inconsistent and factually incorrect al-
though being fluent and coherent (Li et al., 2020; Welleck et al., 2019b; Hayashi et al.,
2020; Petroni et al., 2019).
There exists a number of solutions to this problem, such as using stochastic decoding
methods when generate continuations to keep the model from repeating itself (Fan et al.,
2018; Holtzman et al., 2019), or altering the training objective to penalize the model from
being repetitive (Welleck et al., 2019a; Bengio et al., 2015). However, to the best of our
knowledge, there has not been any works that quantitatively compare the performance
1
Introduction 2
of these solutions, due to the lack of a consistent evaluation metric. Therefore, it is hard
to decide which solution is better than the others.
Traditionally, language models have been evaluated based on perplexity, which concerns
with the probability of a sentence being produced by the model. We also have a number
of other metrics for different tasks, such as BLUE for machine translation (Papineni
et al., 2002) or ROUGE for text summarization (Lin, 2004), which essentially compare
the similarity between human and machine generated text. This works when we want
to judge the quality of the generated text - we want our models to produce natural,
human-like and grammatically correct sentences.
However, with open-ended generation task such as story telling or dialogue generation,
not only we want our model to produce high quality text but also to be creative and
diverse. For example, given the prompt “Once upon a time”, we expect our story
generation model to generate a diverse range of continuations rather than repeating the
most probable story again and again. This is where traditional metrics like perplexity
can be problematic, since it places a high stress on the probability of individual words. In
terms of creativity or diversity, we would probably prefer seeing “she takes the yacht to
school” than “she takes the bus to school” in a story. However, these two sentences can
have significantly different probabilities because the word “yacht” is much less probable
than “bus”. Thus, the first sentence might result in much higher perplexity, which can
be perceived as being lower in quality.
1.1 Objectives
Question: What is the best metric to use when evaluating language models
on open-ended text generation task in each dimension: quality, diversity
and consistency?
Introduction 3
In this work, we study different evaluation metrics that have been proposed to evaluate
quality, diversity and consistency of machine-generated text, and aim to find the best
metric to use in each dimension.
Although stochastic decoding methods significantly reduced the repetition issue thanks
to randomization (Holtzman et al., 2019; Fan et al., 2018), they do not solve the under-
lying problem with maximum likelihood training. Because of this, altering the training
objective might sound like a more viable approach. Many works has been carried out
in this regards, such as scheduled sampling (Bengio et al., 2015), Generative Adversar-
ial Nets (Goodfellow et al., 2014; Yu et al., 2017; Guo et al., 2018), or most recently
unlikelihood training (Welleck et al., 2019a; Li et al., 2020).
However, since there has not been any direct comparison between the two techniques, it is
hard to decide which one is superior. We argue that if both lead to the same performance,
we would be better off using stochastic decoding methods since it is cheaper in terms
of training. Using the new evaluation pipeline that we have proposed in this work,
we aim to find out which one is the superior technique to use for language models in
open-ended text generation task: stochastic decoding methods or tweaking the language
model training objective.
With the success of BERT in a plethora of different NLP tasks (Devlin et al., 2019),
there has been a surge of interest in multi-task learning for language models. Besides
traditional maximum likelihood estimate objective, many different auxiliary training
objectives have been introduced, such as masked language model (Devlin et al., 2019;
Yang et al., 2019), next sentence prediction (Devlin et al., 2019), or word/sentence
order prediction (Wang et al., 2019). All of these additional objectives share a common
goal: to make the language model become better at understanding the language. In
this work, we are curious to find out if multi-task learning can help languages model
to become better on a specific task - open-ended text generation. We only focus on
auxiliary training objectives in an unsupervised setting, i.e. where training labels can
be obtained automatically.
1.2.1 Chapter 2
1.2.2 Chapter 3
1.2.3 Chapter 4
1.2.4 Chapter 5
In Chapter 5, we give a conclusion of our project and its contribution, and provide
several suggestions for future work.
Chapter 2
Background
In recent years, language models have become the powerhouse for an array of natural
language processing (NLP) tasks, from machine translation (Bahdanau et al., 2014;
Luong et al., 2015), summarization (Zhang et al., 2020), story generation (Fan et al.,
2018) to dialogue generation (Li et al., 2020; Zhang et al., 2018; Welleck et al., 2019b).
So what exactly is a language model?
To formalize, for a text that contain m tokens {w1 ...wm }, a language model is used to
compute the probability of that text:
p(w1 , w2 , ..., wm ), w ∈ V
where V is the set of vocabulary. The goal of computing this probability is to pro-
duce more fluent texts, which should have much higher probability than odd ones. For
example, if we want to translate the following sentence from Vietnamese to English:
“Tôi thı́ch xe màu cam và màu xanh”, the literal word-by-word translation would be
“I like car orange and blue”. The language model should give this translation a lower
probability than a natural translation like “I like orange and blue cars”.
To generate text, a language model can produce the next word given a context by using
the conditional probability of the next word on the previous words:
6
Background 7
In the rest of this section, we explore the history of language models and reveal how
we have reached the states-of-the-art models at present, which stand behind recent
remarkable achievements in multiple NLP tasks.
2.1.1 n-gram
To compute the probability of a sequence of text, we can refactor the joint probability
using the chained rule:
p(w1 , w2 , ..., wm ) = p(w1 ) · p(w2 |w1 ) · p(w3 |w1 , w2 ) · ... · p(wm |w1 , ..., wm−1 )
This equation suggests that at every step, we have to calculate the probability of word
given all of its predecessors. This can be done by simply counting the number of occur-
rences of the combination with and without the last word in the whole universe. For
example, for the sentence “I like orange and blue cars”, the probability of the word
“blue” is equal to:
m
Y
p(w1 , w2 , ..., wm ) = p(wk |wk−n+1 , ..., wk−1 )
k=1
For example, in a uni-gram (1-gram) language model, the probability of the sentence “I
like orange and blue cars” will be computed as:
An obvious problem with n-gram language models is sparsity, i.e. yielding zero prob-
ability for unknown word combinations. This is very likely to happen in practice, as
language is always evolving and new combinations are created everyday. There exists a
Background 8
number of solutions to this problem, such as back-off (Kneser and Ney, 1995) or smooth-
ing (Chen and Goodman, 1999). A general idea of these solutions is to give the unknown
combination a tiny amount of the probability mass, so we would not encounter any zero
values in our calculation.
Another problem with n-gram language models is that they are very limited in modeling
long-range dependency between words in a sentence. For example, the sentence “The
cat at the end of the street comes here everyday” requires at least an 8-gram language
model in order to know that “comes” should be a singular verb because of the singular
noun “cat”. Increasing n is not a viable solution, as n-gram language models can be
computationally expensive when n is large. In the last example, an 8-gram language
model with the vocabulary size of 100,000 words can have up to 1040 possible sequences.
Lastly, n-gram language models are incapable of informing us about the linguistic and
semantic information of the language, as everything in n-gram language models are
just plain statistic. Given the sentence “a man is eating an apple”, n-gram language
models have no way to recognize that it is actually very similar to the sentence “a
woman is eating an orange” in term of semantic. This is because in n-gram language
models we are only looking at the surface forms of words, thus in the view of an n-gram
language model, the difference between the word “man” and “woman” is the same as the
difference between “man” and “bread”. Therefore, n-gram language models do not have
the ability to generalize their knowledge to sequences that they have not encountered
during training.
A neural network is a collection of computing units whose goals are to learn a mapping
function between a set of inputs and desired outputs. For example, we might want to
build a neural network to predict whether an image is a picture of a cat or a dog. In
this case, the inputs can be the individual pixels of the image, and the outputs can
be the probability of this image being a picture of a dog or a cat (Figure 2.1). Each
computation unit has an associated weight, which can be learned with the help of a loss
function to penalize the network when it makes incorrect predictions. The end goal is
to learn a set of weights that maximize the likelihood of the training data.
This neural network in Figure 2.1 is called feed-forward because of its architecture: the
computation proceed forward with no cycle. A neural network can have multiple hidden
Background 9
layers between the input layer and output layer, which is often known as a deep neural
network.
The first feed-forward neural language model was proposed by Bengio et al. (2003),
whose architecture is shown in Figure 2.2. Similar to a n-gram language model, the
feed-forward neural language model uses the n previous words as context to compute
the probability of the next word in the sequence. What differs here is that each context
word has a vector representation, which can be looked up in a table C. These vectors
are concatenated and passed to a hidden layer, follow by a final softmax layer to obtain
the probability distribution of the next word over the entire vocabulary.
This model has the same downside as n-gram language models, since it only has access
to n previous words to predict the next word. However, feed-forward neural language
models do not require smoothing (technically the softmax layer never produce zero-values
due to the exponential operation), and they can generalize much better over similar
contexts. Recall the example from last section, where the n-gram model cannot know
that the sentence “a man is eating an apple” is actually very similar to the sentence
“a woman is eating an orange”. The feed-forward neural language model solves this
problem by having access to a dense vector representation for each word, such that
similar words are expected to have similar feature vectors. This way, when the model
updates its parameters following a specific word, the changes will be carried over to
similar words as well.
Background 10
Figure 2.2: Classic feed-forward neural network language model (Bengio et al., 2003)
Bengio et al. (2003)’s language model has laid the foundation for what we now know as
word embedding - a real-valued vector used to represent words, often ranges between ten
to few hundreds of dimensions. There exists a number of pre-trained word embeddings,
among which the most notable ones are GloVe (Pennington et al., 2014) and word2vec
(Mikolov et al., 2013a,b). Compare to Bengio et al. (2003), GloVe and word2vec’s
training objectives are much simpler, allowing them to be trained on a much larger
scale. Thus, using these pre-trained word embeddings directly on downstream tasks is
much more effective than training an word embedding layer from scratch in terms of
both training speed and task performance (Wang et al., 2020).
Figure 2.3: Example of relations captured by word2vec word embeddings when being
projected to a low dimension space (Mikolov et al., 2013a,b). Image credit: Ruder
(2018).
A recurrent neural network (RNN) is a special type of neural network that contain cycles
in its structure. It is typically used for sequential data, such as audio signals, time series
or languages. An RNN processes the input sequence one at a time, and maintains a
state vector (known as the hidden state) to store the information of previously processed
contexts. An example of an RNN is given in Figure 2.4.
We refer to a specific point in the sequence as a time step. At every time step t, the
input xt is passed into the network. Using this input, along with the hidden state from
the last time step ht−1 , the model computes the current hidden state ht . This process is
repeated until we hit the end of the sequence. Depends on the type of recurrent neural
network, the model can choose to yield an output yt at every time step (one-to-one,
e.g. part-of-speech tagging), yield a single output yT at the final step (many-to-one, e.g.
text classification), or produce a different length T 0 outputs {y1 , ..., yT0 } at the final step
(many-to-many, e.g. machine translation).
Background 12
The use of RNN in language modeling was first introduced by Mikolov et al. (2010).
Figure 2.5 gives an example of how we can train a recurrent neural language model.
With each sequence used for training, we can obtain the target labels by shifting the
whole sequence one step to the left. At every time step, we pass the word embedding
of the current word to the RNN layers, apply the softmax operation to obtain the
probability distribution of the next word, then calculate the loss for back propagation
using the target word. RNN language models have been shown to outperformed all state
of the art n-gram models at the time, even when the n-gram models were given much
more training data (Mikolov et al., 2010).
Figure 2.5: Example of RNN language model (Jurafsky and Martin, 2009)
Thanks to their recurrent structure, RNN language models do not rely on a fixed length
context, but can process a sequence of arbitrary length. Every word has access to the
latest hidden state, which carries the information of every other previous words in the
sequence. In practice however, RNN still struggles to capture long-range dependencies,
as information from early time steps can diminish at later ones, especially in long se-
quence - a problem known as vanishing gradients. In addition, because of its sequential
nature, RNN is much slower to trained compare to than feed-forward neural network.
To solve the vanishing gradient problem, Hochreiter and Schmidhuber (1997) proposed
the use of the long short-term memory (LSTM) block to replace the traditional RNN
block, with the addition of memory cells to preserve gradients throughout the sequence,
and multiple gates to control access to these memory cells. A comparison between the
traditional RNN block and LSTM block is given in Figure 2.6.
Background 13
(a) RNN
(b) LSTM
The three type of gates in the LSTM block are the input gate, forget gate and output
gate. At a high level, the forget gate determines how much of the current memory cell’s
content should be forgotten, the input gate decides how much knowledge should be added
to the new memory cell, and the output gate controls how much of the memory cell’s
content should be transferred to the next hidden state. For example, when processing
the word “he” in the sentence “The two cars that he loves”, the forget gate can choose
to let go of the information about the plural noun “cars”, and the input gate can store
this new information about the singular pronoun “he” in order to correctly predict the
singular verb “loves”.
2.1.3.4 Sequence-to-sequence
to start decoding the output sequence using the decoder network. An example of this
architecture is given in Figure 2.7.
This compression step might become a bottleneck, as the source sequence has to be
squeezed into a fixed-size vector before passing to the decoder network. If we have a
long source sequence, it becomes difficult for the model to transfer every information of
the source sequence in the encoder network to the decoder network.
Unlike the sequential process in seq2seq models, when translating a sentence from a
language to another, human do not necessarily take just one single look at the source
sentence and immediately translate it to the target language. What we tend to do is
rather a iterative process, where we keep working back and forth between the source
sentence and the target sentence to find the information needed at a specific time step
and ignore the rest.
This example in language translation is exactly the motivation for the attention mecha-
nism, which was first proposed by Bahdanau et al. (2014). In this model, the attention
mechanism allows the decoder network to look at every hidden state from the encoder
network to find all the information it needs. In addition to the use of the last hidden
state st−1 and the last output yt−1 to produce the next hidden state st in the decoder
network, Bahdanau et al. introduce a context vector ct - a weighted summation of all
hidden states from the encoder network (Figure 2.8):
st = f (st−1 , yt−1 , ct )
Tx
X
ct = αti · hi
i=1
Background 15
Attention has been successfully applied to a plethora of NLP tasks, such as machine
translation (Bahdanau et al., 2014; Luong et al., 2015), text classification (Letarte et al.,
2018; Jain and Wallace, 2019), text summarization (Rush et al., 2015; Wu et al., 2018;
Zhang et al., 2020) or question answering (Sukhbaatar et al., 2015; Kim et al., 2017).
Despite the significant improvements in task performance, the slow computation time of
seq2seq models remains an unsolved problem, preventing them to be trained on a larger
scale.
In the seminal paper Attention is all you need, Vaswani et al. (2017) has revolutionized
the NLP field with the Transformer block (Figure 2.9), which completely eliminates the
need for sequence processing in RNN and seq2seq models. The main innovation in the
Transformer block comes from the multi-head attention layer.
A word in a sentence can relate to other words in various ways, e.g. being a nominal
subject for a word while being a direct object to another. This is exactly what multi-head
attention can be used for: each head in the multi-head attention layer is a self-attention
layer with its own parameters, so it can learn about different relationships between
the source word and the others simultaneously. Figure 2.10b gives an example of this
phenomenon, where the word “it” attends to the phrase “the animal” in one head, but
to the word “tired” in the other.
Background 17
Since Transformer models can take advantage of parallel computing resources, language
models now can be trained on a massive scale. State-of-the-art pre-trained language
models nowadays typically hold up to billions of parameters and are trained on terabytes
of unlabeled data (Brown et al., 2020). Figure 2.13 shows popular pre-trained language
models and their respective sizes. These language models continue to achieve state-
of-the-art performance across a plethora of NLP tasks, such as language generation,
machine translation or question answering (Radford et al., 2015, 2019; Peters et al.,
2018; Yang et al., 2019; Devlin et al., 2019; Brown et al., 2020).
retrain them on labeled dataset to adapt to specific downstream tasks and achieve
competitive results in significantly less time.
Figure 2.13: Pretrained language models and their sizes (Sanh et al., 2019)
The following section gives a brief overview of GPT-2 (Radford et al., 2019), which is
what we use to fine-tune all of our experiments in this project.
2.1.5.2 GPT-2
GPT-2 was trained on the WebText dataset, which contains 8 million web documents
from a variety of domains.
Thanks to huge number of parameters and the diversity of content in its training dataset,
GPT-2 is able to achieve impressive performance in an array of language tasks without
any supervised training data, including machine translation, summarization, reading
comprehension and question answering. Fine-tuning GPT-2 models on specific task has
been proved to be an effective approach, as they can achieve competitive results with
little training time and supervised signal (Ziegler et al., 2019; See et al., 2019).
As discussed, despite of their superiority in multiple NLP tasks, language models are
still falling short in open-ended text generation task. With traditional deterministic de-
coding methods, machine-generated text is often found to be dull, repetitive (Holtzman
et al., 2019; Welleck et al., 2019a; Shao et al., 2017; Fan et al., 2018), and sometimes
inconsistent and factually incorrect although being fluent and coherent (Li et al., 2020;
Welleck et al., 2019b; Hayashi et al., 2020; Petroni et al., 2019) - a problem known as
neural text degeneration. Example of neural text degeneration with greedy decoding and
beam search can be seen in Table 2.1.
Many studies suspect that the root of this problem might be the maximum likelihood
training objective (Holtzman et al., 2019; Welleck et al., 2019a). Multiple alternatives
are suggested, which fall into either of the following two categories: using stochastic
decoding or tweaking the training objective.
All stochastic decoding methods share a common goal, which is to introduce a degree of
randomness to the generation process so the model has less chance to repeat itself. In this
section, we give an overview of popular stochastic decoding methods in the literature,
with their examples of continuations in Table 2.1.
Sampling with temperature One can generate more diverse text simply by sam-
pling next words from the learned softmax distribution. This is often done with the use
of a Boltzmann temperature parameter (Ackley et al., 1985) to control the randomness
of the sampling process, where zero temperature is equivalent to argmax operation, and
infinite temperature corresponds to a uniform sampling. However, choosing the right
Background 20
Decoding Generation
Method
greedy The coronavirus is a virus that causes a variety of diseases, including
pneumonia, hepatitis, and other respiratory diseases. The coron-
avirus is a virus that causes a variety of diseases, including
pneumonia, hepatitis, and other respiratory diseases. The
coronavirus is a virus that causes a variety of diseases, in-
cluding pneumonia, hepatitis, and other respiratory diseases.
The coron
beam search (b = COVID-19 is the most common coronavirus in the United States. The
10) coronavirus is the most common coronavirus in the United
States. The coronavirus is the most common coronavirus
in the United States. The coronavirus is the most common
coronavirus in the United States.The coronavirus
temperature This virus is transmitted from person to person, through the Internet.
sampling This infection is characterized by a wide variety of symptoms including
(t = 0.7) fever, headache, fatigue, fatigue, nausea, vomiting, muscle aches, and
pain in the chest. The symptoms of COVID-19 include: Decreased
heart rate Harmful bowel movements Decreased appetite Increased
top-p (p = 0.9) The outbreak began in October 2008, with 15 cases being confirmed
between November 2008 and June 2009. The cases included 11 women
aged between 18 and 43 years old. Of these, 1 had died. There is still
no evidence of the virus spreading into humans, but it is thought to
be spread by people who are infected by infected persons, or infected
in public.
top-k (k = 10) The coronavirus is present in about half of all patients and can cause
a wide range of health problems including: eye and skin cancers, gas-
trointestinal problems, and liver and kidney diseases. It is the most
widely distributed virus in the United States today, and is transmit-
ted by a number of viruses. It is also the most prevalent and highly
contagious virus, causing the death of
penalized sam- The virus causes severe respiratory and cardiovascular problems, in-
pling (θ = 1.2) cluding death from pneumonia or heart failure in some people with the
infection. : A new type of viral that can cause serious health issues
such as cancer, hepatitis C, HIV/AIDS, diabetes mellitus, stroke. : An
emerging class called “viruses” are known to be responsible for many
diseases like AIDS, malaria, tuberculosis
greedy (with un- It is a common and deadly disease that has been linked to the deaths
likelihood train- of hundreds of thousands of people. The first known case of COVID-19
ing) was reported in the late 1960s. The first person to be infected with
the virus was a woman named Mary Ann Crouch, who had been living
in a house in the village of Crouch.
Table 2.1: Example of continuation for the sentence “Coronavirus disease (COVID-
19) is an infectious disease caused by a newly discovered coronavirus.” using GPT-2
Small model with different decoding methods
Background 21
temperature can be tricky, as there is a trade-off between the diversity and quality of the
generated text: lower temperatures generate less diverse and predictable words, while
higher temperatures generate more diverse text but can produce more implausible words
(Caccia et al., 2018).
Top-k sampling Fan et al. (2018) introduced top-k sampling, which samples from
the list of top k words with highest probability at each time step, effectively ignoring the
tail of the distribution. While top-k sampling has lead to considerably higher quality
text compared to sampling with temperature, choosing the right k value is still arbitrary:
if one word in the list of k words make up the most part of the distribution, we are still
likely to produce implausible words.
Nucleus (top-p) sampling To address the aforementioned problem with top-k sam-
pling, Holtzman et al. (2019) introduced Nucleus (also known as top-p) sampling, which
samples words whose probability mass makes up the top p percent of the total distri-
bution. Unlike top-k sampling where only a fixed number of candidates are considered,
the number of candidates in Nucleus sampling changes dynamically according to the
distribution mass at each time step.
Penalized sampling In some context like question answering, sampling the next
tokens can lead to a wrong answer since implausible tokens can still receive non-zero
probability mass. To take into account this problem, Keskar et al. (2019) proposed pe-
nalized sampling, which samples words in a near-greedy fashion but prevents repetitions
by discounting the scores of previously generated tokens.
Stochastic decoding methods have one downside: they do not solve the underlying prob-
lem with maximum likelihood training. In this section, we examine different strategies
to alter language models training objective to cope with neural language degeneration
problem.
Entmax loss/sampling There exists a mismatch between training and testing con-
ditions in stochastic decoding method, where the model generates text based on a trun-
cated softmax distribution but is evaluated based on the full softmax distribution (Mar-
tins et al., 2020). The authors thus proposed to use an entmax loss function when
training instead of softmax, which transforms a vector of scores into a sparse probability
Background 22
Scheduled sampling When training a language model with the maximum likelihood
training objective, we have access to the ground-truth tokens at every time step; however,
at inference time, we have to rely on tokens generated by the model itself. Thus, the
model cannot recover if it makes a single mistake, since it would cascade errors to
the whole generated sequence. To address this problem, Bengio et al. (2015) proposed
scheduled sampling, in which the model randomly selects between the ground-truth
token or model generated token as the label at training time, with the hope that it
can learn to correct its own mistakes at inference time. Despite being the winner for
the MSCOCO 2015 image captioning challenge, Huszár (2015) suspected that scheduled
sampling did not solve the underlying problem of maximum likelihood training and was
in fact an inconsistent training strategy. Goyal et al. (2017) revealed that scheduled
sampling could not distinguish between local errors and cascading errors as the gradient
did not provide enough useful information.
et al., 2019a) or a list of tokens that appear too often (Li et al., 2020). In sequence level
unlikelihood training, the model is given a list of prefixes to generate text from, then
get penalized for repeated n-grams on its own generation to accommodate for the distri-
bution mismatch between training sequences and generation sequences (Welleck et al.,
2019a). For solving the inconsistency issues, Li et al. (2020) use unlikelihood training
on existing natural language inference datasets to penalize contradicting sentence pairs,
thus pushing down the probabilities of contradicting utterances.
There are several aspects that we want take a closer look on when evaluating language
models on open-ended generation task, which are quality, diversity, and consistency. In
this section, we study what metrics have been proposed in the literature and aim to
make comparison between them.
2.3.1 Quality
1 X
Corpus-BLEU(Sgen , Sref ) = BLEU(S, Sref )
|Sgen |
S∈Sgen
Note that a higher Corpus-BLEU score implies better generation quality since it has
more n-gram overlap with the human reference data. The downside of this evaluation
metric is its quadratic runtime complexity: for each sample we need to calculate a BLEU
score between that sample and the whole reference corpus.
Forward perplexity Because natural, high quality and grammatically correct sen-
tences tend to have higher probabilities than gibberish, we can use the likelihood of a
Background 24
sentence as a proxy for its quality. Zhao et al. (2018) propose the use of a RNN language
model that have been trained with real text data to compute the perplexity of a model’s
samples, which the authors refer to as forward perplexity:
The reason for using a RNN language model here is to estimate the true distribution of
the entire language. This metric can help to measure the fluency of machine generated
text which is its quality in essence (Zhao et al., 2018; Cı́fka et al., 2018). To remove
the need for training a RNN language model on human data, one can leverage available
pre-trained language models which have been trained on massive dataset such as GPT-2
(Radford et al., 2019).
Acceptability Another way to think about the quality of machine generated texts is
their acceptability - how natural they feel to native speakers of the language. Accept-
ability can be influenced by context: sentences that sound strange when standing alone
can appear natural in specific contexts, while those which appear perfectly by them-
selves may sound odd when surrounding by other sentences (Lau et al., 2020; Bizzoni
and Lappin, 2019; Bernardy et al., 2018). This is crucial in open-ended text generation
task, since models are usually conditioned on specific contexts before being asked to
generate more continuations.
Lau et al. (2020) propose the use of pre-trained language models to calculate sentence
probability as a proxy for its acceptability within a given context. According to the
authors finding, using BERT model (Devlin et al., 2019) with PenLP (Vaswani et al.,
2017) to normalize the sentence’s probability produces acceptability scores that match
human intuition.
2.3.2 Diversity
Self-BLEU One way to think about diversity is how the generated samples from a
collection are different from each other. Using the same intuition as Corpus-BLEU, Zhu
et al. (2018) introduce Self-BLEU score to access the similarity between every document
and the rest of the generated collection. To formalize, Self-BLEU returns the mean
BLEU score of every sample from the set of machine generated text Sgen against every
other samples in the same set Sgen :
Background 25
1 X
Self-BLEU(Sgen ) = BLEU(S, Sgen \ {S})
|Sgen |
S∈Sgen
A lower Self-BLEU score implies higher diversity of the collection, as the documents
in the collection are more different from each other. Similar to Corpus-BLEU, this
metric suffers from its quadratic runtime complexity which makes it intractable for
large collection of documents.
Here, the RNN language model resembles the distribution of the generated collection.
Similar to how forward perplexity judges the quality of the generated collection using
human data, reverse perplexity measures the quality of human data based on the gen-
erated collection. If the generated collection is diverse enough to represent different
writing styles or topics, the RNN language model should perceive human text as natural
and fluent, therefore giving low perplexity to human data.
Sequence repetition While human rarely repeat themselves when writing, machine
generated texts are often found to be repetitive, especially when being produced by
deterministic decoding methods (Holtzman et al., 2019; Welleck et al., 2019a; Shao
et al., 2017; Fan et al., 2018). While not being diverse does not necessarily mean being
repetitive, being repetitive prevents the model from generating diverse continuations.
Therefore, using repetition as a metric can give us an idea how diverse a document
collection is. Welleck et al. (2019a) use a metric called seq-rep-n to measure sequence
repetition by calculating the portion of duplicate n-grams in a generated sequence S:
|unique n-grams(S)|
seq-rep-n = 1.0 −
|n-grams|
How to Evaluate Creative Generation? 26
One of the biggest challenge for a model in open-ended text generation tasks is to
demonstrate its commonsense/logical reasoning ability. This is similar to the task of
natural language inference (NLI), in which the model is given pairs of sentences and must
decide whether the relationship between them are neutral, entailment or contradiction
(Fyodorov et al., 2000; Condoravdi et al., 2003; Bos and Markert, 2005; Dagan et al.,
2005; MacCartney and Manning, 2009).
MultiNLI Williams et al. (2018) introduce the Multi-Genre NLI Corpus (MultiNLI),
which is made up from a variety of writing and spoken materials, ranging from reports,
speeches, letters, conversations to non-fiction and fiction books. Due to its diversity
in writing styles, genres and topics, this dataset is suitable for many open-ended text
generation tasks.
Dialogue NLI Welleck et al. (2019c) introduce the Dialogue NLI dataset, which
contains labeled pairs of sentences constructing from the Persona-Chat dialogue dataset
(Zhang et al., 2018). This dataset is most suitable for dialogue generation task.
It is clear that using only traditional metrics such as perplexity is not enough to evaluate
language models on open-ended text generation task. Instead, we need to look at their
performance on all three dimensions: quality, diversity and consistency of the generated
text.
In this chapter, we first present our experiment to compare different evaluation metrics
for each of the dimension when evaluating language models on open-ended text gener-
ation task. We then decide on what is the best metric to use for each dimension, and
use those metrics to assess the two common techniques for solving neural text degener-
ation: stochastic decoding methods and tweaking the training objective. In this thesis,
we explore using unlikelihood training (Welleck et al., 2019a) as a representative for the
training objective strategy.
3.1 Setup
3.1.1 Dataset
27
How to Evaluate Creative Generation? 28
The corpus contains 1,104,770 words in total (1,710,720 subwords), with 29,245 unique
words (50,257 unique subwords). We use the GPT-2 Tokenizer1 to tokenize the text into
sequences of 200 subwords in length. We do a train/dev/test split ratio of 80/10/10,
which results in 6,220 sequences for training, 778 sequences for development and 778
sequences for testing.
The main goal of unlikelihood training is to push down the probability mass of negative
candidates. Given a sequence of tokens (x1 , ..., xT ) and a set of negative candidates at
time step t C t = {c1 , ..., cm }, Welleck et al. (2019a) define the unlikelihood loss at time
step t as:
X
LU L (pθ (·|x<t ), C t ) = − log(1 − pθ (c|x<t ))
c∈C t
The unlikelihood loss can be used along side the usual cross-entropy loss when training/fine-
tuning the language model. The unlikelihood loss can be applied at two levels: token-
level and sequence-level.
Token-level loss For token-level loss, the list of negative candidates at each time
step is all of the tokens from previous time steps, so the model can avoid repeating
tokens that it has seen before (Welleck et al., 2019a).
3.1.3 Training
We use the pre-trained GPT-2 Small from HuggingFace2 as our base model. We fine-
tune it with the Harry Potter books dataset using two different training methods: maxi-
mum likelihood estimate (MLE) training and unlikelihood (UL) training (Welleck et al.,
2019a). With unlikelihood training, we follow what the authors have suggested: with
1
https://huggingface.co/transformers/model doc/gpt2.html#gpt2tokenizer
2
https://huggingface.co/transformers/model doc/gpt2.html#gpt2lmheadmodel
How to Evaluate Creative Generation? 29
probability of 0.5 use sequence-level loss otherwise use the token-level loss. To compute
sequence-level loss (with n-gram of 4), we use the prefix of size 50 of each training se-
quence in the current training batch and greedily decode continuations of length 100.
All models are trained with 4 epochs using Adam optimizer with batch size of 12 and
learning rate of 0.001. To find out how sensitive unlikelihood training is to the number
of training epochs, we also train another model using unlikelihood training with 1 epoch
only.
We are also interested in which effects do larger models have on text generation. When-
ever possible, we repeat the experiments with a GPT-2 Medium base model using a
similar training regime.
3.2 Results
In this section, we want to investigate how different the token distributions of human
text are from that of machine-generated text. Our hypothesis is that if one model can
produce text which have a similar distribution of tokens to human text, it would read
natural and human-like. Below is a paragraph that we have taken from the Harry Potter
series:
Harry, who was shaking all over, thought for a moment that Dumbledore
might not be able to climb into the boat; he staggered a little as he attempted
it; all his efforts seemed to be going into maintaining the ring of protective
flame around them. Harry seized him and helped him back to his seat.
Once they were both safely jammed inside again, the boat began to move
back across the black water, away from the rock, still encircled by that ring
of fire, and it seemed that the Inferi swarming below them did not dare
resurface.
“Sir,” panted Harry, “sir, I forgot – about fire – they were coming at me and
I panicked –”
They reached the bank with a little bump and Harry leapt out, then turned
quickly to help Dumbledore. The moment that Dumbledore reached the
bank he let his wand hand fall
How to Evaluate Creative Generation? 30
We then use the GPT-2 Small model as an oracle to calculate the probability of each
token in the paragraph (Figure 3.1). From the plot, we can see that the distribution
of human written text does not follow a consistent pattern but moves up and down
arbitrarily, which means when writing, human never try to find the most probable word
after every word. This explains why deterministic decoding methods such as greedy and
beam search fail to produce natural and human-like text.
We use the same prefix with the above paragraph (Harry, who was shaking all over )
and generate continuations of similar length (200 subwords) using our three models:
Maximum Likelihood Estimate GPT-2 (MLE GPT-2), Unlikelihood Training with 4
epochs (UL GPT-2 4 epochs), and Unlikelihood Training with 1 epoch (UL GPT-2 1
epoch). All continuations are generated using greedy decoding, except for the MLE
model where we also experiment with top-p and top-k decoding, with the value p set to
0.9 and the value k set to 10. The machine generated continuations are given below:
MLE GPT-2, greedy decoding: Harry, who was shaking all over, was
still standing there, looking at the ceiling.
“I’m sorry,” he said, “but I’m not going to tell you what happened. I’m not
going to tell you what happened. I’m not going to tell you what happened.
I’m not going to tell you what happened. I’m not going to tell you what
How to Evaluate Creative Generation? 31
happened. I’m not going to tell you what happened. I’m not going to tell
you what happened. I’m not going to tell you what happened. I’m not going
to tell you what happened. I’m not going to tell you what happened. I’m not
going to tell you what happened. I’m not going to tell you what happened.
I’m not going to tell you what happened. I’m not going to tell you what
happened. I’m not going to tell you what happened. I’m not going to tell
you what happened. I’m not going to tell you what happened.
MLE GPT-2, top-p decoding, p = 0.9: Harry, who was shaking all over
and watching, said, “You don’t understand how to make a mistake –”
“I am not a wizard,” said Harry. “I do not believe that my mother’s magical
work was a good idea.
But she was, after all, a wizard....”
“That’s right, I doubt she was,” said Hermione, pulling a few strands of
parchment out of her sleeve and peering at them.
“And that’s a sign, then,” said Harry. “What are you, then?”
“You,” said Hermione, with a look of great horror. “I have a badge for my
mother...”
“I have a badge for my mother.” Harry, who was standing behind her, was
now smiling with his forehead on his robes.
“What does it say?”
“It tells you...”
The door of the tent opened and
MLE GPT-2, top-k decoding, k = 10: Harry, who was shaking all over
himself, and Harry looked up to see his father, Ron, who was still looking at
him from his seat.
“Well, you’ll see, I mean, it’s been a long time,” Ron said, looking down at
his hands. “But it’s not like I’m telling you anything, isn’ it, isn’ it? You
can’t say you were a bit scared, or you’re just not doing the right thing. You
can’t.”
Harry looked up again, but Ron had been right. The look in their eyes was
as much relief as it had been feeling. Harry thought he saw a bit of remorse
in the look in Ron’s face.
“You’re sorry,” said Hagrid, “but I’ve just got a feeling something will hap-
pen, Harry Potter. It’s not like we’ve got all these other people’s names on
it — you don’t know anything about the Ministry of Magic.”
How to Evaluate Creative Generation? 32
“You’re not going to be able to do that,” said Harry, looking at the door,
which was open to the outside. Harry had to stop himself, because he was
still looking at the door with a very large smile on his face. He had a feeling
that he was going to be able to get away with it, though, because he had a
very large hand on the wand, and Harry was not sure whether he was going
to be able to see the door open again until the very end of the night.
Harry had to go back to the bathroom, and when he was done, he was looking
at the door again, and Harry felt very lucky. He had been able to see the
door open for a moment, and then he saw the door open again, and he was
very glad that he had managed to get inside.
“I’m sorry,” said Harry, “but I’m not going to tell you what to do next
time. I’m just going to tell you what I want to see Dumbledore do when he’s
here.” -Ron Weasley, who was now looking back at Dumbledore’s portrait,
and the Death Spell wandcrossing into the Forbidden Forest Firep. Trans-
lator’s Transmogrified Spellbag -“And now, Harry, you can tell the whole
world about the Patronuses Curse Curse Patron-Blood Lord’s friends, Ron,
Hermione, and Sirius face on Deathly Deathly High Inquisitor’s face-Ex-Ex-
Lionbeer. Patron Patron: Dumbledore Death Spellpike-GentorHistoryLionbus
AbridgedAndHarry’s parents, Ron, Galleons, and Neville.
We then calculate the token distribution of each paragraph using the same model which
generates it. For continuations that are generated using stochastic decoding, we augment
the distribution to match with the behavior of that decoding method: at each time step
we only consider the top 90% of the tokens for top-p decoding and top 10 tokens for
top-k decoding, then adjust the probability accordingly so the total probabilities sum
up to 1. The plot of the distribution is given in Figure 3.2.
The MLE GPT-2 with greedy decoding (Figure 3.2a) has observed a phenomenon called
model overconfidence (See et al., 2019; Jiang and de Rijke, 2018; Holtzman et al., 2019),
where the model is getting more confident over time and place high probability on a
small number of tokens. This explains why it keeps generating the sentence I’m not
going to tell you what happened again and again.
How to Evaluate Creative Generation? 33
(a) MLE GPT-2, greedy decoding (b) MLE GPT-2, top-p decoding, p = 0.9
(d) UL GPT-2 (1 epoch), greedy decoding (e) UL GPT-2 (4 epochs), greedy decoding
Out of the five paragraphs, the two generated by MLE GPT-2 with top-p (Figure 3.2b)
and top-k (Figure 3.2c) decoding have the most similar distribution to the given human
text. Although in terms of commonsense and logical reasoning, the paragraphs do not
really make sense, but still they feel more native to human written language than MLE
GPT-2 with greedy decoding.
Compare to the other samples, the two paragraphs generated by Unlikelihood training
models have quite low probabilities overall (Figure 3.2d and 3.2e), especially the one
being trained with 4 epochs where most tokens have a probability value smaller than
0.2. With the UL GPT-2 (4 epochs) model, the generated paragraph stops making sense
after the first one hundred subwords and starts to produce only gibberish. The one being
How to Evaluate Creative Generation? 34
In this section, our main objectives are (i) finding the best metrics to evaluate quality
and diversity of machine generated text and (ii) comparing the models based on their
quality/diversity trade-off.
Since there are a number of parameters to tweak, .i.e the choice of decoding method
or the choice of parameters in stochastic decoding method, it can be hard to determine
which model are superior in text generation. Caccia et al. (2018) proposes a method
called temperature sweep, which use a fixed set of parameters across all models then see
which model has the best result overall. Following a similar approach, for each model
we generate continuations using three decoding algorithms: greedy, top-p and top-k
decoding. For the hyper-parameter p and k, we select a fixed set of values for p from the
range of [0.2...0.96] and a fixed set values for k from the range of [2...400]. To generate
the model samples, we take the prefix of length 50 subwords of every sequence from the
training set and generate continuations using different decoding methods/parameters of
length 150 subwords. In total, we have 57 pairs of model/parameter, each generating
6,220 different samples.
We use all of the evaluation metrics as discussed in Chapter 2. For Corpus-BLEU and
Self-BLEU, we only select 500 continuations from each sample set to calculate the scores
due to their runtime complexity. For reverse perplexity, instead of an RNN language
model as the authors have suggested, we use a GPT-2 Small model to speed up the
training process. We fine-tuned it on the 6,220 generated samples of each model/pa-
rameter pair, then calculate the perplexity against our Harry Potter corpus test set.
For forward perplexity, we use an off the shelve GPT-2 Small model to calculate the
perplexity against the generated sample set of each model/parameter pair.
Figure 3.3 shows the relationship between the three evaluation metrics for judging qual-
ity. BERT + PenLP and Forward perplexity highly correlates with each other overall
(Figure 3.3b), and they both agree with Corpus-BLEU up until an inflection point where
we observe a reverse trend (Figure 3.3a and 3.3c). The points where the reverse trend
How to Evaluate Creative Generation? 35
(a) BERT + PenLP vs. Corpus-BLEU (b) BERT + PenLP vs. Forward perplexity
Figure 3.3: Scatter plots between Corpus-BLEU, Forward perplexity and BERT +
PenLP
happens are either generated by (i) greedy decoding method or (ii) stochastic decoding
method with low values for p and k which behaves in a near greedy manner. This is
similar to the likelihood trap phenomenon (Zhang et al., 2021), in which the authors ob-
serve that human judgement is positively related with the log likelihood of the model up
to a certain point where it becomes negatively related afterward. Based on this finding,
we argue that Corpus-BLEU is the best metric to evaluate the quality of generated text,
since it behaves more like human judgement and is able to detect mode collapse.
Figure 3.4 shows the relationship between the three evaluation metrics for evaluating
diversity. Self-BLEU and Reverse perplexity generally correlates well with each other.
seq-rep-4 is able to detect highly repetitive samples; however, it is harder to see how
difference the models performance are, since the magnitudes are marginal. As for com-
plexity, using reverse perplexity can be much slower compared to the other metrics as
we have to train a language model with the generated samples. Therefore, we choose to
use Self-BLEU for the rest of the experiments.
How to Evaluate Creative Generation? 36
(a) Self-BLEU vs. Reverse perplexity (b) seq-rep-4 vs. Reverse perplexity
Figure 3.4: Scatter plots between Self-BLEU, Reverse perplexity and seq-rep-4
Using Corpus-BLEU and Self-BLEU as the two evaluation metrics for quality and di-
versity, we plot the quality/diversity trade-off between the models in Figure 3.5 and use
a log function to fit the data points. Note that we use Negative Corpus-BLEU instead
of Corpus-BLEU, so that the lower is better for both metrics. Looking at the graph, it
is unclear which model is the best in the quality-diversity trade-off space, since they all
lie similar diagonal lines. The trade-off between quality and diversity is clear: increasing
the degree of randomness (by decreasing p in top-p decoding or increasing k in top-k
decoding) improves diversity but worsens quality and vice versa.
Even though the samples generated by MLE GPT-2 with greedy decoding are much
more repetitive (with seq-rep-4 of 0.584 and Self-BLEU of 0.65), UL GPT-2 (1 epoch)
with greedy decoding appears much worse in terms of diversity with a Self-BLEU score
of 0.78 and seq-rep-4 of 0.004 (Table 3.1 and Figure 3.5). Taking a closer look at the
generated samples, we found out that no matter what context were given to the model,
UL GPT-2 still produces texts that are rather similar. Below are several examples:
“But the task’s not till tonight!” said Harry, accidentally spilling scrambled
eggs down his front, afraid he had mistaken the time.
How to Evaluate Creative Generation? 37
“But I’m not going to tell anyone. I’ve got to go and find the
Ministry of Magic, and I’m not going to be in the Ministry’s office
for the rest of the year. I’ve got a lot more to do with the Min-
istry’s Head of Magical Law Enforcement than I could have ever
expected.”
Harry looked around the room, and then back to the Ministry’s
office. He had not seen the Ministry’s office for a long time, but
he had to admit
“I’ll just tell you what, then. I’ll be back in a few minutes.”
The train was a long way from the station, and Harry was still not
sure whether he was going to be able to get back to the castle. He
had to wait until the next morning, when he had a chance to ask
the old man if he wanted to go back to Hogwarts.
“I’ll see you at the Ministry,” he said, and he walked back to the
“There’s no point, Harry.” Tonks had appeared out of nowhere, her mousy
hair wet with sweat. “You’re going to have to get back to Hogwarts,
you know that, if you’re going to be there, you’ll have to be in the
Ministry’s office. You’ve got to get your wand back, and I’ll be
there, and I’ll be there for you, Harry Potter, and you’ll have a
very special place in the school.”
Harry looked around the room, his eyes wide open, and saw the
Professor’s face. He was wearing a very
We can see that even though the contexts are different, the continuations always drift to
Ministry of Magics and getting back to Hogwarts. This proves that looking at repetition
as a metric alone is not enough to judge the diversity of generated text, and although
unlikelihood training might solve the problem of generating repetitive text, it does not
help the model to produce more diverse text.
How to Evaluate Creative Generation? 38
To see if bigger models are better in the quality/diversity trade-off space, we repeat the
experiment using a GPT-2 Medium as a based model. Figure 3.6 shows the comparison
between Small and Medium, with a log function fitting the data points. Overall, Medium
models only perform better than Small models by a narrow margin, as the orange points
are closer to the origin of the graph.
3.2.3 Consistency
As discussed in section 2.3.3, we use selection accuracy (Li et al., 2020) as a metric to
evaluate a model’s ability to demonstrate commonsense reasoning. As we are focusing
on story generation, we choose to use the MultiNLI dataset (Williams et al., 2018) and
the StoryCloze dataset (Mostafazadeh et al., 2016) for this experiment.
How to Evaluate Creative Generation? 39
Table 3.1: Sequence repetition scores between different models and decoding methods.
*Human text sequence repetition is computed using the training dataset
Figure 3.6: Quality vs. Diversity trade-off between GPT-2 Small and GPT-2 Medium
models
From the MultiNLI development set, we extract triples of sentences so that for every
sentence A, we have a sentence B that contradicts it and a sentence C that entails it.
We only consider the triple if the first sentence A has a punctuation at the end of the
sentence for the ease of separation. In the end, we have 4, 607 triples of sentences from
the MultiNLI development set for our experiment. For every triples, we calculate the
perplexity of the second sentence (B or C) with the first sentence A as context. Similar
How to Evaluate Creative Generation? 40
Table 3.2: Selection accuracy between different models with MultiNLI and StoryCloze
dataset
to Li et al. (2020), we assume the model will select the sentence with lower perplexity,
and we use that to calculate its selection accuracy. Note that this is an unsupervised
setup, i.e. we do not fine-tune our models on the MultiNLI dataset. We only use them
to compute the perplexity score.
For StoryCloze, we use its development set which consists of 1, 570 stories. We concate-
nate the first four sentences of each story as the context, and ask the model to select
between two different endings. Similar to above, we calculate the perplexity of the two
endings given the context to determine the model selection. Results are given in Table
3.2.
It is clear that fine-tuning a model on the Harry Potter corpus decreases its selection
accuracy in both dataset, however this is worse in MultiNLI than StoryCloze. One
possible explanation is that in StoryCloze we have much longer context, which may help
the model in finding the correct ending regardless of the fine-tuning process.
Overall, UL models perform worse than its MLE counterpart in both dataset. With the
MultiNLI dataset, fine-tuning both GPT-2 Small and Medium models using unlikelihood
training for 4 epochs decreases their selection accuracy by around 10%, while this is
only around 4% for maximum likelihood estimate training. This suggests that using
unlikelihood training might have a detrimental effect on a model’s logical reasoning
ability.
Among of all of our trained small models, MLE GPT-2 achieves the best result with
a selection accuracy of 0.61 in MultiNLI and 0.59 in StoryCloze. However, this is just
slightly better than chance, which might explain why the text produced by the model
in Section 3.3 is not consistent.
In all cases, the models which are based off GPT-2 Medium perform considerably better
than those are based off GPT-2 Small. This suggests larger models can be superior to
smaller models at language understanding.
Multi-task Learning 41
Table 3.3: Selection accuracy when being trained with different domains
Training the model on a narrow domain dataset such as Harry Potter series may have a
detrimental effect on its selection accuracy, especially with the MultiNLI dataset since
it is made up from a variety of domains. To make sure that narrow domain did not
cause the drop in selection accuracy with unlikelihood training, we repeat the same
experiment but train the models using the WikiText-2 dataset (Merity et al., 2016),
which we considered as a broader domain dataset compared to the Harry Potter series.
The WikiText-2 training set contains 600 Wikipedia articles, with 2,088,628 word tokens
in total. From this training set, we randomly select 6,220 sequences of 200 sub-words to
match the size of the Harry Potter dataset. We train all of the models using the same
environment and hyper-parameters as the previous section.
The final result is given in Table 3.3. Overall, the new models which were trained using
the WikiText-2 performed better than the old ones in selection accuracy in both datasets.
However, the trend is still the same: in the MultiNLI selection task, unlikelihood training
leads to a lower selection accuracy than the usual maximum likelihood estimate training.
This suggests that the drop in selection accuracy when using unlikelihood training has
little to do with domain difference.
An interesting thing to note is that when being trained with the WikiText-2 dataset,
using more epochs on unlikelihood training actually leads to a slight increase in selection
accuracy in both tasks. This further suggests that unlikelihood training is sensitive to
the number of training epochs and rather difficult to train.
Chapter 4
Multi-task Learning
However, instead of focusing on that one single task, we can train our machine learning
models to solve different tasks simultaneously while still sharing the models parameters
across all tasks. This strategy is known as multi-task learning (Caruana, 1997), and it
has been shown to help machine learning models to generalize better on their original
task, since they can learn valuable information from a variety of tasks.
Even though multi-task learning has been applied to language models from an early
stage in the NLP field (Collobert and Weston, 2008), it has grown even more popular
with the introduction of BERT (Devlin et al., 2019). BERT, or Bidirectional Encoder
Representations from Transformers, is a Transformer-based language model, with the
largest variant containing around 340 million parameters. As opposed to GPT-2 - a
unidirectional language model, where at every time step the model can only attend
to previous words to generate the hidden state for the current word - BERT is bidirec-
tional, meaning that each word can access both of its left (previous) and right (following)
context. This bidirectional architecture has been shown to have a much deeper repre-
sentation of the language context, which allows BERT to obtain new state-of-the-art
results on a variety of natural language processing tasks, including question answering
and natural language inference (Devlin et al., 2019).
42
Multi-task Learning 43
BERT obtains its powerful bidirectional representation of words by using two novel
training objectives: masked language model and next sentence prediction (Figure 4.1).
With the masked language model objective, the authors randomly mask 15% of the
words in a training sequence, and ask BERT to correctly identify the masked word
(Figure 4.1a). With the next sentence prediction objective, the authors first extract
pairs of connected sentences from the training corpus, then randomly sample pairs of
unrelated sentences. They then give these pairs of sentences to BERT and ask it to
predict whether the first sentence is followed by the second sentence in each pair (Figure
4.1b).
One downside of BERT is that since each word has access to both of its left and right
context, BERT loses the ability of a causal language model to generate continuations,
where the model only allows to condition on the context it has seen so far. However,
with the success of BERT with multi-task learning, we believe it has the potential to
bring causal language models to another level at natural language understanding. In this
chapter, we explore different strategies to apply multi-task learning on training causal
language models.
4.1 Setup
4.1.1 Dataset
In this experiment, we use the same Harry Potter series from the previous chapter as
our training corpus.
4.1.2 Training
We train a new model by fine-tuning the GPT-2 model on the Harry Potter dataset using
the original next word prediction (MLE) task with one auxiliary training objective. We
end up with 5 different models, which we call NSP GPT-2 (Next Sentence Prediction),
SOP GPT-2 (Sentence Order Prediction), TFIDF GPT-2 (Term Frequency - Inverse
Document Frequency), POS GPT-2 (Part-of-speech Tag) and DP GPT-2 (Dependency
Parsing).
All models are trained with 4 epochs using Adam optimizer with batch size of 5-20 (de-
pending on the memory demand of the training objective) and learning rate of 0.001. In
each training batch, we add the losses from all objectives together for back-propagation.
The implementation detail of each auxiliary objective is given in the following section.
Multi-task Learning 44
To obtain the sentence data for these objectives, we use the ntlk sentence segmenter1
to split the Harry Potter dataset into sentences. This results in 92, 658 unique sentences
for our experiment.
1
https://www.nltk.org/api/nltk.tokenize.html
Multi-task Learning 45
Next Sentence Prediction (NSP) Inspired by BERT (Devlin et al., 2019), in this
task we present our model with two pairs of sentences - one consists of two connected
sentences (positive pair) while the other contains two unrelated sentences (negative
pair) - and ask the model which one it prefers. The preference is calculated using the
perplexity score (ppl) for each pair of sentence. We use a margin ranking loss function
to incentivize the model to give lower perplexity to the positive sentence pair:
Using a number of starting sentences, we obtain the positive training set by simply
taking the sentence next to the starting sentences, and the negative training set by
randomly sample a sentence from the training corpus. To make it easier for training,
we only select 6,220 pairs from each set, which match the number of training sequences
for the original next word prediction task. Note that this is done when constructing
the training data, thus the same positive and negative sets are used for every training
epoch.
TF-IDF In this task, we ask our model to predict the TF-IDF score for each token
in the training dataset. To obtain the true TF-IDF scores as labels for training, we first
split our training corpus into 243 documents, each one consists of 6400 tokens, and then
calculate the TF-IDF scores based on these documents. In order to produce a TF-IDF
score, we add a linear layer on top of the original GPT-2 model. We use a smooth L1-loss
function for the regression task.
Part-of-speech Tags (POS) In this task, we ask our model to predict the correct
part-of-speech tags for each token in the training dataset. We obtain weak labels for
training using the spacy part-of-speech tagger2 . To make the spacy tokenizer compatible
with the GPT-2 tokenizer, we use a similar approach to Devlin et al. (2019) to align the
2
https://spacy.io/usage/linguistic-features#pos-tagging
Multi-task Learning 46
tokens produced by the two tokenizers, where we only assign the POS tag to the first
sub-component of a word. For tokens that we could not find a possible alignment, we
assign them a special label of X (means Other ). An example is given in Table 4.1.
We add a classification head on top of the original GPT-2 model, which uses the last
hidden state of each token to predict the POS tag. We use a cross-entropy loss function
for the classification task.
Dependency Parsing (DP) In this task, we ask our model to correctly identify the
relationship between certain pair of tokens in the training dataset. Borrowing an example
from Jurafsky and Martin (2009), where we have the sentence “I prefer the morning flight
through Denver”, we would want our model to correctly predict all dependency labels
in this sentence (Figure 4.2).
Similar to the POS task, we extract all dependency pairs in the training sequences using
the spacy dependency parser3 , then use the same alignment strategy and cross-entropy
loss function for the classification task. Note that we do not perform this classification
task for every pair of tokens, but only the ones that we obtain from the spacy dependency
parser.
4.2 Results
4.2.1 Consistency
The selection accuracy score with the MultiNLI and StoryCloze dataset for each model
is given in Table 4.2. Overall, all models either perform similar to or better than the
MLE model, which has been trained using only the original next word prediction task.
3
https://spacy.io/usage/linguistic-features#dependency-parse
Multi-task Learning 47
Table 4.2: Selection accuracy between different models with MultiNLI and StoryCloze
dataset
Among all of the five auxiliary training objectives in this experiment, the sentence order
prediction task proves to be the most effective one, as the SOP model achieved the best
score out of every model, and it got really close to the off-the-shelf GPT-2 selection
accuracy score, even though it has been fine-tuned to a specific domain dataset like the
Harry Potter series. We also experiment with another model where we combine the
two most effective tasks according to the selection accuracy scores given in Table 4.2
(TF-IDF and SOP), however this does not further improve the models performance.
Using Corpus-BLEU and Self-BLEU as the two evaluation metrics for quality and di-
versity, we plot the quality/diversity trade-off between the models in Figure 4.3 and use
a log function to fit the data points. Similar to the experiment in Chapter 3, we use
Negative Corpus-BLEU instead of Corpus-BLEU, so that the lower is better for both
metrics.
Even though NSP, SOP and TF-IDF + SOP are the ones that have achieved the highest
selection accuracy scores in the commonsense reasoning task, they did slightly worse at
the quality-diversity trade-off space according to the fitting lines in Figure 4.3, which
suggests there might be a further trade-off between consistency and quality-diversity.
For the other models, it is hard to decide which one is the best model in the quality-
diversity trade-off space, since they all have similar performance. Curiously, even though
Conclusion 48
DP curve starts out very similar to MLE, as quality dips it produces the best diversity
score (lower right corner, Figure 4.3).
Figure 4.3: Quality vs. Diversity trade-off between different learning objectives
We also give the sequence repetition scores for each model in Table 4.3. Overall, the
scores are pretty consistent between all models without any outliers, which suggests
that by adding these auxiliary training objectives, we do not affect the model in terms
of n-gram repetitions in its generation.
Overall, adding these auxiliary objectives seem to help improve logical consistency, while
they do not harm the model in terms of quality-diversity trade-off and sequence rep-
etition. This suggests that there might be an incentive for including these auxiliary
objectives when training/fine-tuning a language model.
Conclusion 49
Table 4.3: Sequence repetition scores between different learning objectives and decod-
ing methods. *Human text sequence repetition is computed using the training dataset
Chapter 5
Conclusion
5.1 Overview
We have conducted an experiment to search for the best metric to use when evaluating
language models on open-ended text generation task. We have proposed an evaluation
pipeline using these metrics and apply it to compare the performance between different
stochastic decoding methods and unlikelihood training.
Finally, we have carried out experiments with multi-task learning to see if it can help
language models to get better at open-ended text generation. Using the evaluation
pipeline above, we evaluated the efficacy of different auxiliary training objectives in
open-ended text generation task.
5.2 Contributions
When evaluating language models on open-ended text generation task, we have found
that Corpus-BLEU is the best metric to evaluate the quality of generated text due to
50
Conclusion 51
its similarity with human judgement. As for diversity, Self-BLEU appears to be the
best metric to use thanks to its simplicity to calculate. To evaluate the consistency of
the generated text, using selection accuracy on the MultiNLI dataset is good enough for
most cases. For specific task such as story generation, other dataset can be considered
(e.g. StoryCloze).
To the best of our knowledge, there has not been any works that quantitatively compare
unlikelihood training with stochastic decoding methods. Using our proposed evaluation
pipeline, we found out that there was no clear difference between unlikelihood training
and maximum likelihood estimate training with stochastic decoding methods in the
quality-diversity trade-off space. However, unlikelihood training might have a negative
effect on the ability of a language model to truly grasp the gist of the language, as it
worsens the model performance in commonsense reasoning task.
We found out that by adding certain auxiliary training objectives along with the max-
imum estimate likelihood objective when fine-tuning GPT-2 models, they achieved a
much better score in commonsense reasoning task, while still maintain their perfor-
mance in the quality-diversity trade-off space. This suggests that multi-task learning
might help a language model to truly understand the language, which in turn leads to
better generation.
Even though our experiment with evaluation metrics agree with Zhang et al. (2021)’s
finding of the likelihood trap, further experiment with the help of human evaluation
should be carried out to verify that the metrics work as expected. To the best of our
Conclusion 52
knowledge, human judgement has only been used to evaluate the quality of genera-
tion, since it might be hard for one to assess the diversity of machine-generated text
(Hashimoto et al., 2019). Therefore, it might be worth to investigate how we can lever-
age human evaluation to confirm the correctness of diversity metrics as well.
In this work, we only select unlikelihood training (Welleck et al., 2019a) as a repre-
sentative for the training objective strategy when making comparison with stochastic
decoding methods. For future work, it would be useful to look at different strategies to
tweak language models training objective as well, such as scheduled sampling (Bengio
et al., 2015), GAN (Yu et al., 2017; Guo et al., 2018) and entmax loss (Martins et al.,
2020).
In this work, we only focused on syntactic supervision objectives. For future work,
it might be worth to carry out experiment with training objectives that contain more
semantics-oriented information, such as word senses or topics classification.
Due to the limitation of the available computing resource, we could only perform our
experiment on GPT-2 Small and Medium models. If one has access to larger language
models, it might be worth to repeat the same experiment to see how much better do
larger language models get at open-ended text generation.
Bibliography
David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. 1985. A learning algo-
rithm for Boltzmann machines. Cognitive science, 9(1):147–169.
Jay Alammar. 2018a. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer
Learning). [Online; accessed May 06, 2021].
Jay Alammar. 2018b. The Illustrated Transformer. [Online; accessed May 06, 2021].
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine trans-
lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled Sam-
pling for Sequence Prediction with Recurrent Neural Networks. In Proceedings of the
28th International Conference on Neural Information Processing Systems - Volume 1,
NIPS’15, page 1171–1179, Cambridge, MA, USA. MIT Press.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural
probabilistic language model. The journal of machine learning research, 3:1137–1155.
Jean-Philippe Bernardy, Shalom Lappin, and Jey Han Lau. 2018. The Influence of
Context on Sentence Acceptability Judgements. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
pages 456–461, Melbourne, Australia. Association for Computational Linguistics.
Yuri Bizzoni and Shalom Lappin. 2019. The Effect of Context on Metaphor Paraphrase
Aptness Judgments. In Proceedings of the 13th International Conference on Compu-
tational Semantics - Long Papers, pages 165–175, Gothenburg, Sweden. Association
for Computational Linguistics.
Johan Bos and Katja Markert. 2005. Recognising textual entailment with logical infer-
ence. In Proceedings of Human Language Technology Conference and Conference on
Empirical Methods in Natural Language Processing, pages 628–635.
Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN training for
high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
53
Bibliography 54
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Lau-
rent Charlin. 2018. Language GANs falling short. arXiv preprint arXiv:1811.02549.
Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu.
2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less
Forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 7870–7881, Online. Association for Computa-
tional Linguistics.
Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques
for language modeling. Computer Speech & Language, 13(4):359–394.
Ondřej Cı́fka, Aliaksei Severyn, Enrique Alfonseca, and Katja Filippova. 2018. Eval all,
trust a few, do wrong to none: Comparing sentence generation models. arXiv preprint
arXiv:1804.07972.
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language
processing: Deep neural networks with multitask learning. In Proceedings of the 25th
international conference on Machine learning, pages 160–167.
Cleo Condoravdi, Dick Crouch, Valeria De Paiva, Reinhard Stolle, and Daniel Bobrow.
2003. Entailment, intensionality and text understanding. In Proceedings of the HLT-
NAACL 2003 workshop on Text meaning, pages 38–45.
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising
textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–
190. Springer.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com-
putational Linguistics.
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and
Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations,
data orders, and early stopping. arXiv preprint arXiv:2002.06305.
Bibliography 55
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Genera-
tion. In Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Associa-
tion for Computational Linguistics.
Yaroslav Fyodorov, Yoad Winter, and Nissim Francez. 2000. A natural logic inference
system. In Proceedings of the 2nd Workshop on Inference in Computational Semantics
(ICoS-2). Citeseer.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher-
jil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In
Advances in neural information processing systems, pages 2672–2680.
Kartik Goyal, Chris Dyer, and Taylor Berg-Kirkpatrick. 2017. Differentiable scheduled
sampling for credit assignment. arXiv preprint arXiv:1704.06970.
Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long
text generation via adversarial training with leaked information. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 32.
Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and sta-
tistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792.
Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham Neubig. 2020. Latent relation
language models. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 34, pages 7911–7918.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation, 9(8):1735–1780.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious
case of neural text degeneration. arXiv preprint arXiv:1904.09751.
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for
Text Classification. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne,
Australia. Association for Computational Linguistics.
Ferenc Huszár. 2015. How (not) to train your generative model: Scheduled sampling,
likelihood, adversary? arXiv preprint arXiv:1511.05101.
Sarthak Jain and Byron C Wallace. 2019. Attention is not explanation. arXiv preprint
arXiv:1902.10186.
Bibliography 56
Shaojie Jiang and Maarten de Rijke. 2018. Why are Sequence-to-Sequence Models So
Dull? Understanding the Low-Diversity Problem of Chatbots. In Proceedings of the
2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented
Conversational AI, pages 81–86, Brussels, Belgium. Association for Computational
Linguistics.
Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2nd
Edition). Prentice-Hall, Inc., USA.
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing
of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard
Socher. 2019. CTRL: A conditional transformer language model for controllable gen-
eration. arXiv preprint arXiv:1909.05858.
Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured at-
tention networks. arXiv preprint arXiv:1702.00887.
Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language
modeling. In 1995 international conference on acoustics, speech, and signal processing,
volume 1, pages 181–184. IEEE.
Jey Han Lau, Carlos Armendariz, Shalom Lappin, Matthew Purver, and Chang Shu.
2020. How Furiously Can Colorless Green Ideas Sleep? Sentence Acceptability in
Context. Transactions of the Association for Computational Linguistics, 8:296–310.
Gaël Letarte, Frédérik Paradis, Philippe Giguère, and François Laviolette. 2018. Im-
portance of self-attention for sentiment analysis. In Proceedings of the 2018 EMNLP
Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages
267–275.
Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun
Cho, and Jason Weston. 2020. Don’t Say That! Making Inconsistent Dialogue Un-
likely with Unlikelihood Training. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages 4715–4728, Online. Association for
Computational Linguistics.
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches
to Attention-based Neural Machine Translation. In Proceedings of the 2015 Confer-
ence on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon,
Portugal. Association for Computational Linguistics.
Bill MacCartney and Christopher D Manning. 2009. An extended model of natural logic.
In Proceedings of the Eight International Conference on Computational Semantics,
pages 140–156.
Pedro Henrique Martins, Zita Marinho, and André F. T. Martins. 2020. Sparse Text
Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 4252–4273, Online. Association for Computa-
tional Linguistics.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer
sentinel mixture models. arXiv preprint arXiv:1609.07843.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781.
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur.
2010. Recurrent neural network based language model. In Eleventh annual conference
of the international speech communication association.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b.
Distributed representations of words and phrases and their compositionality. arXiv
preprint arXiv:1310.4546.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra,
Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A Corpus and Cloze
Evaluation for Deeper Understanding of Commonsense Stories. In Proceedings of the
2016 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 839–849, San Diego, California.
Association for Computational Linguistics.
Moin Nadeem, Tianxing He, Kyunghyun Cho, and James Glass. 2020. A Systematic
Characterization of Sampling Algorithms for Open-ended Language Generation. arXiv
preprint arXiv:2009.07243.
Christopher Olah. 2015. Understanding LSTM Networks. [Online; accessed May 06,
2021].
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method
for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual
Bibliography 58
Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA.
Association for Computational Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global
Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar.
Association for Computational Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken-
ton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In
Proceedings of the 2018 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational
Linguistics.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yux-
iang Wu, and Alexander Miller. 2019. Language Models as Knowledge Bases? In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computa-
tional Linguistics.
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation
learning with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
Sebestian Ruder. 2018. A Review of the Neural History of Natural Language Processing.
[Online; accessed May 06, 2021].
Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model
for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT,
a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint
arXiv:1910.01108.
Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Yerukola, and Christopher D. Man-
ning. 2019. Do Massively Pretrained Language Models Make Better Storytellers?
In Proceedings of the 23rd Conference on Computational Natural Language Learning
(CoNLL), pages 843–861, Hong Kong, China. Association for Computational Linguis-
tics.
Bibliography 59
Yuanlong Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray
Kurzweil. 2017. Generating High-Quality and Informative Conversation Responses
with Sequence-to-Sequence Models. In Proceedings of the 2017 Conference on Em-
pirical Methods in Natural Language Processing, pages 2210–2219, Copenhagen, Den-
mark. Association for Computational Linguistics.
Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. 2015. End-To-End
Memory Networks. In Advances in Neural Information Processing Systems, volume 28.
Curran Associates, Inc.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with
neural networks. arXiv preprint arXiv:1409.3215.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems, pages 5998–6008.
Shirui Wang, Wenan Zhou, and Chao Jiang. 2020. A survey of word embeddings based
on deep learning. Computing, 102(3):717–740.
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo
Si. 2019. StructBERT: Incorporating language structures into pre-training for deep
language understanding. arXiv preprint arXiv:1908.04577.
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason
Weston. 2019a. Neural text generation with unlikelihood training. arXiv preprint
arXiv:1908.04319.
Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. 2019b. Dialogue
Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 3731–3741, Florence, Italy. Association
for Computational Linguistics.
Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. 2019c. Dialogue
Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 3731–3741, Florence, Italy. Association
for Computational Linguistics.
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Chal-
lenge Corpus for Sentence Understanding through Inference. In Proceedings of the
2018 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages
1112–1122. Association for Computational Linguistics.
Bibliography 60
Lijun Wu, Fei Tian, Li Zhao, Jianhuang Lai, and Tie-Yan Liu. 2018. Word attention for
sequence to sequence text understanding. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 32.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and
Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Un-
derstanding. In Advances in Neural Information Processing Systems, volume 32. Cur-
ran Associates, Inc.
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative
adversarial nets with policy gradient. In Proceedings of the AAAI conference on
artificial intelligence, volume 31.
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang,
and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with
stacked generative adversarial networks. In Proceedings of the IEEE international
conference on computer vision, pages 5907–5915.
Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. 2021. Trad-
ing Off Diversity and Quality in Natural Language Generation. In Proceedings of the
Workshop on Human Evaluation of NLP Systems (HumEval), pages 25–33, Online.
Association for Computational Linguistics.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-
training with extracted gap-sentences for abstractive summarization. In International
Conference on Machine Learning, pages 11328–11339. PMLR.
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason
Weston. 2018. Personalizing Dialogue Agents: I have a dog, do you have pets too? In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association
for Computational Linguistics.
Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush, and Yann LeCun. 2018. Adver-
sarially regularized autoencoders. In International conference on machine learning,
pages 5902–5911. PMLR.
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu.
2018. Texygen: A benchmarking platform for text generation models. In The 41st
International ACM SIGIR Conference on Research & Development in Information
Retrieval, pages 1097–1100.
Bibliography 61
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario
Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models
from human preferences. arXiv preprint arXiv:1909.08593.