Truncated_Doc_1
Truncated_Doc_1
by Generative Pre-Training
Abstract
1 Introduction
The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised
learning in natural language processing (NLP). Most deep learning methods require substantial
amounts of manually labeled data, which restricts their applicability in many domains that suffer
from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic
information from unlabeled data provide a valuable alternative to gathering more annotation, which
can be time-consuming and expensive. Further, even in cases where considerable supervision
is available, learning good representations in an unsupervised fashion can provide a significant
performance boost. The most compelling evidence for this so far has been the extensive use of pre-
trained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].
Leveraging more than word-level information from unlabeled text, however, is challenging for two
main reasons. First, it is unclear what type of optimization objectives are most effective at learning
text representations that are useful for transfer. Recent research has looked at various objectives
such as language modeling [44], machine translation [38], and discourse coherence [22], with each
method outperforming the others on different tasks.1 Second, there is no consensus on the most
effective way to transfer these learned representations to the target task. Existing techniques involve
a combination of making task-specific changes to the model architecture [43, 44], using intricate
learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made
it difficult to develop effective semi-supervised learning approaches for language processing.
1
https://gluebenchmark.com/leaderboard
2 Related Work
Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised
learning for natural language. This paradigm has attracted significant interest, with applications to
tasks like sequence labeling [24, 33, 57] or text classification [41, 70]. The earliest approaches used
unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a
supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using
word embeddings [11, 39, 42], which are trained on unlabeled corpora, to improve performance on a
variety of tasks [8, 11, 26, 45]. These approaches, however, mainly transfer word-level information,
whereas we aim to capture higher-level semantics.
Recent approaches have investigated learning and utilizing more than word-level semantics from
unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled
corpus, have been used to encode text into suitable vector representations for various target tasks [28,
32, 1, 36, 22, 12, 56, 31].
2
pre-trained language or machine translation model as auxiliary features while training a supervised
model on the target task. This involves a substantial amount of new parameters for each separate
target task, whereas we require minimal changes to our model architecture during transfer.
3 Framework
Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to
a discriminative task with labeled data.
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target
task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens,
x1 , . . . , xm , along with a label y. The inputs are passed through our pre-trained model to obtain
the final transformer block’s activation hm l , which is then fed into an added linear output layer with
parameters Wy to predict y:
P (y|x1 , . . . , xm ) = softmax(hm l Wy ). (3)
This gives us the following objective to maximize:
X
L2 (C) = log P (y|x1 , . . . , xm ). (4)
(x,y)
We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):
L3 (C) = L2 (C) + λ ∗ L1 (C) (5)
Overall, the only extra parameters we require during fine-tuning are Wy , and embeddings for delimiter
tokens (described below in Section 3.3).