0% found this document useful (0 votes)

13 views3 pages

Truncated_Doc_1

This document presents a method for improving natural language understanding through generative pre-training of a language model on unlabeled text, followed by discriminative fine-tuning on specific tasks. The authors demonstrate that their approach significantly outperforms traditional discriminatively trained models across various benchmarks, achieving notable improvements in tasks such as commonsense reasoning and question answering. The proposed two-stage training procedure utilizes a Transformer architecture, allowing for effective transfer of learned representations with minimal architectural changes.

Uploaded by

githukelvin254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views3 pages

Truncated_Doc_1

Uploaded by

githukelvin254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Improving Language Understanding

by Generative Pre-Training

Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever

OpenAI OpenAI OpenAI OpenAI
[email protected] [email protected] [email protected] [email protected]

Abstract

Natural language understanding comprises a wide range of diverse tasks such

as textual entailment, question answering, semantic similarity assessment, and
document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task. In contrast to previous approaches, we make use of task-aware input
transformations during fine-tuning to achieve effective transfer while requiring
minimal changes to the model architecture. We demonstrate the effectiveness of
our approach on a wide range of benchmarks for natural language understanding.
Our general task-agnostic model outperforms discriminatively trained models that
use architectures specifically crafted for each task, significantly improving upon the
state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute
improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on
question answering (RACE), and 1.5% on textual entailment (MultiNLI).

1 Introduction
The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised
learning in natural language processing (NLP). Most deep learning methods require substantial
amounts of manually labeled data, which restricts their applicability in many domains that suffer
from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic
information from unlabeled data provide a valuable alternative to gathering more annotation, which
can be time-consuming and expensive. Further, even in cases where considerable supervision
is available, learning good representations in an unsupervised fashion can provide a significant
performance boost. The most compelling evidence for this so far has been the extensive use of pre-
trained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].
Leveraging more than word-level information from unlabeled text, however, is challenging for two
main reasons. First, it is unclear what type of optimization objectives are most effective at learning
text representations that are useful for transfer. Recent research has looked at various objectives
such as language modeling [44], machine translation [38], and discourse coherence [22], with each
method outperforming the others on different tasks.1 Second, there is no consensus on the most
effective way to transfer these learned representations to the target task. Existing techniques involve
a combination of making task-specific changes to the model architecture [43, 44], using intricate
learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made
it difficult to develop effective semi-supervised learning approaches for language processing.
1
https://gluebenchmark.com/leaderboard

Preprint. Work in progress.

In this paper, we explore a semi-supervised approach for language understanding tasks using a
combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal
representation that transfers with little adaptation to a wide range of tasks. We assume access to
a large corpus of unlabeled text and several datasets with manually annotated training examples
(target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled
corpus. We employ a two-stage training procedure. First, we use a language modeling objective on
the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt
these parameters to a target task using the corresponding supervised objective.
For our model architecture, we use the Transformer [62], which has been shown to perform strongly on
various tasks such as machine translation [62], document generation [34], and syntactic parsing [29].
This model choice provides us with a more structured memory for handling long-term dependencies in
text, compared to alternatives like recurrent networks, resulting in robust transfer performance across
diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style
approaches [52], which process structured text input as a single contiguous sequence of tokens. As
we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal
changes to the architecture of the pre-trained model.
We evaluate our approach on four types of language understanding tasks – natural language inference,
question answering, semantic similarity, and text classification. Our general task-agnostic model
outperforms discriminatively trained models that employ architectures specifically crafted for each
task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance,
we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test) [40],
5.7% on question answering (RACE) [30], 1.5% on textual entailment (MultiNLI) [66] and 5.5% on
the recently introduced GLUE multi-task benchmark [64]. We also analyzed zero-shot behaviors
of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic
knowledge for downstream tasks.

2 Related Work

Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised
learning for natural language. This paradigm has attracted significant interest, with applications to
tasks like sequence labeling [24, 33, 57] or text classification [41, 70]. The earliest approaches used
unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a
supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using
word embeddings [11, 39, 42], which are trained on unlabeled corpora, to improve performance on a
variety of tasks [8, 11, 26, 45]. These approaches, however, mainly transfer word-level information,
whereas we aim to capture higher-level semantics.
Recent approaches have investigated learning and utilizing more than word-level semantics from
unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled
corpus, have been used to encode text into suitable vector representations for various target tasks [28,
32, 1, 36, 22, 12, 56, 31].

Unsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning

where the goal is to find a good initialization point instead of modifying the supervised learning
objective. Early works explored the use of the technique in image classification [20, 49, 63] and
regression tasks [3]. Subsequent research [15] demonstrated that pre-training acts as a regularization
scheme, enabling better generalization in deep neural networks. In recent work, the method has
been used to help train deep neural networks on various tasks like image classification [69], speech
recognition [68], entity disambiguation [17] and machine translation [48].
The closest line of work to ours involves pre-training a neural network using a language modeling
objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and
Ruder [21] follow this method to improve text classification. However, although the pre-training
phase helps capture some linguistic information, their usage of LSTM models restricts their prediction
ability to a short range. In contrast, our choice of transformer networks allows us to capture longer-
range linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the
effectiveness of our model on a wider range of tasks including natural language inference, paraphrase
detection and story completion. Other approaches [43, 44, 38] use hidden representations from a

2
pre-trained language or machine translation model as auxiliary features while training a supervised
model on the target task. This involves a substantial amount of new parameters for each separate
target task, whereas we require minimal changes to our model architecture during transfer.

Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative

form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of
auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling
to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling
objective to their target task objective and demonstrated performance gains on sequence labeling
tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training
already learns several linguistic aspects relevant to target tasks.

3 Framework
Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to
a discriminative task with labeled data.

3.1 Unsupervised pre-training

Given an unsupervised corpus of tokens U = {u1 , . . . , un }, we use a standard language modeling

objective to maximize the following likelihood:
X
L1 (U) = log P (ui |ui−k , . . . , ui−1 ; Θ) (1)
i
where k is the size of the context window, and the conditional probability P is modeled using a neural
network with parameters Θ. These parameters are trained using stochastic gradient descent [51].
In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is
a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution
over target tokens:
h0 = U We + Wp
hl = transformer_block(hl−1 )∀i ∈ [1, n] (2)
P (u) = softmax(hn WeT )
where U = (u−k , . . . , u−1 ) is the context vector of tokens, n is the number of layers, We is the token
embedding matrix, and Wp is the position embedding matrix.

3.2 Supervised fine-tuning

After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target
task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens,
x1 , . . . , xm , along with a label y. The inputs are passed through our pre-trained model to obtain
the final transformer block’s activation hm l , which is then fed into an added linear output layer with
parameters Wy to predict y:
P (y|x1 , . . . , xm ) = softmax(hm l Wy ). (3)
This gives us the following objective to maximize:
X
L2 (C) = log P (y|x1 , . . . , xm ). (4)
(x,y)

We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):
L3 (C) = L2 (C) + λ ∗ L1 (C) (5)
Overall, the only extra parameters we require during fine-tuning are Wy , and embeddings for delimiter
tokens (described below in Section 3.3).

Face Emotion Recognition - Capstone Project
100% (2)
Face Emotion Recognition - Capstone Project
25 pages
Techknowledge Publication: Artificial Intelligence and Soft Computing
No ratings yet
Techknowledge Publication: Artificial Intelligence and Soft Computing
336 pages
F20BC - CW - 2020 - Biologically Inspired
No ratings yet
F20BC - CW - 2020 - Biologically Inspired
6 pages
6CS4 22 Machine Learning Lab Manual
50% (2)
6CS4 22 Machine Learning Lab Manual
46 pages
Hide Answer Workspace
No ratings yet
Hide Answer Workspace
40 pages
Improving Language Understanding by Generative Pre-Training
No ratings yet
Improving Language Understanding by Generative Pre-Training
12 pages
GPT1
No ratings yet
GPT1
12 pages
N19-1213
No ratings yet
N19-1213
7 pages
LLM_book_8_42
No ratings yet
LLM_book_8_42
35 pages
LLM_test_v1_p8_12
No ratings yet
LLM_test_v1_p8_12
5 pages
Google T5
No ratings yet
Google T5
67 pages
Arxiv - 20191023 - Colin Raffel - Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
No ratings yet
Arxiv - 20191023 - Colin Raffel - Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
53 pages
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
No ratings yet
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
67 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Hello 2
No ratings yet
Hello 2
1 page
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
1 pretraining
No ratings yet
1 pretraining
18 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
2207 06839
No ratings yet
2207 06839
32 pages
A Unified Architecture For Natural Language Processing: Deep Neural Networks With Multitask Learning
No ratings yet
A Unified Architecture For Natural Language Processing: Deep Neural Networks With Multitask Learning
9 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
Trend
No ratings yet
Trend
47 pages
Ernie 2.0 A Continual Pre-Training Framework For
No ratings yet
Ernie 2.0 A Continual Pre-Training Framework For
11 pages
Bert
No ratings yet
Bert
20 pages
Learning To Generate Reviews and Discovering Sentiment
No ratings yet
Learning To Generate Reviews and Discovering Sentiment
9 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
No Training Required Exploring Random Encoders For Sentence Classification
No ratings yet
No Training Required Exploring Random Encoders For Sentence Classification
16 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Bert
No ratings yet
Bert
10 pages
A Little Pretraining Goes A Long Way: A Case Study On Dependency Parsing Task For Low-Resource Morphologically Rich Languages
No ratings yet
A Little Pretraining Goes A Long Way: A Case Study On Dependency Parsing Task For Low-Resource Morphologically Rich Languages
10 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Label Representation
No ratings yet
Label Representation
5 pages
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
No ratings yet
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
42 pages
Contrastive Learning for Sentence Representation
No ratings yet
Contrastive Learning for Sentence Representation
10 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
s10579-022-09620-5
No ratings yet
s10579-022-09620-5
35 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
1901.01216v1
No ratings yet
1901.01216v1
17 pages
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
No ratings yet
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
14 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
Pre-Trained Models For Natural Language Processing: A Survey
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
31 pages
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
No ratings yet
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
28 pages
Text and Code Embeddings by Contrastive Pre-Training
No ratings yet
Text and Code Embeddings by Contrastive Pre-Training
13 pages
CL Honours Report Naman
No ratings yet
CL Honours Report Naman
11 pages
Arxiv: Natural Language Processing (Almost) From Scratch
No ratings yet
Arxiv: Natural Language Processing (Almost) From Scratch
47 pages
Garbacea 22 A
No ratings yet
Garbacea 22 A
17 pages
CONNEAU and Lample - 2019 - Cross-lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-lingual Language Model Pretraining
11 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
Continual Pre-Training Mitigates Forgetting in Language and Vision
No ratings yet
Continual Pre-Training Mitigates Forgetting in Language and Vision
19 pages
Overview of The Transformer-Based Models For NLP Tasks
No ratings yet
Overview of The Transformer-Based Models For NLP Tasks
5 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
No ratings yet
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
31 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
GTE
No ratings yet
GTE
18 pages
Simcse: Simple Contrastive Learning of Sentence Embeddings
No ratings yet
Simcse: Simple Contrastive Learning of Sentence Embeddings
17 pages
UER: An Open-Source Toolkit For Pre-Training Models
No ratings yet
UER: An Open-Source Toolkit For Pre-Training Models
6 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
From Everand
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
From Everand
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
Sergio Torres-Martínez
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
CoreNLP in Practice: Definitive Reference for Developers and Engineers
From Everand
CoreNLP in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Paper Pengolahan Data
No ratings yet
Paper Pengolahan Data
9 pages
Generating Datasets With Pretrained Language Models
No ratings yet
Generating Datasets With Pretrained Language Models
9 pages
Baba Farid College of Engineering and Technology 05032021 Final
No ratings yet
Baba Farid College of Engineering and Technology 05032021 Final
2 pages
78 - Rutuja Surve - AISC - Exp1
No ratings yet
78 - Rutuja Surve - AISC - Exp1
5 pages
Video Summarization Project Presentaion
No ratings yet
Video Summarization Project Presentaion
34 pages
AI Applications To Communications and Information Technologies - IEEE (2024)
No ratings yet
AI Applications To Communications and Information Technologies - IEEE (2024)
493 pages
lec15_parameter_update2 (1)
No ratings yet
lec15_parameter_update2 (1)
4 pages
Unsupervised Learning: Harsha Vardhan Reddy Burri
No ratings yet
Unsupervised Learning: Harsha Vardhan Reddy Burri
10 pages
Afham CrossPoint Self-Supervised Cross-Modal Contrastive Learning For 3D Point Cloud Understanding CVPR 2022 Paper
No ratings yet
Afham CrossPoint Self-Supervised Cross-Modal Contrastive Learning For 3D Point Cloud Understanding CVPR 2022 Paper
11 pages
PRACTICAL FILE fml - Jatin
No ratings yet
PRACTICAL FILE fml - Jatin
15 pages
Lecture 3 Types of Machine Learning
No ratings yet
Lecture 3 Types of Machine Learning
40 pages
The Future of Artificial Intelligence: By: Harsh Jain (IU2041230053)
No ratings yet
The Future of Artificial Intelligence: By: Harsh Jain (IU2041230053)
14 pages
2022 Chen Yang AI in Digital Ag Updated
No ratings yet
2022 Chen Yang AI in Digital Ag Updated
70 pages
Course 5: Quantitative Techniques For Decision Making - Ii (Machine Learning Techniques)
No ratings yet
Course 5: Quantitative Techniques For Decision Making - Ii (Machine Learning Techniques)
5 pages
ioegc-10-032-100471
No ratings yet
ioegc-10-032-100471
8 pages
Ai ML DL App
No ratings yet
Ai ML DL App
24 pages
Emerging Artificial Intelligence Applications in Computer Engineering_ Real Word AI Systems With Applications in EHealth, HCI, Information Retrieval and ... in Artificial Intelligence and Applications) ( PDFDrive )
No ratings yet
Emerging Artificial Intelligence Applications in Computer Engineering_ Real Word AI Systems With Applications in EHealth, HCI, Information Retrieval and ... in Artificial Intelligence and Applications) ( PDFDrive )
421 pages
Machine Learning With Convolutional Neural Networks
No ratings yet
Machine Learning With Convolutional Neural Networks
22 pages
Handwritten Text Recognition Using Deep Learning
No ratings yet
Handwritten Text Recognition Using Deep Learning
13 pages
Descriptor Matching With Convolutional Neural Networks: A Comparison To SIFT
No ratings yet
Descriptor Matching With Convolutional Neural Networks: A Comparison To SIFT
10 pages
4.introduction To Learning - Unit 2
No ratings yet
4.introduction To Learning - Unit 2
8 pages
NNFL Syllabus Oe-III 7sem.
No ratings yet
NNFL Syllabus Oe-III 7sem.
1 page
UNIT-II(57-92)
No ratings yet
UNIT-II(57-92)
36 pages
Different Artificial Neural Networks Architectures
No ratings yet
Different Artificial Neural Networks Architectures
27 pages
2311.00176v5
No ratings yet
2311.00176v5
23 pages

Truncated_Doc_1

Uploaded by

Truncated_Doc_1

Uploaded by

Improving Language Understanding

Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever

Natural language understanding comprises a wide range of diverse tasks such

Preprint. Work in progress.

Unsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning

Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative

3.1 Unsupervised pre-training

Given an unsupervised corpus of tokens U = {u1 , . . . , un }, we use a standard language modeling

3.2 Supervised fine-tuning

You might also like