0% found this document useful (0 votes)

21 views

Lec14 Pretraining

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Lec14 Pretraining

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Pretraining Language Models

Wei Xu
(many slides from Greg Durrett)
Pretraining / ELMo

2
Recall: Context-dependent Embeddings
‣ How to handle diﬀerent word senses? One vector for balls

they dance at balls they hit the balls

‣ Train a neural language model to predict the next word given previous
words in the sentence, use its internal representaQons as word vectors

3 Peters et al. (2018)

ELMo
‣ CNN over each word => RNN next word
RepresentaQon of visited
(plus vectors from
backwards LM)

4096-dim LSTMs w/ 512-dim projecQons

2048 CNN ﬁlters projected down to 512-dim

Char CNN Char CNN Char CNN Char CNN

John visited Madagascar yesterday

4 Peters et al. (2018)
How to apply ELMo?
‣ Take those embeddings and feed them Task predicQons (senQment, etc.)
into whatever architecture you want to
use for your task

‣ Frozen embeddings: update the weights Some neural network

of your network but keep ELMo’s
parameters frozen
‣ Fine-tuning: backpropagate all the way
into ELMo when training your model

they dance at balls

Peters, Ruder, Smith (2019)
5
Results: Frozen ELMo

‣ Massive improvements across 5 benchmark datasets: quesQon

answering, natural language inference, semanQc role labeling
(discussed later in the course), coreference resoluQon, named enQty
recogniQon, and senQment analysis
6
How to apply ELMo?

‣ How does frozen ( ❄ ) vs. ﬁne-tuned ( " ) compare?

‣ RecommendaQons:

Peters, Ruder, Smith (2019)

7
Why is language modeling a good objecQve?
‣ “Impossible” problem but bigger models seem to do be8er and be8er at
distribuQonal modeling (no upper limit yet)
‣ Successfully predicQng next words requires modeling lots of diﬀerent
eﬀects in text

‣ LAMBADA dataset (Papernot et al., 2016): explicitly targets world

knowledge and very challenging LM examples
‣ Coreference, Winograd schema, and much more
8
Why is language modeling a good objecQve?

9 Zhang and Bowman (2018)

Why did this take Qme to catch on?
‣ Earlier version of ELMo by the same authors in 2017, but it was
only evaluated on tagging tasks, gains were 1% or less

‣ Required: training on lots of data, having the right architecture,

signiﬁcant hyperparameter tuning

10
Probing ELMo
‣ From each layer of the ELMo model, a8empt to predict something:
POS tags, word senses, etc.
‣ Higher accuracy => ELMo is capturing that thing more nicely

11
BERT
BERT
‣ AI2 made ELMo in spring 2018, GPT was released in summer 2018, BERT
came out October 2018
‣ Three major changes compared to ELMo:
‣ Transformers instead of LSTMs (transformers in GPT as well)
‣ BidirecEonal <=> Masked LM objecEve instead of standard LM
‣ Fine-tune instead of freeze at test Eme
BERT

14
BERT
‣ ELMo is a unidirecEonal model (as is GPT): we can concatenate two
unidirecEonal models, but is this the right thing to do?
‣ ELMo reprs look at each direcEon in isolaEon; BERT looks at them jointly
“performer” ELMo

ELMo “ballet dancer”

A stunning ballet dancer, Copeland is one of the best performers to see live.

BERT

“ballet dancer/performer”
Devlin et al. (2019)
BERT
‣ How to learn a “deeply bidirecEonal” model? What happens if we just
replace an LSTM with a transformer?
ELMo (Language Modeling) BERT
visited Madag. yesterday … visited Madag. yesterday …

John visited Madagascar yesterday

‣ Transformer LMs have to be “one-
sided” (only a8end to previous
John visited Madagascar yesterday tokens), not what we want
Masked Language Modeling
‣ How to prevent cheaEng? Next word predicEon fundamentally doesn't
work for bidirecEonal models, instead do masked language modeling
Madagascar
‣ BERT formula: take a chunk of
text, predict 15% of the tokens
‣ For 80% (of the 15%),
replace the input token with
[MASK] John visited [MASK] yesterday

‣ For 10%, replace w/random John visited of yesterday

‣ For 10%, keep same John visited Madagascar yesterday

Devlin et al. (2019)
Next “Sentence” PredicEon
‣ Input: [CLS] Text chunk 1 [SEP] Text chunk 2
‣ 50% of the Eme, take the true next chunk of text, 50% of the Eme take a
random other chunk. Predict whether the next chunk is the “true” next
‣ BERT objecEve: masked LM + next sentence predicEon
NotNext Madagascar enjoyed like

Transformer
…
Transformer

[CLS] John visited [MASK] yesterday and really all it [SEP] I like Madonna.
Devlin et al. (2019)
BERT Architecture
‣ BERT Base: 12 layers, 768-dim
per wordpiece token, 12 heads.
Total params = 110M
‣ BERT Large: 24 layers, 1024-dim
per wordpiece token, 16 heads.
Total params = 340M
‣ PosiEonal embeddings and
segment embeddings, 30k
word pieces
‣ This is the model that gets
pre-trained on a large corpus
Devlin et al. (2019)
What can BERT do?

‣ CLS token is used to provide classiﬁcaEon decisions

‣ Sentence pair tasks (entailment): feed both sentences into BERT
‣ BERT can also do tagging by predicEng tags at each word piece
Devlin et al. (2019)
What can BERT do?
Entails

Transformer
…
Transformer

[CLS] A boy plays in the snow [SEP] A boy is outside

‣ How does BERT model this sentence pair stuﬀ?

‣ Transformers can capture interacEons between the two sentences,
even though the NSP objecEve doesn’t really cause this to happen
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, ﬁll in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, ﬁll in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks

Lewis et al. (2019)

Fine-tuning BERT
‣ Fine-tune for 1-3 epochs, batch size 2-32, learning rate 2e-5 - 5e-5
‣ Large changes to weights up here
(parEcularly in last layer to route the
right informaEon to [CLS])
‣ Smaller changes to weights lower down
in the transformer
‣ Small LR and short ﬁne-tuning schedule
mean weights don’t change much
‣ More complex “triangular
learning rate” schemes exist
Fine-tuning BERT

‣ BERT is typically be8er if the whole network is ﬁne-tuned, unlike ELMo

Peters, Ruder, Smith (2019)

EvaluaEon: GLUE

Wang et al. (2019)

Results

‣ Huge improvements over prior work (even compared to ELMo)

‣ EﬀecEve at “sentence pair” tasks: textual entailment (does sentence A

imply sentence B), paraphrase detecEon
Devlin et al. (2018)
RoBERTa
‣ “Robustly opEmized BERT”

‣ 160GB of data instead of

16 GB

‣ Dynamic masking: standard

BERT uses the same MASK
scheme for every epoch,
RoBERTa recomputes them

‣ New training + more data = be8er performance

Liu et al. (2019)
GPT/GPT2/GPT3
OpenAI GPT/GPT2
‣ “ELMo with transformers” (works be8er than ELMo)
‣ Train a single unidirecEonal transformer LM on long contexts
‣ GPT2: trained on 40GB of text
collected from upvoted links
from reddit
‣ 1.5B parameters — by far the
largest of these models trained
as of March 2019
‣ Because it's a language model, we can generate from it

Radford et al. (2019)

OpenAI GPT2

slide credit:
OpenAI
GPT3

https://twitter.com/cocoweixu/status/1285727605568811011
GPT3

https://twitter.com/cocoweixu/status/1285727605568811011
Pre-Training Cost (with Google/AWS)
‣ BERT: Base $500, Large $7000

‣ Grover-MEGA: $25,000

‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)

‣ This is for a single pre-training run…developing new pre-training

techniques may require many runs

‣ Fine-tuning these models can typically be done with a single GPU (but
may take 1-3 days for medium-sized datasets)

h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
Pre-training Cost
And a lot more …

36
Analysis
What does BERT learn?

‣ Heads on transformers learn interesEng and diverse things: content

heads (a8end based on content), posiEonal heads (based on
posiEon), etc.
Clark et al. (2019)
What does BERT learn?

‣ SEll way worse than what supervised systems can do, but
interesEng that this is learned organically

Clark et al. (2019)

Probing BERT
‣ Try to predict POS, etc. from each layer.
Learn mixing weights

representaEon of wordpiece i
for task τ
‣ Plot shows s weights (blue) and
performance deltas when an addiEonal
layer is incorporated (purple)
‣ BERT “rediscovers the classical NLP pipeline”:
ﬁrst syntacEc tasks then semanEc ones Tenney et al. (2019)
Compressing BERT
‣ Remove 60+% of
BERT’s heads with
minimal drop in
performance

‣ DisElBERT (Sanh et al.,

2019): nearly as good with
half the parameters of BERT
(via knowledge disEllaEon)

Michel et al. (2019)

Open QuesEons
‣ BERT-based systems are state-of-the-art for nearly every major text
analysis task

‣ These techniques are here to stay, unclear what form will win out

‣ Role of academia vs. industry: no major pretrained model has come

purely from academia

‣ Cost/carbon footprint: a single model costs $10,000+ to train (though

this cost should come down)

Project Proposal (Jkuat)
86% (7)
Project Proposal (Jkuat)
108 pages
Abinitio Practitioner V 1 0 Day1 PDF
100% (1)
Abinitio Practitioner V 1 0 Day1 PDF
129 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
GRLWEAP 2010 Background
100% (1)
GRLWEAP 2010 Background
155 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
BERT
No ratings yet
BERT
98 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Lec 02
No ratings yet
Lec 02
33 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Bert ayman
No ratings yet
Bert ayman
5 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
BERT
No ratings yet
BERT
4 pages
C4_W3
No ratings yet
C4_W3
98 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Bert
No ratings yet
Bert
36 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
11 Bert
No ratings yet
11 Bert
66 pages
Bert
No ratings yet
Bert
20 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
LSTM to BERT
No ratings yet
LSTM to BERT
30 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
Bert
No ratings yet
Bert
10 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
The Illustrated BERT, ELMo, And Co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
No ratings yet
The Illustrated BERT, ELMo, And Co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
4 pages
data_mining_report
No ratings yet
data_mining_report
17 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
Reasoning With Transformer Bas
No ratings yet
Reasoning With Transformer Bas
28 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
BERT-1-42
No ratings yet
BERT-1-42
42 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Language Models As Knowledge Bases?
No ratings yet
Language Models As Knowledge Bases?
11 pages
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
No ratings yet
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
9 pages
A Primer in BERTology - What We Know About How BERT Works
No ratings yet
A Primer in BERTology - What We Know About How BERT Works
23 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
The Development of Language AI Models in 2018
No ratings yet
The Development of Language AI Models in 2018
5 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
What Happens To BERT Embeddings During Fine-tuning
No ratings yet
What Happens To BERT Embeddings During Fine-tuning
13 pages
LLM Learning
No ratings yet
LLM Learning
56 pages
BERT_GPT_CoT
No ratings yet
BERT_GPT_CoT
83 pages
Bert
No ratings yet
Bert
5 pages
BERT Slides
No ratings yet
BERT Slides
41 pages
CS283 Lecture6 2024
No ratings yet
CS283 Lecture6 2024
115 pages
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
L11 ClassIntricacies
No ratings yet
L11 ClassIntricacies
9 pages
L14 OptimizationSingleVariable
No ratings yet
L14 OptimizationSingleVariable
33 pages
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
No ratings yet
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
38 pages
L12 FileInputOutput
No ratings yet
L12 FileInputOutput
18 pages
L6 Tuple Container
No ratings yet
L6 Tuple Container
18 pages
L8 Dictionaries
No ratings yet
L8 Dictionaries
17 pages
L7 Set Container
No ratings yet
L7 Set Container
16 pages
MODULE T4 - DCC50242 BIM Terbaru
No ratings yet
MODULE T4 - DCC50242 BIM Terbaru
147 pages
Rotovac Brochure
No ratings yet
Rotovac Brochure
2 pages
ZyWALL 310 - 2
No ratings yet
ZyWALL 310 - 2
386 pages
Introduction To Software Testing Graph Coverage For Source Code
No ratings yet
Introduction To Software Testing Graph Coverage For Source Code
25 pages
BW CallCenterSolutionGuide R230
No ratings yet
BW CallCenterSolutionGuide R230
147 pages
ITS62904 Assignment APRIL2023
No ratings yet
ITS62904 Assignment APRIL2023
11 pages
Practical Approach To Precision Balancing
No ratings yet
Practical Approach To Precision Balancing
4 pages
Microsoft SQL Server Black Book PDF
No ratings yet
Microsoft SQL Server Black Book PDF
397 pages
Calypso 02 Pcm
No ratings yet
Calypso 02 Pcm
146 pages
Cloud Security
100% (2)
Cloud Security
91 pages
Temaline TS2 (Access Doors)
No ratings yet
Temaline TS2 (Access Doors)
4 pages
Ryan Alpha ATS II
No ratings yet
Ryan Alpha ATS II
1 page
APC Validated Process - Reference Architecture V10
No ratings yet
APC Validated Process - Reference Architecture V10
51 pages
Galileo!!! Ticketing Traning!!!!!!!!!!!!!!! Learn Free!!!!!!!!! - Air Ticketing (GDS)
100% (1)
Galileo!!! Ticketing Traning!!!!!!!!!!!!!!! Learn Free!!!!!!!!! - Air Ticketing (GDS)
9 pages
Syed Habeeb Ullah Quadri: Page 1 of 9
No ratings yet
Syed Habeeb Ullah Quadri: Page 1 of 9
9 pages
Security, Compliance and Identity Partner Enablement Resource Guide
No ratings yet
Security, Compliance and Identity Partner Enablement Resource Guide
43 pages
Mount Zion College of Engineering and Technology
No ratings yet
Mount Zion College of Engineering and Technology
23 pages
U2k2 VN CM Im020 BW Bex 7.0 Overview v1.0
No ratings yet
U2k2 VN CM Im020 BW Bex 7.0 Overview v1.0
94 pages
Unit 1
No ratings yet
Unit 1
115 pages
English Afaan Oromoo Machine Translation - An Experiment Using Statistical Approach PDF
No ratings yet
English Afaan Oromoo Machine Translation - An Experiment Using Statistical Approach PDF
86 pages
IntelliVibeV3 DSC
No ratings yet
IntelliVibeV3 DSC
9 pages
ACL Exercises With Solutions
No ratings yet
ACL Exercises With Solutions
3 pages
CHANGELOG
No ratings yet
CHANGELOG
167 pages
Swissbit SecureBoot SDK RPi User Manual v2.0
No ratings yet
Swissbit SecureBoot SDK RPi User Manual v2.0
21 pages
ICSE COMPUTER Theory
No ratings yet
ICSE COMPUTER Theory
17 pages
Term Paper Samples Computer Science
100% (2)
Term Paper Samples Computer Science
4 pages
Telecom Life Cycle Management
No ratings yet
Telecom Life Cycle Management
14 pages

Lec14 Pretraining

Uploaded by

Lec14 Pretraining

Uploaded by

Pretraining Language Models

they dance at balls they hit the balls

3 Peters et al. (2018)

4096-dim LSTMs w/ 512-dim projecQons

Char CNN Char CNN Char CNN Char CNN

John visited Madagascar yesterday

‣ Frozen embeddings: update the weights Some neural network

they dance at balls

‣ Massive improvements across 5 benchmark datasets: quesQon

‣ How does frozen ( ❄ ) vs. ﬁne-tuned ( " ) compare?

Peters, Ruder, Smith (2019)

‣ LAMBADA dataset (Papernot et al., 2016): explicitly targets world

9 Zhang and Bowman (2018)

‣ Required: training on lots of data, having the right architecture,

ELMo “ballet dancer”

John visited Madagascar yesterday

‣ For 10%, replace w/random John visited of yesterday

‣ For 10%, keep same John visited Madagascar yesterday

‣ CLS token is used to provide classiﬁcaEon decisions

[CLS] A boy plays in the snow [SEP] A boy is outside

‣ How does BERT model this sentence pair stuﬀ?

Lewis et al. (2019)

‣ BERT is typically be8er if the whole network is ﬁne-tuned, unlike ELMo

Peters, Ruder, Smith (2019)

Wang et al. (2019)

‣ Huge improvements over prior work (even compared to ELMo)

‣ EﬀecEve at “sentence pair” tasks: textual entailment (does sentence A

‣ 160GB of data instead of

‣ Dynamic masking: standard

‣ New training + more data = be8er performance

Radford et al. (2019)

‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)

‣ This is for a single pre-training run…developing new pre-training

‣ Heads on transformers learn interesEng and diverse things: content

Clark et al. (2019)

‣ DisElBERT (Sanh et al.,

Michel et al. (2019)

‣ Role of academia vs. industry: no major pretrained model has come

‣ Cost/carbon footprint: a single model costs $10,000+ to train (though

You might also like