0% found this document useful (0 votes)
21 views

Lec14 Pretraining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lec14 Pretraining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Pretraining Language Models

Wei Xu
(many slides from Greg Durrett)
Pretraining / ELMo

2
Recall: Context-dependent Embeddings
‣ How to handle different word senses? One vector for balls

they dance at balls they hit the balls


‣ Train a neural language model to predict the next word given previous
words in the sentence, use its internal representaQons as word vectors

3 Peters et al. (2018)


ELMo
‣ CNN over each word => RNN next word
RepresentaQon of visited
(plus vectors from
backwards LM)

4096-dim LSTMs w/ 512-dim projecQons


2048 CNN filters projected down to 512-dim

Char CNN Char CNN Char CNN Char CNN

John visited Madagascar yesterday


4 Peters et al. (2018)
How to apply ELMo?
‣ Take those embeddings and feed them Task predicQons (senQment, etc.)
into whatever architecture you want to
use for your task

‣ Frozen embeddings: update the weights Some neural network


of your network but keep ELMo’s
parameters frozen
‣ Fine-tuning: backpropagate all the way
into ELMo when training your model

they dance at balls


Peters, Ruder, Smith (2019)
5
Results: Frozen ELMo

‣ Massive improvements across 5 benchmark datasets: quesQon


answering, natural language inference, semanQc role labeling
(discussed later in the course), coreference resoluQon, named enQty
recogniQon, and senQment analysis
6
How to apply ELMo?

‣ How does frozen ( ❄ ) vs. fine-tuned ( " ) compare?

‣ RecommendaQons:

Peters, Ruder, Smith (2019)


7
Why is language modeling a good objecQve?
‣ “Impossible” problem but bigger models seem to do be8er and be8er at
distribuQonal modeling (no upper limit yet)
‣ Successfully predicQng next words requires modeling lots of different
effects in text

‣ LAMBADA dataset (Papernot et al., 2016): explicitly targets world


knowledge and very challenging LM examples
‣ Coreference, Winograd schema, and much more
8
Why is language modeling a good objecQve?

9 Zhang and Bowman (2018)


Why did this take Qme to catch on?
‣ Earlier version of ELMo by the same authors in 2017, but it was
only evaluated on tagging tasks, gains were 1% or less

‣ Required: training on lots of data, having the right architecture,


significant hyperparameter tuning

10
Probing ELMo
‣ From each layer of the ELMo model, a8empt to predict something:
POS tags, word senses, etc.
‣ Higher accuracy => ELMo is capturing that thing more nicely

11
BERT
BERT
‣ AI2 made ELMo in spring 2018, GPT was released in summer 2018, BERT
came out October 2018
‣ Three major changes compared to ELMo:
‣ Transformers instead of LSTMs (transformers in GPT as well)
‣ BidirecEonal <=> Masked LM objecEve instead of standard LM
‣ Fine-tune instead of freeze at test Eme
BERT

14
BERT
‣ ELMo is a unidirecEonal model (as is GPT): we can concatenate two
unidirecEonal models, but is this the right thing to do?
‣ ELMo reprs look at each direcEon in isolaEon; BERT looks at them jointly
“performer” ELMo

ELMo “ballet dancer”

A stunning ballet dancer, Copeland is one of the best performers to see live.

BERT

“ballet dancer/performer”
Devlin et al. (2019)
BERT
‣ How to learn a “deeply bidirecEonal” model? What happens if we just
replace an LSTM with a transformer?
ELMo (Language Modeling) BERT
visited Madag. yesterday … visited Madag. yesterday …

John visited Madagascar yesterday


‣ Transformer LMs have to be “one-
sided” (only a8end to previous
John visited Madagascar yesterday tokens), not what we want
Masked Language Modeling
‣ How to prevent cheaEng? Next word predicEon fundamentally doesn't
work for bidirecEonal models, instead do masked language modeling
Madagascar
‣ BERT formula: take a chunk of
text, predict 15% of the tokens
‣ For 80% (of the 15%),
replace the input token with
[MASK] John visited [MASK] yesterday

‣ For 10%, replace w/random John visited of yesterday

‣ For 10%, keep same John visited Madagascar yesterday


Devlin et al. (2019)
Next “Sentence” PredicEon
‣ Input: [CLS] Text chunk 1 [SEP] Text chunk 2
‣ 50% of the Eme, take the true next chunk of text, 50% of the Eme take a
random other chunk. Predict whether the next chunk is the “true” next
‣ BERT objecEve: masked LM + next sentence predicEon
NotNext Madagascar enjoyed like

Transformer

Transformer

[CLS] John visited [MASK] yesterday and really all it [SEP] I like Madonna.
Devlin et al. (2019)
BERT Architecture
‣ BERT Base: 12 layers, 768-dim
per wordpiece token, 12 heads.
Total params = 110M
‣ BERT Large: 24 layers, 1024-dim
per wordpiece token, 16 heads.
Total params = 340M
‣ PosiEonal embeddings and
segment embeddings, 30k
word pieces
‣ This is the model that gets
pre-trained on a large corpus
Devlin et al. (2019)
What can BERT do?

‣ CLS token is used to provide classificaEon decisions


‣ Sentence pair tasks (entailment): feed both sentences into BERT
‣ BERT can also do tagging by predicEng tags at each word piece
Devlin et al. (2019)
What can BERT do?
Entails

Transformer

Transformer

[CLS] A boy plays in the snow [SEP] A boy is outside

‣ How does BERT model this sentence pair stuff?


‣ Transformers can capture interacEons between the two sentences,
even though the NSP objecEve doesn’t really cause this to happen
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, fill in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, fill in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks

Lewis et al. (2019)


Fine-tuning BERT
‣ Fine-tune for 1-3 epochs, batch size 2-32, learning rate 2e-5 - 5e-5
‣ Large changes to weights up here
(parEcularly in last layer to route the
right informaEon to [CLS])
‣ Smaller changes to weights lower down
in the transformer
‣ Small LR and short fine-tuning schedule
mean weights don’t change much
‣ More complex “triangular
learning rate” schemes exist
Fine-tuning BERT

‣ BERT is typically be8er if the whole network is fine-tuned, unlike ELMo

Peters, Ruder, Smith (2019)


EvaluaEon: GLUE

Wang et al. (2019)


Results

‣ Huge improvements over prior work (even compared to ELMo)

‣ EffecEve at “sentence pair” tasks: textual entailment (does sentence A


imply sentence B), paraphrase detecEon
Devlin et al. (2018)
RoBERTa
‣ “Robustly opEmized BERT”

‣ 160GB of data instead of


16 GB

‣ Dynamic masking: standard


BERT uses the same MASK
scheme for every epoch,
RoBERTa recomputes them

‣ New training + more data = be8er performance


Liu et al. (2019)
GPT/GPT2/GPT3
OpenAI GPT/GPT2
‣ “ELMo with transformers” (works be8er than ELMo)
‣ Train a single unidirecEonal transformer LM on long contexts
‣ GPT2: trained on 40GB of text
collected from upvoted links
from reddit
‣ 1.5B parameters — by far the
largest of these models trained
as of March 2019
‣ Because it's a language model, we can generate from it

Radford et al. (2019)


OpenAI GPT2

slide credit:
OpenAI
GPT3

https://twitter.com/cocoweixu/status/1285727605568811011
GPT3

https://twitter.com/cocoweixu/status/1285727605568811011
Pre-Training Cost (with Google/AWS)
‣ BERT: Base $500, Large $7000

‣ Grover-MEGA: $25,000

‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)

‣ This is for a single pre-training run…developing new pre-training


techniques may require many runs

‣ Fine-tuning these models can typically be done with a single GPU (but
may take 1-3 days for medium-sized datasets)

h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
Pre-training Cost
And a lot more …

36
Analysis
What does BERT learn?

‣ Heads on transformers learn interesEng and diverse things: content


heads (a8end based on content), posiEonal heads (based on
posiEon), etc.
Clark et al. (2019)
What does BERT learn?

‣ SEll way worse than what supervised systems can do, but
interesEng that this is learned organically

Clark et al. (2019)


Probing BERT
‣ Try to predict POS, etc. from each layer.
Learn mixing weights

representaEon of wordpiece i
for task τ
‣ Plot shows s weights (blue) and
performance deltas when an addiEonal
layer is incorporated (purple)
‣ BERT “rediscovers the classical NLP pipeline”:
first syntacEc tasks then semanEc ones Tenney et al. (2019)
Compressing BERT
‣ Remove 60+% of
BERT’s heads with
minimal drop in
performance

‣ DisElBERT (Sanh et al.,


2019): nearly as good with
half the parameters of BERT
(via knowledge disEllaEon)

Michel et al. (2019)


Open QuesEons
‣ BERT-based systems are state-of-the-art for nearly every major text
analysis task

‣ These techniques are here to stay, unclear what form will win out

‣ Role of academia vs. industry: no major pretrained model has come


purely from academia

‣ Cost/carbon footprint: a single model costs $10,000+ to train (though


this cost should come down)

You might also like