Lec14 Pretraining
Lec14 Pretraining
Wei Xu
(many slides from Greg Durrett)
Pretraining / ELMo
2
Recall: Context-dependent Embeddings
‣ How to handle different word senses? One vector for balls
‣ RecommendaQons:
10
Probing ELMo
‣ From each layer of the ELMo model, a8empt to predict something:
POS tags, word senses, etc.
‣ Higher accuracy => ELMo is capturing that thing more nicely
11
BERT
BERT
‣ AI2 made ELMo in spring 2018, GPT was released in summer 2018, BERT
came out October 2018
‣ Three major changes compared to ELMo:
‣ Transformers instead of LSTMs (transformers in GPT as well)
‣ BidirecEonal <=> Masked LM objecEve instead of standard LM
‣ Fine-tune instead of freeze at test Eme
BERT
14
BERT
‣ ELMo is a unidirecEonal model (as is GPT): we can concatenate two
unidirecEonal models, but is this the right thing to do?
‣ ELMo reprs look at each direcEon in isolaEon; BERT looks at them jointly
“performer” ELMo
A stunning ballet dancer, Copeland is one of the best performers to see live.
BERT
“ballet dancer/performer”
Devlin et al. (2019)
BERT
‣ How to learn a “deeply bidirecEonal” model? What happens if we just
replace an LSTM with a transformer?
ELMo (Language Modeling) BERT
visited Madag. yesterday … visited Madag. yesterday …
Transformer
…
Transformer
[CLS] John visited [MASK] yesterday and really all it [SEP] I like Madonna.
Devlin et al. (2019)
BERT Architecture
‣ BERT Base: 12 layers, 768-dim
per wordpiece token, 12 heads.
Total params = 110M
‣ BERT Large: 24 layers, 1024-dim
per wordpiece token, 16 heads.
Total params = 340M
‣ PosiEonal embeddings and
segment embeddings, 30k
word pieces
‣ This is the model that gets
pre-trained on a large corpus
Devlin et al. (2019)
What can BERT do?
Transformer
…
Transformer
slide credit:
OpenAI
GPT3
https://twitter.com/cocoweixu/status/1285727605568811011
GPT3
https://twitter.com/cocoweixu/status/1285727605568811011
Pre-Training Cost (with Google/AWS)
‣ BERT: Base $500, Large $7000
‣ Grover-MEGA: $25,000
‣ Fine-tuning these models can typically be done with a single GPU (but
may take 1-3 days for medium-sized datasets)
h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
Pre-training Cost
And a lot more …
36
Analysis
What does BERT learn?
‣ SEll way worse than what supervised systems can do, but
interesEng that this is learned organically
representaEon of wordpiece i
for task τ
‣ Plot shows s weights (blue) and
performance deltas when an addiEonal
layer is incorporated (purple)
‣ BERT “rediscovers the classical NLP pipeline”:
first syntacEc tasks then semanEc ones Tenney et al. (2019)
Compressing BERT
‣ Remove 60+% of
BERT’s heads with
minimal drop in
performance
‣ These techniques are here to stay, unclear what form will win out