0% found this document useful (0 votes)

78 views

Jacob Devlin BERT

The document discusses the development of contextual word representations and pre-trained language models, including BERT. It describes how BERT addresses limitations of previous word embedding approaches by training bidirectional representations from unlabeled text using masked language modeling and next sentence prediction. The document outlines the model architecture of BERT and details of its pre-training procedure. It analyzes BERT's performance on various NLP tasks and the effects of pre-training objectives, model size, and training data. Finally, it discusses several follow-up models that improved on BERT's approach, such as RoBERTa, XLNet, ALBERT, and T5.

Uploaded by

Thura Htet Aung

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views

Jacob Devlin BERT

Uploaded by

Thura Htet Aung

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Contextual Word Representations with

BERT and Other Pre-trained Language

Models

Jacob Devlin
Google AI Language
History and Background
Pre-training in NLP
● Word embeddings are the basis of deep learning
for NLP
king queen

[-0.5, -0.9, 1.4, …] [-0.6, -0.8, -0.2, …]

● Word embeddings (word2vec, GloVe) are often

pre-trained on text corpus from co-occurrence
statistics
Inner Product Inner Product

the king wore a crown the queen wore a crown

Contextual Representations
● Problem: Word embeddings are applied in a
context free manner
open a bank account on the river bank

[0.3, 0.2, -0.8, …]

● Solution: Train contextual representations on text

corpus
[0.9, -0.2, 1.6, …] [-1.9, -0.4, 0.1, …]

open a bank account on the river bank

History of Contextual Representations
● Semi-Supervised Sequence Learning, Google, 2015

Train LSTM Fine-tune on

Language Model Classification Task
open a bank POSITIVE

LSTM LSTM LSTM ... LSTM LSTM LSTM

<s> open a very funny movie

History of Contextual Representations
● ELMo: Deep Contextual Word Embeddings, AI2 &
University of Washington, 2017
Train Separate Left-to-Right and Apply as “Pre-trained
Right-to-Left LMs Embeddings”
open a bank <s> open a
Existing Model Architecture

LSTM LSTM LSTM LSTM LSTM LSTM

<s> open a open a bank

open a bank
History of Contextual Representations
● Improving Language Understanding by Generative
Pre-Training, OpenAI, 2018

Train Deep (12-layer) Fine-tune on

Transformer LM Classification Task
POSITIVE
open a bank

Transformer Transformer Transformer

<s> open a
<s> open a
Model Architecture
Transformer encoder
● Multi-headed self attention
○ Models context
● Feed-forward layers
○ Computes non-linear hierarchical features
● Layer norm and residuals
○ Makes training deep networks healthy
● Positional embeddings
○ Allows model to learn relative positioning
Model Architecture
● Empirical advantages of Transformer vs. LSTM:
1. Self-attention == no locality bias
● Long-distance context has “equal opportunity”
2. Single multiplication per layer == efficiency on TPU
● Effective batch size is number of words, not sequences
Transformer LSTM

X_0_0 X_0_1 X_0_2 X_0_3 X_0_0 X_0_1 X_0_2 X_0_3

✕ W ✕ W
X_1_0 X_1_1 X_1_2 X_1_3 X_1_0 X_1_1 X_1_2 X_1_3
BERT
Problem with Previous Methods
● Problem: Language models only use left context or
right context, but language understanding is
bidirectional.
● Why are LMs unidirectional?
● Reason 1: Directionality is needed to generate a
well-formed probability distribution.
○ We don’t care about this.
● Reason 2: Words can “see themselves” in a
bidirectional encoder.
Unidirectional vs. Bidirectional Models

Unidirectional context Bidirectional context

Build representation incrementally Words can “see themselves”

open a bank open a bank

Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2

<s> open a <s> open a

Masked LM
● Solution: Mask out k% of the input words, and
then predict the masked words
○ We always use k = 15%

store gallon

the man went to the [MASK] to buy a [MASK] of milk

● Too little masking: Too expensive to train

● Too much masking: Not enough context
Masked LM
● Problem: Mask token never seen at fine-tuning
● Solution: 15% of the words to predict, but don’t
replace with [MASK] 100% of the time. Instead:
● 80% of the time, replace with [MASK]
went to the store → went to the [MASK]
● 10% of the time, replace random word
went to the store → went to the running
● 10% of the time, keep same
went to the store → went to the store
Next Sentence Prediction
● To learn relationships between sentences, predict
whether Sentence B is actual sentence that
proceeds Sentence A, or a random sentence
Input Representation

● Use 30,000 WordPiece vocabulary on input.

● Each token is sum of three embeddings
● Single sequence is much more efficient.
Model Details
● Data: Wikipedia (2.5B words) + BookCorpus (800M
words)
● Batch Size: 131,072 words (1024 sequences * 128
length or 256 sequences * 512 length)
● Training Time: 1M steps (~40 epochs)
● Optimizer: AdamW, 1e-4 learning rate, linear decay
● BERT-Base: 12-layer, 768-hidden, 12-head
● BERT-Large: 24-layer, 1024-hidden, 16-head
● Trained on 4x4 or 8x8 TPU slice for 4 days
Fine-Tuning Procedure
Fine-Tuning Procedure
GLUE Results

MultiNLI CoLa
Premise: Hills and mountains are especially Sentence: The wagon rumbled down the road.
sanctified in Jainism. Label: Acceptable
Hypothesis: Jainism hates nature.
Label: Contradiction Sentence: The car honked down the road.
Label: Unacceptable
SQuAD 2.0

● Use token 0 ([CLS]) to emit

logit for “no answer”.
● “No answer” directly
competes with answer span.
● Threshold is optimized on dev
set.
Effect of Pre-training Task

● Masked LM (compared to left-to-right LM) is very important on

some tasks, Next Sentence Prediction is important on other tasks.
● Left-to-right model does very poorly on word-level task (SQuAD),
although this is mitigated by BiLSTM
Effect of Directionality and Training Time

● Masked LM takes slightly longer to converge because we

only predict 15% instead of 100%
● But absolute results are much better almost immediately
Effect of Model Size

● Big models help a lot

● Going from 110M -> 340M params helps even on
datasets with 3,600 labeled examples
● Improvements have not asymptoted
Open Source Release
● One reason for BERT’s success was the open
source release
○ Minimal release (not part of a larger codebase)
○ No dependencies but TensorFlow (or PyTorch)
○ Abstracted so people could including a single file to use model
○ End-to-end push-button examples to train SOTA models
○ Thorough README
○ Idiomatic code
○ Well-documented code
○ Good support (for the first few months)
Post-BERT Pre-training Advancements
RoBERTA
● RoBERTa: A Robustly Optimized BERT Pretraining
Approach (Liu et al, University of Washington and
Facebook, 2019)
● Trained BERT for more epochs and/or on more data
○ Showed that more epochs alone helps, even on same data
○ More data also helps
● Improved masking and pre-training data slightly
XLNet
● XLNet: Generalized Autoregressive Pretraining for
Language Understanding (Yang et al, CMU and
Google, 2019)
● Innovation #1: Relative position embeddings
○ Sentence: John ate a hot dog
○ Absolute attention: “How much should dog attend to hot(in any
position), and how much should dog in position 4 attend to the word
in position 3? (Or 508 attend to 507, …)”
○ Relative attention: “How much should dog attend to hot (in any
position) and how much should dog attend to the previous word?”
XLNet
● Innovation #2: Permutation Language Modeling
○ In a left-to-right language model, every word is predicted based on
all of the words to its left
○ Instead: Randomly permute the order for every training sentence
○ Equivalent to masking, but many more predictions per sentence
○ Can be done efficiently with Transformers
XLNet
● Also used more data and bigger models, but
showed that innovations improved on BERT even
with same data and model size
● XLNet results:
ALBERT
● ALBERT: A Lite BERT for Self-supervised Learning
of Language Representations (Lan et al, Google
and TTI Chicago, 2019)
● Innovation #1: Factorized embedding
parameterization
○ Use small embedding size (e.g., 128) and then project it to
Transformer hidden size (e.g., 1024) with parameter matrix

1024 128 1024

x vs. x ⨉ x
100k 100k 128
ALBERT
● Innovation #2: Cross-layer parameter sharing
○ Share all parameters between Transformer layers
● Results:

● ALBERT is light in terms of parameters, not speed

T5
● Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer (Raffel et al,
Google, 2019)
● Ablated many aspects of pre-training:
○ Model size
○ Amount of training data
○ Domain/cleanness of training data
○ Pre-training objective details (e.g., span length of masked text)
○ Ensembling
○ Finetuning recipe (e.g., only allowing certain layers to finetune)
○ Multi-task training
T5
● Conclusions:
○ Scaling up model size and amount of training data helps a lot
○ Best model is 11B parameters (BERT-Large is 330M), trained on 120B
words of cleaned common crawl text
○ Exact masking/corruptions strategy doesn’t matter that much
○ Mostly negative results for better finetuning and multi-task strategies
● T5 results:
ELECTRA
● ELECTRA: Pre-training Text Encoders as
Discriminators Rather Than Generators (Clark et al,
2020)
● Train model to discriminate locally plausible text
from real text
ELECTRA
● Difficult to match SOTA results with less compute
Distillation
Applying Models to Production Services
● BERT and other pre-trained language models are
extremely large and expensive
● How are companies applying them to low-latency
production services?
Distillation
● Answer: Distillation (a.k.a., model compression)
● Idea has been around for a long time:
○ Model Compression (Bucila et al, 2006)
○ Distilling the Knowledge in a Neural Network (Hinton et al, 2015)
● Simple technique:
○ Train “Teacher”: Use SOTA pre-training + fine-tuning technique to
train model with maximum accuracy
○ Label a large amount of unlabeled input examples with Teacher
○ Train “Student”: Much smaller model (e.g., 50x smaller) which is
trained to mimic Teacher output
○ Student objective is typically Mean Square Error or Cross Entropy
Distillation
● Example distillation results
○ 50k labeled examples, 8M unlabeled examples

Well-Read Students Learn

Better: On the Importance of
Pre-training Compact Models
(Turc et al, 2020)

● Distillation works much better than pre-training +

fine-tuning with smaller model
Distillation
● Why does distillation work so well? A hypothesis:
○ Language modeling is the “ultimate” NLP task in many ways
■ I.e., a perfect language model is also a perfect question
answering/entailment/sentiment analysis model
○ Training a massive language model learns millions of latent features
which are useful for these other NLP tasks
○ Finetuning mostly just picks up and tweaks these existing latent
features
○ This requires an oversized model, because only a subset of the
features are useful for any given task
○ Distillation allows the model to only focus on those features
○ Supporting evidence: Simple self-distillation (distilling a smaller BERT
model) doesn’t work
Conclusions
Conclusions
● Pre-trained bidirectional language models work
incredibly well
● However, the models are extremely expensive
● Improvements (unfortunately) seem to mostly
come from even more expensive models and more
data
● The inference/serving problem is mostly “solved”
through distillation

Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
272 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
LSTM to BERT
No ratings yet
LSTM to BERT
30 pages
Bert
No ratings yet
Bert
60 pages
Note 1015202360148 PM
No ratings yet
Note 1015202360148 PM
4 pages
Bert
No ratings yet
Bert
36 pages
icaps-llm-tut-slides-posted
No ratings yet
icaps-llm-tut-slides-posted
97 pages
04 - RNNs
No ratings yet
04 - RNNs
37 pages
Lec 02
No ratings yet
Lec 02
33 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
LLM
No ratings yet
LLM
41 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Transformer
No ratings yet
Transformer
5 pages
Day 12 Masked Language Models
No ratings yet
Day 12 Masked Language Models
7 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
50 LLM Interview Questions
No ratings yet
50 LLM Interview Questions
56 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
RNN Vanishing Gradients LSTM Compressed
No ratings yet
RNN Vanishing Gradients LSTM Compressed
53 pages
ML for NLP-LO3
No ratings yet
ML for NLP-LO3
61 pages
Deep Learning Lecture 28 April
No ratings yet
Deep Learning Lecture 28 April
4 pages
Generative AI Roadmap
No ratings yet
Generative AI Roadmap
36 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
Presentation 11 (1)
No ratings yet
Presentation 11 (1)
20 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
LLM_introduction 2024
No ratings yet
LLM_introduction 2024
77 pages
1719720399971
No ratings yet
1719720399971
51 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
01 The Transformer
No ratings yet
01 The Transformer
64 pages
N19-1213
No ratings yet
N19-1213
7 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
808D63F1-DecisionTransformersModel
No ratings yet
808D63F1-DecisionTransformersModel
21 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
The Novice LLM Training Guide
No ratings yet
The Novice LLM Training Guide
13 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Transformers
No ratings yet
Transformers
30 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
unit6
No ratings yet
unit6
26 pages
23NE1D5802
No ratings yet
23NE1D5802
15 pages
Slides
No ratings yet
Slides
137 pages
Notes 1311
No ratings yet
Notes 1311
4 pages
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
No ratings yet
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
45 pages
Generative AI and LLMS
No ratings yet
Generative AI and LLMS
34 pages
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
GENAI-SEE
No ratings yet
GENAI-SEE
51 pages
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
No ratings yet
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
28 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
Named Entity Recognition Using Deep Learning
100% (1)
Named Entity Recognition Using Deep Learning
21 pages
Chap 7.1 Sequence Analysis Using FFN
No ratings yet
Chap 7.1 Sequence Analysis Using FFN
47 pages
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
No ratings yet
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
34 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
NLP tutorial1
No ratings yet
NLP tutorial1
7 pages
Demystifying Large Language Models: Unraveling the Mysteries of Language Transformer Models, Build from Ground up, Pre-train, Fine-tune and Deployment
From Everand
Demystifying Large Language Models: Unraveling the Mysteries of Language Transformer Models, Build from Ground up, Pre-train, Fine-tune and Deployment
James Chen
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
From Everand
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
Prem Timsina
No ratings yet
CH 13 Slides Gear General
No ratings yet
CH 13 Slides Gear General
57 pages
CAE Pipe Tutorial 2
100% (1)
CAE Pipe Tutorial 2
88 pages
Solubility
No ratings yet
Solubility
4 pages
Music Room
No ratings yet
Music Room
2 pages
ZET Astrology Software E
No ratings yet
ZET Astrology Software E
3 pages
Usp42-Nf37 124
No ratings yet
Usp42-Nf37 124
5 pages
1.2 Moles, Molar Volume & Gas Laws
No ratings yet
1.2 Moles, Molar Volume & Gas Laws
14 pages
KR - SCAC (R410A - 50,60Hz) - Saudi - CO - MFL67986317 - 0CSL0-03B (Mar.2020) - IDU Duct - Spec
No ratings yet
KR - SCAC (R410A - 50,60Hz) - Saudi - CO - MFL67986317 - 0CSL0-03B (Mar.2020) - IDU Duct - Spec
2 pages
f1 Chemistry Nov-Dec 2024 H.A
No ratings yet
f1 Chemistry Nov-Dec 2024 H.A
12 pages
Fault Detection and Fault Tolerant Control
No ratings yet
Fault Detection and Fault Tolerant Control
207 pages
Surrey ENGM030 Unit 9 Presentation1a
No ratings yet
Surrey ENGM030 Unit 9 Presentation1a
26 pages
VLSI Lab
No ratings yet
VLSI Lab
4 pages
Linn 7079
No ratings yet
Linn 7079
10 pages
6110 J PC8538P - 54 - 11JAN11
No ratings yet
6110 J PC8538P - 54 - 11JAN11
620 pages
Solutions 6
No ratings yet
Solutions 6
10 pages
Gas Turbine Journal
100% (2)
Gas Turbine Journal
10 pages
GeFanuc90486d 2GeniusIODiscreteandAnalogBlocks PDF
No ratings yet
GeFanuc90486d 2GeniusIODiscreteandAnalogBlocks PDF
282 pages
CA (CL) - Information Technology - MCQ - (301-To-350)
No ratings yet
CA (CL) - Information Technology - MCQ - (301-To-350)
10 pages
BEE Unit 5 Study Notes
No ratings yet
BEE Unit 5 Study Notes
8 pages
DLR - Institute of Robotics and Mechatronics - Controlling Light-Weight Robotics
No ratings yet
DLR - Institute of Robotics and Mechatronics - Controlling Light-Weight Robotics
2 pages
Manual Mother CX h81-m1
86% (7)
Manual Mother CX h81-m1
1 page
Datasheet Element Point To Point
No ratings yet
Datasheet Element Point To Point
2 pages
STULZ Liquid Cooling Brochure 2405 EN
100% (1)
STULZ Liquid Cooling Brochure 2405 EN
8 pages
Math-5 Exam 0 Wkes Ewing 5.6 Fractions Word Problems
No ratings yet
Math-5 Exam 0 Wkes Ewing 5.6 Fractions Word Problems
7 pages
Calculation of Reference Values:: Ref Ref, A/ B Nom A/b
No ratings yet
Calculation of Reference Values:: Ref Ref, A/ B Nom A/b
7 pages
Deflection and Stress Analysis of A Simp
No ratings yet
Deflection and Stress Analysis of A Simp
4 pages
1 - Catalog Bed 09 04
No ratings yet
1 - Catalog Bed 09 04
49 pages
701-01 Pure Standard Eng
No ratings yet
701-01 Pure Standard Eng
16 pages
BERNSTEIN Float Switches: Contactless Level Control
No ratings yet
BERNSTEIN Float Switches: Contactless Level Control
12 pages
C, C++ Labmanual 08
No ratings yet
C, C++ Labmanual 08
174 pages

Jacob Devlin BERT

Uploaded by

Jacob Devlin BERT

Uploaded by

Contextual Word Representations with

BERT and Other Pre-trained Language

[-0.5, -0.9, 1.4, …] [-0.6, -0.8, -0.2, …]

● Word embeddings (word2vec, GloVe) are often

the king wore a crown the queen wore a crown

[0.3, 0.2, -0.8, …]

● Solution: Train contextual representations on text

open a bank account on the river bank

Train LSTM Fine-tune on

LSTM LSTM LSTM ... LSTM LSTM LSTM

<s> open a very funny movie

LSTM LSTM LSTM LSTM LSTM LSTM

<s> open a open a bank

Train Deep (12-layer) Fine-tune on

Transformer Transformer Transformer

X_0_0 X_0_1 X_0_2 X_0_3 X_0_0 X_0_1 X_0_2 X_0_3

Unidirectional context Bidirectional context

open a bank open a bank

Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2

Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2

<s> open a <s> open a

the man went to the [MASK] to buy a [MASK] of milk

● Too little masking: Too expensive to train

● Use 30,000 WordPiece vocabulary on input.

● Use token 0 ([CLS]) to emit

● Masked LM (compared to left-to-right LM) is very important on

● Masked LM takes slightly longer to converge because we

● Big models help a lot

1024 128 1024

● ALBERT is light in terms of parameters, not speed

Well-Read Students Learn

● Distillation works much better than pre-training +

You might also like