Jacob Devlin BERT
Jacob Devlin BERT
Jacob Devlin
Google AI Language
History and Background
Pre-training in NLP
● Word embeddings are the basis of deep learning
for NLP
king queen
open a bank
History of Contextual Representations
● Improving Language Understanding by Generative
Pre-Training, OpenAI, 2018
<s> open a
<s> open a
Model Architecture
Transformer encoder
● Multi-headed self attention
○ Models context
● Feed-forward layers
○ Computes non-linear hierarchical features
● Layer norm and residuals
○ Makes training deep networks healthy
● Positional embeddings
○ Allows model to learn relative positioning
Model Architecture
● Empirical advantages of Transformer vs. LSTM:
1. Self-attention == no locality bias
● Long-distance context has “equal opportunity”
2. Single multiplication per layer == efficiency on TPU
● Effective batch size is number of words, not sequences
Transformer LSTM
✕ W ✕ W
X_1_0 X_1_1 X_1_2 X_1_3 X_1_0 X_1_1 X_1_2 X_1_3
BERT
Problem with Previous Methods
● Problem: Language models only use left context or
right context, but language understanding is
bidirectional.
● Why are LMs unidirectional?
● Reason 1: Directionality is needed to generate a
well-formed probability distribution.
○ We don’t care about this.
● Reason 2: Words can “see themselves” in a
bidirectional encoder.
Unidirectional vs. Bidirectional Models
store gallon
MultiNLI CoLa
Premise: Hills and mountains are especially Sentence: The wagon rumbled down the road.
sanctified in Jainism. Label: Acceptable
Hypothesis: Jainism hates nature.
Label: Contradiction Sentence: The car honked down the road.
Label: Unacceptable
SQuAD 2.0