L3 Transformer and PLMs
L3 Transformer and PLMs
THUNLP
0
Content
• Transformer
• Attention Mechanism
• Transformer Structure
• Pretrained Language Models
• Language Modeling
• Pre-trained Langue Models (PLMs)
• Fine-tuning Approaches
• PLMs after BERT
• Applications of Masked LM
• Frontiers of PLMs
• Transformers Tutorial
• Introduction
• Frequently-used APIs
• Quick Start
• Demo
1
Transformer
Yufei Huang
[email protected]
THUNLP
2
Content
• Attention Mechanism
• Seq2Seq With Attention
• Attention Mechanism
• Transformer Structure
• Overview
• Input Encoding
• Transformer Block
• Transformer Performance
3
Attention Mechanism
THUNLP
4
Attention
5
Attention
6
Seq2Seq with Attention
Attention
Scores
多 个 机场 都 被迫 关闭 了 <START>
7
Seq2Seq with Attention
Attention
Scores
多 个 机场 都 被迫 关闭 了 <START>
8
Seq2Seq with Attention
Attention
Scores
多 个 机场 都 被迫 关闭 了 <START>
9
Seq2Seq with Attention
Attention
Scores
多 个 机场 都 被迫 关闭 了 <START>
10
Seq2Seq with Attention
多 个 机场 都 被迫 关闭 了 <START>
11
Seq2Seq with Attention
Attention
Distribution 𝛼! = softmax(𝑒! )
多 个 机场 都 被迫 关闭 了 <START>
12
Seq2Seq with Attention
The output is the weighted sum of the encoder
hidden states by the attention distribution.
Attention '
Output 𝑜! = 4 𝛼)! ℎ)
)*!
Attention
Distribution 𝛼! = softmax(𝑒! )
多 个 机场 都 被迫 关闭 了 <START>
13
Seq2Seq with Attention
Attention
Distribution
Attention
Scores
多 个 机场 都 被迫 关闭 了 <START>
14
Seq2Seq with Attention
Attention airports
Output
Attention
Distribution
Attention
Scores
多 个 机场 都 被迫 关闭 了 <START>many
15
Seq2Seq with Attention
Attention were
Output
Attention
Distribution
Attention
Scores
多 个 机场 都 被迫 关闭 了 <START>many airports
16
Seq2Seq with Attention
Attention down
Output
Attention
Distribution
Attention
Scores
17
Attention Mechanism
• Equations (1)
• Encoder hidden states: ℎ5, ℎ6 … , ℎ7 ∈ ℝ8
• Decoder hidden state at time step t: 𝑠9 ∈ ℝ8
• Compute attention scores 𝑒 9 for this step
𝑒 9 = [𝑠9: ℎ5, … , 𝑠9: ℎ7 ] ∈ ℝ7
• Use softmax to get the attention distribution 𝛼 9
𝛼 9 = softmax(𝑒 9 ) ∈ ℝ7
18
Attention Mechanism
• Equations (2)
• Use the attention distribution to compute the weighted sum
of the encoder hidden states as the attention output
𝑜9 = ∑7 ;<5 ;𝛼 9
ℎ ; ∈ ℝ8
19
Attention Variants
• More variants
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau at el. ICLR 2015 20
General Definition of Attention
• Intuition:
• Based on the query, the weighted sum is a selective summary
of the values.
• We can obtain a fixed-size representation of an arbitrary set of
representations via the attention mechanism.
21
General Definition of Attention
• Math Formulations
• If we have values 𝒉5, 𝒉6 … , 𝒉7 ∈ ℝ=" and the query 𝐬 ∈ ℝ=#
• The attention technique is used to compute the attention
output 𝐨 ∈ ℝ=" from the attention scores 𝐞 ∈ ℝ7
𝜶 = softmax(𝒆) ∈ ℝ7
7
𝒐=D 𝛼; 𝒉; ∈ ℝ8
;<5
• There are several attention variants by different methods to
compute 𝐞 ∈ ℝ7
22
Attention Variants
• Multiplicative attention
𝑒; = 𝑠 : 𝑊ℎ; ∈ ℝ, 𝑊 ∈ ℝ=# ×="
23
Insights of Attention
24
Transformer Structure
THUNLP
25
Transformer
• Motivations
• Sequential computation in RNNs prevents parallelization
26
Transformer
• Motivations
• Sequential computation in RNNs prevents parallelization
• Overview
• Architecture: encoder-decoder
28
Transformer
• Overview
• Architecture: encoder-decoder
• Input: byte pair encoding +
positional encoding
29
Transformer
• Overview
• Architecture: encoder-decoder
• Input: byte pair encoding +
positional encoding
• Model: stack of several
encoder/decoder blocks
30
Transformer
• Overview
• Architecture: encoder-decoder
• Input: byte pair encoding +
positional encoding
• Model: stack of several
encoder/decoder blocks
• Output: probability of the
translated word
• Loss function: standard cross-
entropy loss over a softmax layer
31
Input Encoding
Dictionary Vocabulary
Freq Word l, o, w, e, r, n, w, s, t, i, d
5 low
2 lower
Add a pair (e, s) with freq 9.
6 n e w es t
3 w i d es t
l, o, w, e, r, n, w, s, t, i, d, es
32
Input Encoding
Dictionary Vocabulary
33
Input Encoding
Neural Machine Translation of Rare Words with Subword Units, Sennrich at el. 2016 34
Input Encoding
36
Transformer Block
• Two sublayers
• Multi-Head Attention
• Feed-Forward Network (2-layer MLP)
• Two tricks
• Residual connection
• Layer normalization
• Changes input to have mean 0 and
variance 1
• 𝐴 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 / 𝑉
Q × =
Softmax
× =
39
Transformer Block: Attention Layer
40
Transformer Block: Attention Layer
• Self-attention
• Let the word vectors themselves select each other
• Q, K, V are derived from the stack of word vectors from a
sentence
41
Transformer Block: Attention Layer
• Multi-head Attention
• Different head: same computation
component & different parameters
• Concatenate all outputs and feed into
the linear layer
• For ℎ heads, we have:
"
ℎ𝑒𝑎𝑑! = A (𝑄𝑊! , 𝐾𝑊!# , 𝑉𝑊!$ )
MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑% , … , ℎ𝑒𝑎𝑑& 𝑊 '
42
Transformer Block
• Two sublayers
• Multi-head attention
• 2-layer feed-forward network
• Two tricks
• Residual connection
• Layer normalization
• Changes input to have mean 0 and
variance 1
• In each layer, Q, K, V are the same
as the previous layer’s output
• Two changes:
• Masked self-attention
• The word can only look at previous words
• Encoder-decoder attention
• Queries come from the decoder while
keys and values come from the encoder
• Blocks are also repeated 6 times
44
Transformer Decoder
• Complete Decoder
45
Transformer
• Other tricks
• Checkpoint averaging
• ADAM optimizer
• Dropout during training at every layer just before adding
residual
• Label smoothing
• Auto-regressive decoding with beam search and length
penalties
46
Transformer Performance
47
Transformer
• Attention Visualization
• Implicit co-reference resolution
49
Transformer
• Advantage:
• The Transformer is a powerful model and proven to be effective
in many NLP tasks
• The Transformer is suitable for parallelization
• It proves the effectiveness of the attention mechanism
• It also gives insights to recent NLP advancements such as BERT
and GPT
• Disadvantage:
• The architecture is hard to optimize and sensitive to model
modifications
• 𝑂(𝑛6) per-layer complexity makes it hard to be used on
extremely long document (usually set max length to be 512)
50
Pre-trained Language
Models
Yuan Yao
[email protected]
THUNLP
51
Content
• Language Modeling
• Pre-trained Langue Models (PLMs)
• Fine-tuning Approaches
• GPT and BERT
• PLMs after BERT
• Applications of Masked LM
• Cross-lingual and Cross-modal LM Pre-training
• Frontiers of PLMs
• GPT-3, T5 and MoE
52
Language Modeling
THUNLP
53
Review of Language Modeling
code
read
54
Language Modeling
55
Language Modeling
56
Language Modeling
57
Language Modeling
outputs 𝑦# 𝑦$ 𝑦'
hidden
ℎ& ℎ# ℎ$ ℎ'
states
inputs 𝑥# 𝑥$ 𝑥'
58
Language Modeling
59
Pre-trained Langue models
(PLMs)
THUNLP
60
What are PLMs
61
Two Mainstreams of PLMs
• Feature-based approaches
• The most representative model of feature-based approaches
is word2vec
• Use the outputs of PLMs as the inputs of our downstream
models
• Fine-tuning approaches
• The most representative model of fine-tuning approaches is
BERT.
• The language models will also be the downstream models and
their parameters will be updated
62
Development of Fine-tuning
Approaches
THUNLP
63
GPT
• A huge Transformer LM
• Trained on 40GB of text
• SOTA perplexities on datasets it’s not even trained on
65
More than LM
• Zero-Shot Learning
• Ask LM to generate from a prompt
• Reading Comprehension:
• <context> <question> A:
• Summarization:
• <article> TL;DR:
• Question Answering:
• <question> A:
66
Zero-Shot Results
67
GPT
68
BERT
70
Unidirectional vs. Bidirectional Models
71
BERT: Masked LM
72
BERT: Masked LM
73
BERT: Next Sentence Prediction
74
BERT: Input Representation
75
BERT: GLUE Results
76
BERT: Effect of Pre-training Task
77
BERT: Effect of Model Size
78
BERT
79
Summary
80
PLMs after BERT
THUNLP
81
Is BERT really perfect?
82
RoBERTa
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ICLR 2020. 84
Applications of Masked LM
THUNLP
85
Masked LM
86
Cross-lingual LM Pre-training
88
Cross-Modal LM Pre-training
VideoBERT: A Joint Model for Video and Language Representation Learning. ICCV 2019. 89
Cross-Modal LM Pre-training
91
Frontiers of PLMs
THUNLP
92
GPT-3
Q: How many eyes does a giraffe have? Q: Who was president of the United States in
A: A giraffe has two eyes. 1700?
A: William Penn was president of the United
Q: How many eyes does my foot have? States in 1700.
A: Your foot has two eyes.
Q: How many eyes does a spider have? 1.USA was created in 1776.
A: A spider has eight eyes. 2.1700 is earlier than 1776.
3.There is no president then.
Q: How many eyes does the sun have?
A: The sun has one eye.
95
T5
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR. 2020. 96
Larger Model with MoE
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR 2021.
97
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. 2021.
Summary
98
Transformers Tutorial
Weize Chen
[email protected]
THUNLP
99
Outline
• Introduction
• What is Transformers package
• Why use it
• Frequently-used APIs
• Load / Save model
• Tokenize texts
•…
• Quick Start
• Downstream task evaluation
• Demo
100
Introduction
• Various pre-trained language models are being
proposed
😵💫 101
Introduction
• Various pre-trained language models are being
proposed
102
Introduction
• Various pre-trained language models are being
proposed
103
Introduction
• 🤗 Transformers is a package:
• Providing thousands of models
• Supporting PyTorch, TensorFlow, JAX
• Hosting pre-trained models for text, audio and vision
104
Pipeline
• I want to directly use the off-the-shelf model on
down-stream tasks
• Use pipeline!
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('I love you! ')
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train() # Start training!
trainer.evaluate()
108
Demo
• We have provided a demo, which fine-tunes BERT
for sentiment analysis task.
• See
https://colab.research.google.com/drive/1tcDiyHIKg
EJp4TzGbGp27HYbdFWGolU_?usp=sharing
109
Q&A
THUNLP