0% found this document useful (0 votes)
19 views

L3 Transformer and PLMs

LLM network

Uploaded by

magicuncle520
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

L3 Transformer and PLMs

LLM network

Uploaded by

magicuncle520
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Lecture 3

Transformer and Pretrain


Language Models
Yufei Huang, Yuan Yao, Weize Chen

THUNLP

0
Content
• Transformer
• Attention Mechanism
• Transformer Structure
• Pretrained Language Models
• Language Modeling
• Pre-trained Langue Models (PLMs)
• Fine-tuning Approaches
• PLMs after BERT
• Applications of Masked LM
• Frontiers of PLMs
• Transformers Tutorial
• Introduction
• Frequently-used APIs
• Quick Start
• Demo
1
Transformer

Yufei Huang
[email protected]
THUNLP

2
Content
• Attention Mechanism
• Seq2Seq With Attention
• Attention Mechanism
• Transformer Structure
• Overview
• Input Encoding
• Transformer Block
• Transformer Performance

3
Attention Mechanism

THUNLP

4
Attention

• Seq2Seq: the bottleneck problem

5
Attention

• The Bottleneck Problem


• The single vector of source sentence encoding needs to
capture all information about the source sentence
• The single vector limits the representation capacity of the
encoder: the information bottleneck
• Attention
• Attention provides a solution to the bottleneck problem
• Core idea: at each step of the decoder, focus on a particular
part of the source sequence

6
Seq2Seq with Attention

Dot Product 𝑒!! = 𝑠!( ℎ!

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

7
Seq2Seq with Attention

Dot Product 𝑒"! = 𝑠!( ℎ"

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

8
Seq2Seq with Attention

Dot Product 𝑒#! = 𝑠!( ℎ#

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

9
Seq2Seq with Attention

Dot Product 𝑒$! = 𝑠!( ℎ$

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

10
Seq2Seq with Attention

Dot Product 𝑒'! = 𝑠!( ℎ'

Attention 𝑒! = [𝑠!( ℎ! , … , 𝑠!( ℎ' ]


Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

11
Seq2Seq with Attention

On this step, the decoder mostly focuses on


the first two hidden states.

Attention
Distribution 𝛼! = softmax(𝑒! )

Attention 𝑒! = [𝑠!( ℎ! , … , 𝑠!( ℎ' ]


Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

12
Seq2Seq with Attention
The output is the weighted sum of the encoder
hidden states by the attention distribution.
Attention '
Output 𝑜! = 4 𝛼)! ℎ)
)*!

Attention
Distribution 𝛼! = softmax(𝑒! )

Attention 𝑒! = [𝑠!( ℎ! , … , 𝑠!( ℎ' ]


Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

13
Seq2Seq with Attention

Attention many Use the concatenation of the


Output attention output and the hidden
state of the decoder to compute
𝑦!! = [𝑜!; 𝑠!] the predicted word.

Attention
Distribution

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠!


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>

14
Seq2Seq with Attention

Attention airports
Output

𝑦!" = [𝑜"; 𝑠"]

Attention
Distribution

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠! 𝑠"


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>many

15
Seq2Seq with Attention

Attention were
Output

𝑦!# = [𝑜#; 𝑠#]

Attention
Distribution

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠! 𝑠" 𝑠#


RNN RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>many airports

16
Seq2Seq with Attention

Attention down
Output

𝑦!$ = [𝑜$; 𝑠$]

Attention
Distribution

Attention
Scores

ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' 𝑠! 𝑠" 𝑠# 𝑠'


RNN … RNN
Encoder Decoder

多 个 机场 都 被迫 关闭 了 <START>many airports … close

17
Attention Mechanism

• Equations (1)
• Encoder hidden states: ℎ5, ℎ6 … , ℎ7 ∈ ℝ8
• Decoder hidden state at time step t: 𝑠9 ∈ ℝ8
• Compute attention scores 𝑒 9 for this step
𝑒 9 = [𝑠9: ℎ5, … , 𝑠9: ℎ7 ] ∈ ℝ7
• Use softmax to get the attention distribution 𝛼 9
𝛼 9 = softmax(𝑒 9 ) ∈ ℝ7

18
Attention Mechanism

• Equations (2)
• Use the attention distribution to compute the weighted sum
of the encoder hidden states as the attention output
𝑜9 = ∑7 ;<5 ;𝛼 9
ℎ ; ∈ ℝ8

• Concatenate the attention output and the decoder hidden


state to predict the word
[𝑜9 ; 𝑠9 ] ∈ ℝ68

19
Attention Variants

• There are several attention variants by different


methods to compute e ∈ ℝ!
• Additive attention
𝑒; = 𝑣 : tanh(𝑊5ℎ; + 𝑊6𝑠) ∈ ℝ

• Where 𝑊5 ∈ ℝ=! ×=" , 𝑊6 ∈ ℝ=! ×=# are weight matrices and


𝑣 ∈ ℝ=! is a weight vector

• More variants

Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau at el. ICLR 2015 20
General Definition of Attention

• A more general definition of attention:


• Given a query vector and a set of value vectors, the attention
technique computes a weighted sum of the values according
to the query

• Intuition:
• Based on the query, the weighted sum is a selective summary
of the values.
• We can obtain a fixed-size representation of an arbitrary set of
representations via the attention mechanism.

21
General Definition of Attention

• Math Formulations
• If we have values 𝒉5, 𝒉6 … , 𝒉7 ∈ ℝ=" and the query 𝐬 ∈ ℝ=#
• The attention technique is used to compute the attention
output 𝐨 ∈ ℝ=" from the attention scores 𝐞 ∈ ℝ7
𝜶 = softmax(𝒆) ∈ ℝ7
7
𝒐=D 𝛼; 𝒉; ∈ ℝ8
;<5
• There are several attention variants by different methods to
compute 𝐞 ∈ ℝ7

22
Attention Variants

• There are several attention variants by different


methods to compute e ∈ ℝ!
• Basic dot-product attention
𝑒; = 𝑠 : ℎ; ∈ ℝ
• This assumes the vector size 𝑑5 = 𝑑6

• Multiplicative attention
𝑒; = 𝑠 : 𝑊ℎ; ∈ ℝ, 𝑊 ∈ ℝ=# ×="

23
Insights of Attention

• Attention solves the bottleneck problem


• The decoder could directly look at source
• Attention helps with vanishing gradient problem
• By providing shortcuts to long-distance states
• Attention provides some
interpretability
• We can find out what the decoder
was focusing on by the attention map
• Attention allows the network to align
relevant words

24
Transformer Structure

THUNLP

25
Transformer

• Motivations
• Sequential computation in RNNs prevents parallelization

• Despite using GRU or LSTM, RNNs still need attention


mechanism which provides access to any state
• Maybe we do not need RNNs?

26
Transformer

• Motivations
• Sequential computation in RNNs prevents parallelization

• Despite using GRU or LSTM, RNNs still need attention


mechanism which provides access to any state
• Maybe we do not need RNNs?
• Attention is all you need (NeurIPS 2017)

Attention is all you need, Vaswani at el. NeurIPS 2017 27


Transformer

• Overview
• Architecture: encoder-decoder

28
Transformer

• Overview
• Architecture: encoder-decoder
• Input: byte pair encoding +
positional encoding

29
Transformer

• Overview
• Architecture: encoder-decoder
• Input: byte pair encoding +
positional encoding
• Model: stack of several
encoder/decoder blocks

30
Transformer

• Overview
• Architecture: encoder-decoder
• Input: byte pair encoding +
positional encoding
• Model: stack of several
encoder/decoder blocks
• Output: probability of the
translated word
• Loss function: standard cross-
entropy loss over a softmax layer

31
Input Encoding

• Byte Pair Encoding (BPE)


• A word segmentation algorithm
• Start with a vocabulary of characters
• Turn the most frequent n-gram to a new n-gram

Dictionary Vocabulary

Freq Word l, o, w, e, r, n, w, s, t, i, d
5 low
2 lower
Add a pair (e, s) with freq 9.
6 n e w es t
3 w i d es t
l, o, w, e, r, n, w, s, t, i, d, es

32
Input Encoding

• Byte Pair Encoding (BPE)


• A word segmentation algorithm
• Start with a vocabulary of characters
• Turn the most frequent n-gram to a new n-gram

Dictionary Vocabulary

Freq Word l, o, w, e, r, n, w, s, t, i, d, es, est


5 low
2 lower
Add a pair (es, t) with freq 9.
6 n e w est
3 w i d est

33
Input Encoding

• Byte Pair Encoding (BPE)


• Solve the OOV (out of vocabulary) problem by encoding rare
and unknown words as sequences of subword units
• In the example above, the OOV word "lowest" would be
segmented into "low est”
• The relation between “low” and “lowest” can be generalized
to “smart” and “smartest”
• The model which used BPE got top places in WMT 2016

Neural Machine Translation of Rare Words with Subword Units, Sennrich at el. 2016 34
Input Encoding

• Byte Pair Encoding (BPE)


• Dimension: d
• Positional Encoding (PE)
• The Transformer block is not sensitive to the same words with
different positions
• The positional encoding is added so that the same words at
different locations have different representations
𝑃𝐸(?@A,6;) = sin(𝑝𝑜𝑠/100006;/= )
𝑃𝐸(?@A,6;C5) = cos(𝑝𝑜𝑠/100006;/= )
• 𝑖 is the index of embedding, ranges from 0 to d/2
• Input = BPE + PE
35
Visualization of PE

36
Transformer Block

• Two sublayers
• Multi-Head Attention
• Feed-Forward Network (2-layer MLP)
• Two tricks
• Residual connection
• Layer normalization
• Changes input to have mean 0 and
variance 1

Layer normalization. Ba et al. arXiv 2016.


Deep residual learning for image recognition. He et al. CVPR 2016. 37
Transformer Block: Attention Layer

• General Dot-Product Attention


• Inputs
• A query q and a set of key-value (k, v) pairs
• Queries and keys are vectors with dimension 𝑑D
• Values are vectors with dimension 𝑑E
• Output
• Weighted sum of values
• Weight of each value is computed by the dot product of the
query and corresponding key
𝑒 FGD$
𝐴 𝑞, 𝐾, 𝑉 = D FGD 𝑣;
∑H 𝑒 %
;
• stack multiple queries 𝑞 in a matrix 𝑄
𝐴 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 ! 𝑉
38
Transformer Block: Attention Layer

• 𝐴 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 / 𝑉

KT Attention Score Attention Distribution

Q × =
Softmax

Attention Distribution V Output

× =

39
Transformer Block: Attention Layer

• Scaled Dot-Product Attention


• Problem
• As 𝑑D gets large, the variance of 𝑞: S 𝑘 increases
• The softmax gets very peaked; Gradient gets smaller
• Solution
• Scale by the length of the query/key vectors
IJ&
• 𝐴 𝑄, 𝐾, 𝑉 = softmax 𝑉
='

40
Transformer Block: Attention Layer

• Self-attention
• Let the word vectors themselves select each other
• Q, K, V are derived from the stack of word vectors from a
sentence

41
Transformer Block: Attention Layer

• Multi-head Attention
• Different head: same computation
component & different parameters
• Concatenate all outputs and feed into
the linear layer
• For ℎ heads, we have:

"
ℎ𝑒𝑎𝑑! = A (𝑄𝑊! , 𝐾𝑊!# , 𝑉𝑊!$ )
MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑% , … , ℎ𝑒𝑎𝑑& 𝑊 '

42
Transformer Block

• Two sublayers
• Multi-head attention
• 2-layer feed-forward network
• Two tricks
• Residual connection
• Layer normalization
• Changes input to have mean 0 and
variance 1
• In each layer, Q, K, V are the same
as the previous layer’s output

Layer normalization. Ba et al. arXiv 2016.


Deep residual learning for image recognition. He et al. CVPR 2016. 43
Transformer Decoder Block

• Two changes:
• Masked self-attention
• The word can only look at previous words
• Encoder-decoder attention
• Queries come from the decoder while
keys and values come from the encoder
• Blocks are also repeated 6 times

44
Transformer Decoder

• Complete Decoder

45
Transformer

• Other tricks
• Checkpoint averaging
• ADAM optimizer
• Dropout during training at every layer just before adding
residual
• Label smoothing
• Auto-regressive decoding with beam search and length
penalties

46
Transformer Performance

• Experimental Results for Machine Translation

47
Transformer

• Attention Visualization in Layer 5


• Words start to focus on other words accordingly

Online Visualization Demo. https://poloclub.github.io/dodrio/ 48


Transformer

• Attention Visualization
• Implicit co-reference resolution

49
Transformer

• Advantage:
• The Transformer is a powerful model and proven to be effective
in many NLP tasks
• The Transformer is suitable for parallelization
• It proves the effectiveness of the attention mechanism
• It also gives insights to recent NLP advancements such as BERT
and GPT
• Disadvantage:
• The architecture is hard to optimize and sensitive to model
modifications
• 𝑂(𝑛6) per-layer complexity makes it hard to be used on
extremely long document (usually set max length to be 512)
50
Pre-trained Language
Models
Yuan Yao
[email protected]
THUNLP

51
Content
• Language Modeling
• Pre-trained Langue Models (PLMs)
• Fine-tuning Approaches
• GPT and BERT
• PLMs after BERT
• Applications of Masked LM
• Cross-lingual and Cross-modal LM Pre-training
• Frontiers of PLMs
• GPT-3, T5 and MoE

52
Language Modeling

THUNLP

53
Review of Language Modeling

• Language Modeling is the task of predicting the


upcoming word
• Compute conditional probability of an upcoming word 𝑤K :
𝑃(𝑤" |𝑤# , 𝑤$ , ⋯ , 𝑤"%# )

code

Never too late to _____ learn

read

54
Language Modeling

• Language Modeling: the most basic and important NLP


task
• Contain a variety of knowledge for language
understanding, e.g., linguistic knowledge and factual
knowledge
• Only require the plain text without any human
annotations

55
Language Modeling

• The language knowledge learned by language models


can be transferred to other NLP tasks easily
• There are three representative models for transfer
learning of NLP
• Word2vec
• Pre-trained RNN
• GPT&BERT

56
Language Modeling

• Distributed Representations of Words and Phrases and


their Compositionality (word2vec)
• The basic idea is taken from “A Neural Probabilistic Language
Model”

57
Language Modeling

• Semi-supervised Sequence Learning (Pre-trained RNN)


• The basic idea is taken from “Recurrent neural network based
language model”

outputs 𝑦# 𝑦$ 𝑦'

hidden
ℎ& ℎ# ℎ$ ℎ'
states

inputs 𝑥# 𝑥$ 𝑥'

58
Language Modeling

• GPT & BERT


• The basic idea is taken from “Character-Level Language
Modeling with Deeper Self-Attention”

59
Pre-trained Langue models
(PLMs)

THUNLP

60
What are PLMs

• We have mentioned several PLMs in the last section


• word2vec, GPT, BERT, …
• PLMs: language models having powerful transferability
for other NLP tasks
• word2vec is the first PLM
• Nowadays, the PLMs based on Transformers are very
popular (e.g. BERT)

61
Two Mainstreams of PLMs

• Feature-based approaches
• The most representative model of feature-based approaches
is word2vec
• Use the outputs of PLMs as the inputs of our downstream
models
• Fine-tuning approaches
• The most representative model of fine-tuning approaches is
BERT.
• The language models will also be the downstream models and
their parameters will be updated

62
Development of Fine-tuning
Approaches

THUNLP

63
GPT

• Inspired by the success of Transformers in different NLP


tasks, GPT is the first work to pre-train a PLM based on
Transformer
• Transformer + left-to-right LM
• Fine-tuned on downstream tasks

Improving Language Understanding by Generative Pre-Training. OpenAI 2018. 64


GPT-2

• A huge Transformer LM
• Trained on 40GB of text
• SOTA perplexities on datasets it’s not even trained on

Results of language modeling

65
More than LM

• Zero-Shot Learning
• Ask LM to generate from a prompt
• Reading Comprehension:
• <context> <question> A:
• Summarization:
• <article> TL;DR:
• Question Answering:
• <question> A:

66
Zero-Shot Results

67
GPT

• A very powerful generative model


• Also achieve very good transfer learning results on
downstream tasks
• Outperform ELMo significantly

• The key to success


• Big data (Large unsupervised corpus)
• Deep neural model (Transformer)

68
BERT

• The most popular PLM


• NAACL 2019 best paper
• Change the paradigm of NLP significantly

BERT: Bi-directional Encoder Representations from Transformers. NAACL 2019. 69


Problem with Previous Methods

• Problem: Language models only use left context or right


context, but language understanding is bidirectional

• Why are LMs unidirectional?


• Reason 1: Directionality is needed to generate a well-
formed probability distribution
• Reason 2: Words can “see themselves” in a bidirectional
encoder

70
Unidirectional vs. Bidirectional Models

71
BERT: Masked LM

• Solution: Mask out k% of the input words, and then


predict the masked words
• k=15% in BERT

• Too little masking: too expensive to train


• Too much masking: not enough context

72
BERT: Masked LM

• Problem: [Mask] token never seen at fine-tuning


• Solution: 15% of the words to predict
• 80% of the time, replace with [MASK]
went to the store → went to the [MASK]
• 10% of the time, replace with a random word
went to the store → went to the running
• 10% of the time, keep the same
went to the store → went to the store

73
BERT: Next Sentence Prediction

• To learn relationships between sentences, predict


whether Sentence B is the actual sentence that
proceeds Sentence A, or just a random sentence

74
BERT: Input Representation

• Use 30,000 WordPiece vocabulary on input.


• Each token is the sum of three embeddings
• Single sequence is much more efficient.

75
BERT: GLUE Results

76
BERT: Effect of Pre-training Task

• Masked LM (compared to left-to-right LM) is very


important on some tasks
• Next Sentence Prediction is important for other tasks

77
BERT: Effect of Model Size

• Big models help a lot


• Going from 110M -> 340M params helps even on
datasets with 3,600 labeled examples

78
BERT

• Empirical results from BERT are great, but biggest


impact on the field is:
• With pre-training, bigger == better, without clear limits (so far)

• Excellent performance for researchers and companies


building NLP systems

79
Summary

• Feature-based approaches transfer the contextualized


word embeddings for downstream tasks
• Fine-tuning approaches transfer the whole model for
downstream tasks
• Experimental results show that fine-tuning approaches
are better than feature-based approaches
• Hence, current research mainly focuses on fine-tuning
approaches

80
PLMs after BERT

THUNLP

81
Is BERT really perfect?

• Any optimized pre-training paradigm?

• The gap between pre-training and fine-tuning


• [MASK] token will not appear in fine-tuning

• The efficiency of Masked Language Model


• Only predict 15% words

82
RoBERTa

• Explore several pre-training approaches for a more


robust BERT
• Dynamic Masking
• Model Input Format
• Next Sentence Prediction
• Training with Large Batches
• Text Encoding
• Massive experiments

RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. 83


ELECTRA

• Recall: the efficiency of bi-directional pre-training


• Masked LM: 15% prediction
• Premutation LM: 1/6~1/7 prediction
• Traditional LM: 100% prediction
• Single direction
• Replaced Token Detection
• A new bi-directional pre-training task
• 100% prediction

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ICLR 2020. 84
Applications of Masked LM

THUNLP

85
Masked LM

• Basic idea: to use bi-direction information to predict the


target token
• Beyond token: use multi-modal or multi-lingual
information together by masking
• Input the objects from different domains together and
predict the target object based on the input objects

86
Cross-lingual LM Pre-training

• Translation Language Modeling (TLM)


• The TLM objective extends MLM to pairs of parallel
sentences (e.g., English-French)
• To predict a masked English word, the model can attend
to both the English sentence and its French translation,
and is encouraged to align English and French
representations

Cross-lingual Language Model Pretraining. NeurIPS 2019. 87


Cross-lingual LM Pre-training

• The translation language modeling (TLM) objective


improves cross-lingual language model pretraining by
leveraging parallel data

88
Cross-Modal LM Pre-training

• Pairs of videos and texts from automatic speech


recognition (ASR)
• Generate a sequence of “visual words” by applying
hierarchical vector quantization to features derived from
the video using a pre-trained model
• Encourages the model to focus on high-level semantics
and longer-range temporal dynamics in the video

VideoBERT: A Joint Model for Video and Language Representation Learning. ICCV 2019. 89
Cross-Modal LM Pre-training

Video captioning examples 90


Summary

• Masked LM inspired a variety of new pre-training tasks


• What’s your idea about transferring Masked LM?

91
Frontiers of PLMs

THUNLP

92
GPT-3

• A super large-scale PLM

Organization Model Parameter Data Time


2018.6 OpenAI GPT 110M 4GB 3 Day
2018.10 Google BERT 330M 16GB 50 Day
2019.2 OpenAI GPT-2 1.5B 40GB 200 Day
2019.7 Facebook RoBERTa 330M 160GB 3 Year
2019.10 Google T5 11B 800GB 66 Year
2020.6 OpenAI GPT-3 175B 560GB 90 Year

Language Models are Few-Shot Learners. OpenAI. 2020. 93


GPT-3

• Excellent few-shot/in-context learning ability

Language Models are Few-Shot Learners. OpenAI. 2020. 94


GPT-3: Doesn’t know when to say “I do not
know”

Q: How many eyes does a giraffe have? Q: Who was president of the United States in
A: A giraffe has two eyes. 1700?
A: William Penn was president of the United
Q: How many eyes does my foot have? States in 1700.
A: Your foot has two eyes.

Q: How many eyes does a spider have? 1.USA was created in 1776.
A: A spider has eight eyes. 2.1700 is earlier than 1776.
3.There is no president then.
Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How many eyes does a blade of grass have?


A: A blade of grass has one eye.

95
T5

• Reframe all NLP tasks into a unified text-to-text-format


where the input and output are always text strings
• Encoder-decoder architecture

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR. 2020. 96
Larger Model with MoE

• Enhance encoder-decoder with MoE (Mixture of Experts)


for billions of parameters
• Gshard 600B parameters
• Switch Transformer 1,571B parameters

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR 2021.
97
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. 2021.
Summary

• The technique of PLMs is very important for NLP (from


word2vec to BERT)
• Fine-tuning approaches are widely used after BERT
• The idea of Masked LM inspired the research on
unsupervised learning
• Consider PLMs first when you plan to construct a new
NLP system

98
Transformers Tutorial

Weize Chen
[email protected]
THUNLP

99
Outline
• Introduction
• What is Transformers package
• Why use it
• Frequently-used APIs
• Load / Save model
• Tokenize texts
•…
• Quick Start
• Downstream task evaluation
• Demo

100
Introduction
• Various pre-trained language models are being
proposed

ALBERT, BART, BERT, BERTweet, BigBird


ByT5, CLIP, CodeGen, ConvBERT, CPM, CTRL
DeBERTa, DistilBERT, DPR, ELECTRA, GPT, GPT-J
GPT-Neo, GPT-2, ImageGPT, LongFormer, mBART
MT5, REALM, Reformer, RoBERTa, T5, T5v1.1
Transformer-XL, ViT, XLM, XLNet
Sooooooooo many new models

😵💫‍‍ 101
Introduction
• Various pre-trained language models are being
proposed

• 🤔 Is there any package that helps us:


• Reproduce the results easily
• Deploy the models quickly
• Customize your models freely

102
Introduction
• Various pre-trained language models are being
proposed

• 🤔 Is there any package that helps us:


• Reproduce the results easily
• Deploy the models quickly
• Customize your models freely

103
Introduction
• 🤗 Transformers is a package:
• Providing thousands of models
• Supporting PyTorch, TensorFlow, JAX
• Hosting pre-trained models for text, audio and vision

• ✅ Fairly easy to use. Low barrier to entry for


researchers.

• Almost all the researches on pre-trained models are


built on 🤗 Transformers!

104
Pipeline
• I want to directly use the off-the-shelf model on
down-stream tasks
• Use pipeline!
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('I love you! ')

from transformers import pipeline


question_answerer = pipeline('question-answering')
question_answerer({
'question': 'What is the name of the repository ?’,
'context': 'Pipeline has been included in the huggingface /
transformers repository'
})

• Pipeline automatically uses a fine-tuned model and


perform the downstream task
105
Tokenization
• Pre-trained language models have different
tokenization
• BPE (Byte-Pair Encoding): GPT, Roberta, …
• WordPiece: BERT, Electra, …
• SentencePiece: ALBERT, T5, …

• No need to worry in 🤗 Transformers


from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer("I love you.")

• The tokenizer automatically uses the tokenization


strategy of the given model to tokenize your text.
106
Frequently-used APIs
• Load the pre-trained models in a few lines
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-
uncased')

• Tokenize the texts


inputs = tokenizer(”Hello World!”, return_tensors='pt')

• Run the model


outputs = model(**inputs)

• Save the fine-tuned model in one line


model.save_pretrained("path_to_save_model")

🥳 That’s all! 107


Frequently-used APIs
• Train the model with Trainer

trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train() # Start training!
trainer.evaluate()

108
Demo
• We have provided a demo, which fine-tunes BERT
for sentiment analysis task.

• You will be able to use 🤗 Transformers after going


through this demo.

• See
https://colab.research.google.com/drive/1tcDiyHIKg
EJp4TzGbGp27HYbdFWGolU_?usp=sharing

109
Q&A

THUNLP

You might also like