self-supervised learning and Bert from a

Self-Supervised
Learning
Hung-yi Lee 李宏毅
https://www.sesameworkshop.org/what-
we-do/sesame-streets-50th-anniversary

ELMo
(Embeddings from
Language Models)
BERT (Bidirectional
Encoder Representations
from Transformers)
ERNIE (Enhanced Representation
through Knowledge Integration)
Big Bird: Transformers for
Longer Sequences

Source of image: https://leemeng.tw/attack_on_bert_transfer_learni
ng_in_nlp.html
BERT
Bertolt
Hoover
340M
parameters

BERT
GPT-2
T5
GPT-3
ELMo
Source: https://youtu.be/wJJnjzNlMws

Source of image: https://huaban.com/pins/1714071707/
ELMO
(94M)
BERT
(340M)
GPT-2
(1542M)
The models become larger
and larger …

Megatron (8B)
GPT-2 T5 (11B)
Turing NLG
(17B)
The models become larger
and larger …
GPT-3 is 10 times larger than
Turing NLG.

BERT (340M)
GPT-3 (175B)
BERT
GPT-3
死臭酸宅本人
https://arxiv.org/abs/
2101.03961
Switch
Transformer
(1.6T)

Outline
BERT series GPT series

Self-supervised Learning
Supervised
𝑥
𝑦
label
Model
^
𝑦
𝑥
𝑥′
𝑥′′
Model
Self-
supervised
𝑦

Masking Input
BERT
台
MASK
Random
(special
token)
https://arxiv.org/abs/1810.04805
灣大學
Transformer
Encoder
Linear
學 0.1
灣 0.7
台 0.1
大 0.1
…… ……
(all characters)
=
=
or
Randomly masking
some tokens
一、天、大、小
…
softmax

Masking Input
BERT
台
MASK
Random
(special
token)
灣大學
Transformer
Encoder
Linear
=
=
or
Randomly masking
some tokens
一、天、大、小
…
softmax 灣
Ground
truth
minimize cross
entropy

Next Sentence Prediction
BERT
[SEP]
Yes/No
[CLS]
Linear
Robustly optimized BERT approach
(RoBERTa)
w1 w2
Sentence 1
w3 w4 w5
Sentence 2
• This approach is not helpful.
• SOP: Sentence order prediction
Used in ALBERT

• Masked token prediction
• Next sentence prediction
BERT
Self-supervised
Learning
Model for
Task 1
Downstream Tasks
Model for
Task 2
Model for
Task 3
• The tasks we care
• We have a little bit labeled data.
Fine-tune
Pre-train

GLUE
• Corpus of Linguistic Acceptability (CoLA)
• Stanford Sentiment Treebank (SST-2)
• Microsoft Research Paraphrase Corpus (MRPC)
• Quora Question Pairs (QQP)
• Semantic Textual Similarity Benchmark (STS-B)
• Multi-Genre Natural Language Inference (MNLI)
• Question-answering NLI (QNLI)
• Recognizing Textual Entailment (RTE)
• Winograd NLI (WNLI)
General Language Understanding
Evaluation (GLUE)
https://gluebenchmark.com/
GLUE also has Chinese version (https://www.cluebenchmarks.com/)

BERT and its Family
• GLUE scores
Source of image: https://arxiv.org/abs/1905.00537

How to use BERT – Case 1
BERT
[CLS] w1 w2 w3
Linear
class Input: sequence
output: class
sentence
Example:
Sentiment analysis
Random
initialization
Init by pre-train
This is the model
to be learned.
this is good
positive
Better than random

Pre-train v.s. Random Initialization
Source of image: https://arxiv.org/abs/1908.05620
(fine-
tune)
(scratch)

BERT
[CLS] w1 w2 w3
Linear
class
Input: sequence
output: same as input
sentence
Linear
class
Linear
class
I saw a saw
N V DET N
Example:
POS tagging

Input: two sequences
Output: a class
premise: A person on a horse
jumps over a broken down airplane
hypothesis: A person is at a diner. contradiction
Model
contradiction
entailment
neutral
Example:
Natural Language Inferencee (NLI)

Linear
w1 w2
BERT
[CLS] [SEP]
Class
Sentence 1 Sentence 2
w3 w4 w5
Input: two sequences
Output: a class

• Extraction-based Question
Answering (QA)
𝐷={𝑑1,𝑑2 ,⋯ ,𝑑𝑁 }
𝑄={𝑞1 , 𝑞2 , ⋯ , 𝑞𝑀 }
QA
Model
output: two integers (, )
𝐴={𝑑𝑠 , ⋯ ,𝑑𝑒 }
Document:
Query:
Answer:
𝐷
𝑄
𝑠
𝑒
17
77 79
𝑠=17 , 𝑒=17
𝑠=77 , 𝑒=79

q1 q2
BERT
[CLS] [SEP]
question document
d1 d2 d3
inner product
Softmax
0.5
0.3 0.2
s = 2
Random
Initialized

q1 q2
BERT
[CLS] [SEP]
question document
d1 d2 d3
inner product
Softmax
0.2
0.1 0.7
The answer is “d2 d3”.
s = 2 e = 3
Random
Initialized

Training BERT is challenging!
GLUE scores
This work is done by 姜成翰
台達電產學合作計畫研究成果
Our ALBERT-base
Google’s ALBERT-base
Google’s BERT-base
Training data has more than 3 billions of words.
3000 times of Harry Potter series
8 days with TPU v3

BERT Embryology ( 胚胎學 )
When does BERT know POS tagging,
syntactic parsing, semantics?
The answer is counterintuitive!

Pre-training a seq2seq model
w1 w2 w3
w5 w6 w7
w4
Cross
Attention
w8
Decoder
Encoder
w1 w2 w3 w4
Reconstruct the input
Corrupted

MASS / BART
BART
A B [SEP] C D E
A B [SEP] C D E
A B [SEP] C E
C D E [SEP] A B
D E A B [SEP] C
A B [SEP] E
MASS
(Delete
“D”)
Text Infilling
(permutation)
(rotation)

T5 – Comparison
• Transfer Text-to-Text Transformer (T5)
• Colossal Clean Crawled Corpus (C4)

Why does BERT work?
BERT
台灣大學
Represent the
meaning of “ 大”
魚
鳥
草
電
吃蘋果
蘋果手機
embedding
The tokens with similar meaning
have similar embedding.
Context is considered.

Why does BERT work?
BERT
喝蘋果汁
BERT
蘋果電腦
compute cosine similarity

self-supervised learning and Bert from a

Why does BERT work?
John Rupert Firth
You shall know a word by
the company it keeps
BERT
w1 w2 w3 w4
w2
word
embedding
Contextualized
word embedding

Why does BERT work?
• Applying BERT to protein, DNA, music classification
This work is done by 高瑋聰
EI CCAGCTGCATCACAGGAGGCCAGCG
EI AGACCCGCCGGGAGGCGGAGGACC
IE AACGTGGCCTCCTTGTGCCCTTCCCC
IE CCACTCAGCCAGGCCCTTCTTCTCCT
IE CCTGATCTGGGTCTCCCCTCCCACCCT
IE AGCCCTCAACCCTTCTGTCTCACCCTC
IE CCACTCAGCCAGGCCCTTCTTCTCCT
N CTGTGTTCACCACATCAAGCGCCGGG
N GTGTTACCGAGGGCATTTCTAACAGT
N TCTGAGCTCTGCATTTGTCTATTCTCC
class DNA sequence

A we
T you
C he
G she
BERT
[CLS]
Linear
class
DNA sequence
Random
initialization
Init by pre-train
pre-train on English
Why does BERT work?
A G A C
we we
she he

Why does BERT work?
• Applying BERT to protein, DNA, music classification

To Learn More ……
BERT (Part 1) BERT (Part 2)
https://youtu.be/1_gRK9EIQpc https://youtu.be/Bywo7m6ySlk

Multi-lingual BERT
Multi-BERT
深度學習
Training a BERT model by many different languages.
Multi-BERT
high est moun tain Mask
Mask

Zero-shot Reading Comprehension
Training on the sentences of 104 languages
Multi-BERT
Doc1
Query1
Ans1
Doc2
Query2
Ans2
Doc3
Query3
Ans3
Doc4
Query4
Ans4
Doc5
Query5
Ans5
Doc1
Query1
? Doc3
Query3
?
Doc2
Query2
?
Train on English QA
training examples
Test on Chinese
QA test

Zero-shot Reading Comprehension
• English: SQuAD, Chinese: DRCD
F1 score of Human performance is 93.30%
Model Pre-train Fine-tune Test EM F1
QANet none Chinese
Chinese
66.1 78.1
BERT
Chinese Chinese 82.0 89.1
104
languages
Chinese 81.2 88.7
English 63.3 78.8
Chinese + English 82.6 90.1
This work is done by 劉記良、許宗嫄

Cross-lingual Alignment?
Multi-BERT
深度學習
high est moun tain
魚
兔
跳
游
swim
jump
rabbit
fish

投影片來源 : 許宗嫄同學碩士口試投影
片
Mean Reciprocal Rank (MRR):
Higher MRR, better alignment
Google’s
Multi-BERT
Our Multi-BERT
200k sentences
for each lang
How about 1000k?

The training is also challenging …
Two days ……
(the whole training took one week)

投影片來源 : 許宗嫄同學碩士口試投影
片
Mean Reciprocal Rank (MRR):
Higher MRR, better alignment
Google’s Multi-
BERT
Our Multi-BERT
200k sentences
for each lang
Our Multi-BERT
1000k sentences
The amount of training data is critical for alignment.

魚
兔
跳
游
swim
jump
rabbit
fish
Multi-BERT
深度學習
high est moun tain
Reconstruction
深度學習
high est moun tain
Weird???
If the embedding is
language independent …
How to correctly
reconstruct?
There must be language
information.

Multi-BERT
Reconstruction
那有一貓
Where is
Language?
Average of
Chinese
Average of
English
This work is done by 劉記良、許宗嫄、莊永松
there is a cat
+ + + +
魚
兔
跳
游
swim
jump
rabbit
fish

If this is true …
Average of
Chinese
Average of
English
This work is done by 劉記良、許宗嫄、莊永松
魚
兔
跳
游
swim
jump
rabbit
fish
x
Unsupervised token-level translation 

Predict Next Token
<BOS> 台灣
台灣大
h1 h2 h3 h4
Model
? ? ? ?
大
學
Linear
Transform
softmax
Cross
entropy
wt+1
from wt h𝑡
Training
data: “ 台灣
大學”

Predict Next Token
They can do generation.
https://talktotransformer.com/

How to use GPT?
Description
A few example

“Few-
shot”
Learning
“One-
shot”
Learning
“Zero-
shot”
Learning
(no gradient
descent)
“In-context” Learning

To learn more ……
https://youtu.be/DOG1L9lvsDY

Beyond Text
Data Centric Prediction
Position, 2015
Jigsaw, 2017
Rotation, 2018 Cutout, 2015
RNNLM, 1997
word2v, 2013 audio2v, 2019
BERT, 2018 Mock, 2020
TERA, 2020
APC, 2019
NLP Speech CV
Contrastive
InfoNCE,
2017
CPC, 2019
MoCo, 2019
SimCLR, 2020
MoCov2, 2020
BYOL, 2020
SimSiam, 2020
本投影片由劉廷緯同學提供

Image- SimCLR https://arxiv.org/abs/2002.05709
https://github.com/google-research/simclr

Image- BYOL
Bootstrap your own latent:
A new approach to self-supervised Learning

Speech
Audio version
BERT
深度學習

Speech GLUE- SUPERB
• Speech processing Universal PERformance
Benchmark
• Will be available soon
• Downstream: Benchmark with 10+ tasks
• The models need to know how to process
content, speaker, emotion, and even semantics.
• Toolkit: A flexible and modularized framework for
self-supervised speech models.
• https://github.com/s3prl/s3prl

https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning

Predict Next Token
They can do generation.

I forced a bot to watch over 1,000 hours of XXX
是一個梗 ! 人在模仿機器模仿人 !!!

self-supervised learning and Bert from a

Recommended

More Related Content

Similar to self-supervised learning and Bert from a (20)

Recently uploaded (20)

self-supervised learning and Bert from a

Editor's Notes