0% found this document useful (0 votes)

21 views

Transformers MUIA

Transformers

Uploaded by

shanhewei2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Transformers MUIA

Transformers

Uploaded by

shanhewei2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Cutting-edge Deep Learning for NLP

learners

Transformers

Isabel Segura-Bedmar
8 June 2022
Máster Universitario en Inteligencia Artificial, UPM
Convolutional Neural Network

2
Recurrent Neural Network (bidirectional)
positive neutral negative

y1 y2 y3

Saroufim, C., Almatarky, A., & Hady, M. A. (2018, October). Language independent sentiment analysis with sentiment-specific word embeddings. In
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 14-23).
Pretrained word embeddings (word2vec)

source: Sentiment analysis on Twitter data with semi-supervised Doc2Vec

Advantages of pre-trained (static) word
embeddings
- Capture the semantic relationship between words
- Vectors to represent texts have a much lower
dimension than using BoW models.
- Much less training time than using random
initialization (the network has to learn less
parameters!)
Disadvantages of pre-trained (static) word
embeddings
- Fail to capture polysemy. ‘bank’ always has
the same embedding:
- I authorize my bank to debit the amount
from my bank account.
- I was walking along the right bank of the
river.
Solution: Contextualized word
embeddings
- Capturing word semantics in different contexts
to address the issue of polysemous.
- Word representation (embedding) depends
on the context where the word occurs,
meaning that the same word in different
contexts can have different representations.
Transformers are language models
● All the Transformer models have been trained as language models.
● They are trained on large amounts of raw text in a self-supervised
fashion.
● Some examples:
a. predicting the next word in a sentence having read the n
previous words (casual language modelling)
b. model predicts a masked word in the sentence (masked
language modelling)
Casual language modelling

● GPT-like (auto-regressive)
Masked language modelling

● BERT-like (auto-encoding, masked LM)

Transformer
● attention mechanisms:
○ self-attention
○ multi-head attention
to gather information about
the relevant context of a
given word, and then
encode that context in a
rich vector that smartly
represents the word.

11
Waswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In NIPS
Attention layers

pay speciﬁc attention to certain words in the input

sentence you passed it (and more or less ignore the
others)
Basic idea of self-attention

source: https://www.kaggle.com/residentmario/transformer-architecture-self-attention

13
Basic idea of self-attention
● Computes the interaction between each input word with
other input words.
● Can compute the weights in parallel, for all tokens at
once.
● As it also sees all the other inputs, it can easily preserve
long-term dependencies in long text sequences.
● Therefore, self-attention could completely replace RNN

source: https://www.kaggle.com/residentmario/transformer-architecture-self-attention

14
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,
J., Jones, L., Gomez, A. N., ... & Polosukhin, I.
(2017). Attention is all you need. Advances in

How to implement self-attention neural information processing systems, 30.

attention weights

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://medium.com/analytics-vidhya/masking-in-transformers-self-attention-mechanism-bad3c9ec235c 15
Multi-head attention

Attention module
repeats its
computations
multiple times in
parallel. Each of
these is called an
Attention Head.

https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
16
BERT: Bidirectional Encoder Representations
from Transformers

BERT architecture is a stack

of encoders

BERT-base = 12 layers,
embedding size= 768
BERT-large = 24 layers,
embedding size = 1024

input size = 512

source: http://jalammar.github.io/illustrated-bert/
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
Training BERT
BERT was trained on
Wikipedia (2,500 million
words) and BookCorpus (800
million words)

https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03
18
Pre-Training BERT
● To train BERT, two approaches are used simultaneously:
○ Masked Language Model (MLM)
○ Next Sentence Prediction (NSP)

https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
19
Masked Language Model (MLM)
- The model is fed with a sentence such that 15% of the words in the
sentence are masked.
- Then, BERT has to predict the masked words correctly given the context
of unmasked words.

https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad

20
Next Sentence Prediction (NSP)
● The model is fed with 2 sentences. The model has to
predict the order of the 2 sentences.
● Training dataset: 50% of the inputs are a pair in which the
second sentence is the subsequent sentence in the
original document, while in the other 50% a random
sentence from the corpus is chosen as the second
sentence.

21
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad
BERT input
[CLS] marks the beginning of an input sentence, [SEP] marks the separation/end of sentences.

Input: “My dog is cute. He likes playing” => [‘[CLS]’, ‘My’, ‘dog’, ‘is’, ‘cute’ [SEP]’,’He’, ‘likes’, ‘play’, ‘##ing’, ‘[SEP]’]

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
22
BERT input - token embeddings
● generates the token embedding (a numerical vector for
each word in the input sentence).

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
23
BERT input - segment embeddings
● help to distinguish between the different sentences in a single
input. For the input:
[‘[CLS]’, ‘my’, dog, ‘is’, ‘cute’, ‘[SEP]’, ‘he’, ‘likes’, ‘play,
‘##ing’, ‘[SEP]’], the segment embeddings will be:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
24
BERT, position embeddings
● Position Embeddings generated internally in BERT and
that provide the input data a sense of order.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
25
BERT input - mask embeddings
● We also generate a mask embedding of size 512 in
which the index corresponding to the relevant words
will have 1s and the index corresponding to padding
will have 0sil

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
26
Fine-tuning
- To perform fine-tuning, you first acquire a pretrained
language model, then perform additional training with
a dataset specific to your task
- Since the pretrained model was already trained on
lots of data, the fine-tuning requires way less data to
get decent results (the amount of time and resources
needed to get good results are much lower.)
-
Fine-tuning for a specific task
output dimension=768 (base), 1024
(large
Text classification: spam
Text similarity detection, fake news detection,
hate speech detection,
sentiment analysis,...

Sequence labelling tasks: NER,

QA PoS tagging
Text summarization
Machine Translation

https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
28
How fine-tune a pre-trained model?
● The entire pre-trained BERT model and the additional
layers are trained on a specific task.
● For example, if the task if text classification, we can add a linear
softmax layer on top of the BERT for predicting the class label
of the input text.
● In a NER task, we could add a CRF layer on the top of BERT.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
29
Later models based on BERT

● BETO (spanish
● ROBERTa
● Dilistbert
● XML/mBERT
● ALBERT

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert
30
pretraining approach.
Conclusions
● Proper language models are key for developing NLP systems.
● Recurrent Neural Network requires high computation cost to
process long sequences.
● Transforms based on attention mechanisms can replace RNN
● BERT is a contextual language model capable to correctly
represent polysemy words.
● Fine-tuning a BERT model for a specific task is relatively
inexpensive
● BERT is fully bidirectional and obtains state-of-the-art results in
many NLP tasks.
31
Thank you
Question time!!!

[email protected]
https://hulat.inf.uc3m.es/nosotros/miembros/isegura
https://github.com/isegura
32
Conclusions
● Proper language models are key for developing NLP systems.
● Recurrent Neural Network requires high computation cost to
process long sequences.
● Transforms based on attention mechanisms can replace RNN
● BERT is a contextual language model capable to correctly
represent polysemy words.
● Fine-tuning a BERT model for a specific task is relatively
inexpensive
● BERT is fully bidirectional and obtains state-of-the-art results in
many NLP tasks.
33
Thank you
Question time!!!

[email protected]
https://hulat.inf.uc3m.es/nosotros/miembros/isegura
https://github.com/isegura
34

Bert Explained
No ratings yet
Bert Explained
8 pages
F2F Starter SB 2nd Ed
100% (2)
F2F Starter SB 2nd Ed
85 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
BERT
No ratings yet
BERT
98 pages
Lec 02
No ratings yet
Lec 02
33 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
BERT
No ratings yet
BERT
4 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
Transformers Tutorial 1 56
No ratings yet
Transformers Tutorial 1 56
56 pages
Introduction To Transformers
No ratings yet
Introduction To Transformers
187 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
11 Bert
No ratings yet
11 Bert
66 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
Bert ayman
No ratings yet
Bert ayman
5 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Bert
No ratings yet
Bert
36 pages
Bert
No ratings yet
Bert
20 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Bert
No ratings yet
Bert
60 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
C4_W3
No ratings yet
C4_W3
98 pages
paper_review
No ratings yet
paper_review
6 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
Bert
No ratings yet
Bert
10 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
22 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
LSTM to BERT
No ratings yet
LSTM to BERT
30 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
17 pages
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
No ratings yet
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
16 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
BERT_GPT_CoT
No ratings yet
BERT_GPT_CoT
83 pages
ASSIGNMENT 05 CL[1]
No ratings yet
ASSIGNMENT 05 CL[1]
3 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
BERT Slides
No ratings yet
BERT Slides
41 pages
Final
No ratings yet
Final
30 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
data_mining_report
No ratings yet
data_mining_report
17 pages
BERT-1-42
No ratings yet
BERT-1-42
42 pages
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Assessment Types of Assessment Methods
0% (1)
Assessment Types of Assessment Methods
3 pages
English Language Certification: Level
No ratings yet
English Language Certification: Level
11 pages
ModuleTTL2 - L1, L2
No ratings yet
ModuleTTL2 - L1, L2
22 pages
Curriculum Development 1
No ratings yet
Curriculum Development 1
33 pages
Literature Cited
No ratings yet
Literature Cited
4 pages
2) Cultural Deprivation
No ratings yet
2) Cultural Deprivation
2 pages
Embedded Clauses and Complex Sentences
No ratings yet
Embedded Clauses and Complex Sentences
2 pages
KR_1_2016_ontology
No ratings yet
KR_1_2016_ontology
33 pages
6.NIMHANS Neurolopsychological PDF
No ratings yet
6.NIMHANS Neurolopsychological PDF
1 page
Summarizing and Paraphrasing
No ratings yet
Summarizing and Paraphrasing
9 pages
Assignment - 05: Shivam Kamlesh Yadav BEIT-B4 77
No ratings yet
Assignment - 05: Shivam Kamlesh Yadav BEIT-B4 77
6 pages
Behavioral Finance Syllabus
No ratings yet
Behavioral Finance Syllabus
8 pages
RTU Buss Res Full Compile 01.01.15
No ratings yet
RTU Buss Res Full Compile 01.01.15
44 pages
Language Disorders
No ratings yet
Language Disorders
16 pages
Based On The Paradigms With The Grammar in Book 17. ALC Book 17 Grammar Guide Lesson 1
No ratings yet
Based On The Paradigms With The Grammar in Book 17. ALC Book 17 Grammar Guide Lesson 1
6 pages
Performance Appraisal
100% (1)
Performance Appraisal
56 pages
Nelson Cox and Lehninger Principles of Biochemistry 6e
100% (2)
Nelson Cox and Lehninger Principles of Biochemistry 6e
22 pages
Data Analyst
No ratings yet
Data Analyst
1 page
Article Summary - "The Secret To Happiness: Feeling Good or Feeling Right?"
No ratings yet
Article Summary - "The Secret To Happiness: Feeling Good or Feeling Right?"
3 pages
DLL - Mapeh 3 - Q2 - W9
No ratings yet
DLL - Mapeh 3 - Q2 - W9
2 pages
That's Why (You Go Away) - MLTR: Nama: Susilo Yudo Husodo Kelas: X-Mipa 2
No ratings yet
That's Why (You Go Away) - MLTR: Nama: Susilo Yudo Husodo Kelas: X-Mipa 2
4 pages
FCE Use of English Practice
No ratings yet
FCE Use of English Practice
2 pages
Narrative Presence - The Illusion of Language in Heart of Darkness - Jerry Wasserman
No ratings yet
Narrative Presence - The Illusion of Language in Heart of Darkness - Jerry Wasserman
13 pages
Python Workbook (Teacher's Notes) : Sequence Selection
No ratings yet
Python Workbook (Teacher's Notes) : Sequence Selection
3 pages
Warm Up Activities For English Clubs
No ratings yet
Warm Up Activities For English Clubs
8 pages
Idioms Worksheet 3 PDF
100% (1)
Idioms Worksheet 3 PDF
2 pages
Basic of Instructional Planning Escorido BEED
No ratings yet
Basic of Instructional Planning Escorido BEED
6 pages
Wider Skills For Learning
No ratings yet
Wider Skills For Learning
76 pages
Dussias & Sagarra (2007) PDF
No ratings yet
Dussias & Sagarra (2007) PDF
16 pages