0% found this document useful (0 votes)
21 views

Transformers MUIA

Transformers

Uploaded by

shanhewei2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Transformers MUIA

Transformers

Uploaded by

shanhewei2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Cutting-edge Deep Learning for NLP

learners

Transformers

Isabel Segura-Bedmar
8 June 2022
Máster Universitario en Inteligencia Artificial, UPM
Convolutional Neural Network

2
Recurrent Neural Network (bidirectional)
positive neutral negative

y1 y2 y3

Saroufim, C., Almatarky, A., & Hady, M. A. (2018, October). Language independent sentiment analysis with sentiment-specific word embeddings. In
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 14-23).
Pretrained word embeddings (word2vec)

source: Sentiment analysis on Twitter data with semi-supervised Doc2Vec


Advantages of pre-trained (static) word
embeddings
- Capture the semantic relationship between words
- Vectors to represent texts have a much lower
dimension than using BoW models.
- Much less training time than using random
initialization (the network has to learn less
parameters!)
Disadvantages of pre-trained (static) word
embeddings
- Fail to capture polysemy. ‘bank’ always has
the same embedding:
- I authorize my bank to debit the amount
from my bank account.
- I was walking along the right bank of the
river.
Solution: Contextualized word
embeddings
- Capturing word semantics in different contexts
to address the issue of polysemous.
- Word representation (embedding) depends
on the context where the word occurs,
meaning that the same word in different
contexts can have different representations.
Transformers are language models
● All the Transformer models have been trained as language models.
● They are trained on large amounts of raw text in a self-supervised
fashion.
● Some examples:
a. predicting the next word in a sentence having read the n
previous words (casual language modelling)
b. model predicts a masked word in the sentence (masked
language modelling)
Casual language modelling

● GPT-like (auto-regressive)
Masked language modelling

● BERT-like (auto-encoding, masked LM)


Transformer
● attention mechanisms:
○ self-attention
○ multi-head attention
to gather information about
the relevant context of a
given word, and then
encode that context in a
rich vector that smartly
represents the word.

11
Waswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In NIPS
Attention layers

pay specific attention to certain words in the input


sentence you passed it (and more or less ignore the
others)
Basic idea of self-attention

source: https://www.kaggle.com/residentmario/transformer-architecture-self-attention

13
Basic idea of self-attention
● Computes the interaction between each input word with
other input words.
● Can compute the weights in parallel, for all tokens at
once.
● As it also sees all the other inputs, it can easily preserve
long-term dependencies in long text sequences.
● Therefore, self-attention could completely replace RNN

source: https://www.kaggle.com/residentmario/transformer-architecture-self-attention

14
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,
J., Jones, L., Gomez, A. N., ... & Polosukhin, I.
(2017). Attention is all you need. Advances in

How to implement self-attention neural information processing systems, 30.

attention weights

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://medium.com/analytics-vidhya/masking-in-transformers-self-attention-mechanism-bad3c9ec235c 15
Multi-head attention

Attention module
repeats its
computations
multiple times in
parallel. Each of
these is called an
Attention Head.

https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
16
BERT: Bidirectional Encoder Representations
from Transformers

BERT architecture is a stack


of encoders

BERT-base = 12 layers,
embedding size= 768
BERT-large = 24 layers,
embedding size = 1024

input size = 512

source: http://jalammar.github.io/illustrated-bert/
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
Training BERT
BERT was trained on
Wikipedia (2,500 million
words) and BookCorpus (800
million words)

https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03
18
Pre-Training BERT
● To train BERT, two approaches are used simultaneously:
○ Masked Language Model (MLM)
○ Next Sentence Prediction (NSP)

https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
19
Masked Language Model (MLM)
- The model is fed with a sentence such that 15% of the words in the
sentence are masked.
- Then, BERT has to predict the masked words correctly given the context
of unmasked words.

https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad

20
Next Sentence Prediction (NSP)
● The model is fed with 2 sentences. The model has to
predict the order of the 2 sentences.
● Training dataset: 50% of the inputs are a pair in which the
second sentence is the subsequent sentence in the
original document, while in the other 50% a random
sentence from the corpus is chosen as the second
sentence.

21
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad
BERT input
[CLS] marks the beginning of an input sentence, [SEP] marks the separation/end of sentences.

Input: “My dog is cute. He likes playing” => [‘[CLS]’, ‘My’, ‘dog’, ‘is’, ‘cute’ [SEP]’,’He’, ‘likes’, ‘play’, ‘##ing’, ‘[SEP]’]

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
22
BERT input - token embeddings
● generates the token embedding (a numerical vector for
each word in the input sentence).

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
23
BERT input - segment embeddings
● help to distinguish between the different sentences in a single
input. For the input:
[‘[CLS]’, ‘my’, dog, ‘is’, ‘cute’, ‘[SEP]’, ‘he’, ‘likes’, ‘play,
‘##ing’, ‘[SEP]’], the segment embeddings will be:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
24
BERT, position embeddings
● Position Embeddings generated internally in BERT and
that provide the input data a sense of order.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
25
BERT input - mask embeddings
● We also generate a mask embedding of size 512 in
which the index corresponding to the relevant words
will have 1s and the index corresponding to padding
will have 0sil

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
26
Fine-tuning
- To perform fine-tuning, you first acquire a pretrained
language model, then perform additional training with
a dataset specific to your task
- Since the pretrained model was already trained on
lots of data, the fine-tuning requires way less data to
get decent results (the amount of time and resources
needed to get good results are much lower.)
-
Fine-tuning for a specific task
output dimension=768 (base), 1024
(large
Text classification: spam
Text similarity detection, fake news detection,
hate speech detection,
sentiment analysis,...

Sequence labelling tasks: NER,


QA PoS tagging
Text summarization
Machine Translation

https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
28
How fine-tune a pre-trained model?
● The entire pre-trained BERT model and the additional
layers are trained on a specific task.
● For example, if the task if text classification, we can add a linear
softmax layer on top of the BERT for predicting the class label
of the input text.
● In a NER task, we could add a CRF layer on the top of BERT.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
29
Later models based on BERT

● BETO (spanish
● ROBERTa
● Dilistbert
● XML/mBERT
● ALBERT

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert
30
pretraining approach.
Conclusions
● Proper language models are key for developing NLP systems.
● Recurrent Neural Network requires high computation cost to
process long sequences.
● Transforms based on attention mechanisms can replace RNN
● BERT is a contextual language model capable to correctly
represent polysemy words.
● Fine-tuning a BERT model for a specific task is relatively
inexpensive
● BERT is fully bidirectional and obtains state-of-the-art results in
many NLP tasks.
31
Thank you
Question time!!!

[email protected]
https://hulat.inf.uc3m.es/nosotros/miembros/isegura
https://github.com/isegura
32
Conclusions
● Proper language models are key for developing NLP systems.
● Recurrent Neural Network requires high computation cost to
process long sequences.
● Transforms based on attention mechanisms can replace RNN
● BERT is a contextual language model capable to correctly
represent polysemy words.
● Fine-tuning a BERT model for a specific task is relatively
inexpensive
● BERT is fully bidirectional and obtains state-of-the-art results in
many NLP tasks.
33
Thank you
Question time!!!

[email protected]
https://hulat.inf.uc3m.es/nosotros/miembros/isegura
https://github.com/isegura
34

You might also like