Transformers MUIA
Transformers MUIA
learners
Transformers
Isabel Segura-Bedmar
8 June 2022
Máster Universitario en Inteligencia Artificial, UPM
Convolutional Neural Network
2
Recurrent Neural Network (bidirectional)
positive neutral negative
y1 y2 y3
Saroufim, C., Almatarky, A., & Hady, M. A. (2018, October). Language independent sentiment analysis with sentiment-specific word embeddings. In
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 14-23).
Pretrained word embeddings (word2vec)
● GPT-like (auto-regressive)
Masked language modelling
11
Waswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In NIPS
Attention layers
source: https://www.kaggle.com/residentmario/transformer-architecture-self-attention
13
Basic idea of self-attention
● Computes the interaction between each input word with
other input words.
● Can compute the weights in parallel, for all tokens at
once.
● As it also sees all the other inputs, it can easily preserve
long-term dependencies in long text sequences.
● Therefore, self-attention could completely replace RNN
source: https://www.kaggle.com/residentmario/transformer-architecture-self-attention
14
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,
J., Jones, L., Gomez, A. N., ... & Polosukhin, I.
(2017). Attention is all you need. Advances in
attention weights
https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://medium.com/analytics-vidhya/masking-in-transformers-self-attention-mechanism-bad3c9ec235c 15
Multi-head attention
Attention module
repeats its
computations
multiple times in
parallel. Each of
these is called an
Attention Head.
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
16
BERT: Bidirectional Encoder Representations
from Transformers
BERT-base = 12 layers,
embedding size= 768
BERT-large = 24 layers,
embedding size = 1024
source: http://jalammar.github.io/illustrated-bert/
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
Training BERT
BERT was trained on
Wikipedia (2,500 million
words) and BookCorpus (800
million words)
https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03
18
Pre-Training BERT
● To train BERT, two approaches are used simultaneously:
○ Masked Language Model (MLM)
○ Next Sentence Prediction (NSP)
https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
19
Masked Language Model (MLM)
- The model is fed with a sentence such that 15% of the words in the
sentence are masked.
- Then, BERT has to predict the masked words correctly given the context
of unmasked words.
https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad
20
Next Sentence Prediction (NSP)
● The model is fed with 2 sentences. The model has to
predict the order of the 2 sentences.
● Training dataset: 50% of the inputs are a pair in which the
second sentence is the subsequent sentence in the
original document, while in the other 50% a random
sentence from the corpus is chosen as the second
sentence.
21
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad
BERT input
[CLS] marks the beginning of an input sentence, [SEP] marks the separation/end of sentences.
Input: “My dog is cute. He likes playing” => [‘[CLS]’, ‘My’, ‘dog’, ‘is’, ‘cute’ [SEP]’,’He’, ‘likes’, ‘play’, ‘##ing’, ‘[SEP]’]
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
22
BERT input - token embeddings
● generates the token embedding (a numerical vector for
each word in the input sentence).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
23
BERT input - segment embeddings
● help to distinguish between the different sentences in a single
input. For the input:
[‘[CLS]’, ‘my’, dog, ‘is’, ‘cute’, ‘[SEP]’, ‘he’, ‘likes’, ‘play,
‘##ing’, ‘[SEP]’], the segment embeddings will be:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
24
BERT, position embeddings
● Position Embeddings generated internally in BERT and
that provide the input data a sense of order.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
25
BERT input - mask embeddings
● We also generate a mask embedding of size 512 in
which the index corresponding to the relevant words
will have 1s and the index corresponding to padding
will have 0sil
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
26
Fine-tuning
- To perform fine-tuning, you first acquire a pretrained
language model, then perform additional training with
a dataset specific to your task
- Since the pretrained model was already trained on
lots of data, the fine-tuning requires way less data to
get decent results (the amount of time and resources
needed to get good results are much lower.)
-
Fine-tuning for a specific task
output dimension=768 (base), 1024
(large
Text classification: spam
Text similarity detection, fake news detection,
hate speech detection,
sentiment analysis,...
https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef
28
How fine-tune a pre-trained model?
● The entire pre-trained BERT model and the additional
layers are trained on a specific task.
● For example, if the task if text classification, we can add a linear
softmax layer on top of the BERT for predicting the class label
of the input text.
● In a NER task, we could add a CRF layer on the top of BERT.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019.
29
Later models based on BERT
● BETO (spanish
● ROBERTa
● Dilistbert
● XML/mBERT
● ALBERT
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert
30
pretraining approach.
Conclusions
● Proper language models are key for developing NLP systems.
● Recurrent Neural Network requires high computation cost to
process long sequences.
● Transforms based on attention mechanisms can replace RNN
● BERT is a contextual language model capable to correctly
represent polysemy words.
● Fine-tuning a BERT model for a specific task is relatively
inexpensive
● BERT is fully bidirectional and obtains state-of-the-art results in
many NLP tasks.
31
Thank you
Question time!!!
[email protected]
https://hulat.inf.uc3m.es/nosotros/miembros/isegura
https://github.com/isegura
32
Conclusions
● Proper language models are key for developing NLP systems.
● Recurrent Neural Network requires high computation cost to
process long sequences.
● Transforms based on attention mechanisms can replace RNN
● BERT is a contextual language model capable to correctly
represent polysemy words.
● Fine-tuning a BERT model for a specific task is relatively
inexpensive
● BERT is fully bidirectional and obtains state-of-the-art results in
many NLP tasks.
33
Thank you
Question time!!!
[email protected]
https://hulat.inf.uc3m.es/nosotros/miembros/isegura
https://github.com/isegura
34