Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model

DA 5330 – Advanced Machine Learning
Applications
Lecture 10 – Transformers
Maninda Edirisooriya
manindaw@uom.lk

Limitations of RNN Models
• Slow computation for longer sequences as the computation cannot
be done in parallel due to the dependencies in timesteps
• As there are significant number of timesteps the backpropagation
depth increases which increases Vanishing Gradient and Exploding
Gradient problems
• As information is passed from the history as a hidden state vector the
amount of information is limited to that vector size
• As information passed from the history gets updated in each time
step, the history is forgotten after number of time steps

Attention-based Models
• Instead of processing all the time steps with the same weight,
attention models performed well when only certain time steps are
given an exponentially higher weight while processing any time step
which are known as Attention Models
• Thought Attention Models were significantly better, its processing
requirement (Complexity) was Quadratic (i.e. proportional to the
square of the number of time stamps) which was an extra slowdown
• However, the paper published with name “Attention is all you need”
by Vasvani et al. 2017 proposed that RNN units can be replaced with a
higher performance mechanism keeping only the “Attention” in mind
• This model is known as a Transformer Model

Transformer Model Architecture
Encoder Decoder

Transformer Model
• The original paper defined this model (with both Encoder and Decoder) for the
application of Natural Language Translation
• However, the Encoder and Decoder models were separately used independently
in some later models for different tasks
Source: https://pub.aimind.so/unraveling-the-power-of-language-models-understanding-llms-and-transformer-variants-71bfc42e0b21

Encoder Only (Autoencoding) Models
• Only the Encoder of the Transformer is used
• Pre-Trained with Masked Language Models
• Some random tokens of the input sequence are masked
• Try to predict the missing (masked) tokens to reconstruct the original
sequence
• This process learns the Bidirectional Context of the tokens in a sequence
(probabilities of being around certain tokens in both right and left)
• Used in applications like Sentence Classification for Sentiment
Analysis and token level operations like Named Entity Recognition
• BERT and RoBERTa are some examples

Decoder Only (Autoregressive) Models
• Only the Decoder of the Transformer is used
• Pre-Trained with Causal Language Models
• Last token of the input sequence is masked
• Try to predict the last token to reconstruct the original sequence
• Also known as Full Language Model as well
• This process learns the Unidirectional Context of the tokens in a sequence
(probabilities of being the next token given the tokens at the left)
• Used in applications like Text Generation
• GPT and BLOOM are some examples

Encoder Decoder (Sequence-to-Sequence)
Models
• Use both Encoder and the Decoder of the Transformer
• Pre-Training objective may depend on the requirement. In T5 model,
• In Encoder, some random tokens of the input sequence are masked with a
unique placeholder token, added to the vocabulary, known as Sentinel token
• This process is known as Span Corruption
• Decoder tries to predict the missing (masked) tokens to reconstruct the
original sequence, replacing the Sentinel tokens, with auto-regression
• Used in applications like Translation, Summarization and Question-
answering
• T5 and BART are some examples

Encoder – Input and Embedding
• Inputs is the sequence of tokens (words in case of
Natural Language Processing (NLP))
• Each input token is converted to a vector using Input
Embedding (Word Embedding in case of NLP)
Output

Encoder – Input and Embedding
Source: https://www.youtube.com/watch?v=bCz4OMemCcA

Encoder – Positional Encoding
Output

Encoder – Multi-Head Attention
• Multi-Head Attention is about applying multiple
similar operations known as Single-Head Attention
or simply Attention
Attention(Q, K, V) = softmax(
𝑄𝑘𝑇
𝑑𝑘
)V
• The type of attention used here is known as Self
Attention where each token is having a attention
against all the tokens in the input sequence
• For the Encoder we take, Q = K = V = X
Output

Self Attention
Source: https://jalammar.github.io/illustrated-transformer/
• Self Attention formula is inspired by
the data query from a data store
where Q is the query which is
matched against the K key values
where V is the actual value
• 𝐐𝐊𝐓
is a measure between the
similarity between Q and K
• 𝐝𝐤 is used to normalize by dividing it
by the dimensionality of the K
• Softmax is used to give the attention
to the largest
• Finally, normalized similarity is used to
the weight V resulting the Attention

Self Attention

Encoder – Multi-Head Attention
• When Single-Head Attention is defined as,
Attention(Q, K, V) = softmax(
𝑄𝑘𝑇
𝑑𝑘
)V
• Multi-Head Attention Head is defined as,
headi(Q, K, V) = Attention(QWi
Q, KWi
K, VWi
V)
• i.e.: We can have arbitrary number of heads where
parameter weight matrices have to be defined for Q, K, and
V for all heads
• Multi-Head is defined as,
MultiHead(Q, K, V) = Concat(head1, head2, … headh)WO
• i.e. MultiHead is the concatenation of all the heads
multiplied by another parameter matrix WO
Output

Encoder – Add & Normalization
• Input given to the multi-head attention is added to the
output as the Residual Inputs (remember ResNet?)
• Then the result is Layer Normalized
• Similar to the Batch Norm but instead of normalizing on the
items in the batch (or the minibatch), normalization happens
on the values in the layer
Output

Decoder – Masked Multi-Head Attention
• Multi-Head Attention for the Decoder is same as for the
Encoder
• However, only the query, Q is received from the
previous layer
• K and V are received from the Encoder output
• Here, K and V contain the context related information
that are required to process Q which is generated only
from the input to decoder

Masking the Multi-Head Attention
• The model must not see the tokens on
the right side of the sequence
• Therefore, the softmax output related
this attention should be zero
• For that, all the values that are right
from the diagonal will be replaced
with minus infinite, before the
Softmax is applied

Training a Transformer
• Vocabulary have special tokens,
• <SOS> for the Start of the Sentence
• <EOS> for the End of the Sentence
• Encoded output is given to the Decoder
(as K and V) to translate its input to
Italian
• Linear layer maps the Decoder output to
the vocabulary size
• Softmax layer outputs the positional
encodings of the tokens in one timestep
• Cross Entropy loss is used

Making Inferences with a Transformer
• Unlike training a transformer, while making inferences, a transformer
needs one timestep to generate a single token
• The reason is because we have to use that generated token to
generate the next token

Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model

Recommended

More Related Content

Similar to Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model (20)

More from Maninda Edirisooriya (20)

Recently uploaded (20)

Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model