SlideShare a Scribd company logo
DA 5330 – Advanced Machine Learning
Applications
Lecture 10 – Transformers
Maninda Edirisooriya
manindaw@uom.lk
Limitations of RNN Models
• Slow computation for longer sequences as the computation cannot
be done in parallel due to the dependencies in timesteps
• As there are significant number of timesteps the backpropagation
depth increases which increases Vanishing Gradient and Exploding
Gradient problems
• As information is passed from the history as a hidden state vector the
amount of information is limited to that vector size
• As information passed from the history gets updated in each time
step, the history is forgotten after number of time steps
Attention-based Models
• Instead of processing all the time steps with the same weight,
attention models performed well when only certain time steps are
given an exponentially higher weight while processing any time step
which are known as Attention Models
• Thought Attention Models were significantly better, its processing
requirement (Complexity) was Quadratic (i.e. proportional to the
square of the number of time stamps) which was an extra slowdown
• However, the paper published with name “Attention is all you need”
by Vasvani et al. 2017 proposed that RNN units can be replaced with a
higher performance mechanism keeping only the “Attention” in mind
• This model is known as a Transformer Model
Transformer Model Architecture
Encoder Decoder
Transformer Model
• The original paper defined this model (with both Encoder and Decoder) for the
application of Natural Language Translation
• However, the Encoder and Decoder models were separately used independently
in some later models for different tasks
Source: https://pub.aimind.so/unraveling-the-power-of-language-models-understanding-llms-and-transformer-variants-71bfc42e0b21
Encoder Only (Autoencoding) Models
• Only the Encoder of the Transformer is used
• Pre-Trained with Masked Language Models
• Some random tokens of the input sequence are masked
• Try to predict the missing (masked) tokens to reconstruct the original
sequence
• This process learns the Bidirectional Context of the tokens in a sequence
(probabilities of being around certain tokens in both right and left)
• Used in applications like Sentence Classification for Sentiment
Analysis and token level operations like Named Entity Recognition
• BERT and RoBERTa are some examples
Decoder Only (Autoregressive) Models
• Only the Decoder of the Transformer is used
• Pre-Trained with Causal Language Models
• Last token of the input sequence is masked
• Try to predict the last token to reconstruct the original sequence
• Also known as Full Language Model as well
• This process learns the Unidirectional Context of the tokens in a sequence
(probabilities of being the next token given the tokens at the left)
• Used in applications like Text Generation
• GPT and BLOOM are some examples
Encoder Decoder (Sequence-to-Sequence)
Models
• Use both Encoder and the Decoder of the Transformer
• Pre-Training objective may depend on the requirement. In T5 model,
• In Encoder, some random tokens of the input sequence are masked with a
unique placeholder token, added to the vocabulary, known as Sentinel token
• This process is known as Span Corruption
• Decoder tries to predict the missing (masked) tokens to reconstruct the
original sequence, replacing the Sentinel tokens, with auto-regression
• Used in applications like Translation, Summarization and Question-
answering
• T5 and BART are some examples
Encoder – Input and Embedding
• Inputs is the sequence of tokens (words in case of
Natural Language Processing (NLP))
• Each input token is converted to a vector using Input
Embedding (Word Embedding in case of NLP)
Output
Encoder – Input and Embedding
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Positional Encoding
Output
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Input and Embedding
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Multi-Head Attention
• Multi-Head Attention is about applying multiple
similar operations known as Single-Head Attention
or simply Attention
Attention(Q, K, V) = softmax(
𝑄𝑘𝑇
𝑑𝑘
)V
• The type of attention used here is known as Self
Attention where each token is having a attention
against all the tokens in the input sequence
• For the Encoder we take, Q = K = V = X
Output
Self Attention
Source: https://jalammar.github.io/illustrated-transformer/
• Self Attention formula is inspired by
the data query from a data store
where Q is the query which is
matched against the K key values
where V is the actual value
• 𝐐𝐊𝐓
is a measure between the
similarity between Q and K
• 𝐝𝐤 is used to normalize by dividing it
by the dimensionality of the K
• Softmax is used to give the attention
to the largest
• Finally, normalized similarity is used to
the weight V resulting the Attention
Self Attention
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Multi-Head Attention
• When Single-Head Attention is defined as,
Attention(Q, K, V) = softmax(
𝑄𝑘𝑇
𝑑𝑘
)V
• Multi-Head Attention Head is defined as,
headi(Q, K, V) = Attention(QWi
Q, KWi
K, VWi
V)
• i.e.: We can have arbitrary number of heads where
parameter weight matrices have to be defined for Q, K, and
V for all heads
• Multi-Head is defined as,
MultiHead(Q, K, V) = Concat(head1, head2, … headh)WO
• i.e. MultiHead is the concatenation of all the heads
multiplied by another parameter matrix WO
Output
Encoder – Add & Normalization
• Input given to the multi-head attention is added to the
output as the Residual Inputs (remember ResNet?)
• Then the result is Layer Normalized
• Similar to the Batch Norm but instead of normalizing on the
items in the batch (or the minibatch), normalization happens
on the values in the layer
Output
Decoder – Masked Multi-Head Attention
• Multi-Head Attention for the Decoder is same as for the
Encoder
• However, only the query, Q is received from the
previous layer
• K and V are received from the Encoder output
• Here, K and V contain the context related information
that are required to process Q which is generated only
from the input to decoder
Masking the Multi-Head Attention
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
• The model must not see the tokens on
the right side of the sequence
• Therefore, the softmax output related
this attention should be zero
• For that, all the values that are right
from the diagonal will be replaced
with minus infinite, before the
Softmax is applied
Training a Transformer
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
• Vocabulary have special tokens,
• <SOS> for the Start of the Sentence
• <EOS> for the End of the Sentence
• Encoded output is given to the Decoder
(as K and V) to translate its input to
Italian
• Linear layer maps the Decoder output to
the vocabulary size
• Softmax layer outputs the positional
encodings of the tokens in one timestep
• Cross Entropy loss is used
Making Inferences with a Transformer
• Unlike training a transformer, while making inferences, a transformer
needs one timestep to generate a single token
• The reason is because we have to use that generated token to
generate the next token
Questions?

More Related Content

Similar to Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model (20)

attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...
Amit Ranjan
 
240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx
thanhdowork
 
RNN JAN 2025 ppt fro scratch looking from basic.pptx
RNN JAN 2025 ppt fro scratch looking from basic.pptxRNN JAN 2025 ppt fro scratch looking from basic.pptx
RNN JAN 2025 ppt fro scratch looking from basic.pptx
webseriesnit
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
NibrasulIslam
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
Deep Learning to Text
Deep Learning to TextDeep Learning to Text
Deep Learning to Text
Jian-Kai Wang
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Universitat Politècnica de Catalunya
 
Machine Learning - Transformers, Large Language Models and ChatGPT
Machine Learning - Transformers, Large Language Models and ChatGPTMachine Learning - Transformers, Large Language Models and ChatGPT
Machine Learning - Transformers, Large Language Models and ChatGPT
MoissFreitas13
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
Suman Debnath
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
Grigory Sapunov
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
fikki11
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
encoder-decoder for large language model
encoder-decoder for large language modelencoder-decoder for large language model
encoder-decoder for large language model
ShrideviS7
 
encoder and decoder for language modelss
encoder and decoder for language modelssencoder and decoder for language modelss
encoder and decoder for language modelss
ShrideviS7
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
Plain Concepts
 
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
Eugene Nho
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
June-Woo Kim
 
How to build a GPT model.pdf
How to build a GPT model.pdfHow to build a GPT model.pdf
How to build a GPT model.pdf
StephenAmell4
 
Improving neural question generation using answer separation
Improving neural question generation using answer separationImproving neural question generation using answer separation
Improving neural question generation using answer separation
NAVER Engineering
 
attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...
Amit Ranjan
 
240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx
thanhdowork
 
RNN JAN 2025 ppt fro scratch looking from basic.pptx
RNN JAN 2025 ppt fro scratch looking from basic.pptxRNN JAN 2025 ppt fro scratch looking from basic.pptx
RNN JAN 2025 ppt fro scratch looking from basic.pptx
webseriesnit
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
NibrasulIslam
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
Deep Learning to Text
Deep Learning to TextDeep Learning to Text
Deep Learning to Text
Jian-Kai Wang
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Universitat Politècnica de Catalunya
 
Machine Learning - Transformers, Large Language Models and ChatGPT
Machine Learning - Transformers, Large Language Models and ChatGPTMachine Learning - Transformers, Large Language Models and ChatGPT
Machine Learning - Transformers, Large Language Models and ChatGPT
MoissFreitas13
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
Suman Debnath
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
fikki11
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
encoder-decoder for large language model
encoder-decoder for large language modelencoder-decoder for large language model
encoder-decoder for large language model
ShrideviS7
 
encoder and decoder for language modelss
encoder and decoder for language modelssencoder and decoder for language modelss
encoder and decoder for language modelss
ShrideviS7
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
Plain Concepts
 
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
Eugene Nho
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
June-Woo Kim
 
How to build a GPT model.pdf
How to build a GPT model.pdfHow to build a GPT model.pdf
How to build a GPT model.pdf
StephenAmell4
 
Improving neural question generation using answer separation
Improving neural question generation using answer separationImproving neural question generation using answer separation
Improving neural question generation using answer separation
NAVER Engineering
 

More from Maninda Edirisooriya (20)

Lecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning TechniquesLecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning Techniques
Maninda Edirisooriya
 
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Maninda Edirisooriya
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Maninda Edirisooriya
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Maninda Edirisooriya
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Maninda Edirisooriya
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Maninda Edirisooriya
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Maninda Edirisooriya
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Maninda Edirisooriya
 
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAMAnalyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Maninda Edirisooriya
 
WSO2 BAM - Your big data toolbox
WSO2 BAM - Your big data toolboxWSO2 BAM - Your big data toolbox
WSO2 BAM - Your big data toolbox
Maninda Edirisooriya
 
Training Report
Training ReportTraining Report
Training Report
Maninda Edirisooriya
 
GViz - Project Report
GViz - Project ReportGViz - Project Report
GViz - Project Report
Maninda Edirisooriya
 
Mortivation
MortivationMortivation
Mortivation
Maninda Edirisooriya
 
Hafnium impact 2008
Hafnium impact 2008Hafnium impact 2008
Hafnium impact 2008
Maninda Edirisooriya
 
ChatCrypt
ChatCryptChatCrypt
ChatCrypt
Maninda Edirisooriya
 
Lecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning TechniquesLecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning Techniques
Maninda Edirisooriya
 
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Maninda Edirisooriya
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Maninda Edirisooriya
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Maninda Edirisooriya
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Maninda Edirisooriya
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Maninda Edirisooriya
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Maninda Edirisooriya
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Maninda Edirisooriya
 
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAMAnalyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Maninda Edirisooriya
 

Recently uploaded (20)

Proposed EPA Municipal Waste Combustor Rule
Proposed EPA Municipal Waste Combustor RuleProposed EPA Municipal Waste Combustor Rule
Proposed EPA Municipal Waste Combustor Rule
AlvaroLinero2
 
May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...
May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...
May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...
sebastianku31
 
All about the Snail Power Catalog Product 2025
All about the Snail Power Catalog  Product 2025All about the Snail Power Catalog  Product 2025
All about the Snail Power Catalog Product 2025
kstgroupvn
 
Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...
Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...
Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...
ManiMaran230751
 
Tesia Dobrydnia - A Leader In Her Industry
Tesia Dobrydnia - A Leader In Her IndustryTesia Dobrydnia - A Leader In Her Industry
Tesia Dobrydnia - A Leader In Her Industry
Tesia Dobrydnia
 
Structural Health and Factors affecting.pptx
Structural Health and Factors affecting.pptxStructural Health and Factors affecting.pptx
Structural Health and Factors affecting.pptx
gunjalsachin
 
Introduction of Structural Audit and Health Montoring.pptx
Introduction of Structural Audit and Health Montoring.pptxIntroduction of Structural Audit and Health Montoring.pptx
Introduction of Structural Audit and Health Montoring.pptx
gunjalsachin
 
world subdivision.pdf...................
world subdivision.pdf...................world subdivision.pdf...................
world subdivision.pdf...................
bmmederos12
 
UNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCH
UNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCHUNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCH
UNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCH
Sridhar191373
 
Webinar On Steel Melting IIF of steel for rdso
Webinar  On Steel  Melting IIF of steel for rdsoWebinar  On Steel  Melting IIF of steel for rdso
Webinar On Steel Melting IIF of steel for rdso
KapilParyani3
 
Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...
Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...
Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...
ManiMaran230751
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
ISO 4020-6.1- Filter Cleanliness Test Rig Catalogue.pdf
ISO 4020-6.1- Filter Cleanliness Test Rig Catalogue.pdfISO 4020-6.1- Filter Cleanliness Test Rig Catalogue.pdf
ISO 4020-6.1- Filter Cleanliness Test Rig Catalogue.pdf
FILTRATION ENGINEERING & CUNSULTANT
 
Software Engineering Project Presentation Tanisha Tasnuva
Software Engineering Project Presentation Tanisha TasnuvaSoftware Engineering Project Presentation Tanisha Tasnuva
Software Engineering Project Presentation Tanisha Tasnuva
tanishatasnuva76
 
ISO 4548-9 Oil Filter Anti Drain Catalogue.pdf
ISO 4548-9 Oil Filter Anti Drain Catalogue.pdfISO 4548-9 Oil Filter Anti Drain Catalogue.pdf
ISO 4548-9 Oil Filter Anti Drain Catalogue.pdf
FILTRATION ENGINEERING & CUNSULTANT
 
Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...
Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...
Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...
BeHappy728244
 
ENERGY STORING DEVICES-Primary Battery.pdf
ENERGY STORING DEVICES-Primary Battery.pdfENERGY STORING DEVICES-Primary Battery.pdf
ENERGY STORING DEVICES-Primary Battery.pdf
TAMILISAI R
 
Axial Capacity Estimation of FRP-strengthened Corroded Concrete Columns
Axial Capacity Estimation of FRP-strengthened Corroded Concrete ColumnsAxial Capacity Estimation of FRP-strengthened Corroded Concrete Columns
Axial Capacity Estimation of FRP-strengthened Corroded Concrete Columns
Journal of Soft Computing in Civil Engineering
 
UNIT-1-PPT-Introduction about Power System Operation and Control
UNIT-1-PPT-Introduction about Power System Operation and ControlUNIT-1-PPT-Introduction about Power System Operation and Control
UNIT-1-PPT-Introduction about Power System Operation and Control
Sridhar191373
 
[HIFLUX] Lok Fitting&Valve Catalog 2025 (Eng)
[HIFLUX] Lok Fitting&Valve Catalog 2025 (Eng)[HIFLUX] Lok Fitting&Valve Catalog 2025 (Eng)
[HIFLUX] Lok Fitting&Valve Catalog 2025 (Eng)
하이플럭스 / HIFLUX Co., Ltd.
 
Proposed EPA Municipal Waste Combustor Rule
Proposed EPA Municipal Waste Combustor RuleProposed EPA Municipal Waste Combustor Rule
Proposed EPA Municipal Waste Combustor Rule
AlvaroLinero2
 
May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...
May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...
May 2025: Top 10 Cited Articles in Software Engineering & Applications Intern...
sebastianku31
 
All about the Snail Power Catalog Product 2025
All about the Snail Power Catalog  Product 2025All about the Snail Power Catalog  Product 2025
All about the Snail Power Catalog Product 2025
kstgroupvn
 
Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...
Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...
Digital Crime – Substantive Criminal Law – General Conditions – Offenses – In...
ManiMaran230751
 
Tesia Dobrydnia - A Leader In Her Industry
Tesia Dobrydnia - A Leader In Her IndustryTesia Dobrydnia - A Leader In Her Industry
Tesia Dobrydnia - A Leader In Her Industry
Tesia Dobrydnia
 
Structural Health and Factors affecting.pptx
Structural Health and Factors affecting.pptxStructural Health and Factors affecting.pptx
Structural Health and Factors affecting.pptx
gunjalsachin
 
Introduction of Structural Audit and Health Montoring.pptx
Introduction of Structural Audit and Health Montoring.pptxIntroduction of Structural Audit and Health Montoring.pptx
Introduction of Structural Audit and Health Montoring.pptx
gunjalsachin
 
world subdivision.pdf...................
world subdivision.pdf...................world subdivision.pdf...................
world subdivision.pdf...................
bmmederos12
 
UNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCH
UNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCHUNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCH
UNIT-4-PPT UNIT COMMITMENT AND ECONOMIC DISPATCH
Sridhar191373
 
Webinar On Steel Melting IIF of steel for rdso
Webinar  On Steel  Melting IIF of steel for rdsoWebinar  On Steel  Melting IIF of steel for rdso
Webinar On Steel Melting IIF of steel for rdso
KapilParyani3
 
Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...
Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...
Forensic Science – Digital Forensics – Digital Evidence – The Digital Forensi...
ManiMaran230751
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Software Engineering Project Presentation Tanisha Tasnuva
Software Engineering Project Presentation Tanisha TasnuvaSoftware Engineering Project Presentation Tanisha Tasnuva
Software Engineering Project Presentation Tanisha Tasnuva
tanishatasnuva76
 
Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...
Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...
Direct Current circuitsDirect Current circuitsDirect Current circuitsDirect C...
BeHappy728244
 
ENERGY STORING DEVICES-Primary Battery.pdf
ENERGY STORING DEVICES-Primary Battery.pdfENERGY STORING DEVICES-Primary Battery.pdf
ENERGY STORING DEVICES-Primary Battery.pdf
TAMILISAI R
 
UNIT-1-PPT-Introduction about Power System Operation and Control
UNIT-1-PPT-Introduction about Power System Operation and ControlUNIT-1-PPT-Introduction about Power System Operation and Control
UNIT-1-PPT-Introduction about Power System Operation and Control
Sridhar191373
 

Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model

  • 1. DA 5330 – Advanced Machine Learning Applications Lecture 10 – Transformers Maninda Edirisooriya [email protected]
  • 2. Limitations of RNN Models • Slow computation for longer sequences as the computation cannot be done in parallel due to the dependencies in timesteps • As there are significant number of timesteps the backpropagation depth increases which increases Vanishing Gradient and Exploding Gradient problems • As information is passed from the history as a hidden state vector the amount of information is limited to that vector size • As information passed from the history gets updated in each time step, the history is forgotten after number of time steps
  • 3. Attention-based Models • Instead of processing all the time steps with the same weight, attention models performed well when only certain time steps are given an exponentially higher weight while processing any time step which are known as Attention Models • Thought Attention Models were significantly better, its processing requirement (Complexity) was Quadratic (i.e. proportional to the square of the number of time stamps) which was an extra slowdown • However, the paper published with name “Attention is all you need” by Vasvani et al. 2017 proposed that RNN units can be replaced with a higher performance mechanism keeping only the “Attention” in mind • This model is known as a Transformer Model
  • 5. Transformer Model • The original paper defined this model (with both Encoder and Decoder) for the application of Natural Language Translation • However, the Encoder and Decoder models were separately used independently in some later models for different tasks Source: https://pub.aimind.so/unraveling-the-power-of-language-models-understanding-llms-and-transformer-variants-71bfc42e0b21
  • 6. Encoder Only (Autoencoding) Models • Only the Encoder of the Transformer is used • Pre-Trained with Masked Language Models • Some random tokens of the input sequence are masked • Try to predict the missing (masked) tokens to reconstruct the original sequence • This process learns the Bidirectional Context of the tokens in a sequence (probabilities of being around certain tokens in both right and left) • Used in applications like Sentence Classification for Sentiment Analysis and token level operations like Named Entity Recognition • BERT and RoBERTa are some examples
  • 7. Decoder Only (Autoregressive) Models • Only the Decoder of the Transformer is used • Pre-Trained with Causal Language Models • Last token of the input sequence is masked • Try to predict the last token to reconstruct the original sequence • Also known as Full Language Model as well • This process learns the Unidirectional Context of the tokens in a sequence (probabilities of being the next token given the tokens at the left) • Used in applications like Text Generation • GPT and BLOOM are some examples
  • 8. Encoder Decoder (Sequence-to-Sequence) Models • Use both Encoder and the Decoder of the Transformer • Pre-Training objective may depend on the requirement. In T5 model, • In Encoder, some random tokens of the input sequence are masked with a unique placeholder token, added to the vocabulary, known as Sentinel token • This process is known as Span Corruption • Decoder tries to predict the missing (masked) tokens to reconstruct the original sequence, replacing the Sentinel tokens, with auto-regression • Used in applications like Translation, Summarization and Question- answering • T5 and BART are some examples
  • 9. Encoder – Input and Embedding • Inputs is the sequence of tokens (words in case of Natural Language Processing (NLP)) • Each input token is converted to a vector using Input Embedding (Word Embedding in case of NLP) Output
  • 10. Encoder – Input and Embedding Source: https://www.youtube.com/watch?v=bCz4OMemCcA
  • 11. Encoder – Positional Encoding Output Source: https://www.youtube.com/watch?v=bCz4OMemCcA
  • 12. Encoder – Input and Embedding Source: https://www.youtube.com/watch?v=bCz4OMemCcA
  • 13. Encoder – Multi-Head Attention • Multi-Head Attention is about applying multiple similar operations known as Single-Head Attention or simply Attention Attention(Q, K, V) = softmax( 𝑄𝑘𝑇 𝑑𝑘 )V • The type of attention used here is known as Self Attention where each token is having a attention against all the tokens in the input sequence • For the Encoder we take, Q = K = V = X Output
  • 14. Self Attention Source: https://jalammar.github.io/illustrated-transformer/ • Self Attention formula is inspired by the data query from a data store where Q is the query which is matched against the K key values where V is the actual value • 𝐐𝐊𝐓 is a measure between the similarity between Q and K • 𝐝𝐤 is used to normalize by dividing it by the dimensionality of the K • Softmax is used to give the attention to the largest • Finally, normalized similarity is used to the weight V resulting the Attention
  • 16. Encoder – Multi-Head Attention • When Single-Head Attention is defined as, Attention(Q, K, V) = softmax( 𝑄𝑘𝑇 𝑑𝑘 )V • Multi-Head Attention Head is defined as, headi(Q, K, V) = Attention(QWi Q, KWi K, VWi V) • i.e.: We can have arbitrary number of heads where parameter weight matrices have to be defined for Q, K, and V for all heads • Multi-Head is defined as, MultiHead(Q, K, V) = Concat(head1, head2, … headh)WO • i.e. MultiHead is the concatenation of all the heads multiplied by another parameter matrix WO Output
  • 17. Encoder – Add & Normalization • Input given to the multi-head attention is added to the output as the Residual Inputs (remember ResNet?) • Then the result is Layer Normalized • Similar to the Batch Norm but instead of normalizing on the items in the batch (or the minibatch), normalization happens on the values in the layer Output
  • 18. Decoder – Masked Multi-Head Attention • Multi-Head Attention for the Decoder is same as for the Encoder • However, only the query, Q is received from the previous layer • K and V are received from the Encoder output • Here, K and V contain the context related information that are required to process Q which is generated only from the input to decoder
  • 19. Masking the Multi-Head Attention Source: https://www.youtube.com/watch?v=bCz4OMemCcA • The model must not see the tokens on the right side of the sequence • Therefore, the softmax output related this attention should be zero • For that, all the values that are right from the diagonal will be replaced with minus infinite, before the Softmax is applied
  • 20. Training a Transformer Source: https://www.youtube.com/watch?v=bCz4OMemCcA • Vocabulary have special tokens, • <SOS> for the Start of the Sentence • <EOS> for the End of the Sentence • Encoded output is given to the Decoder (as K and V) to translate its input to Italian • Linear layer maps the Decoder output to the vocabulary size • Softmax layer outputs the positional encodings of the tokens in one timestep • Cross Entropy loss is used
  • 21. Making Inferences with a Transformer • Unlike training a transformer, while making inferences, a transformer needs one timestep to generate a single token • The reason is because we have to use that generated token to generate the next token