Transformers LLMs
Transformers LLMs
Lecture 19
Transformers and LLMs
Shikhar Agnihotri Liangze Li
1
Part 1
Transformers
2
Transformers
• Tokenizaton • Attention
• Value • Softmax
• Decoder
3
Transformers
• Tokenizaton • Attention
• Value • Softmax
• Decoder
4
Machine Translation
Targets
Inputs
I ate an apple
5
Inputs
Processing Inputs
Inputs
I ate an apple
6
Inputs
Tokenizer
I ate an apple
Embedding Layer
Tokenizer
I ate an apple
T H E
E I S
H ER XT ?
W NT E
CO
9
Encoder
BLACK BOX
OF SORTS
BLACK BOX
OF SORTS
LEARN TO
ADD
CONTEXT
BLACK BOX
OF SORTS
LEARN TO
ADD
CONTEXT
BLACK BOX
OF SORTS
LEARN TO
ADD
CONTEXT
BLACK BOX
OF SORTS
LEARN TO
ADD
CONTEXT
⍺[ i j ] ?
From lecture 18:
15
Attention
⍺[ i j ] ?
From lecture 18:
• Query
• Key
• Value
16
Query, Key & Value
Database
{Key, Value store}
17
Query, Key & Value
Database
{Key, Value store}
18
Query, Key & Value
19
Query, Key & Value
20
Query, Key & Value
21
Query, Key & Value
e !!
e t im
sam {Key, Value store}
t t h e
ne a
Do
{Query: “Order details of order_104”}
OR
{Query: “Order details of order_106”}
22
Query, Key & Value
23
Attention
24
Attention
25
Attention
e !!
e t im
sam
t t h e
ne a
Do
26
Attention
e !!!
z a bl
lle li
P ara
e !!!
z a bl
lle li Attention Filter
P ara
I1 I2 I3 I4 I5 29
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 30
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
e1,1
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 31
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
α1,1
softmax
e1,1
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 32
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
α1,1 ⊗
softmax
e1,1
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 33
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
α1,1 ⊗ α1,2 ⊗
softmax
e1,1 e1,2
⊗ ⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 34
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
softmax
⊗ ⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 35
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
softmax
⊗
⊗ ⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 36
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
softmax
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 37
hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑
softmax
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 38
hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 39
hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑
softmax
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 40
hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑
d
softmax
e liz e
l
e1,1 e1,2 e1,4 e1,5
l
e1,3
⊗ ⊗
Para ⊗
⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5
WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV
I1 I2 I3 I4 I5 41
Which of the following are true about attention? (Select all that apply)
a. To calculate attention weights for input I2, you would use key k2, and
all queries
b. To calculate attention weights for input I2, you would use query q2,
and all keys
c. We scale the QKT product to bring attention weights in the range of
[0,1]
d. We scale the QKT product to allow for numerical stability
42
Poll 1 @1296
Which of the following are true about attention? (Select all that apply)
a. To calculate attention weights for input I2, you would use key k2, and
all queries
b. To calculate attention weights for input I2, you would use query q2,
and all keys
c. We scale the QKT product to bring attention weights in the range of
[0,1]
d. We scale the QKT product to allow for numerical stability
43
Positional Encoding
44
Positional Encoding
Positional Encoding
45
Positional Encoding
Positional Encoding
46
Positional Encoding
Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%
Positional Encoding
47
Positional Encoding
Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%
Positional Encoding
48
Positional Encoding
Positional Encoding
49
Positional Encoding
Positional Encoding
50
Positional Encoding
Positional Encoding
51
Positional Encoding
Positional Encoding
52
Rotary Positional
Embedding
Positional Encoding
54
Positional Encoding
Positional Encoding
55
Positional Encoding
Positional Encoding
56
Positional Encoding
Final Input Embeddings
Position Encodings
Input Embeddings
Embedding Layer
Tokenizer
I ate an apple
Input 57
Encoder ⍺[ i j ] ∑
CONTEXTUALLY RICH EMBEDDINGS
BLACK BOX
OF SORTS
LEARN TO
ADD
CONTEXT
59
Self Attention
The animal didn’t cross the street because it was too wide
60
Self Attention
The animal didn’t cross the street because it was too wide
coreference resolution ?
61
Self Attention
62
Self Attention
63
Self Attention
64
Self Attention
SELF
65
Self Attention
WQ WK Wv
Input Embeddings
66
Self Attention
WV V Projections 67
Input Embeddings
Self Attention
softmax
QProjection KProjection
√"!"#$%
68
Self Attention
l
)
o de
2 x dm
( T
!
T
softmax
QProjection KProjection
√"!"#$%
69
Self Attention
l
)
o de
2 x dm
( T
!
T
softmax
QProjection KProjection
VProjection
√"!"#$%
70
Self Attention
Attention: Z
71
Self Attention
The animal didn’t cross the street because it was too wide
coreference resolution
72
Self Attention
The animal didn’t cross the street because it was too wide
coreference resolution
73
Self Attention
WQ WK Wv
Input Embeddings
74
Multi-Head Attention
H H H
.. .. ..
2 2 2
1 1 1
Input Embeddings
75
Multi-Head Attention
H H
.. ..
2 2
Inputs WQi 1 Qi 1
H H
WKi .. WKi ..
2 2
Inputs
WKi 1 Ki 1
H WVi H
WVi .. 76
Inputs ..
2
WVi 2 Vi
Multi-Head Attention
softmax
Qi Ki
Vi
√"!"#$%
Z1 Z2 Zh
CONCAT
The animal didn’t cross the street because it was too wide
Input Norm(Z)
80
Feed Forward
Feed Forward
• Non Linearity
• Complex Relationships
• Learn from each other
Feed Forward
Residuals
Input Norm(Z) 81
Add & Norm
Add & Norm
Feed Forward
Residuals
Input Norm(Z) 82
Encoders
Encoder
ENCODER
83
Encoders
Encoder
ENCODER
.
.
.
Input to Encoderi+1
ENCODER
84
Transformers
ü Tokenizaton ü Attention
ü Value • Softmax
• Decoder
85
Machine Translation
Targets
Inputs
I ate an apple
86
Targets
Targets
87
Targets
Tokenizer
89
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)
Inference
1 <sos>
2 <sos> Ich
90
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)
Inference
1 <sos>
2 <sos> Ich
ed ?
liz
lle
3 <sos> Ich have
ra
Pa
4 <sos> Ich have einen
91
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)
Training
92
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)
Training
93
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)
94
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)
1 <sos> - - - - - -
2 <sos> Ich - - - - -
96
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)
1 <sos> - - - - - -
2 <sos> Ich - - - - -
101
Encoder Decoder Attention
103
Encoder Decoder Attention
Encoder Decoder
NOTE: Every decoder block receives the same FINAL encoder output
104
Encoder Decoder Attention
Z’’
(#)$%
softmax( ). Ve
* &'#$(
(#)$%
softmax( )
* &'#$(
Qd Ke
105
Encoder Decoder Attention
• Non Linearity
• Complex Relationships
• Learn from each other
Feed Forward
Residuals
DECODER
107
Decoder
DECODER
.
.
.
DECODER
Decoder output
DECODER
108
Linear
Linear weights are often tied with
input embedding matrix
softmax
…
Output Probabilities
Td
110
Poll 2 (@1297)
Targets
Inputs
I ate an apple
Machine Translation
113
Transformers
ü Tokenizaton ü Attention
ü Value ü Softmax
ü Decoder
114
Part 2
LLM
109
Transformers, mid-2017
110
Transformers, mid-2017
Representation
111
Transformers, mid-2017
Representation Generation
112
Transformers, mid-2017
Representation Generation
113
Transformers, mid-2017
Model can see all timesteps Model can only see previous timesteps
Representation Generation
114
Transformers, mid-2017
Model can see all timesteps Model can only see previous timesteps
Representation Generation
115
Transformers, mid-2017
Input – input tokens
Output – hidden states Input – output tokens and hidden states*
Output – output tokens
Model can see all timesteps
Model can only see previous timesteps
Does not usually output tokens, so no
inherent auto-regressivity Model is auto-regressive with previous
timesteps’ outputs
Can also be adapted to generate tokens
by appending a module that maps Can also be adapted to generate hidden
hidden state dimensionality to vocab size states by looking before token outputs
Representation Generation
116
2018 – The Inception of the LLM Era
BERT GPT
Oct 2018 Jun 2018
Representation Generation
117
BERT - Bidirectional Encoder Representations
118
BERT - Bidirectional Encoder Representations
119
BERT - Bidirectional Encoder Representations
120
BERT - Bidirectional Encoder Representations
121
BERT - Bidirectional Encoder Representations
MLM (Masked Language Modeling)
you 60%
they 20%
Prediction
head … …
BERT
122
BERT - Bidirectional Encoder Representations
MLM (Masked Language Modeling)
is_next 95%
Prediction not_next 5%
head
BERT
123
BERT - Bidirectional Encoder Representations
BERT Fine-Tuning:
• Classification Tasks:
• Add a feed-forward layer on top of the encoder
output for the [CLS] token
• Question Answering Tasks:
• Train two extra vectors to mark the beginning
and end of answer from paragraph
• …
124
BERT - Bidirectional Encoder Representations
BERT Evaluation:
125
BERT - Bidirectional Encoder Representations
BERT Evaluation:
126
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?
127
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?
128
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?
BERT GPT
Oct 2018 Jun 2018
Representation Generation
130
GPT – Generative Pretrained Transformer
131
GPT – Generative Pretrained Transformer
132
GPT – Generative Pretrained Transformer
GPT Fine-Tuning:
• Prompt-format task-specific text as a continuous
stream for the model to fit
QA
Summarization
Answer the question based on the context.
Summarize this article: Context:
Answer:
133
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?
134
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?
135
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?
136
Poll
Piazza @1291
A. 117
B. 117K
C. 117M
D. 117B
Poll
Piazza @1291
A. 117
B. 117K
C. 117M
D. 117B
The LLM Era – Paradigm Shift in Machine Learning
BERT GPT
Oct 2018 Jun 2018
Representation Generation
139
The LLM Era – Paradigm Shift in Machine Learning
GPT – 2018
BERT – 2018 GPT-2 – 2019
DistilBERT – 2019 GPT-3 – 2020
RoBERTa – 2019 GPT-Neo – 2021
ALBERT – 2019 GPT-3.5 (ChatGPT) – 2022
ELECTRA – 2020 LLaMA – 2023
DeBERTa – 2020 T5 – 2019 GPT-4 – 2023
… BART – 2019 …
mT5 – 2021
Representation … Generation
140
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs
• Feature Engineering
• How do we design or select the best
features for a task?
141
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs
• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?
142
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs
• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?
• Transfer Learning
• Given scarce labeled data, how do we
transfer knowledge from other domains?
143
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs
• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?
• Transfer Learning
• Given scarce labeled data, how do we
transfer knowledge from other domains?
• Overfitting vs Generalization
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
144
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs
149
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?
150
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?
151
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?
152
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?
153
Poll
Piazza @1292
Freeform response
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?
155
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?
• Solution: ???
156
Looking Back
It is true that language models are just programmed to predict the next token. But that
is not as simple as you might think.
In fact, all animals, including us, are just programmed to survive and reproduce, and yet
amazingly complex and beautiful stuff comes from it.
- Sam Altman*
*Paraphrased by IDL TAs
157