0% found this document useful (0 votes)
143 views

Transformers LLMs

Uploaded by

fido knn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

Transformers LLMs

Uploaded by

fido knn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

Introduction to Deep Learning

Lecture 19
Transformers and LLMs
Shikhar Agnihotri Liangze Li

11-785, Fall 2023

1
Part 1

Transformers

2
Transformers

• Tokenizaton • Attention

• Input Embeddings • Self Attention

• Position Encodings • Multi Head Attention

• Residuals • Masked Attention

• Query • Encoder Decoder Attention

• Key • Output Probabilities / Logits

• Value • Softmax

• Add & Norm • Encoder-Decoder models

• Encoder • Decoder only models

• Decoder

3
Transformers

• Tokenizaton • Attention

• Input Embeddings • Self Attention

• Position Encodings • Multi Head Attention

• Residuals • Masked Attention

• Query • Encoder Decoder Attention

• Key • Output Probabilities / Logits

• Value • Softmax

• Add & Norm • Encoder-Decoder models

• Encoder • Decoder only models

• Decoder

4
Machine Translation

Targets

Ich have einen apfel gegessen

Inputs

I ate an apple

5
Inputs

Processing Inputs

Inputs

I ate an apple

6
Inputs

I ate an apple <eos>

Tokenizer

I ate an apple

Generate Input Emebeddings


7
Inputs
dmodel

Embedding Layer

I ate an apple <eos>

Tokenizer

I ate an apple

Generate Input Emebeddings


8
Encoder

T H E
E I S
H ER XT ?
W NT E
CO

I ate an apple <eos>

9
Encoder

BLACK BOX
OF SORTS

I ate an apple <eos>


10
Encoder

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>


11
Encoder
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>


12
Encoder ⍺[ i j ] ?
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>


13
Encoder ⍺[ i j ] ? ∑ ∏ ?
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>


14
Attention

⍺[ i j ] ?
From lecture 18:

15
Attention

⍺[ i j ] ?
From lecture 18:

• Query
• Key
• Value

16
Query, Key & Value

Database
{Key, Value store}

17
Query, Key & Value

Database
{Key, Value store}

{Query: “Order details of order_104”}


OR
{Query: “Order details of order_106”}

18
Query, Key & Value

{Key, Value store}

{Query: “Order details of order_104”}


OR
{Query: “Order details of order_106”}

19
Query, Key & Value

{Key, Value store}

{Query: “Order details of order_104”}


OR
{Query: “Order details of order_106”}

20
Query, Key & Value

{Key, Value store}

{Query: “Order details of order_104”}


OR
{Query: “Order details of order_106”}

21
Query, Key & Value

e !!
e t im
sam {Key, Value store}
t t h e
ne a
Do
{Query: “Order details of order_104”}
OR
{Query: “Order details of order_106”}

22
Query, Key & Value

{Query: “Order details of order_104”}


OR
{Query: “Order details of order_106”}

Query Key Value


1. Search for info 1. Interacts directly with Queries 1. Actual details of the object
2. Distinguishes one object from another 2. More fine grained
3. Identify which object is the most relevant
and by how much

23
Attention

Query Key Value Store Key Value

24
Attention

Query Key Value Store Key Value

25
Attention

e !!
e t im
sam
t t h e
ne a
Do

Query Key Value Store Key Value

26
Attention

e !!!
z a bl
lle li
P ara

Query Key Value Store Key Value

Q QKT !" ! !"!


softmax( ) softmax( )V
√$ √$
27
Attention

e !!!
z a bl
lle li Attention Filter
P ara

Query Key Value Store Key Value

Q QKT !" ! !"!


softmax( ) softmax( )V
√$ √$
28
Attention

I1 I2 I3 I4 I5 29

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 30

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

e1,1


Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 31

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1

softmax

e1,1


Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 32

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗

softmax

e1,1


Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 33

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗

softmax

e1,1 e1,2

⊗ ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 34

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗

softmax

e1,1 e1,2 e1,3

⊗ ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 35

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗

softmax

e1,1 e1,2 e1,3 e1,4


⊗ ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 36

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

softmax

e1,1 e1,2 e1,3 e1,4 e1,5




⊗ ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 37

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

softmax

e1,1 e1,2 e1,3 e1,4 e1,5




⊗ ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 38

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 39

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

softmax

e1,1 e1,2 e1,3 e1,4 e1,5




⊗ ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 40

I ate an apple <eos>


ity
r brev
d fo

hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

d
softmax

e liz e
l
e1,1 e1,2 e1,4 e1,5

l
e1,3

⊗ ⊗
Para ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 41

I ate an apple <eos>


Poll 1 @1296

Which of the following are true about attention? (Select all that apply)

a. To calculate attention weights for input I2, you would use key k2, and
all queries
b. To calculate attention weights for input I2, you would use query q2,
and all keys
c. We scale the QKT product to bring attention weights in the range of
[0,1]
d. We scale the QKT product to allow for numerical stability

42
Poll 1 @1296

Which of the following are true about attention? (Select all that apply)

a. To calculate attention weights for input I2, you would use key k2, and
all queries
b. To calculate attention weights for input I2, you would use query q2,
and all keys
c. We scale the QKT product to bring attention weights in the range of
[0,1]
d. We scale the QKT product to allow for numerical stability

43
Positional Encoding

I ate an apple <eos>

44
Positional Encoding

I ate an apple <eos>

apple ate an I <eos>

Positional Encoding
45
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic

Positional Encoding
46
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic

Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%

"!"# = "!. !∆%

Positional Encoding
47
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic

Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%

"!"# = "!. !∆%

Positional Encoding
48
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%

"!"# = "!. !∆%

Positional Encoding
49
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :

P(t + t’) = Mt’ x P(t)

Positional Encoding
50
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :

P(t + t’) = Mt’ x P(t)


M?
1. Should be a unitary matrix
2. Magnitudes of eigen value should be 1 -> norm preserving

Positional Encoding
51
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :

P(t + t’) = Mt’ x P(t)


M
1. The matrix can be learnt
2. Produces unique rotated embeddings each time

Positional Encoding
52
Rotary Positional
Embedding

REF: Rotary Positional Embeddings !


53
Positional Encoding

Requirements for Positional Encodings


• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Actual Candidates :
sine(g(t))
cosine(g(t))

Positional Encoding
54
Positional Encoding

Requirements for g(t)


• Must have same dimensions as input embeddings
• Must produce overall unique encodings

pos -> idx of the token in input sentence


i -> ith dimension out of d

Positional Encoding
55
Positional Encoding

Requirements for g(t)


• Must have same dimensions as input embeddings
• Must produce overall unique encodings

pos -> idx of the token in input sentence


i -> ith dimension out of d

Positional Encoding
56
Positional Encoding
Final Input Embeddings

Position Encodings

Input Embeddings

Embedding Layer

I ate an apple <eos> Tokens

Tokenizer

I ate an apple
Input 57
Encoder ⍺[ i j ] ∑
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>


58
Self Attention

From lecture 18:

59
Self Attention

The animal didn’t cross the street because it was too wide

60
Self Attention

The animal didn’t cross the street because it was too wide

coreference resolution ?

61
Self Attention

62
Self Attention

63
Self Attention

64
Self Attention

SELF

Query Inputs = Key Inputs = Value Inputs

65
Self Attention

WQ WK Wv

Input Embeddings
66
Self Attention

Input Embeddings WQ Q Projections

Input Embeddings WK K Projections

WV V Projections 67
Input Embeddings
Self Attention

softmax
QProjection KProjection

√"!"#$%

68
Self Attention
l
)
o de
2 x dm
( T
!
T

softmax
QProjection KProjection

√"!"#$%

69
Self Attention
l
)
o de
2 x dm
( T
!
T

softmax
QProjection KProjection
VProjection

√"!"#$%

70
Self Attention

Attention: Z

71
Self Attention

The animal didn’t cross the street because it was too wide

coreference resolution

72
Self Attention

The animal didn’t cross the street because it was too wide

coreference resolution

73
Self Attention

WQ WK Wv

Input Embeddings
74
Multi-Head Attention

H H H
.. .. ..
2 2 2
1 1 1

WQ1, WQ2, … WQH, WK1, WK2, … WKH, WV1, WV2, … WVH,

Input Embeddings
75
Multi-Head Attention

H H
.. ..
2 2
Inputs WQi 1 Qi 1

H H
WKi .. WKi ..
2 2
Inputs
WKi 1 Ki 1

H WVi H
WVi .. 76
Inputs ..
2
WVi 2 Vi
Multi-Head Attention

softmax
Qi Ki

Vi

√"!"#$%

for all i ∈ [1, h]


77
Multi-Head Attention

Z1 Z2 Zh

CONCAT

Multi Head Attention : Z


78
Multi-Head Attention

The animal didn’t cross the street because it was too wide

Sentence boundaries ? coreference resolution Context ?

Semantic relationships ? Part of Speech ? Comparisons ?


79
Add & Norm

Input Norm(Z)

Normalization(Z) Add -> Residuals


• Mean 0, Std dev 1 • Avoid vanishing gradients
• Stabilizes training • Train deeper networks
• Regularization effect

80
Feed Forward

Feed Forward
• Non Linearity
• Complex Relationships
• Learn from each other

Feed Forward

Residuals

Input Norm(Z) 81
Add & Norm
Add & Norm

Feed Forward

Residuals

Input Norm(Z) 82
Encoders
Encoder

ENCODER

83
Encoders
Encoder

ENCODER

.
.
.
Input to Encoderi+1

ENCODER

Output from Encoderi


ENCODER

84
Transformers

ü Tokenizaton ü Attention

ü Input Embeddings ü Self Attention

ü Position Encodings ü Multi Head Attention

ü Residuals • Masked Attention

ü Query • Encoder Decoder Attention

ü Key • Output Probabilities / Logits

ü Value • Softmax

ü Add & Norm • Encoder-Decoder models

ü Encoder • Decoder only models

• Decoder

85
Machine Translation

Targets

Ich have einen apfel gegessen

Inputs

I ate an apple

86
Targets
Targets

Ich have einen apfel gegessen

87
Targets

Embedding Layer + Positional Encoding

<sos> Ich have einen apfel gegessen <eos>

Tokenizer

Ich have einen apfel gegessen

Generate Target Emebeddings


88
Masked Multi Head Attention

<sos> Ich have einen apfel gegessen <eos>

89
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Inference
1 <sos>

2 <sos> Ich

3 <sos> Ich have

4 <sos> Ich have einen

5 <sos> Ich have einen apfel

6 <sos> Ich have einen apfel gegessen

7 <sos> Ich have einen apfel gegessen <eos>

90
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Inference
1 <sos>

2 <sos> Ich

ed ?
liz
lle
3 <sos> Ich have

ra
Pa
4 <sos> Ich have einen

5 <sos> Ich have einen apfel

6 <sos> Ich have einen apfel gegessen

7 <sos> Ich have einen apfel gegessen <eos>

91
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Training

<sos> Ich have einen apfel gegessen <eos>

92
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Training

<sos> Ich have einen apfel gegessen <eos>

Outputs at time T should only pay attention to outputs


until time T-1

93
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> Ich have einen apfel gegessen <eos>

2 <sos> Ich have einen apfel gegessen <eos>

3 <sos> Ich have einen apfel gegessen <eos>

4 <sos> Ich have einen apfel gegessen <eos>

5 <sos> Ich have einen apfel gegessen <eos>

6 <sos> Ich have einen apfel gegessen <eos>

7 <sos> Ich have einen apfel gegessen <eos>

94
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> Ich have einen apfel gegessen <eos>

2 <sos> Ich have einen apfel gegessen <eos>

3 <sos> Ich have einen apfel gegessen <eos>

4 <sos> Ich have einen apfel gegessen <eos>

5 <sos> Ich have einen apfel gegessen <eos>

6 <sos> Ich have einen apfel gegessen <eos>

7 <sos> Ich have einen apfel gegessen <eos>

Mask the available attention values ?


95
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> - - - - - -

2 <sos> Ich - - - - -

3 <sos> Ich have - - - -

4 <sos> Ich have einen - - -

5 <sos> Ich have einen apfel - -

6 <sos> Ich have einen apfel gegessen -

7 <sos> Ich have einen apfel gegessen <eos>

96
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> - - - - - -

2 <sos> Ich - - - - -

3 <sos> Ich have - - - -

4 <sos> Ich have einen - - -

5 <sos> Ich have einen apfel - -

6 <sos> Ich have einen apfel gegessen -

7 <sos> Ich have einen apfel gegessen <eos>

Softmax -> 0.- -> 0


97
Masked Multi Head Attention
Masked Multi Head Attention

QKT Attention Mask: M Masked Attention

Masked Multi Head Attention : Z’


98
Masked Multi Head Attention
Masked Multi Head Attention

Masked Attention Values

Masked Multi Head Attention : Z’


99
Encoder Decoder Attention

Encoder Decoder Attention ?

Add & Norm

Input Norm(Z’) 100


Encoder Decoder Attention

Encoder Decoder Attention ?

101
Encoder Decoder Attention

Encoder Self Attention

1. Queries from Encoder Inputs


2. Keys from Encoder Inputs
3. Values from Encoder Inputs

Decoder Masked Self Attention

1. Queries from Decoder Inputs


2. Keys from Decoder Inputs
3. Values from Decoder Inputs
102
Attention

{Key, Value store}

{Query: “Order details of order_104”}

{Query: “Order details of order_106”}

103
Encoder Decoder Attention

Encoder Decoder

Keys from Encoder Outputs Queries from Decoder Inputs


Values from Encoder Outputs

NOTE: Every decoder block receives the same FINAL encoder output

104
Encoder Decoder Attention

Z’’

(#)$%
softmax( ). Ve
* &'#$(

(#)$%
softmax( )
* &'#$(

Qd Ke

105
Encoder Decoder Attention

• Non Linearity
• Complex Relationships
• Learn from each other

Feed Forward

Residuals

Add n Norm Decoder Self Attn Norm(Z’’) 106


Decoder

DECODER

107
Decoder

DECODER

.
.
.

DECODER
Decoder output

DECODER

108
Linear
Linear weights are often tied with
input embedding matrix

softmax

Final Decoder Output Linear


109
Softmax

Output Probabilities

Td

110
Poll 2 (@1297)

Which of the following are true about transformers?

a. Transformers can always be run in parallel


b. Transformer decoders can only be parallelized during training
c. Positional encodings help parallelize the transformer encoder
d. Queries, keys, and values are obtained by splitting the input into 3 equal segments
e. Multiheaded attention helps transformers find different kinds of relations between the tokens
f. During decoding, decoder outputs function as queries and keys while the values come from the encoder
Poll 2 (@1126)

Which of the following are true about transformers?

a. Transformers can always be run in parallel


b. Transformer decoders can only be parallelized during training
c. Positional encodings help parallelize the transformer encoder
d. Queries, keys, and values are obtained by splitting the input into 3 equal segments
e. Multiheaded attention helps transformers find different kinds of relations between the tokens
f. During decoding, decoder outputs function as queries and keys while the values come from the encoder
Transformers

Targets

Ich have einen apfel gegessen

Inputs

I ate an apple

Machine Translation
113
Transformers

ü Tokenizaton ü Attention

ü Input Embeddings ü Self Attention

ü Position Encodings ü Multi Head Attention

ü Residuals ü Masked Attention

ü Query ü Encoder Decoder Attention

ü Key ü Output Probabilities / Logits

ü Value ü Softmax

ü Add & Norm • Encoder-Decoder models

ü Encoder • Decoder only models

ü Decoder

114
Part 2

LLM

109
Transformers, mid-2017

110
Transformers, mid-2017

Representation

111
Transformers, mid-2017

Representation Generation

112
Transformers, mid-2017

Input – input tokens Input – output tokens and hidden states*


Output – hidden states Output – output tokens

Representation Generation

113
Transformers, mid-2017

Input – input tokens Input – output tokens and hidden states*


Output – hidden states Output – output tokens

Model can see all timesteps Model can only see previous timesteps

Representation Generation

114
Transformers, mid-2017

Input – input tokens Input – output tokens and hidden states*


Output – hidden states Output – output tokens

Model can see all timesteps Model can only see previous timesteps

Does not usually output tokens, so no Model is auto-regressive with previous


inherent auto-regressivity timesteps’ outputs

Representation Generation

115
Transformers, mid-2017
Input – input tokens
Output – hidden states Input – output tokens and hidden states*
Output – output tokens
Model can see all timesteps
Model can only see previous timesteps
Does not usually output tokens, so no
inherent auto-regressivity Model is auto-regressive with previous
timesteps’ outputs
Can also be adapted to generate tokens
by appending a module that maps Can also be adapted to generate hidden
hidden state dimensionality to vocab size states by looking before token outputs

Representation Generation

116
2018 – The Inception of the LLM Era

BERT GPT
Oct 2018 Jun 2018

Representation Generation

117
BERT - Bidirectional Encoder Representations

• One of the biggest challenges in LM-building used to


be the lack of task-specific training data.

• What if we learn an effective representation that can


be applied to a variety of downstream tasks?
• Word2vec (2013)
• GloVe (2014)

118
BERT - Bidirectional Encoder Representations

BERT Pre-Training Corpus:


• English Wikipedia - 2,500 million words
• Book Corpus - 800 million words

119
BERT - Bidirectional Encoder Representations

BERT Pre-Training Corpus:


• English Wikipedia - 2,500 million words
• Book Corpus - 800 million words

BERT Pre-Training Tasks:


• MLM (Masked Language Modeling)
• NSP (Next Sentence Prediction)

120
BERT - Bidirectional Encoder Representations

BERT Pre-Training Corpus:


• English Wikipedia - 2,500 million words
• Book Corpus - 800 million words

BERT Pre-Training Tasks:


• MLM (Masked Language Modeling)
• NSP (Next Sentence Prediction)

BERT Pre-Training Results:


• BERT-Base – 110M Params
• BERT-Large – 340M Params

121
BERT - Bidirectional Encoder Representations
MLM (Masked Language Modeling)

you 60%
they 20%
Prediction
head … …

<CLS> How are <MASK> doing today <SEP>

BERT

<CLS> How are <MASK> doing today <SEP>

122
BERT - Bidirectional Encoder Representations
MLM (Masked Language Modeling)

is_next 95%

Prediction not_next 5%
head

<CLS> … … <SEP> … … <SEP>

BERT

<CLS> … … <SEP> … … <SEP>

123
BERT - Bidirectional Encoder Representations

BERT Fine-Tuning:

• Simply add a task-specific module after the last


encoder layer to map it to the desired dimension.

• Classification Tasks:
• Add a feed-forward layer on top of the encoder
output for the [CLS] token
• Question Answering Tasks:
• Train two extra vectors to mark the beginning
and end of answer from paragraph
• …

124
BERT - Bidirectional Encoder Representations

BERT Evaluation:

• General Language Understanding Evaluation (GLUE)


• Sentence pair tasks
• Single sentence classification

• Standford Question Answering Dataset (SQuAD)

125
BERT - Bidirectional Encoder Representations

BERT Evaluation:

126
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?

• Pre-training tasks can be invented flexibly…


• Effective representations can be derived from a
flexible regime of pre-training tasks.

127
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?

• Pre-training tasks can be invented flexibly…


• Effective representations can be derived from a
flexible regime of pre-training tasks.

• Different NLP tasks seem to be highly transferable


with each other...
• As long as we have effective representations, that
seems to form a general model which can serve as
the backbone for many specialized models.

128
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?

• Pre-training tasks can be invented flexibly…


• Effective representations can be derived from a
flexible regime of pre-training tasks.

• Different NLP tasks seem to be highly transferable


with each other...
• As long as we have effective representations, that
seems to form a general model which can serve as
the backbone for many specialized models.

• And scaling works!!!


• 340M is considered large in 2018
129
2018 – The Inception of the LLM Era

BERT GPT
Oct 2018 Jun 2018

Representation Generation

130
GPT – Generative Pretrained Transformer

• Similarly motivated as BERT, though differently designed

• Can we leverage large amounts of unlabeled data to


pretrain an LM that understands general patterns?

131
GPT – Generative Pretrained Transformer

GPT Pre-Training Corpus:


• Similarly, BooksCorpus and English Wikipedia

GPT Pre-Training Tasks:


• Predict the next token, given the previous tokens
• More learning signals than MLM

GPT Pre-Training Results:


• GPT – 117M Params
• Similarly competitive on GLUE and SQuAD

132
GPT – Generative Pretrained Transformer
GPT Fine-Tuning:
• Prompt-format task-specific text as a continuous
stream for the model to fit
QA
Summarization
Answer the question based on the context.
Summarize this article: Context:

The summary is: Question:

Answer:
133
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?

• The Effectiveness of Self-Supervised Learning


• Specifically, the model seems to be able to learn
from generating the language itself, rather than
from any specific task we might cook up.

134
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?

• The Effectiveness of Self-Supervised Learning


• Specifically, the model seems to be able to learn
from generating the language itself, rather than
from any specific task we might cook up.

• Language Model as a Knowledge Base


• Specifically, a generatively pretrained model seems
to have a decent zero-shot performance on a range
of NLP tasks.

135
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?

• The Effectiveness of Self-Supervised Learning


• Specifically, the model seems to be able to learn
from generating the language itself, rather than
from any specific task we might cook up.

• Language Model as a Knowledge Base


• Specifically, a generatively pretrained model seems
to have a decent zero-shot performance on a range
of NLP tasks.

• And scaling works!!!

136
Poll
Piazza @1291

The original GPT’s parameter count is closest to…

A. 117
B. 117K
C. 117M
D. 117B
Poll
Piazza @1291

The original GPT’s parameter count is closest to…

A. 117
B. 117K
C. 117M
D. 117B
The LLM Era – Paradigm Shift in Machine Learning

BERT GPT
Oct 2018 Jun 2018

Representation Generation

139
The LLM Era – Paradigm Shift in Machine Learning

GPT – 2018
BERT – 2018 GPT-2 – 2019
DistilBERT – 2019 GPT-3 – 2020
RoBERTa – 2019 GPT-Neo – 2021
ALBERT – 2019 GPT-3.5 (ChatGPT) – 2022
ELECTRA – 2020 LLaMA – 2023
DeBERTa – 2020 T5 – 2019 GPT-4 – 2023
… BART – 2019 …
mT5 – 2021
Representation … Generation

140
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering
• How do we design or select the best
features for a task?

141
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?

142
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?
• Transfer Learning
• Given scarce labeled data, how do we
transfer knowledge from other domains?

143
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?
• Transfer Learning
• Given scarce labeled data, how do we
transfer knowledge from other domains?
• Overfitting vs Generalization
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
144
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning


• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection
• Which model is best for which type of task?
• Transfer Learning
• Given scarce labeled data, how do we
transfer knowledge from other domains?
• Overfitting vs Generalization
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
145
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning


• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection • Zero-shot and Few-shot learning
• Which model is best for which type of task? • How can we make models perform on tasks they
• Transfer Learning are not trained on?
• Given scarce labeled data, how do we
transfer knowledge from other domains?
• Overfitting vs Generalization
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
146
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning


• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection • Zero-shot and Few-shot learning
• Which model is best for which type of task? • How can we make models perform on tasks they
• Transfer Learning are not trained on?
• Given scarce labeled data, how do we • Prompting
transfer knowledge from other domains? • How do we make models understand their task
• Overfitting vs Generalization simply by describing it in natural language?
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
147
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning


• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection • Zero-shot and Few-shot learning
• Which model is best for which type of task? • How can we make models perform on tasks they
• Transfer Learning are not trained on?
• Given scarce labeled data, how do we • Prompting
transfer knowledge from other domains? • How do we make models understand their task
• Overfitting vs Generalization simply by describing it in natural language?
• How do we balance complexity and • Interpretability and Explainability
capacity to prevent overfitting while • How can we understand the inner workings of
maintaining good performance? our own models?
148
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?

149
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?

• Problem in recurrent networks


• Information is effectively lost during encoding of long sequences
• Sequential nature disables parallel training and favors late timestep inputs

150
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?

• Problem in recurrent networks


• Information is effectively lost during encoding of long sequences
• Sequential nature disables parallel training and favors late timestep inputs

• Solution: Attention mechanism


• Handling long-range dependencies
• Parallel training
• Dynamic attention weights based on inputs

151
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

152
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

• Problem in current Transformer-based LLMs??

153
Poll
Piazza @1292

What might be a flaw of our current Transformer-based LLMs?

Freeform response
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

• Problem in current Transformer-based LLMs??


• True understanding the material vs. memorization and pattern-matching
• Cannot reliably follow rules – factual hallucination e.g. inability in arithmetic

155
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

• Problem in current Transformer-based LLMs??


• True understanding the material vs. memorization and pattern-matching
• Cannot reliably follow rules – factual hallucination e.g. inability in arithmetic

• Solution: ???

156
Looking Back

It is true that language models are just programmed to predict the next token. But that
is not as simple as you might think.

In fact, all animals, including us, are just programmed to survive and reproduce, and yet
amazingly complex and beautiful stuff comes from it.

- Sam Altman*
*Paraphrased by IDL TAs

157

You might also like