0% found this document useful (0 votes)

143 views

Transformers LLMs

Uploaded by

fido knn

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

143 views

Transformers LLMs

Uploaded by

fido knn

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 163

Introduction to Deep Learning

Lecture 19
Transformers and LLMs
Shikhar Agnihotri Liangze Li

11-785, Fall 2023

1
Part 1

Transformers

2
Transformers

• Tokenizaton • Attention

• Input Embeddings • Self Attention

• Position Encodings • Multi Head Attention

• Residuals • Masked Attention

• Query • Encoder Decoder Attention

• Key • Output Probabilities / Logits

• Value • Softmax

• Add & Norm • Encoder-Decoder models

• Encoder • Decoder only models

• Decoder

3
Transformers

• Tokenizaton • Attention

• Input Embeddings • Self Attention

• Position Encodings • Multi Head Attention

• Residuals • Masked Attention

• Query • Encoder Decoder Attention

• Key • Output Probabilities / Logits

• Value • Softmax

• Add & Norm • Encoder-Decoder models

• Encoder • Decoder only models

• Decoder

4
Machine Translation

Targets

Ich have einen apfel gegessen

Inputs

I ate an apple

5
Inputs

Processing Inputs

Inputs

I ate an apple

6
Inputs

I ate an apple <eos>

Tokenizer

I ate an apple

Generate Input Emebeddings

7
Inputs
dmodel

Embedding Layer

I ate an apple <eos>

Tokenizer

I ate an apple

Generate Input Emebeddings

8
Encoder

T H E
E I S
H ER XT ?
W NT E
CO

I ate an apple <eos>

9
Encoder

BLACK BOX
OF SORTS

I ate an apple <eos>

10
Encoder

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>

11
Encoder
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>

12
Encoder ⍺[ i j ] ?
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>

13
Encoder ⍺[ i j ] ? ∑ ∏ ?
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>

14
Attention

⍺[ i j ] ?
From lecture 18:

15
Attention

⍺[ i j ] ?
From lecture 18:

• Query
• Key
• Value

16
Query, Key & Value

Database
{Key, Value store}

17
Query, Key & Value

Database
{Key, Value store}

{Query: “Order details of order_104”}

OR
{Query: “Order details of order_106”}

18
Query, Key & Value

{Key, Value store}

{Query: “Order details of order_104”}

OR
{Query: “Order details of order_106”}

19
Query, Key & Value

{Key, Value store}

{Query: “Order details of order_104”}

OR
{Query: “Order details of order_106”}

20
Query, Key & Value

{Key, Value store}

{Query: “Order details of order_104”}

OR
{Query: “Order details of order_106”}

21
Query, Key & Value

e !!
e t im
sam {Key, Value store}
t t h e
ne a
Do
{Query: “Order details of order_104”}
OR
{Query: “Order details of order_106”}

22
Query, Key & Value

{Query: “Order details of order_104”}

OR
{Query: “Order details of order_106”}

Query Key Value

1. Search for info 1. Interacts directly with Queries 1. Actual details of the object
2. Distinguishes one object from another 2. More fine grained
3. Identify which object is the most relevant
and by how much

23
Attention

Query Key Value Store Key Value

24
Attention

Query Key Value Store Key Value

25
Attention

e !!
e t im
sam
t t h e
ne a
Do

Query Key Value Store Key Value

26
Attention

e !!!
z a bl
lle li
P ara

Query Key Value Store Key Value

Q QKT !" ! !"!

softmax( ) softmax( )V
√$ √$
27
Attention

e !!!
z a bl
lle li Attention Filter
P ara

Query Key Value Store Key Value

Q QKT !" ! !"!

softmax( ) softmax( )V
√$ √$
28
Attention

I1 I2 I3 I4 I5 29

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 30

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

e1,1

⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 31

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1

softmax

e1,1

⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 32

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗

softmax

e1,1

⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 33

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗

softmax

e1,1 e1,2

⊗ ⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 34

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗

softmax

e1,1 e1,2 e1,3

⊗ ⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 35

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗

softmax

e1,1 e1,2 e1,3 e1,4

⊗
⊗ ⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 36

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

softmax

e1,1 e1,2 e1,3 e1,4 e1,5

⊗
⊗
⊗ ⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 37

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

softmax

e1,1 e1,2 e1,3 e1,4 e1,5

⊗
⊗
⊗ ⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 38

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention
Q KV
oss
s ac r
ns io n
e
Dim

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 39

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

softmax

e1,1 e1,2 e1,3 e1,4 e1,5

⊗
⊗
⊗ ⊗
⊗
Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 40

I ate an apple <eos>

ity
r brev
d fo

hav
eb
ee n drop
pe
Attention Contextually
rich
oss
Q KV Z1 embedding
s ac r
ns io n
Dim
e ∑

α1,1 ⊗ α1,2 ⊗ α1,3 ⊗ α1,4 ⊗ α1,5 ⊗

d
softmax

e liz e
l
e1,1 e1,2 e1,4 e1,5

l
e1,3

⊗ ⊗
Para ⊗
⊗
⊗

Q1 K1 V1 Q2 K2 V2 Q3 K3 V3 Q4 K4 V4 Q5 K5 V5

WQ WK WV WQ WK WV WQ WK WV WQ WK WV WQ WK WV

I1 I2 I3 I4 I5 41

I ate an apple <eos>

Poll 1 @1296

Which of the following are true about attention? (Select all that apply)

a. To calculate attention weights for input I2, you would use key k2, and
all queries
b. To calculate attention weights for input I2, you would use query q2,
and all keys
c. We scale the QKT product to bring attention weights in the range of
[0,1]
d. We scale the QKT product to allow for numerical stability

42
Poll 1 @1296

Which of the following are true about attention? (Select all that apply)

43
Positional Encoding

I ate an apple <eos>

44
Positional Encoding

I ate an apple <eos>

apple ate an I <eos>

Positional Encoding
45
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic

Positional Encoding
46
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic

Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%

"!"# = "!. !∆%

Positional Encoding
47
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic

Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%

"!"# = "!. !∆%

Positional Encoding
48
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :
"!"# = "! + ∆%
"!"# = & $! ∆%

"!"# = "!. !∆%

Positional Encoding
49
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :

P(t + t’) = Mt’ x P(t)

Positional Encoding
50
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :

P(t + t’) = Mt’ x P(t)

M?
1. Should be a unitary matrix
2. Magnitudes of eigen value should be 1 -> norm preserving

Positional Encoding
51
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Possible Candidates :

P(t + t’) = Mt’ x P(t)

M
1. The matrix can be learnt
2. Produces unique rotated embeddings each time

Positional Encoding
52
Rotary Positional
Embedding

REF: Rotary Positional Embeddings !

53
Positional Encoding

Requirements for Positional Encodings

• Some representation of time ? (like seq2seq ?)
• Should be unique for each position – not cyclic
• Bounded
Actual Candidates :
sine(g(t))
cosine(g(t))

Positional Encoding
54
Positional Encoding

Requirements for g(t)

• Must have same dimensions as input embeddings
• Must produce overall unique encodings

pos -> idx of the token in input sentence

i -> ith dimension out of d

Positional Encoding
55
Positional Encoding

Requirements for g(t)

• Must have same dimensions as input embeddings
• Must produce overall unique encodings

pos -> idx of the token in input sentence

i -> ith dimension out of d

Positional Encoding
56
Positional Encoding
Final Input Embeddings

Position Encodings

Input Embeddings

Embedding Layer

I ate an apple <eos> Tokens

Tokenizer

I ate an apple
Input 57
Encoder ⍺[ i j ] ∑
CONTEXTUALLY RICH EMBEDDINGS

BLACK BOX
OF SORTS

LEARN TO
ADD
CONTEXT

I ate an apple <eos>

58
Self Attention

From lecture 18:

59
Self Attention

The animal didn’t cross the street because it was too wide

60
Self Attention

The animal didn’t cross the street because it was too wide

coreference resolution ?

61
Self Attention

62
Self Attention

63
Self Attention

64
Self Attention

SELF

Query Inputs = Key Inputs = Value Inputs

65
Self Attention

WQ WK Wv

Input Embeddings
66
Self Attention

Input Embeddings WQ Q Projections

Input Embeddings WK K Projections

WV V Projections 67
Input Embeddings
Self Attention

softmax
QProjection KProjection

√"!"#$%

68
Self Attention
l
)
o de
2 x dm
( T
!
T

softmax
QProjection KProjection

√"!"#$%

69
Self Attention
l
)
o de
2 x dm
( T
!
T

softmax
QProjection KProjection
VProjection

√"!"#$%

70
Self Attention

Attention: Z

71
Self Attention

The animal didn’t cross the street because it was too wide

coreference resolution

72
Self Attention

The animal didn’t cross the street because it was too wide

coreference resolution

73
Self Attention

WQ WK Wv

Input Embeddings
74
Multi-Head Attention

H H H
.. .. ..
2 2 2
1 1 1

WQ1, WQ2, … WQH, WK1, WK2, … WKH, WV1, WV2, … WVH,

Input Embeddings
75
Multi-Head Attention

H H
.. ..
2 2
Inputs WQi 1 Qi 1

H H
WKi .. WKi ..
2 2
Inputs
WKi 1 Ki 1

H WVi H
WVi .. 76
Inputs ..
2
WVi 2 Vi
Multi-Head Attention

softmax
Qi Ki

√"!"#$%

for all i ∈ [1, h]

77
Multi-Head Attention

Z1 Z2 Zh

CONCAT

Multi Head Attention : Z

78
Multi-Head Attention

The animal didn’t cross the street because it was too wide

Sentence boundaries ? coreference resolution Context ?

Semantic relationships ? Part of Speech ? Comparisons ?

79
Add & Norm

Input Norm(Z)

Normalization(Z) Add -> Residuals

• Mean 0, Std dev 1 • Avoid vanishing gradients
• Stabilizes training • Train deeper networks
• Regularization effect

80
Feed Forward

Feed Forward
• Non Linearity
• Complex Relationships
• Learn from each other

Feed Forward

Residuals

Input Norm(Z) 81
Add & Norm
Add & Norm

Feed Forward

Residuals

Input Norm(Z) 82
Encoders
Encoder

ENCODER

83
Encoders
Encoder

ENCODER

.
.
.
Input to Encoderi+1

ENCODER

Output from Encoderi

ENCODER

84
Transformers

ü Tokenizaton ü Attention

ü Input Embeddings ü Self Attention

ü Position Encodings ü Multi Head Attention

ü Residuals • Masked Attention

ü Query • Encoder Decoder Attention

ü Key • Output Probabilities / Logits

ü Value • Softmax

ü Add & Norm • Encoder-Decoder models

ü Encoder • Decoder only models

• Decoder

85
Machine Translation

Targets

Ich have einen apfel gegessen

Inputs

I ate an apple

86
Targets
Targets

Ich have einen apfel gegessen

87
Targets

Embedding Layer + Positional Encoding

<sos> Ich have einen apfel gegessen <eos>

Tokenizer

Ich have einen apfel gegessen

Generate Target Emebeddings

88
Masked Multi Head Attention

<sos> Ich have einen apfel gegessen <eos>

89
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Inference
1 <sos>

2 <sos> Ich

3 <sos> Ich have

4 <sos> Ich have einen

5 <sos> Ich have einen apfel

6 <sos> Ich have einen apfel gegessen

7 <sos> Ich have einen apfel gegessen <eos>

90
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Inference
1 <sos>

2 <sos> Ich

ed ?
liz
lle
3 <sos> Ich have

ra
Pa
4 <sos> Ich have einen

5 <sos> Ich have einen apfel

6 <sos> Ich have einen apfel gegessen

7 <sos> Ich have einen apfel gegessen <eos>

91
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Training

<sos> Ich have einen apfel gegessen <eos>

92
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

Training

<sos> Ich have einen apfel gegessen <eos>

Outputs at time T should only pay attention to outputs

until time T-1

93
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> Ich have einen apfel gegessen <eos>

2 <sos> Ich have einen apfel gegessen <eos>

3 <sos> Ich have einen apfel gegessen <eos>

4 <sos> Ich have einen apfel gegessen <eos>

5 <sos> Ich have einen apfel gegessen <eos>

6 <sos> Ich have einen apfel gegessen <eos>

7 <sos> Ich have einen apfel gegessen <eos>

94
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> Ich have einen apfel gegessen <eos>

2 <sos> Ich have einen apfel gegessen <eos>

3 <sos> Ich have einen apfel gegessen <eos>

4 <sos> Ich have einen apfel gegessen <eos>

5 <sos> Ich have einen apfel gegessen <eos>

6 <sos> Ich have einen apfel gegessen <eos>

7 <sos> Ich have einen apfel gegessen <eos>

Mask the available attention values ?

95
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> - - - - - -

2 <sos> Ich - - - - -

3 <sos> Ich have - - - -

4 <sos> Ich have einen - - -

5 <sos> Ich have einen apfel - -

6 <sos> Ich have einen apfel gegessen -

7 <sos> Ich have einen apfel gegessen <eos>

96
Masked Multi Head Attention
Decoding step by step (using Teacher Forcing)

1 <sos> - - - - - -

2 <sos> Ich - - - - -

3 <sos> Ich have - - - -

4 <sos> Ich have einen - - -

5 <sos> Ich have einen apfel - -

6 <sos> Ich have einen apfel gegessen -

7 <sos> Ich have einen apfel gegessen <eos>

Softmax -> 0.- -> 0

97
Masked Multi Head Attention
Masked Multi Head Attention

QKT Attention Mask: M Masked Attention

Masked Multi Head Attention : Z’

98
Masked Multi Head Attention
Masked Multi Head Attention

Masked Attention Values

Masked Multi Head Attention : Z’

99
Encoder Decoder Attention

Encoder Decoder Attention ?

Add & Norm

Input Norm(Z’) 100

Encoder Decoder Attention

Encoder Decoder Attention ?

101
Encoder Decoder Attention

Encoder Self Attention

1. Queries from Encoder Inputs

2. Keys from Encoder Inputs
3. Values from Encoder Inputs

Decoder Masked Self Attention

1. Queries from Decoder Inputs

2. Keys from Decoder Inputs
3. Values from Decoder Inputs
102
Attention

{Key, Value store}

{Query: “Order details of order_104”}

{Query: “Order details of order_106”}

103
Encoder Decoder Attention

Encoder Decoder

Keys from Encoder Outputs Queries from Decoder Inputs

Values from Encoder Outputs

NOTE: Every decoder block receives the same FINAL encoder output

104
Encoder Decoder Attention

Z’’

(#)$%
softmax( ). Ve
* &'#$(

(#)$%
softmax( )
* &'#$(

Qd Ke

105
Encoder Decoder Attention

• Non Linearity
• Complex Relationships
• Learn from each other

Feed Forward

Residuals

Add n Norm Decoder Self Attn Norm(Z’’) 106

Decoder

DECODER

107
Decoder

DECODER

.
.
.

DECODER
Decoder output

DECODER

108
Linear
Linear weights are often tied with
input embedding matrix

softmax
…

Final Decoder Output Linear

109
Softmax

Output Probabilities

110
Poll 2 (@1297)

Which of the following are true about transformers?

a. Transformers can always be run in parallel

b. Transformer decoders can only be parallelized during training
c. Positional encodings help parallelize the transformer encoder
d. Queries, keys, and values are obtained by splitting the input into 3 equal segments
e. Multiheaded attention helps transformers find different kinds of relations between the tokens
f. During decoding, decoder outputs function as queries and keys while the values come from the encoder
Poll 2 (@1126)

Which of the following are true about transformers?

a. Transformers can always be run in parallel

Targets

Ich have einen apfel gegessen

Inputs

I ate an apple

Machine Translation
113
Transformers

ü Tokenizaton ü Attention

ü Input Embeddings ü Self Attention

ü Position Encodings ü Multi Head Attention

ü Residuals ü Masked Attention

ü Query ü Encoder Decoder Attention

ü Key ü Output Probabilities / Logits

ü Value ü Softmax

ü Add & Norm • Encoder-Decoder models

ü Encoder • Decoder only models

ü Decoder

114
Part 2

LLM

109
Transformers, mid-2017

110
Transformers, mid-2017

Representation

111
Transformers, mid-2017

Representation Generation

112
Transformers, mid-2017

Input – input tokens Input – output tokens and hidden states*

Output – hidden states Output – output tokens

Representation Generation

113
Transformers, mid-2017

Input – input tokens Input – output tokens and hidden states*

Output – hidden states Output – output tokens

Model can see all timesteps Model can only see previous timesteps

Representation Generation

114
Transformers, mid-2017

Input – input tokens Input – output tokens and hidden states*

Output – hidden states Output – output tokens

Model can see all timesteps Model can only see previous timesteps

Does not usually output tokens, so no Model is auto-regressive with previous

inherent auto-regressivity timesteps’ outputs

Representation Generation

115
Transformers, mid-2017
Input – input tokens
Output – hidden states Input – output tokens and hidden states*
Output – output tokens
Model can see all timesteps
Model can only see previous timesteps
Does not usually output tokens, so no
inherent auto-regressivity Model is auto-regressive with previous
timesteps’ outputs
Can also be adapted to generate tokens
by appending a module that maps Can also be adapted to generate hidden
hidden state dimensionality to vocab size states by looking before token outputs

Representation Generation

116
2018 – The Inception of the LLM Era

BERT GPT
Oct 2018 Jun 2018

Representation Generation

117
BERT - Bidirectional Encoder Representations

• One of the biggest challenges in LM-building used to

be the lack of task-specific training data.

• What if we learn an effective representation that can

be applied to a variety of downstream tasks?
• Word2vec (2013)
• GloVe (2014)

118
BERT - Bidirectional Encoder Representations

BERT Pre-Training Corpus:

• English Wikipedia - 2,500 million words
• Book Corpus - 800 million words

119
BERT - Bidirectional Encoder Representations

BERT Pre-Training Corpus:

• English Wikipedia - 2,500 million words
• Book Corpus - 800 million words

BERT Pre-Training Tasks:

• MLM (Masked Language Modeling)
• NSP (Next Sentence Prediction)

120
BERT - Bidirectional Encoder Representations

BERT Pre-Training Corpus:

• English Wikipedia - 2,500 million words
• Book Corpus - 800 million words

BERT Pre-Training Tasks:

• MLM (Masked Language Modeling)
• NSP (Next Sentence Prediction)

BERT Pre-Training Results:

• BERT-Base – 110M Params
• BERT-Large – 340M Params

121
BERT - Bidirectional Encoder Representations
MLM (Masked Language Modeling)

you 60%
they 20%
Prediction
head … …

<CLS> How are <MASK> doing today <SEP>

BERT

<CLS> How are <MASK> doing today <SEP>

122
BERT - Bidirectional Encoder Representations
MLM (Masked Language Modeling)

is_next 95%

Prediction not_next 5%
head

<CLS> … … <SEP> … … <SEP>

BERT

<CLS> … … <SEP> … … <SEP>

123
BERT - Bidirectional Encoder Representations

BERT Fine-Tuning:

• Simply add a task-specific module after the last

encoder layer to map it to the desired dimension.

• Classification Tasks:
• Add a feed-forward layer on top of the encoder
output for the [CLS] token
• Question Answering Tasks:
• Train two extra vectors to mark the beginning
and end of answer from paragraph
• …

124
BERT - Bidirectional Encoder Representations

BERT Evaluation:

• General Language Understanding Evaluation (GLUE)

• Sentence pair tasks
• Single sentence classification

• Standford Question Answering Dataset (SQuAD)

125
BERT - Bidirectional Encoder Representations

BERT Evaluation:

126
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?

• Pre-training tasks can be invented flexibly…

• Effective representations can be derived from a
flexible regime of pre-training tasks.

127
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?

• Pre-training tasks can be invented flexibly…

• Effective representations can be derived from a
flexible regime of pre-training tasks.

• Different NLP tasks seem to be highly transferable

with each other...
• As long as we have effective representations, that
seems to form a general model which can serve as
the backbone for many specialized models.

128
BERT - Bidirectional Encoder Representations
What is our takeaway from BERT?

• Pre-training tasks can be invented flexibly…

• Effective representations can be derived from a
flexible regime of pre-training tasks.

• Different NLP tasks seem to be highly transferable

with each other...
• As long as we have effective representations, that
seems to form a general model which can serve as
the backbone for many specialized models.

• And scaling works!!!

• 340M is considered large in 2018
129
2018 – The Inception of the LLM Era

BERT GPT
Oct 2018 Jun 2018

Representation Generation

130
GPT – Generative Pretrained Transformer

• Similarly motivated as BERT, though differently designed

• Can we leverage large amounts of unlabeled data to

pretrain an LM that understands general patterns?

131
GPT – Generative Pretrained Transformer

GPT Pre-Training Corpus:

• Similarly, BooksCorpus and English Wikipedia

GPT Pre-Training Tasks:

• Predict the next token, given the previous tokens
• More learning signals than MLM

GPT Pre-Training Results:

• GPT – 117M Params
• Similarly competitive on GLUE and SQuAD

132
GPT – Generative Pretrained Transformer
GPT Fine-Tuning:
• Prompt-format task-specific text as a continuous
stream for the model to fit
QA
Summarization
Answer the question based on the context.
Summarize this article: Context:

The summary is: Question:

Answer:
133
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?

• The Effectiveness of Self-Supervised Learning

• Specifically, the model seems to be able to learn
from generating the language itself, rather than
from any specific task we might cook up.

134
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?

• The Effectiveness of Self-Supervised Learning

• Specifically, the model seems to be able to learn
from generating the language itself, rather than
from any specific task we might cook up.

• Language Model as a Knowledge Base

• Specifically, a generatively pretrained model seems
to have a decent zero-shot performance on a range
of NLP tasks.

135
GPT – Generative Pretrained Transformer
What is our takeaway from GPT?

• The Effectiveness of Self-Supervised Learning

• Specifically, the model seems to be able to learn
from generating the language itself, rather than
from any specific task we might cook up.

• Language Model as a Knowledge Base

• Specifically, a generatively pretrained model seems
to have a decent zero-shot performance on a range
of NLP tasks.

• And scaling works!!!

136
Poll
Piazza @1291

The original GPT’s parameter count is closest to…

A. 117
B. 117K
C. 117M
D. 117B
Poll
Piazza @1291

The original GPT’s parameter count is closest to…

A. 117
B. 117K
C. 117M
D. 117B
The LLM Era – Paradigm Shift in Machine Learning

BERT GPT
Oct 2018 Jun 2018

Representation Generation

139
The LLM Era – Paradigm Shift in Machine Learning

GPT – 2018
BERT – 2018 GPT-2 – 2019
DistilBERT – 2019 GPT-3 – 2020
RoBERTa – 2019 GPT-Neo – 2021
ALBERT – 2019 GPT-3.5 (ChatGPT) – 2022
ELECTRA – 2020 LLaMA – 2023
DeBERTa – 2020 T5 – 2019 GPT-4 – 2023
… BART – 2019 …
mT5 – 2021
Representation … Generation

140
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering
• How do we design or select the best
features for a task?

141
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?

142
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

143
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering
• How do we design or select the best
features for a task?
• Model Selection
• Which model is best for which type of task?
• Transfer Learning
• Given scarce labeled data, how do we
transfer knowledge from other domains?
• Overfitting vs Generalization
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
144
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning

• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection
• Which model is best for which type of task?
• Transfer Learning
• Given scarce labeled data, how do we
transfer knowledge from other domains?
• Overfitting vs Generalization
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
145
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning

• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection • Zero-shot and Few-shot learning
• Which model is best for which type of task? • How can we make models perform on tasks they
• Transfer Learning are not trained on?
• Given scarce labeled data, how do we
transfer knowledge from other domains?
• Overfitting vs Generalization
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
146
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning

• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection • Zero-shot and Few-shot learning
• Which model is best for which type of task? • How can we make models perform on tasks they
• Transfer Learning are not trained on?
• Given scarce labeled data, how do we • Prompting
transfer knowledge from other domains? • How do we make models understand their task
• Overfitting vs Generalization simply by describing it in natural language?
• How do we balance complexity and
capacity to prevent overfitting while
maintaining good performance?
147
The LLM Era – Paradigm Shift in Machine Learning
From both BERT and GPT, we learn that…
• Transformers seem to provide a new class of generalist models that are capable of
capturing knowledge which is more fundamental than task-specific abilities.
Before LLMs Since LLMs

• Feature Engineering • Pre-training and Fine-tuning

• How do we design or select the best • How do we leverage large scales of unlabeled
features for a task? data out there previously under-leveraged?
• Model Selection • Zero-shot and Few-shot learning
• Which model is best for which type of task? • How can we make models perform on tasks they
• Transfer Learning are not trained on?
• Given scarce labeled data, how do we • Prompting
transfer knowledge from other domains? • How do we make models understand their task
• Overfitting vs Generalization simply by describing it in natural language?
• How do we balance complexity and • Interpretability and Explainability
capacity to prevent overfitting while • How can we understand the inner workings of
maintaining good performance? our own models?
148
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?

149
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?

• Problem in recurrent networks

• Information is effectively lost during encoding of long sequences
• Sequential nature disables parallel training and favors late timestep inputs

150
The LLM Era – Paradigm Shift in Machine Learning
• What has caused this paradigm shift?

• Problem in recurrent networks

• Information is effectively lost during encoding of long sequences
• Sequential nature disables parallel training and favors late timestep inputs

• Solution: Attention mechanism

• Handling long-range dependencies
• Parallel training
• Dynamic attention weights based on inputs

151
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

152
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

• Problem in current Transformer-based LLMs??

153
Poll
Piazza @1292

What might be a flaw of our current Transformer-based LLMs?

Freeform response
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

• Problem in current Transformer-based LLMs??

• True understanding the material vs. memorization and pattern-matching
• Cannot reliably follow rules – factual hallucination e.g. inability in arithmetic

155
The LLM Era – Paradigm Shift in Machine Learning
• Attention and Transformer – is this the end?

• Problem in current Transformer-based LLMs??

• True understanding the material vs. memorization and pattern-matching
• Cannot reliably follow rules – factual hallucination e.g. inability in arithmetic

• Solution: ???

156
Looking Back

It is true that language models are just programmed to predict the next token. But that
is not as simple as you might think.

In fact, all animals, including us, are just programmed to survive and reproduce, and yet
amazingly complex and beautiful stuff comes from it.

- Sam Altman*
*Paraphrased by IDL TAs

157

Download Complete Evolutionary Deep Learning 1st Edition Micheal Lanham PDF for All Chapters
No ratings yet
Download Complete Evolutionary Deep Learning 1st Edition Micheal Lanham PDF for All Chapters
55 pages
GraphRAG + GPT-4o-Mini Is The RAG Heaven - by Vatsal Saglani - Jul, 2024 - Towards AI
No ratings yet
GraphRAG + GPT-4o-Mini Is The RAG Heaven - by Vatsal Saglani - Jul, 2024 - Towards AI
34 pages
2022 Staticspeed Vunerability Report Template
No ratings yet
2022 Staticspeed Vunerability Report Template
57 pages
Interpretable Machine Learning
100% (4)
Interpretable Machine Learning
251 pages
PPT
100% (1)
PPT
21 pages
Building Machine Learning Systems With A Feature Store - Early Release
100% (1)
Building Machine Learning Systems With A Feature Store - Early Release
48 pages
IDE204 - TimeGPT Generative AI For Time Series
100% (1)
IDE204 - TimeGPT Generative AI For Time Series
36 pages
Vector_Databases
No ratings yet
Vector_Databases
35 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
RAG Multimodal Complexe Financial Reports
No ratings yet
RAG Multimodal Complexe Financial Reports
25 pages
Software Architecture in An AI World
No ratings yet
Software Architecture in An AI World
25 pages
Improve Real-World RAG Systems
No ratings yet
Improve Real-World RAG Systems
43 pages
Crud Rag
No ratings yet
Crud Rag
31 pages
Paper3 - LLM Agent Operating System
No ratings yet
Paper3 - LLM Agent Operating System
14 pages
Evolving LLOMPS For RAG
No ratings yet
Evolving LLOMPS For RAG
6 pages
Rakesh Kumar - Data Scientist
No ratings yet
Rakesh Kumar - Data Scientist
3 pages
Building a Dynamic Multi-Agent Workflow_ Harnessing AI Collaboration with LangChain & LangGraph _ by Rohit Kumar _ Oct, 2024 _ Medium
No ratings yet
Building a Dynamic Multi-Agent Workflow_ Harnessing AI Collaboration with LangChain & LangGraph _ by Rohit Kumar _ Oct, 2024 _ Medium
13 pages
Generative AI Database
No ratings yet
Generative AI Database
14 pages
A Practical Guide To Hybrid Natural Language Processing (Combining Neural Models and Knowledge Graph
No ratings yet
A Practical Guide To Hybrid Natural Language Processing (Combining Neural Models and Knowledge Graph
281 pages
2023 - 07 - How To Train Generative Ai Using Your Companys Data
No ratings yet
2023 - 07 - How To Train Generative Ai Using Your Companys Data
12 pages
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
No ratings yet
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
13 pages
Aios LLM As Os
100% (2)
Aios LLM As Os
35 pages
26 RAG Concepts in Alphabetical Order
No ratings yet
26 RAG Concepts in Alphabetical Order
15 pages
Machine Learning Q and AI: 30 Essential Questions and Answers on Machine Learning and AI 1 / converted Edition Sebastian Raschka download pdf
75% (4)
Machine Learning Q and AI: 30 Essential Questions and Answers on Machine Learning and AI 1 / converted Edition Sebastian Raschka download pdf
65 pages
Guide to Fast GraphRAG
No ratings yet
Guide to Fast GraphRAG
7 pages
Langchain Retrieval Augmented Generation White Paper
100% (1)
Langchain Retrieval Augmented Generation White Paper
23 pages
The New Stack and Ops For AI - LLMOps
No ratings yet
The New Stack and Ops For AI - LLMOps
12 pages
Transformers For NLP
No ratings yet
Transformers For NLP
42 pages
Mlops Ebook With Preview
67% (3)
Mlops Ebook With Preview
57 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
No ratings yet
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
Machine Learning Interviews V 2 Week 11715787639480
No ratings yet
Machine Learning Interviews V 2 Week 11715787639480
49 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
Vector Database Essentials
No ratings yet
Vector Database Essentials
26 pages
AI
100% (2)
AI
234 pages
LLM Questions
No ratings yet
LLM Questions
51 pages
LlamaIndex Talk (W&B Fully Connected 2024)
No ratings yet
LlamaIndex Talk (W&B Fully Connected 2024)
38 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
Machine Learning Design Patterns Solutions to Common Challenges in Data Preparation Model Building and MLOps 1st Edition Valliappa Lakshmanan Sara Robinson Michael Munn download pdf
100% (3)
Machine Learning Design Patterns Solutions to Common Challenges in Data Preparation Model Building and MLOps 1st Edition Valliappa Lakshmanan Sara Robinson Michael Munn download pdf
65 pages
Embeddings
No ratings yet
Embeddings
13 pages
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
No ratings yet
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
12 pages
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
100% (1)
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
28 pages
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
No ratings yet
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
12 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
LCM LoRA Technical Report
No ratings yet
LCM LoRA Technical Report
7 pages
Agentic AI Projects
0% (1)
Agentic AI Projects
9 pages
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
No ratings yet
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
34 pages
A-Z of RAG Question Answering Methods in Langchain
No ratings yet
A-Z of RAG Question Answering Methods in Langchain
33 pages
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
No ratings yet
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
20 pages
A Taxonomy of Retrieval Augmented Generation
100% (1)
A Taxonomy of Retrieval Augmented Generation
56 pages
Evaluating LLM Models For Production Systems - Methods and Practices - Data Phoenix
No ratings yet
Evaluating LLM Models For Production Systems - Methods and Practices - Data Phoenix
61 pages
The 10 Generic Kinds of Agents 1730948119
No ratings yet
The 10 Generic Kinds of Agents 1730948119
17 pages
Real-World Machine L v8 MEAP
100% (1)
Real-World Machine L v8 MEAP
186 pages
1. Application Of Large Language
No ratings yet
1. Application Of Large Language
75 pages
Neurosymbolic Presentation
No ratings yet
Neurosymbolic Presentation
42 pages
Generative AI 3d Model
No ratings yet
Generative AI 3d Model
117 pages
Uncertainty in Modeling
No ratings yet
Uncertainty in Modeling
25 pages
00 Course Introduction
100% (1)
00 Course Introduction
17 pages
Types of RAG: @bhavishya Pandit
No ratings yet
Types of RAG: @bhavishya Pandit
15 pages
GenAI POC - Training
No ratings yet
GenAI POC - Training
43 pages
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
No ratings yet
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
24 pages
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Lec1 Introduction
No ratings yet
Lec1 Introduction
130 pages
Poisson CDF Table
No ratings yet
Poisson CDF Table
6 pages
Lab RNN Intro
No ratings yet
Lab RNN Intro
22 pages
ct 1 qp nndl
No ratings yet
ct 1 qp nndl
2 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
Eradist and Eranataf Classes: Documentation and Examples
No ratings yet
Eradist and Eranataf Classes: Documentation and Examples
12 pages
Seminar 7: Discrete and Continuous Random Variables
No ratings yet
Seminar 7: Discrete and Continuous Random Variables
1 page
D FINALS Econometrics-II MCQs
No ratings yet
D FINALS Econometrics-II MCQs
6 pages
Week 2 Homework: This Is A Preview of The Published Version of The Quiz
No ratings yet
Week 2 Homework: This Is A Preview of The Published Version of The Quiz
8 pages
From Regular Expressions To Automata
No ratings yet
From Regular Expressions To Automata
33 pages
Sentiment Analysis On Twitter Using Neural Network
No ratings yet
Sentiment Analysis On Twitter Using Neural Network
7 pages
DL PRACTICAL FILE
No ratings yet
DL PRACTICAL FILE
58 pages
MGT646
No ratings yet
MGT646
8 pages
Restricted Boltzman Machines
No ratings yet
Restricted Boltzman Machines
25 pages
A Comparison of Deep Learning Methods For Time Series Forecasting With Limited Data
No ratings yet
A Comparison of Deep Learning Methods For Time Series Forecasting With Limited Data
55 pages
Deep Learning Part 1 (IITM) - Unit 14 - Week 11
No ratings yet
Deep Learning Part 1 (IITM) - Unit 14 - Week 11
3 pages
An Introduction To Convolutional Neural Networks: Abstract
No ratings yet
An Introduction To Convolutional Neural Networks: Abstract
11 pages
Real Statistics Examples Distributions
No ratings yet
Real Statistics Examples Distributions
491 pages
Evaluating Some Yule-Walker Methods With The Maximum-Likelihood Estimator For The Spectral ARMA Model
No ratings yet
Evaluating Some Yule-Walker Methods With The Maximum-Likelihood Estimator For The Spectral ARMA Model
10 pages
Introduction To Statistics and Data Analysis - 5
No ratings yet
Introduction To Statistics and Data Analysis - 5
26 pages
Continuous Random Variables: Probability Distribution Function PDF
No ratings yet
Continuous Random Variables: Probability Distribution Function PDF
2 pages
Basic Structural Modeling
No ratings yet
Basic Structural Modeling
48 pages
XOR Problem Demonstration Using MATLAB
0% (1)
XOR Problem Demonstration Using MATLAB
19 pages
Flat Nmu Paper 2
No ratings yet
Flat Nmu Paper 2
2 pages
What Is OOPS?: 1) Class
No ratings yet
What Is OOPS?: 1) Class
3 pages
Lecture 3
No ratings yet
Lecture 3
21 pages
Question Bank Ann
50% (2)
Question Bank Ann
2 pages
NFA To DFA Example
No ratings yet
NFA To DFA Example
23 pages