Transformers
Transformers
Based Networks
Avinash Kak
Purdue University
So far you have seen two major architectural elements in the neural networks
meant for deep learning (DL): convolutional layers and recurrence layers. Until
recently, they were the primary reasons for the fame and glory that have been
bestowed on DL during recent years.
But now we have another element: attention layers.
That difficult problems could be solved with neural networks through purely
attention based logic — that is, without convolutions and recurrence — was first
revealed in the paper ”Attention is All You Need” by Vaswani et el. that you can
access here:
https://arxiv.org/pdf/1706.03762.pdf
To see the need for cross-attention, consider the fact that in the
English-to-Spanish translation example mentioned previously, the Spanish word
“averiguar” has several nuances in what it means: it can stand for “to discover”,
“to figure out”, “to find out”, etc.
With cross-attention, during the training phase, the neural network would learn
that when the context in the English sentence is “friend”, it would be appropriate
to use “averiguar” for the translation because one of its meanings is “to find out.”
Along the same lines, in English-to-Spanish translation, ordinarily the English
word “running” would be translated into the gerund “corriendo” in Spanish,
however, on account of the context that would not be appropriate here.
Purdue University 5
Preamble (contd.)
To continue with the example at the bottom of the previous slide, on account of
the context “car” and through the mechanism of cross-attention the neural
network would learn that “running” is being used in the context of a “car
engine”, implying that that a more appropriate Spanish translation would be
based on the verb “funcionar”.
In this lecture, I’ll be teaching purely-attention based learning with the following
three inner classes in the Transformers module of DLStudio:
TransformerFG
TransformerPreLN
visTransformer
The first two, meant for seq2seq learning, are only slightly different variants of
the same implementation. I have kept them separate for educational reasons. The
last one shows how to use the self-attention in Transformers for solving image
recognition problems in computer vision.
Purdue University 6
Preamble (contd.)
The suffix “FG” in TransformerFG stands for “First Generation”. And the suffix
“PreLN” inn TransformerPreLN stands for “Pre Layer Norm”.
The TransformerFG implementation is based on the transformers as first envisioned
in the seminal paper ”Attention is All You Need” by Vaswani et el.:
https://arxiv.org/pdf/1706.03762.pdf
The class, visTransformer, meant for solving image recognition problems, is based
on the paper “An Image is Worth 16 × 16 Words: Transformers for Image
Recognition at Scale” by Dosovitskiy et al.:
https://arxiv.org/pdf/2010.11929.pdf
All three Transformer classes mentioned above are defined in the module file
Transformers.py in DLStudio.
Purdue University 7
Preamble (contd.)
About the dataset I’ll be using to demonstrate Transformers for seq2seq learning,
DLStudio comes with the following data archive:
en_es_xformer_8_90000.tar.gz
In the name of the archive, the number 8 refers to the maximum number of words
in a sentence, which translates into sentences with a maximum length of 10 when
you include the SOS and EOS tokens at the two ends of a sentence. The number
90,000 is for how many English-Spanish sentence pairs are there in the archive.
The following two scripts in the ExamplesTransformers directory of the distribution
are your main entry points for experimenting with the seq2seq Transformer code
in DLStudio:
seq2seq_with_transformerFG.py
seq2seq_with_transformerPreLN.py
For the image recognition class visTransformer, I’ll use the CIFAR-10 dataset that
you are already very familiar with. The following two scripts in the same
ExamplesTransformers directory as mentioned above are your main entry points for
playing with the vision related Transformer code in DLStudio:
image_recog_with_visTransformer.py
test_checkpoint_for_visTransformer.py
Purdue University 8
Preamble – How to Learn from These Slides
At your first reading of these slides, just focus on thoroughly understanding the
following three topics:
What do we mean by attention and the theory and implementation of the QKV attention
that the transformers are based on. This is explained on Slides 12 through 19, for a total
of 7 slides.
Your next topic should be coming to grips with the notion of multi-headed attention and
its DLStudio implementation as explained on Slides 20 through 29, for a total of 10 slides.
Now jump to the end and spend some time on the vision transformer on Slides 85 through
95, for a total of 10 slides.
That makes for a total of 27 slides for your first reading. Note, however, in
this lecture in particular, the rest of the material not included above is just
as important. However, after you have understood the core concept of
what exactly is meant by Attention, you should be able to breeze through
the rest with relative ease.
Purdue University 9
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 10
The Basic Idea of Dot-Product Attention
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 11
The Basic Idea of Dot-Product Attention
q = w · Wq k = w · Wk v = w · Wv (1)
Purdue University 14
The Basic Idea of Dot-Product Attention
With all the words of a sentence packed into the tensor X , we can set
things up so that the network learns all of the matrices Wq , Wk , and
Wv for all the words in a sentence simultaneously. We can therefore
visualize a triplet of learnable tensors (WQ , WK , WV ) whose different
axes would correspond to the individual-word (Wq , Wk , Wv ) matrices.
Purdue University 15
The Basic Idea of Dot-Product Attention
The tensor Q packs all the word-based query vectors into a single
data object. The tensor K does the same for the word-based key
vectors, and the tensor V for the value vectors.
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 20
Multi-Headed Attention
Multi-Headed Attention
What I have described in the previous section is referred to as a
Single-Headed Attention. As it turns out, single-headed attention is
not sufficiently rich in its representational power for capturing all the
needed inter-word dependencies in a sentence.
That is, each Attention Head gets to focus on a slice along the
embedding dimension of the input sentence tensor.
For reasons that I’ll make clear later, I’ll denote the size of the
embedding slice given to each Attention Head by the same notation
sqkv that you saw earlier.
Purdue University 21
Multi-Headed Attention
Figure: Correction: In the upper part of the figure, read ZK as ZNH . And, in the middle of the figure, read AHk as
AHN . The symbol NH stands for the number of Attention Heads used.
H
Purdue University 22
Multi-Headed Attention
Each Attention Head learns its own values for the Q , K , and V
tensors with its own matrices for WQ , WK , and WV .
While each Attention Head receives only a sqkv -sized slice from the
embedding axis of the input sentence, the output tensors Q , K , V will
still be of shape (Nw , sqkv ) for the same reason as described in the
previous section.
Since for each Attention Head, Q and K are of shape (Nw , sqkv ) for an
Nw -word sentence, the inner-product Q · K T is of the same shape as
in the previous section, that is Nw × Nw .
Purdue University 23
Implementation of Attention in DLStudio’s Transformers
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 24
Implementation of Attention in DLStudio’s Transformers
class AttentionHead(nn.Module):
Purdue University 26
Implementation of Attention in DLStudio’s Transformers
Purdue University 28
Implementation of Attention in DLStudio’s Transformers
Purdue University 29
The Encoder-Decoder Architecture of a Transformer
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 30
The Encoder-Decoder Architecture of a Transformer
Purdue University 32
The Encoder-Decoder Architecture of a Transformer
Purdue University 34
The Encoder-Decoder Architecture of a Transformer
Purdue University 35
The Master Encoder Class
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 36
The Master Encoder Class
Master Encoder
class MasterEncoder(nn.Module):
Purdue University 37
The Basic Encoder Class
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 38
The Basic Encoder Class
Basic Encoder
The BasicEncoder consists of a layer of self-attention (SA) followed by a
purely feed-forward layer (FFN). You already know what is
accomplished by SA. The role played by FFN is the same as it does in
any neural network — to enhance the discrimination ability of the
network.
The output of SA goes through FFN and the output of FFN becomes
the output of the BasicEncoder.
To mitigate the problem of vanishing gradients, the output of each of
the two components — SA and FFN — is subject to Layer Norm. In
addition, we use residual connections, one that wraps around the SA
layer and the other that wraps around the FFN layer as shown in the
figure on Slide 35.
Deploying a stack of BasicEncoder instances becomes easier if the
output tensor from a BasicEncoder has the same shape as its input
tensor.
Purdue University 39
The Basic Encoder Class
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 41
Cross Attention
In what follows, I’ll use X enc represent the tensor at the output of
the MasterEncoder. Its shape will be the same as that of the source
sentence supplied to the MasterEncoder instance.
Purdue University 42
Cross Attention
Purdue University 43
Cross Attention
class CrossAttention(nn.Module):
Purdue University 44
Cross Attention
The dot products eventually modify the Value vector slices that are
also from the MasterEncoder output for the source sequence. About the
word ”slice” here, as mentioned earlier, what each attention head sees
is a slice along the embedding axis for the words in a sentence.
Purdue University 45
Cross Attention
The operations carried out in lines (N) through (R) can be described
more compactly as:
nn.Sofmax(Qsource · Ktarget T )
Zcross = √ · Vsource (6)
M
class CrossAttentionHead(nn.Module):
def __init__(self, dl_studio, max_seq_length, qkv_size, num_atten_heads):
super(TransformerFG.CrossAttentionHead, self).__init__()
self.dl_studio = dl_studio
self.qkv_size = qkv_size
self.max_seq_length = max_seq_length
self.WQ = nn.Linear( max_seq_length * self.qkv_size, max_seq_length * self.qkv_size ) ## (B)
self.WK = nn.Linear( max_seq_length * self.qkv_size, max_seq_length * self.qkv_size ) ## (C)
self.WV = nn.Linear( max_seq_length * self.qkv_size, max_seq_length * self.qkv_size ) ## (D)
self.softmax = nn.Softmax(dim=1) ## (E)
def forward(self, basic_decoder_slice, final_encoder_slice): ## (F)
Q = self.WQ( basic_decoder_slice.reshape(final_encoder_slice.shape[0],-1).float() ) ## (G)
K = self.WK( final_encoder_slice.reshape(final_encoder_slice.shape[0],-1).float() ) ## (H)
V = self.WV( final_encoder_slice.reshape(final_encoder_slice.shape[0],-1).float() ) ## (I)
Q = Q.view(final_encoder_slice.shape[0], self.max_seq_length, self.qkv_size) ## (J)
K = K.view(final_encoder_slice.shape[0], self.max_seq_length, self.qkv_size) ## (K)
V = V.view(final_encoder_slice.shape[0], self.max_seq_length, self.qkv_size) ## (L)
A = K.transpose(2,1) ## (M)
QK_dot_prod = Q @ A ## (N)
rowwise_softmax_normalizations = self.softmax( QK_dot_prod ) ## (O)
Z = rowwise_softmax_normalizations @ V ## (P)
coeff = 1.0/torch.sqrt(torch.tensor([self.qkv_size]).float()).to(self.dl_studio.device) ## (Q)
Z = coeff * Z ## (R)
return Z
Purdue University 46
The Basic Decoder Class
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 47
The Basic Decoder Class
Purdue University 49
The Master Decoder Class
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 50
The Master Decoder Class
Master Decoder
The primary job of the Master Decoder is to orchestrate the
invocation of a stack of BasicDecoderWithMasking instances. The number
of BasicDecoderWithMasking instances used is a user-defined parameter.
With regard to the data flow through the network, note how the mask
is initialized in Line (D) on Slide 53. The mask is a vector of one’s
that grows with the prediction for each output word. We start by
setting it equal to just a single-element vector containing a single ”1”.
Purdue University 51
The Master Decoder Class
MasterDecoderWithMasking (contd.)
Lines (E) and (F) in the code on the next slide declare the tensors
that will store the final output of the Master Decoder. This final
output consists of two tensors:
One tensor holds the integer index to the target-language vocabulary
word where the output log-prob is maximum. [This index is needed at
inference time to output the words in the translation.]
The other tensor holds the log-probs over the target language
vocabulary. The log-probs are produced by the nn.LogSoftmax in Line
(L).
Purdue University 52
The Master Decoder Class
MasterDecoderWithMasking (contd.)
class MasterDecoderWithMasking(nn.Module):
def __init__(self, dls, xformer, how_many_basic_decoders, num_atten_heads):
super(TransformerFG.MasterDecoderWithMasking, self).__init__()
self.dls = dls
self.max_seq_length = xformer.max_seq_length
self.embedding_size = xformer.embedding_size
self.target_vocab_size = xformer.vocab_es_size ## (A)
self.basic_decoder_arr = nn.ModuleList([xformer.BasicDecoderWithMasking( dls, xformer,
num_atten_heads) for _ in range(how_many_basic_decoders)]) ## (B)
## Need the following layer because we want the prediction of each target word to be a probability
## distribution over the target vocabulary. The conversion to probs would be done by the criterion
## nn.CrossEntropyLoss in the training loop:
self.out = nn.Linear(self.embedding_size, self.target_vocab_size) ## (C)
def forward(self, sentence_tensor, final_encoder_out): ## (D)
## This part is for training:
mask = torch.ones(1, dtype=int) ## (E)
## A tensor with two axes, one for the batch instance and the other for storing the predicted
## word ints for that batch instance:
predicted_word_index_values = torch.ones(sentence_tensor.shape[0], self.max_seq_length,
dtype=torch.long).to(self.dls.device) ## (F)
## A tensor with two axes, one for the batch instance and the other for storing the log-prob
## of predictions for that batch instance. The log_probs for each predicted word over the entire
## target vocabulary:
predicted_word_logprobs = torch.zeros( sentence_tensor.shape[0], self.max_seq_length,
self.target_vocab_size, dtype=float).to(self.dls.device) ## (G)
for mask_index in range(1, sentence_tensor.shape[1]):
masked_target_sentence = self.apply_mask(sentence_tensor, mask, self.max_seq_length,
self.embedding_size) ## (H)
## out_tensor will start as just the first word, then two first words, etc.
out_tensor = masked_target_sentence ## (I)
for i in range(len(self.basic_decoder_arr)): ## (J)
out_tensor = self.basic_decoder_arr[i](out_tensor, final_encoder_out, mask)
last_word_tensor = out_tensor[:,mask_index] ## (K)
last_word_onehot = self.out(last_word_tensor.view(sentence_tensor.shape[0],-1)) ## (L)
output_word_logprobs = nn.LogSoftmax(dim=1)(last_word_onehot) ## (M)
_, idx_max = torch.max(output_word_logprobs, 1) ## (N)
predicted_word_index_values[:,mask_index] = idx_max ## (P)
predicted_word_logprobs[:,mask_index] = output_word_logprobs ## (Q)
mask = torch.cat( ( mask, torch.ones(1, dtype=int) ) ) ## (R)
return predicted_word_logprobs, predicted_word_index_values ## (S)
def apply_mask(self, sentence_tensor, mask, max_seq_length, embedding_size):
out = torch.zeros_like(sentence_tensor).float().to(self.dls.device)
out[:,:len(mask),:] = sentence_tensor[:,:len(mask),:]
return out
Purdue University 53
Positional Encoding for the Words
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 54
Positional Encoding for the Words
The alternating columns of the 2D array shown on the next slide are
filled using sine and cosine functions whose periodicities vary with the
column index in the pattern.
In this case, the sentence tensor is of shape (10, 512). So the array of
positional-encoding numbers we need to construct will also be of
shape (10, 512). We need to fill the alternating columns of this
(10, 512) array with sin() and cos() values as shown above.
Purdue University 56
Positional Encoding for the Words
Purdue University 58
TransformerFG and TransformerPreLN
Classes in DLStudio
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 59
TransformerFG and TransformerPreLN
Classes in DLStudio
Purdue University 60
TransformerFG and TransformerPreLN
Classes in DLStudio
Purdue University 62
TransformerFG and TransformerPreLN
Classes in DLStudio
Purdue University 63
Regarding the Difficulty of Training a Transformer Network
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 64
Regarding the Difficulty of Training a Transformer Network
When comparing the n-grams between the predicted and the target
sentences, you do NOT seek a position based matching of the
n-grams. For a given value of n, what BLEU calculates is the
occurrence count for an n-gram in the predicted sentence that has a
matching n-gram anywhere in the target sentence. The ratio of this
number to the total number of such n-grams in the predicted
sentence is the translation precision as measured for that n. Typically,
one constructs a weighted average of these ratios for n ∈ {1, 2, 3, 4}.
The above formula requires a critical modification in order to be
effective: You do not want the occurrence based count for an n-gram
in a predicted sentence to exceed the count for the same n-gram in
the target sentence. [To cite an example provided by the original authors of BLEU, consider the case
when the predicted sentence is a gibberish repetition of a commonly occurring word like “the” as in the predicted
sentence “the the the the the the the”. Assume that the target sentence is “the cat is on the mat”. A unigram
based precision in this case would return a value of 77 = 1 since the unigram “the” occurs 7 times in the predicted
sentence and it does occur at least once in the target sentence. To remedy this shortcoming, we require that the
count returned for any n-gram not exceed the count for same n-gram in the target sentence. With that modification,
the value returned for the example would be 27 . You would impose this constraint for all n in the n-grams used.]
Purdue University 67
Regarding the Difficulty of Training a Transformer Network
Since the n-gram based counts are based solely on the predicted
sentences (albeit on the basis that the same n-grams exist in the
target sentences), predicted sentences much shorter than the target
sentences will in general score higher. [Consider the case when when the predicted
sentence is “the cat is” for the target sentence “the cat is on the mat”. In this case, all of the unigram, digram,
BLEU metric multiplies the n-gram based scores with the factor
r
e (1− c ) when c < r where c is the length of the predicted sentence
and r the length of the target sentence.
You use the BLEU metric in you code by calling on its implementation
provided by the Natural Language Toolkit (NLTK) library. If you
wish, you can download the source code for the BLEU metric from:
https://www.nltk.org/_modules/nltk/translate/bleu_score.html
Purdue University 68
Regarding the Difficulty of Training a Transformer Network
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 70
Results on the English-Spanish Dataset
Figure: Training loss vs. iterations for 20 epochs with the TransformerFG class in DLStudio.
Purdue University
Figure: Training loss vs. iterations for 60 epochs with the TransformerPreLN class in DLStudio.
71
Results on the English-Spanish Dataset
The training was carried out on RVL Cloud using a single GPU
(NVIDIA GeForce RTX 2080) and by executing the following command
in the ExamplesTransformers directory of DLStudio:
python3 seq2seq_with_transformerFG.py
Here are the parameters used for training the transformer network:
Batch size: 50
Embedding_size: 256
Number Basic Encoders: 4
Number Basic Decoders: 4
Number Attention Heads: 4
Number of Warmup Steps: 4000
Masking: False
Purdue University 72
Results on the English-Spanish Dataset
Purdue University 73
Results on the English-Spanish Dataset
TRANSLATIONS PRODUCED:
1. The input sentence pair: [’SOS anybody can read it EOS’] [’SOS cualquiera puede leerlo EOS’]
The translation produced by TransformerFG: EOS cualquiera puede leerlo EOS EOS EOS EOS EOS EOS [CORRECT]
2. The input sentence pair: [’SOS is he your teacher EOS’] [’SOS es tu profesor EOS’]
The translation produced by TransformerFG: EOS es tu profesor EOS EOS EOS EOS EOS EOS [CORRECT]
3. The input sentence pair: [’SOS i wanted to study french EOS’] [’SOS querı́a estudiar francés EOS’]
The translation produced by TransformerFG: EOS querı́a estudiar francés EOS EOS EOS EOS EOS EOS [CORRECT]
4. The input sentence pair: [’SOS what are you doing next monday EOS’] [’SOS qué vas a hacer el próximo lunes EOS’]
The translation produced by TransformerFG: EOS qué vas a hacer el próximo lunes EOS EOS [CORRECT]
5. The input sentence pair: [’SOS it was a beautiful wedding EOS’] [’SOS fue un hermoso casamiento EOS’]
The translation produced by TransformerFG: EOS fue un hermoso hermoso EOS EOS EOS EOS EOS [WRONG]
7. The input sentence pair: [’SOS he has a very interesting book EOS’] [’SOS él tiene un libro muy divertido EOS’]
The translation produced by TransformerFG: EOS él tiene un libro muy divertido EOS EOS EOS [CORRECT]
8. The input sentence pair: [’SOS i was waiting for tom EOS’] [’SOS estaba esperando a tom EOS’]
The translation produced by TransformerFG: EOS estaba esperando a tom EOS EOS EOS EOS EOS [CORRECT]
9. The input sentence pair: [’SOS mary has curlers in her hair EOS’] [’SOS mary lleva rulos en el pelo EOS’]
The translation produced by TransformerFG: EOS mary lleva tengo en el pelo EOS EOS EOS [WRONG]
10. The input sentence pair: [’SOS tom thought about mary a lot EOS’] [’SOS tom pensó mucho acerca de marı́a EOS’]
The translation produced by TransformerFG: EOS tom pensó mucho acerca de marı́a EOS EOS EOS [CORRECT]
11. The input sentence pair: [’SOS you are so shallow EOS’] [’SOS eres tan superficial EOS’]
The translation produced by TransformerFG: EOS eres tan superficial EOS EOS EOS EOS EOS EOS [CORRECT]
12. The input sentence pair: [’SOS can you solve this problem EOS’] [’SOS podéis resolver este problema EOS’]
The translation produced by TransformerFG: EOS puedes resolver este problema EOS EOS EOS EOS EOS [CORRECT]
13. The input sentence pair: [’SOS they were listening to the radio EOS’] [’SOS ellos estaban escuchando la radio EOS’]
The translation produced by TransformerFG: EOS ellos estaban escuchando la radio EOS EOS EOS EOS [CORRECT]
15. The input sentence pair: [’SOS when did you learn to swim EOS’] [’SOS cuándo aprendiste a nadar EOS’]
The translation produced by TransformerFG: EOS cuándo aprendiste a nadar EOS EOS EOS EOS EOS [CORRECT]
16. The input sentence pair: [’SOS tom has been busy all morning EOS’] [’SOS tom estuvo ocupado toda la ma~
nana EOS’]
The translation produced by TransformerFG: EOS tom ha estado toda toda ma~
nana EOS EOS EOS [WRONG]
17. The input sentence pair: [’SOS i just want to read EOS’] [’SOS solo quiero leer EOS’]
The translation produced by TransformerFG: EOS solo quiero leer EOS EOS EOS EOS EOS EOS [CORRECT]
18. The input sentence pair: [’SOS tell us something EOS’] [’SOS dı́ganos algo EOS’]
The translation produced by TransformerFG: EOS dinos algo EOS EOS EOS EOS EOS EOS EOS [Semantically CORRECT]
19. The input sentence pair: [’SOS how often does tom play hockey EOS’] [’SOS con qué frecuencia juega tom al hockey EOS’]
The translation produced by TransformerFG: EOS con qué frecuencia juega tom al hockey EOS EOS [CORRECT]
20. The input sentence pair: [’SOS he was reelected mayor EOS’] [’SOS él fue reelegido alcalde EOS’]
The translation produced by TransformerFG: EOS él fue a alcalde EOS EOS EOS EOS EOS [WRONG]
Purdue University 76
Results on the English-Spanish Dataset
I could have gotten more “juice” out of my small dataset if I had also
incorporated in the learning framework the commonly used step of
tokenization
Purdue University as a front-end and trained the model with the tokens. 77
Results on the English-Spanish Dataset
https://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 81
Transformers with Generative Pre-Training (GPT)
I’ll now present some insights gleaned from the paper “Improving
Language Understanding by Generative Pre-Training” by Radford et al.:
Purdue Universityhttps://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf 82
Transformers with Generative Pre-Training (GPT)
where U represents the “tokens” in the corpus and k the size of the
context window. [Using the words directly in creating a language model can result in too large a
vocabulary — you’ll need a separate representation for every possible inflection of each noun and every possible
conjugation of each verb. Besides, you will also run into problems with “synthesized” words like
“overparameterized”. Language modeling becomes more efficient if the words are first decomposed into tokens
through a step called tokenization. As you would expect, tokenization is highly language specific.]
Purdue University 84
The Vision Transformer Class visTransformer
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 85
The Vision Transformer Class visTransformer
I have illustrated this idea in the figure on the next slide that shows
us representing an image with an 5 × 5 array of nonoverlapping
patches. If a patch consists of p × p pixels, and assuming that we are
talking about color images, we will have a total of p 2 × 3 numeric
values in a patch.
Purdue University 88
The Vision Transformer Class visTransformer
The authors of the ViT paper discovered that if they gave the transformer a “cell” in
which it could store its understanding of what was unique to all the images for the same
class, that helped the neural network make a correct guess for the class label. This “cell”
is referred to as the class token in the ViT architecture.
As you know already, transformer based learning for languages required positional
encoding for the words that gave the network a sense of the order in which the words
existed in a sentence. As you’ll recall, I presented sinusoidal positional encoding for the
case of language modeling on Slides 54-58. The question now is: How does one do that
for patch sequences? Here again, the solution consisted of providing another “cell”, but
this time on a per-patch basis, where the network can put away its understanding of the
order in which the patches relate to one another spatially. These per-patch cells are
referred to as positional-encodings in ViT.
Purdue University 89
The Vision Transformer Class visTransformer
where 3 is for the three color channels of a training image, M the embedding size
and P the kernel size. By setting both the kernel and the stride to the same value,
you will directly output the embedding vector of size M for each non-overlapping
P × P patch in the image. If your training dataset is CIFAR-10, your input images
are of size 32 × 32. If you want your patches to be of size 16 × 16, you would set
P = M = 16.
The second approach is based on separately extracting the patches by calling
torch.tensor.unfold() and then mapping them with an nn.Linear layer to the
embedding vectors, as shown below:
for i, data in enumerate(self.train_data_loader):
input_images, labels = data
...
patch_sequences = input_images.unfold(2, self.patch_size[0], self.patch_size[1]).unfold(3, \
self.patch_size[0], self.patch_size[1])
patch_sequence_embeddings = patch_embedding_generator( patch_sequences )
Purdue University
... 92
The Vision Transformer Class visTransformer
visTransformer in DLStudio
The visTransformer class in the Transformers module in DLStudio
consists of the following inner classes and methods:
class visTransformer(nn.Module)
class PatchEmbeddingGenerator(nn.Module)
class MasterEncoder(nn.Module)
class BasicEncoder(nn.Module)
class SelfAttention(nn.Module)
class AttentionHead(nn.Module)
The names I have used the main vision transformer class visTransformer
and its five inner classes should make them self-explanatory.
Of the five inner classes, you have already seen the last four. In what
follows, I’ll present the definitions for the main class visTransformer and
its inner class PatchEmbeddingGenerator
Purdue University 93
The Vision Transformer Class visTransformer
test_checkpoint_for_visTransformer.py
When you run the first script, it outputs checkpoints every 10 epochs
(by default). As the first is continuing to train further, you can test
the quality of the model learned in a checkpoint by executing the
second script. See the doc section of the second script for to specify a
particular checkpoint.
Without any hyperparameter tuning, shown below are some results on
the testing-portion of the CIFAR-10 dataset:
Displaying the confusion matrix:
plane car bird cat deer dog frog horse ship truck
plane: 65.63 1.60 3.41 3.71 2.91 1.90 2.40 1.20 13.23 4.01
car: 4.80 60.46 0.30 4.70 1.30 2.20 1.40 0.80 10.31 13.71
bird: 8.81 0.50 38.34 14.21 15.12 7.51 7.91 3.90 2.60 1.10
cat: 2.51 0.90 5.92 51.05 7.22 15.75 10.23 2.81 1.71 1.91
deer: 3.70 0.20 7.80 10.40 53.50 6.20 10.20 3.80 2.70 1.50
dog: 1.81 0.30 6.02 29.29 6.42 44.53 6.42 2.91 1.50 0.80
frog: 1.40 0.70 5.00 12.70 7.90 4.10 65.40 0.70 1.10 1.00
horse: 4.11 1.00 2.71 12.34 9.93 11.74 2.81 52.26 1.00 2.11
Purdue University
ship:
truck:
7.82
5.10
3.41
11.50
1.40
0.70
3.71
7.00
2.71
2.00
1.91
3.60
1.00
2.40
0.90
2.30
72.92
7.50
4.21
57.90
97
The Vision Transformer Class visTransformer
Purdue University 98
Using QKV Modeling for Inter-Pixel Attention
Outline
1 The Basic Idea of Dot-Product Attention 11
2 Multi-Headed Attention 20
3 Implementation of Attention in DLStudio’s Transformers 24
4 The Encoder-Decoder Architecture of a Transformer 30
5 The Master Encoder Class 36
6 The Basic Encoder Class 38
7 Cross Attention 41
8 The Basic Decoder Class 47
9 The Master Decoder Class 50
10 Positional Encoding for the Words 54
11 TransformerFG and TransformerPreLN Classes in DLStudio 59
12 Regarding the Difficulty of Training a Transformer Network 64
13 Results on the English-Spanish Dataset 70
14 Transformers with Generative Pre-Training (GPT) 81
15 The Vision Transformer Class visTransformer 85
16 Using QKV Modeling for Inter-Pixel Attention 99
Purdue University 99
Using QKV Modeling for Inter-Pixel Attention
As you would expect, the data that is input into the UNet is of shape
(B, C , H, W ). For calculating the inter-pixel attention, for each pixel
in the H × W array, we consider the C floating-point values along the
channel axis as the embedding vector representation of that pixel.
Subsequently, (1) We first flatten the H × W array of pixels into a
1-dimensional pixel array — just to make it easier to write the
dot-product code later. (2) We use a 1-dimensional convolution on
the 1-dimensional array of pixels to convert the C channels associated
with each pixel into a 3 ∗ C channels.
Since the channel axis is used as the embedding vector at each pixel,
increasing the number of channels gives us more latitude in dividing
the channel axis into portions reserved for Q, K , and V .
The next slide mentions a very interesting computationally efficient
way of implementing Vaswani attention for the inter-pixel case.
Purdue University 101
Using QKV Modeling for Inter-Pixel Attention