L22_Attention in Deep Learning
L22_Attention in Deep Learning
• For example, translating “What are you doing today?” from English to
Chinese has input of 5 words and output of 7 symbols
When doing machine translation, for example, it is important to have attention scores for the
source and target sequences, and to have it between the source sequence themselves, thus sel
attention.
Soft vs Hard Attention
• In the show, attend and tell paper, attention mechanism is
applied to images to generate captions.
• Soft Attention: the alignment weights are learned and placed “softly”
over all patches in the source image; essentially the same type of
attention as in Bahdanau et al., 2015.
– Pro: the model is smooth and differentiable.
– Con: expensive when the source input is large.
The first step in calculating self-attention is to create three vectors from each of the
encoder’s input vectors (in this case, the embedding of each word). So for each word, we
create a Query vector, a Key vector, and a Value vector. These vectors are created by
multiplying the embedding by three matrices that we trained during the training process.
Calculating Self-Attention
1. First, we need to create three vectors from each of the encoder’s
input vectors:
– Query Vector
– Key Vector
– Value Vector.
These vectors are trained and updated during the training process.
The feed-forward layer is not expecting eight matrices – it’s expecting a single
matrix (a vector for each word). So we need a way to condense these eight down
into a single matrix.
Masked Multi-Head Attention
• The Decoder has masked multi-head attention where it masks
or blocks the decoder inputs from the future steps.
• During training, the multi-head attention of the Decoder hides
the future decoder inputs.
• For the machine translation task to translate a sentence, “I
enjoy nature” from English to Hindi using the Transformer,
the Decoder will consider all the inputs words “I, enjoy,
nature” to predict the first word.
Masked Multi-Head Attention
• the Decoder would block the inputs from future steps
Layer Normalization:
• Normalizes the inputs across each of
the features and is independent of
other examples.
• Layer normalization reduces the
training time in feed-forward neural
networks.
• In Layer normalization, we compute
mean and variance from all of the
summed inputs to the neurons in a
layer on a single training case.
Fully Connected Layer
• Encoder and Decoder in the Transformer both have a
fully connected feed-forward network, and it has two
linear transformations containing a ReLU activation in
between.
Features of Transformers
The drawbacks of the seq2seq model are addressed by
Transformer
• Parallelizing Computation:
– Transformer’s architecture removes the auto-regressive model
used in the Seq2Seq model and relies entirely on Self-Attention
to understand global dependencies between input and output.
– Self-Attention helps significantly with parallelizing the
computation
• Reduced number of operations:
– Transformers have a constant number of operations as the
attention weights are averaged in multi-head attention
Features of Transformers
The drawbacks of the seq2seq model are addressed by
Transformer
• Long-range dependencies:
– Factor that impacts the learning of long-range dependencies is
based on the length of forward and backward paths the signals
have to traverse in the network.
– The shorter the route between any combination of positions in
the input and output sequences, the easier it is to learn long-
range dependencies.
– Self-Attention layer connects all positions with a constant
number of sequentially executed operations learning long-
range dependencies.
Limitations of the Transformer
• Transformer is undoubtedly a huge improvement
over the RNN based seq2seq models.
• But it comes with its own share of limitations:
– Attention can only deal with fixed-length text strings. The
text has to be split into a certain number of segments or
chunks before being fed into the system as input
– This chunking of text causes context fragmentation. For
example, if a sentence is split from the middle, then a
significant amount of context is lost.
• https://jalammar.github.io/illustrated-transfor
mer/
• https://towardsdatascience.com/transformers
-141e32e69591
• https://
stackoverflow.com/questions/58127059/how-
to-understand-masked-multi-head-attention-i
n-transformer