0% found this document useful (0 votes)
45 views65 pages

L22_Attention in Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views65 pages

L22_Attention in Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Attention in Deep Learning

Encoder-Decoder: Sequence to Sequence Model

• Introduced for the first time in 2014 by Google, a sequence to sequence


model aims to map a fixed-length input with a fixed-length output where
the length of the input and output may differ.

• For example, translating “What are you doing today?” from English to
Chinese has input of 5 words and output of 7 symbols

• A sequence to sequence model lies behind numerous systems which you


face on a daily basis. For instance, seq2seq model powers applications like
Google Translate, voice-enabled devices and online chatbots.
• these applications are composed of Machine translation, Speech
recognition, Video captioning
• The power of this model lies in the fact that it can map sequences of
different lengths to each other.
How the Sequence to Sequence Model works?

The model consists of 3 parts: encoder, intermediate


(encoder) vector and decoder.
Encoder

• A stack of several recurrent units (LSTM or GRU cells for


better performance) where each accepts a single element of
the input sequence, collects information for that element and
propagates it forward.
• In question-answering problem, the input sequence is a
collection of all words from the question. Each word is
represented as x_i where i is the order of that word.
• The hidden states h_i are computed using the formula:
Encoder Vector

• This is the final hidden state produced from


the encoder part of the model. It is calculated
using the formula of ht.
• This vector aims to encapsulate the
information for all input elements in order to
help the decoder make accurate predictions.
• It acts as the initial hidden state of the
decoder part of the model.
Decoder

• A stack of several recurrent units where each predicts an


output y_t at a time step t.
• Each recurrent unit accepts a hidden state from the previous unit
and produces and output as well as its own hidden state.
• In the question-answering problem, the output sequence is a
collection of all words from the answer. Each word is represented
as y_i where i is the order of that word.
• Any hidden state h_i is computed using the formula:

• The output y_t at time step t is computed using the formula:

Softmax is used to create a probability vector which will help us


determine the final output e.g. word in the question-answering problem
Encoder and Decoder
Drawbacks of Encoder- Decoder
• If the encoder makes a bad summary, the translation will
also be bad. And indeed it has been observed that the
encoder creates a bad summary when it tries to
understand longer sentences. It is called the long-range
dependency problem of RNN/LSTMs.

• RNNs cannot remember longer sentences and sequences


due to the vanishing/exploding gradient problem. It can
remember the parts which it has just seen. Even Cho et al
(2014), who proposed the encoder-decoder network,
demonstrated that the performance of the encoder-
decoder network degrades rapidly as the length of the
input sentence increases.
Drawbacks of Encoder- Decoder
• Even Cho et al (2014), who proposed the encoder-decoder
network, demonstrated that the performance of the encoder-
decoder network degrades rapidly as the length of the input
sentence increases.

• Although an LSTM is supposed to capture the long-range


dependency better than the RNN, it tends to become forgetful
in specific cases.

• Another problem is that there is no way to give more


importance to some of the input words compared to others
while translating the sentence.
Enhancements of Simple Seq to Seq Model

• Due to multiple drawbacks, multiple


enhancements are being introduced. Each one
aims to strengthen the performance of this
model on slightly complex tasks with long input
and output sequences. Examples are:
• Reversing the order of the input sequence.
• Using LSTM or GRU cells.
• Introducing Attention mechanism.
• and many more.
Attention
• In psychology, attention is the cognitive process of selectively
concentrating on one or a few things while ignoring others.

– A neural network is considered to be an effort to mimic human brain


actions in a simplified manner.
– Attention Mechanism is also an attempt to implement the same action
of selectively concentrating on a few relevant things, while ignoring
others in deep neural networks.
Attention in Deep Learning
• The attention mechanism emerged as an
improvement over the encoder decoder-
based neural machine translation system in natural
language processing (NLP).

• Later, this mechanism, or its variants, was used in


other applications, including computer vision, speech
processing, etc.
Idea Behind the Attention
• Suppose, we want to predict the next word in a sentence, and its context is
located a few words back.
• Here’s an example – “Despite originally being from Uttar Pradesh, as he
was brought up in Bengal, he is more comfortable in Bengali”. In these
groups of sentences, if we want to predict the word “Bengali”, the
phrase “brought up” and “Bengal”- these two should be given more weight
while predicting it. And although Uttar Pradesh is another state’s name, it
should be “ignored”.
• So is there any way we can keep all the relevant information in the input
sentences intact while creating the context vector?
• Bahdanau et al (2015) came up with a simple but elegant idea where they
suggested that not only can all the input words be taken into account in the
context vector, but relative importance should also be given to each one of
them.
• So, whenever the proposed model generates a sentence, it searches for a set
of positions in the encoder hidden states where the most relevant
information is available. This idea is called ‘Attention’.
Attention Mechanism
Attention Mechanism
• The alignment between the source and target is learned and
controlled by the context vector.

• Essentially the context vector consumes three pieces of


information:
– encoder hidden states;
– decoder hidden states;
– alignment between source and target.
Attention Mechanism
• source sequence x of length n and try to output a target
sequence y of length m:

• The encoder is a bidirectional RNN with a forward hidden


state and a backward one
• A simple concatenation of two represents the encoder state.
• The motivation is to include both the preceding and following
words in the annotation of one word.
Attention Mechanism
• The decoder network has hidden state for
the output word at position t, t=1,…,m, where the context
vector c t is a sum of hidden states of the input sequence,
weighted by alignment scores:
Attention Mechanism
• The alignment model assigns a score α t,i to the pair of input at
position i and output at position t, (yt,xi), based on how well
they match.
• The set of {αt,i } are weights defining how much of each source
hidden state should be considered for each output.
• In Bahdanau’s paper, the alignment score α is parametrized by
a feed-forward network with a single hidden layer and this
network is jointly trained with other parts of the model.
Attention Mechanism
• The score function is therefore in the following form, given
that tanh is used as the non-linear activation function:

• where both va and Wa are weight matrices to be learned in


the alignment model.
Self-Attention
• Self-attention, also known as intra-attention, is an attention
mechanism relating different positions of a single sequence in
order to compute a representation of the same sequence.

• It has been shown to be very useful in machine reading,


abstractive summarization, or image description generation.
Self-Attention

When doing machine translation, for example, it is important to have attention scores for the
source and target sequences, and to have it between the source sequence themselves, thus sel
attention.
Soft vs Hard Attention
• In the show, attend and tell paper, attention mechanism is
applied to images to generate captions.

• The image is first encoded by a CNN to extract features.

• Then a LSTM decoder consumes the convolution features to


produce descriptive words one by one, where the weights are
learned through attention.

• The visualization of the attention weights clearly demonstrates


which regions of the image the model is paying attention to so
as to output a certain word.
Soft vs Hard Attention
Distinction between “soft” vs “hard” attention is described based on
whether the attention has access to the entire image or only a patch:

• Soft Attention: the alignment weights are learned and placed “softly”
over all patches in the source image; essentially the same type of
attention as in Bahdanau et al., 2015.
– Pro: the model is smooth and differentiable.
– Con: expensive when the source input is large.

• Hard Attention: only selects one patch of the image to attend to at a


time.
– Pro: less calculation at the inference time.
– Con: the model is non-differentiable and requires more complicated techniques
such as variance reduction or reinforcement learning to train. (Luong, et al.,
2015)
Global vs Local Attention
• Luong, et al., 2015 proposed the “global” and “local” attention.
• The global attention is similar to the soft attention, while the
local one is an interesting blend between hard and soft, an
improvement over the hard attention to make it differentiable.
• Computing attention over the entire input sequence as in
Global Attention is sometimes unnecessary because despite its
simplicity it can be computationally expensive. Thus Local
Attention resulted as a solution for this as Local Attention
considers only a subset of the input units/tokens.
• In global attention, we require as many weights as the source
sentence length whereas in Local attention it is less because
attention is paced over only a few source states.
Advantages of Attention Mechanism

• The Attention mechanism is a powerful tool for improving the


performance of deep learning models, and it has several key
advantages. Some of the main advantages of the attention mechanism
include the following:
• Improved accuracy: By allowing the model to pay attention to the most
relevant information, the attention mechanism can help to improve the
accuracy of predictions.
• Improved efficiency: The attention mechanism can make the model
more efficient by only processing the most important data, reducing the
computational resources required and making the model more
scalable.
• Improved interpretability: The attention weights learned by the model
can provide insight into which parts of the data are the most important,
which can help improve the model's interpretability.
Disadvantages of Attention Mechanism

• Difficulty of training: The attention mechanism can be challenging to train,


especially for large and complex tasks. This is because the attention weights
must be learned from the data, which can require a large amount of data
and computational resources.
• Overfitting: The attention mechanism might be prone to overfitting, which
means that the model may perform considerably well on the training data
but not generalize well to new data. Regularization techniques can mitigate
this, but it can still be challenging when working with large and complex
tasks.
• Exposure bias: The attention mechanism can suffer from the problem of
exposure bias, which occurs when the model is trained to generate the
output sequence one step at a time, but at test time, it is required to
generate the entire sequence at once. This can lead to poor performance on
the test data, as the model may not be able to generate the entire output
sequence accurately.
Problems of LSTMs and RNNs

• Sequential computation inhibits parallelization


• No explicit modeling of long and short range
dependencies
• “Distance” between positions is linear
Convolutional Neural Networks

• Convolutional Neural Networks help solve these


problems. With them we can
• Trivial to parallelize (per layer)
• Exploits local dependencies
• Distance between positions is logarithmic
• The problem is that Convolutional Neural Networks do
not necessarily help with the problem of figuring out
the problem of dependencies when translating
sentences. That’s why Transformers were created,
they are a combination of both CNNs with attention.
Transformer
• The Transformer in NLP is a novel architecture that aims to solve
sequence-to-sequence tasks while handling long-range
dependencies with ease.
• It relies entirely on self-attention to compute representations of
its input and output WITHOUT using sequence-aligned RNNs or
convolution.
• Transformer is a model that uses attention to boost the speed.
More specifically, it uses self-attention.
• Transformers were recently used by OpenAI in their
language models, and also used recently by DeepMind for
AlphaStar — their program to defeat a top professional Starcraft
player.
• The Transformer was proposed in the paper Attention Is All You
Need.
Transformer

– “The Transformer is the first transduction model relying


entirely on self-attention to compute representations of its
input and output without using sequence-aligned RNNs or
convolution.”

❑ “transduction” means the conversion of input sequences into


output sequences.
❑ The idea behind Transformer is to handle the dependencies
between input and output with attention and recurrence
completely.
Architecture

Transformer consists of six encoders and six


decoders.
Transformer (Encoder)
• Each encoder consists of two layers: Self-attention and a feed
Forward Neural Network.

The encoder’s inputs first flow through a self-attention layer – a


layer that helps the encoder look at other words in the input
sentence as it encodes a specific word.
Transformer (Decoder)
• The decoder has both those layers, but between them is an
attention layer that helps the decoder focus on relevant parts
of the input sentence
Transformer
• For implementing this we need word embedding.
• word embedding isa means of building a low-
dimensional vector representation from corpus of
text, which preserves the contextual similarity of
words such as word2vec, CBOW (Continuous Bag-Of-
Words) and Skip-gram.
Transformer

The embedding only happens in the bottom-most encoder. The


abstraction that is common to all the encoders is that they receive a
list of vectors each of the size 512 – In the bottom encoder that
would be the word embeddings, but in other encoders, it would be
the output of the encoder that’s directly below.
Transformer

As we’ve mentioned already, an encoder receives a list of vectors as input. It


processes this list by passing these vectors into a ‘self-attention’ layer, then
into a feed-forward neural network, then sends out the output upwards to
the next encoder.
• Here we see one key property of the Transformer, which is
that the word in each position flows through its own path in
the encoder.
• There are dependencies between these paths in the self-
attention layer.
• The feed-forward layer does not have those dependencies,
however, and thus the various paths can be executed in
parallel while flowing through the feed-forward layer.
Architecture
Architecture
Inputs to Encoder and Decoder
• All input and output tokens to Encoder/Decoder are
converted to vectors using learned embeddings.

• These input embeddings are then passed to


Positional Encoding.
Positional Encoding
• The Transformer’s architecture does not contain any recurrence
or convolution and hence has no notion of word order.
• All the words of the input sequence are fed to the network with
no special order or position as they all flow simultaneously
through the Encoder and decoder stack.
• To understand the meaning of a sentence, it is essential to
understand the position and the order of words.
• Positional encoding is added to the model to helps inject the
information about the relative or absolute position of the
words in the sentence
• Positional encoding has the same dimension as the input
embedding so that the two can be summed.
Self Attention
• Attention in simplistic terms is to get a better understanding
of the meaning and the context of words in a sentence.
• A self-attention layer connects all positions with a constant
number of sequentially executed operations and hence are
faster than recurrent layers
• An Attention function in a Transformer is described as
mapping a query and a set of key and value pair to an output.
• Query, key, and value are all vectors.
• Attention weights are calculated using Scaled Dot-Product
Attention for each word in the sentence.
• The final score is the weighted sum of the values.
Self attention Examples
Self-Attention in Detail

The first step in calculating self-attention is to create three vectors from each of the
encoder’s input vectors (in this case, the embedding of each word). So for each word, we
create a Query vector, a Key vector, and a Value vector. These vectors are created by
multiplying the embedding by three matrices that we trained during the training process.
Calculating Self-Attention
1. First, we need to create three vectors from each of the encoder’s
input vectors:
– Query Vector
– Key Vector
– Value Vector.
These vectors are trained and updated during the training process.

2. Next, we will calculate self-attention for every word in the input


sequence
3. Consider this phrase – “Action gets results”. To calculate the self-
attention for the first word “Action”, we will calculate scores for all the
words in the phrase with respect to “Action”. This score determines the
importance of other words when we are encoding a certain word in an
input sequence
Calculating Self-Attention
1. The score for the first word is calculated by taking
the dot product of the Query vector (q1) with the
keys vectors (k1, k2, k3) of all the words:
Calculating Self-Attention
2. Then, these scores are divided by 8 which is the square root
of the dimension of the key vector:
Calculating Self-Attention
3. Next, these scores are normalized using the
softmax activation function
Calculating Self-Attention
4. These normalized scores are then multiplied by the value
vectors (v1, v2, v3) and sum up the resultant vectors to
arrive at the final vector (z1). This is the output of the self-
attention layer. It is then passed on to the feed-forward
network as input:
Calculating Self-Attention
z1 is the self-attention vector for the first word of the input
sequence “Action gets results”. We can get the vectors for the
rest of the words in the input sequence in the same fashion:
Multi-Head Attention
• Self-attention is computed not once but multiple times in the
Transformer’s architecture, in parallel and independently.
• It is therefore referred to as Multi-head Attention.
Multi-Head Attention
• Each attention-head has a different linear transformation
applied to the same input representation.
• The Transformer uses eight different attention heads, which
are computed parallelly and independently.
• With eight different attention heads, we have eight different
sets of the query, key, and value and also eight sets of
Encoder and Decoder each of these sets is initialized
randomly

– “Multi-head attention allows the model to jointly attend to


information from different representation subspaces at different
positions.”
Multi-Head Attention

The feed-forward layer is not expecting eight matrices – it’s expecting a single
matrix (a vector for each word). So we need a way to condense these eight down
into a single matrix.
Masked Multi-Head Attention
• The Decoder has masked multi-head attention where it masks
or blocks the decoder inputs from the future steps.
• During training, the multi-head attention of the Decoder hides
the future decoder inputs.
• For the machine translation task to translate a sentence, “I
enjoy nature” from English to Hindi using the Transformer,
the Decoder will consider all the inputs words “I, enjoy,
nature” to predict the first word.
Masked Multi-Head Attention
• the Decoder would block the inputs from future steps
Layer Normalization:
• Normalizes the inputs across each of
the features and is independent of
other examples.
• Layer normalization reduces the
training time in feed-forward neural
networks.
• In Layer normalization, we compute
mean and variance from all of the
summed inputs to the neurons in a
layer on a single training case.
Fully Connected Layer
• Encoder and Decoder in the Transformer both have a
fully connected feed-forward network, and it has two
linear transformations containing a ReLU activation in
between.
Features of Transformers
The drawbacks of the seq2seq model are addressed by
Transformer
• Parallelizing Computation:
– Transformer’s architecture removes the auto-regressive model
used in the Seq2Seq model and relies entirely on Self-Attention
to understand global dependencies between input and output.
– Self-Attention helps significantly with parallelizing the
computation
• Reduced number of operations:
– Transformers have a constant number of operations as the
attention weights are averaged in multi-head attention
Features of Transformers
The drawbacks of the seq2seq model are addressed by
Transformer
• Long-range dependencies:
– Factor that impacts the learning of long-range dependencies is
based on the length of forward and backward paths the signals
have to traverse in the network.
– The shorter the route between any combination of positions in
the input and output sequences, the easier it is to learn long-
range dependencies.
– Self-Attention layer connects all positions with a constant
number of sequentially executed operations learning long-
range dependencies.
Limitations of the Transformer
• Transformer is undoubtedly a huge improvement
over the RNN based seq2seq models.
• But it comes with its own share of limitations:
– Attention can only deal with fixed-length text strings. The
text has to be split into a certain number of segments or
chunks before being fed into the system as input
– This chunking of text causes context fragmentation. For
example, if a sentence is split from the middle, then a
significant amount of context is lost.
• https://jalammar.github.io/illustrated-transfor
mer/
• https://towardsdatascience.com/transformers
-141e32e69591
• https://
stackoverflow.com/questions/58127059/how-
to-understand-masked-multi-head-attention-i
n-transformer

You might also like