Deep Neural Network Module 7 Attention Transformer
Deep Neural Network Module 7 Attention Transformer
2
References and Recommendation for Readings: Draft Chapter 9 and Draft Chapter 10 , Speech and Language Processing. Daniel Jurafsky & James H. Martin.
RNN Architectures for NLP Tasks
• Applications:
• Machine translation
• Summarization
• Question answering
• Dialogue modeling.
4
Encoder - Decoder
Encoder:
• Input sequence (x1...n ) → Sequence
of contextualized representations,
(h1...n )
• Ex: LSTM, CNN, Transformers etc.
Context:
• c, a function of (h )
1. ..n
Decoder:
• c → arbitrary length sequence of
hidden states (h1...m ) → sequence
output
of states (y1...m )
5
Encoder - Decoder
Encoder - Decoder for Language Translation
Source and target sentences are concatenated with a separator token in between, and the decoder uses context information from the encoder’s last
hidden state 6
Encoder - Decoder
Encoder - Decoder for Language Translation
Solution: Make the context vector c available at each step in the decoding
process by adding it as a parameter to the computation of the current hidden
state
Source and target sentences are concatenated with a separator token in between, and the decoder uses context information from the encoder’s last
hidden state 7
Encoder - Decoder
Encoder - Decoder for Language Translation
Solution: Make the context vector c available at each step in the decoding
process by adding it as a parameter to the computation of the current hidden
state
Source and target sentences are concatenated with a separator token in between, and the decoder uses context information from the encoder’s last
hidden state 8
Encoder - Decoder
Encoder - Decoder for Language Translation
9
Encoder - Decoder
Encoder - Decoder for Language Translation
10
Encoder - Decoder
Training
11
Encoder - Decoder
Teaching Forcing
• Force the system to use the gold target token from training as the
next input xt+1, rather than allowing it to rely on the (possibly
erroneous) decoder output ˆyt .
• Speeds up training
12
Attention !
13
Attention !
14
Attention !
Without attention, a decoder sees the same context vector , With attention, decoder to sees a different, dynamic,
which is a static function of all the encoder hidden states context, which is a function of all the encoder hidden states
15
Attention !
Without attention, a decoder sees the same context vector , With attention, decoder to sees a different, dynamic,
which is a static function of all the encoder hidden states context, which is a function of all the encoder hidden states
With attention, decoder gets information from all the hidden states of the encoder, not just the last hidden
state of the encoder
Each context vector is obtained by taking a weighted sum of all the encoder hidden
states.
The weights focus on (‘attend to’) a particular part of the source text that is relevant for the token the
decoder is currently producing 16
Attention !
Step -1 : Find out how relevant each encoder state is to the present decoder
state
Compute a score of similarity between and all the encoder
states :
17
Attention !
Step -3 : Given the distribution in α, compute a fixed-length context vector for the
current average over all the encoder hidden
decoder state by taking a weighted
states
18
Attention !
Step -3 : Given the distribution in α, compute a fixed-length context vector for the
current average over all the encoder hidden
decoder state by taking a weighted
states
Plus : In step-1, we can get a more powerful scoring function by parameterizing the
score with its own set of weights, Ws:
20
Transformers
• 2017, NIPS, Vaswani et. al., Attention Is All You Need !!!
• Transformers map sequences of input vectors (x 1,...,x n) to sequences of
output vectors (y1,...,yn) of the same length
• Made up of transformer blocks in which the key component is
self-attention layers
[ Self-attention allows a network to directly extract and use information from
arbitrarily large contexts directly !!! ]
• Transformers are not based on recurrent connections ⇒
Parallel implementations possible ⇒ Efficient to scale ( comparing
LSTM)
21
Self-Attention | Transformers
Attention •⇒ Ability to compare an item of interest to a collection
of other items in a way that reveals their relevance in the current
context.
• Self-attention ~
> Set of comparisons are to other elements within a given sequence
> Use these comparisons to compute an output for the current input
22
Self-Attention | Transformers
In processing each element of the sequence, the model attends to all the inputs up to, and including,
the current one.
Unlike RNNs, the computations at each time step are independent of all the other steps and therefore can be
performed in parallel.
23
Self-Attention | Transformers
24
Self-Attention | Transformers
• Let us understand how transformers uses self-attention !
25
Self-Attention | Transformers , ,
In Vaswani et al., 2017, d was 1024.
27
Self-Attention | Transformers
• Each output, yi , is
computed independently
• Entire process can be
parallelized
29
Self-Attention | Transformers
Matrix
30
Transformer Blocks | Transformers
31
Transformer Blocks | Transformers
LayerNorm:
32
Multihead-Attention | Transformers
• Different words in a sentence can relate to each other in many different ways
simultaneously
>> A single transformer block to learn to capture all of the different kinds of parallel
relations among its inputs is inadequate.
• Multihead self-attention layers
>> Heads ⇒ sets of self-attention layers, that reside in parallel layers at the
sdepth
ame in a model, each with its own set of parameters.
>> Each head learn different aspects of the relationships that exist among inputs at
the same level of abstraction
33
Multihead-Attention | Transformers
• Different words in a sentence can relate to each other in many different ways
simultaneously
>> A single transformer block to learn to capture all of the different kinds of parallel
relations among its inputs is inadequate.
• Multihead self-attention layers
>> Heads ⇒ sets of self-attention layers, that reside in parallel layers at the
sdepth
ame in a model, each with its own set of parameters.
>> Each head learn different aspects of the relationships that exist among inputs at
the same level of abstraction
34
Multihead-Attention | Transformers
Each of the multihead self-attention layers is provided with its own set of key, query and value weight matrices.
The outputs from each of the layers are concatenated and then projected down to d, thus producing an output of
the same size as the input so layers can be stacked. 35
Positional Embeddings | Transformers
36
Transformers as Language Models
37
Vision Transformers (ViT)
• Ref: Alexey et. al.(2021 ) , ICLR, An Image Is Worth 16x16 Words: Transformers
For Image Recognition At Scale
•
38
Readings
1. Draft Chapter 9 and Draft Chapter 10 , Speech and Language Processing. Daniel Jurafsky &
James H. Martin.
2. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).
3. Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image
recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
TRANSFORMER MODELS 39
Thank you
PRESENTATION TITLE 40