0% found this document useful (0 votes)
22 views

Deep Neural Network Module 7 Attention Transformer

Uploaded by

manoranjan sahu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Deep Neural Network Module 7 Attention Transformer

Uploaded by

manoranjan sahu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Attention Models & Transformers

Department of CSIS | WILP | BITS Pilani 1


Topics
★ Brief review of Sequence Models ( RNN, LSTMS )
★ Encoder - Decoder Models
★ Attention
★ Transformers
★ Vision Transformers

2
References and Recommendation for Readings: Draft Chapter 9 and Draft Chapter 10 , Speech and Language Processing. Daniel Jurafsky & James H. Martin.
RNN Architectures for NLP Tasks

Ex: POS Tagging, Named Entity Tagging Ex: Sentiment Analysis

Ex: Predict Next Ex: Language Translation


3
Word
Encoder - Decoder
• Goal: Develop an architecture capable of
generating contextually appropriate,
arbitrary length, output sequences

• Applications:
• Machine translation
• Summarization
• Question answering
• Dialogue modeling.

4
Encoder - Decoder

Encoder:
• Input sequence (x1...n ) → Sequence
of contextualized representations,
(h1...n )
• Ex: LSTM, CNN, Transformers etc.
Context:
• c, a function of (h )
1. ..n
Decoder:
• c → arbitrary length sequence of
hidden states (h1...m ) → sequence
output
of states (y1...m )
5
Encoder - Decoder
Encoder - Decoder for Language Translation

Source and target sentences are concatenated with a separator token in between, and the decoder uses context information from the encoder’s last
hidden state 6
Encoder - Decoder
Encoder - Decoder for Language Translation

Weakness of this approach : Influence of the context vector, c, will wane as


the output sequence is generated.

Solution: Make the context vector c available at each step in the decoding
process by adding it as a parameter to the computation of the current hidden
state

Source and target sentences are concatenated with a separator token in between, and the decoder uses context information from the encoder’s last
hidden state 7
Encoder - Decoder
Encoder - Decoder for Language Translation

Weakness of this approach : Influence of the context vector, c, will wane as


the output sequence is generated.

Solution: Make the context vector c available at each step in the decoding
process by adding it as a parameter to the computation of the current hidden
state

Source and target sentences are concatenated with a separator token in between, and the decoder uses context information from the encoder’s last
hidden state 8
Encoder - Decoder
Encoder - Decoder for Language Translation

9
Encoder - Decoder
Encoder - Decoder for Language Translation

10
Encoder - Decoder
Training

11
Encoder - Decoder
Teaching Forcing
• Force the system to use the gold target token from training as the
next input xt+1, rather than allowing it to rely on the (possibly
erroneous) decoder output ˆyt .
• Speeds up training

12
Attention !

In an encoder-decoder arch, the final hidden state acts as a bottleneck:


● It must represent absolutely everything about the meaning of the source text
● The only thing the decoder knows about the source text is what’s in this context
vector

13
Attention !

Without attention, a decoder sees the same context vector ,


which is a static function of all the encoder hidden states

14
Attention !

Without attention, a decoder sees the same context vector , With attention, decoder to sees a different, dynamic,
which is a static function of all the encoder hidden states context, which is a function of all the encoder hidden states

15
Attention !

Without attention, a decoder sees the same context vector , With attention, decoder to sees a different, dynamic,
which is a static function of all the encoder hidden states context, which is a function of all the encoder hidden states

With attention, decoder gets information from all the hidden states of the encoder, not just the last hidden
state of the encoder
Each context vector is obtained by taking a weighted sum of all the encoder hidden
states.
The weights focus on (‘attend to’) a particular part of the source text that is relevant for the token the
decoder is currently producing 16
Attention !
Step -1 : Find out how relevant each encoder state is to the present decoder
state
Compute a score of similarity between and all the encoder
states :

Dot Product Attention :


Step -2 : Normalize all the scores with softmax to create a vector of weights, αi,j
αi,j indicates the proportional relevance of each encoder hidden state j to the prior hidden
decoder state,

17
Attention !
Step -3 : Given the distribution in α, compute a fixed-length context vector for the
current average over all the encoder hidden
decoder state by taking a weighted
states

18
Attention !
Step -3 : Given the distribution in α, compute a fixed-length context vector for the
current average over all the encoder hidden
decoder state by taking a weighted
states

Plus : In step-1, we can get a more powerful scoring function by parameterizing the
score with its own set of weights, Ws:

Ws , is trained during normal end-to-end training,


Ws , gives the network the ability to learn which aspects of similarity between the
decoder and encoder states are important to the current application. 19
Attention !

20
Transformers
• 2017, NIPS, Vaswani et. al., Attention Is All You Need !!!
• Transformers map sequences of input vectors (x 1,...,x n) to sequences of
output vectors (y1,...,yn) of the same length
• Made up of transformer blocks in which the key component is
self-attention layers
[ Self-attention allows a network to directly extract and use information from
arbitrarily large contexts directly !!! ]
• Transformers are not based on recurrent connections ⇒
Parallel implementations possible ⇒ Efficient to scale ( comparing
LSTM)
21
Self-Attention | Transformers
Attention •⇒ Ability to compare an item of interest to a collection
of other items in a way that reveals their relevance in the current
context.
• Self-attention ~
> Set of comparisons are to other elements within a given sequence
> Use these comparisons to compute an output for the current input

22
Self-Attention | Transformers

In processing each element of the sequence, the model attends to all the inputs up to, and including,
the current one.

Unlike RNNs, the computations at each time step are independent of all the other steps and therefore can be
performed in parallel.
23
Self-Attention | Transformers

24
Self-Attention | Transformers
• Let us understand how transformers uses self-attention !

25
Self-Attention | Transformers , ,
In Vaswani et al., 2017, d was 1024.

• Let us understand how transformers uses self-attention !

Query, Key, Value,


Q K V
As the current focus of In its role as a As a value used to
attention when being preceding input being compute the output
compared to all of the compared to the for the current focus
other preceding inputs. current focus of of attention
attention.

Three different roles each xi (input embedding) , in the computation


of self attention
26
Self-Attention | Transformers , ,
In Vaswani et al., 2017, d was 1024.

• Let us understand how transformers uses self-attention !

The simple dot product can be an arbitrarily large;


scaled dot-product is used in transformers;

27
Self-Attention | Transformers

• Each output, yi , is
computed independently
• Entire process can be
parallelized

Calculating the value of y3, the third element of a


sequence
using causal (left-to-right) self-attention
28
Self-Attention | Transformers
• Pack the input embeddings of the N input tokens into a single matrix

> Each row of X is the embedding of one token of the input

• Multiply X by the key, query, and value (dxd) matrices

29
Self-Attention | Transformers

Matrix

Note: Upper-triangle portion of the comparisons matrix


zeroed out (set to −∞, which the softmax will turn to
zero)

30
Transformer Blocks | Transformers

31
Transformer Blocks | Transformers

LayerNorm:

32
Multihead-Attention | Transformers
• Different words in a sentence can relate to each other in many different ways
simultaneously
>> A single transformer block to learn to capture all of the different kinds of parallel
relations among its inputs is inadequate.
• Multihead self-attention layers
>> Heads ⇒ sets of self-attention layers, that reside in parallel layers at the
sdepth
ame in a model, each with its own set of parameters.
>> Each head learn different aspects of the relationships that exist among inputs at
the same level of abstraction

33
Multihead-Attention | Transformers
• Different words in a sentence can relate to each other in many different ways
simultaneously
>> A single transformer block to learn to capture all of the different kinds of parallel
relations among its inputs is inadequate.
• Multihead self-attention layers
>> Heads ⇒ sets of self-attention layers, that reside in parallel layers at the
sdepth
ame in a model, each with its own set of parameters.
>> Each head learn different aspects of the relationships that exist among inputs at
the same level of abstraction

34
Multihead-Attention | Transformers

Each of the multihead self-attention layers is provided with its own set of key, query and value weight matrices.
The outputs from each of the layers are concatenated and then projected down to d, thus producing an output of
the same size as the input so layers can be stacked. 35
Positional Embeddings | Transformers

A simple way to model position:


Add an embedding representation of the absolute
position to the input word embedding to produce a
new embedding of the same dimensionality.

A combination of sine and cosine functions


with
differing frequencies was used in the
original
transformer
work.

36
Transformers as Language Models

37
Vision Transformers (ViT)
• Ref: Alexey et. al.(2021 ) , ICLR, An Image Is Worth 16x16 Words: Transformers
For Image Recognition At Scale

38
Readings

1. Draft Chapter 9 and Draft Chapter 10 , Speech and Language Processing. Daniel Jurafsky &
James H. Martin.
2. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).
3. Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image
recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

TRANSFORMER MODELS 39
Thank you

PRESENTATION TITLE 40

You might also like