0% found this document useful (0 votes)
39 views

Transformer

The document summarizes the Transformer, a novel neural network architecture introduced in the 2017 paper "Attention Is All You Need". The Transformer is based solely on attention mechanisms and achieves state-of-the-art results in machine translation. It uses multi-head self-attention and positional encoding to allow tokens to attend to other tokens regardless of distance. The Transformer is more parallelizable than RNN and CNN models. It establishes new state-of-the-art results on two machine translation tasks while being faster and requiring less training time than previous models.

Uploaded by

lil tel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Transformer

The document summarizes the Transformer, a novel neural network architecture introduced in the 2017 paper "Attention Is All You Need". The Transformer is based solely on attention mechanisms and achieves state-of-the-art results in machine translation. It uses multi-head self-attention and positional encoding to allow tokens to attend to other tokens regardless of distance. The Transformer is more parallelizable than RNN and CNN models. It establishes new state-of-the-art results on two machine translation tasks while being faster and requiring less training time than previous models.

Uploaded by

lil tel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Attention Is All You Need

Vaswani et al. NeurIPS 2017


Presented by Luke Song
Abstract
● Presents a new neural architecture named the Transformer
● Based solely on the attention mechanism widely used in SEQ2SEQ models
● More parallelizable compared to existing state-of-the-art (SOTA) models
● Achieves SOTA in 2 machine translation datasets
Outline
1. Important Background
2. Model Architecture
3. Experimental Results
4. Model Variation Study
5. Conclusion & Limitation
6. Discussion Time :)
Important Background
What is Attention Mechanism?

● Mechanism used to let individual tokens


“attend” to other tokens regardless of
the distance between them
● Transformer uses only self-attention
which is attention onto the same
sentence
● Think of self-attention as recalculating
the representation of each token based
on how its meaning is influenced by
other tokens in the same sentence

Source: https://github.com/jessevig/bertviz
Model Architecture
High Level

● Input embedding is first added with


Positional Encoding
● 3 components in each
encoder/decoder: (Masked) Multi-Head
Attention, Addition & Normalization,
Feed Forward Network

Source: Attention Is All You Need


Model Architecture
Attention Function

● Mapping a query and set of key-value


pairs to an output, where the query,
keys, values, and output are all vectors
● Q: Queries
K: Keys
V: Values
d_k: dimension of k (64 in the paper)
● Uses a dot-product attention due to its
empirical speed/space advantage
● Scale dot product by 1/sqrt(d_k) b/c
large values of d_k may push softmax
function to region where it has
extremely small gradients
Source: Attention Is All You Need
Source: Illustrated Transformer
Source: Illustrated Transformer
Source: Illustrated Transformer
Source: Illustrated Transformer
Adding it all together...
Source: Illustrated Transformer
Source: Illustrated Transformer
Model Architecture
Multi-Head Attention

● Apply attention to different versions of


Q, K, V
● Expands model’s ability to focus on
different positions
● Generates a multiple “representation
subspaces” in order to give the model
better representation of the input
● Uses 8 attention heads which are
concatenated and fed into a linear layer
at the end

Source: Attention Is All You Need


Source: Illustrated Transformer
Source: Illustrated Transformer
Source: Illustrated Transformer
Combining everything attention-wise...
Source: Illustrated Transformer
Before moving on..
• In encoder, all queries, keys, and values come from the same place
• In encoder-decoder attention layer, queries come from the previous decoder
layer and keys and values come from the output of the encoder
• This mimics the typical encoder-decoder attention mechanism
• In decoder to ensure auto-regressive property, the model masks everything
right to the current token being attended
Source: Illustrated Transformer, Positional
Embedding

Model Architecture
Positional Encoding

• Since attention mechanism in the


Transformer does not attend each word
auto-regressively (no recurrence nor
convolution), model needs something to
let it know the relative position of tokens
in the sentence
• Positional Encoding is the combination
of sine and cosine functions of different
frequencies
• Advantages include distance between
tokens being symmetrical and being
easier to calculate distance between
tokens
Source: Illustrated Transformer

Model Architecture
Layer Normalization & Residual Connection

• Layer normalization (Ba et al. 2016) is


applied to output of sub-layer + input to
sub-layer
• Layer normalization normalizes the input
across the features
• Empirically shown to reduce training
time
• Residual connection means there is a
connection that skips few layers (in here
1)
Source: Illustrated Transformer

Model Architecture
Position-wise Feed Forward Networks

• Fully connected feed-forward network


• Two linear transformations with a ReLU
activation in between
• Inner layer has dimensionality of 2048
• Applied to each position separately and
identically
Combining all elements...
Source: Illustrated Transformer
Source: Illustrated Transformer
Source: Illustrated Transformer

In case you are curious


Why Self-Attention?

• Less total computational complexity per layer


• More parallelizable than existing fully autoregressive models
• Shorten the path between tokens to enable model to learn long-term
dependency better
• Tang et al. (EMNLP 2018) claims that self-attention outperforms RNN/CNN as a
semantic feature extractor and empirically show that it excels on word sense
disambiguation task (but not subject-verb agreement over long distance!)
Experimental Results

● Achevies SOTA on 2
machine translation dataset
● Less training cost than
existing SOTA models
Model Variation Study

● Attention key size is


important
● More heads doesn’t
necessary mean better
performance
● Learned positional
embedding is not better than
sinusoidal positional
encoding
Conclusion & Limitation
● Introduces a groundbreaking new model that is solely based on attention
● Faster and better than existing models
● Still not fully parallelized due to decoder being auto-regressive
● Context is fixed length and cannot attend long-term dependency
● Stacking more encoders/decoders might lead to vanishing gradients
References
● “Attention Is All You Need,” Vaswani et al. NeurIPS 2017
● “Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures,” Tang et al.
EMNLP 2018
● “The Annotated Transformer,” https://nlp.seas.harvard.edu/2018/04/03/attention.html
● “The Illustrated Transformer,” http://jalammar.github.io/illustrated-transformer/
● “Positional Embedding,” https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
● “BertViz,” https://github.com/jessevig/bertviz
Thank you! &
Discussion Time :)

You might also like