0% found this document useful (0 votes)

39 views

Transformer

The document summarizes the Transformer, a novel neural network architecture introduced in the 2017 paper "Attention Is All You Need". The Transformer is based solely on attention mechanisms and achieves state-of-the-art results in machine translation. It uses multi-head self-attention and positional encoding to allow tokens to attend to other tokens regardless of distance. The Transformer is more parallelizable than RNN and CNN models. It establishes new state-of-the-art results on two machine translation tasks while being faster and requiring less training time than previous models.

Uploaded by

lil tel

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Transformer

Uploaded by

lil tel

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Attention Is All You Need

Vaswani et al. NeurIPS 2017

Presented by Luke Song
Abstract
● Presents a new neural architecture named the Transformer
● Based solely on the attention mechanism widely used in SEQ2SEQ models
● More parallelizable compared to existing state-of-the-art (SOTA) models
● Achieves SOTA in 2 machine translation datasets
Outline
1. Important Background
2. Model Architecture
3. Experimental Results
4. Model Variation Study
5. Conclusion & Limitation
6. Discussion Time :)
Important Background
What is Attention Mechanism?

● Mechanism used to let individual tokens

“attend” to other tokens regardless of
the distance between them
● Transformer uses only self-attention
which is attention onto the same
sentence
● Think of self-attention as recalculating
the representation of each token based
on how its meaning is influenced by
other tokens in the same sentence

Source: https://github.com/jessevig/bertviz
Model Architecture
High Level

● Input embedding is first added with

Positional Encoding
● 3 components in each
encoder/decoder: (Masked) Multi-Head
Attention, Addition & Normalization,
Feed Forward Network

Source: Attention Is All You Need

Model Architecture
Attention Function

● Mapping a query and set of key-value

pairs to an output, where the query,
keys, values, and output are all vectors
● Q: Queries
K: Keys
V: Values
d_k: dimension of k (64 in the paper)
● Uses a dot-product attention due to its
empirical speed/space advantage
● Scale dot product by 1/sqrt(d_k) b/c
large values of d_k may push softmax
function to region where it has
extremely small gradients
Source: Attention Is All You Need
Source: Illustrated Transformer
Source: Illustrated Transformer
Source: Illustrated Transformer
Source: Illustrated Transformer
Adding it all together...
Source: Illustrated Transformer
Source: Illustrated Transformer
Model Architecture
Multi-Head Attention

● Apply attention to different versions of

Q, K, V
● Expands model’s ability to focus on
different positions
● Generates a multiple “representation
subspaces” in order to give the model
better representation of the input
● Uses 8 attention heads which are
concatenated and fed into a linear layer
at the end

Source: Attention Is All You Need

Source: Illustrated Transformer
Source: Illustrated Transformer
Source: Illustrated Transformer
Combining everything attention-wise...
Source: Illustrated Transformer
Before moving on..
• In encoder, all queries, keys, and values come from the same place
• In encoder-decoder attention layer, queries come from the previous decoder
layer and keys and values come from the output of the encoder
• This mimics the typical encoder-decoder attention mechanism
• In decoder to ensure auto-regressive property, the model masks everything
right to the current token being attended
Source: Illustrated Transformer, Positional
Embedding

Model Architecture
Positional Encoding

• Since attention mechanism in the

Transformer does not attend each word
auto-regressively (no recurrence nor
convolution), model needs something to
let it know the relative position of tokens
in the sentence
• Positional Encoding is the combination
of sine and cosine functions of different
frequencies
• Advantages include distance between
tokens being symmetrical and being
easier to calculate distance between
tokens
Source: Illustrated Transformer

Model Architecture
Layer Normalization & Residual Connection

• Layer normalization (Ba et al. 2016) is

applied to output of sub-layer + input to
sub-layer
• Layer normalization normalizes the input
across the features
• Empirically shown to reduce training
time
• Residual connection means there is a
connection that skips few layers (in here
1)
Source: Illustrated Transformer

Model Architecture
Position-wise Feed Forward Networks

In case you are curious

Why Self-Attention?

• Less total computational complexity per layer

• More parallelizable than existing fully autoregressive models
• Shorten the path between tokens to enable model to learn long-term
dependency better
• Tang et al. (EMNLP 2018) claims that self-attention outperforms RNN/CNN as a
semantic feature extractor and empirically show that it excels on word sense
disambiguation task (but not subject-verb agreement over long distance!)
Experimental Results

● Achevies SOTA on 2
machine translation dataset
● Less training cost than
existing SOTA models
Model Variation Study

● Attention key size is

important
● More heads doesn’t
necessary mean better
performance
● Learned positional
embedding is not better than
sinusoidal positional
encoding
Conclusion & Limitation
● Introduces a groundbreaking new model that is solely based on attention
● Faster and better than existing models
● Still not fully parallelized due to decoder being auto-regressive
● Context is fixed length and cannot attend long-term dependency
● Stacking more encoders/decoders might lead to vanishing gradients
References
● “Attention Is All You Need,” Vaswani et al. NeurIPS 2017
● “Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures,” Tang et al.
EMNLP 2018
● “The Annotated Transformer,” https://nlp.seas.harvard.edu/2018/04/03/attention.html
● “The Illustrated Transformer,” http://jalammar.github.io/illustrated-transformer/
● “Positional Embedding,” https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
● “BertViz,” https://github.com/jessevig/bertviz
Thank you! &
Discussion Time :)

COLLINS COBUILD ENGLISH GUIDES (METAPHOR) (Alice Deignan)
100% (3)
COLLINS COBUILD ENGLISH GUIDES (METAPHOR) (Alice Deignan)
226 pages
Csec It Paper 2 May 2016 - Answer Sheet
No ratings yet
Csec It Paper 2 May 2016 - Answer Sheet
20 pages
Talal Asad - The Concept of Cultural Translation
No ratings yet
Talal Asad - The Concept of Cultural Translation
13 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Pre-Kindergarten Progress Report: Name: Teacher: Date
No ratings yet
Pre-Kindergarten Progress Report: Name: Teacher: Date
2 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Transformer Presentation
No ratings yet
Transformer Presentation
15 pages
L.7
No ratings yet
L.7
54 pages
DL CO4 PPT-1
No ratings yet
DL CO4 PPT-1
29 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Pointer: Int* ; =& (biến khác) Or int* =& (biến khác) Char 1 byte
No ratings yet
Pointer: Int* ; =& (biến khác) Or int* =& (biến khác) Char 1 byte
6 pages
UNIT 2 FULL - Compressed
No ratings yet
UNIT 2 FULL - Compressed
26 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
Transformers
No ratings yet
Transformers
27 pages
Transformers
No ratings yet
Transformers
2 pages
Transformers
No ratings yet
Transformers
21 pages
R: T E T: Eformer HE Fficient Ransformer
No ratings yet
R: T E T: Eformer HE Fficient Ransformer
12 pages
Unit_IV_Natural Language Processing (1)
No ratings yet
Unit_IV_Natural Language Processing (1)
9 pages
1 Lecture 1
No ratings yet
1 Lecture 1
65 pages
Transformer
No ratings yet
Transformer
10 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Am Ogh Seminar Report
No ratings yet
Am Ogh Seminar Report
19 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Reformer The Efficient Transformer-Original PDF
No ratings yet
Reformer The Efficient Transformer-Original PDF
11 pages
Transformers - Introduction
No ratings yet
Transformers - Introduction
22 pages
Transformer networks
No ratings yet
Transformer networks
53 pages
LLM
No ratings yet
LLM
41 pages
2 CPE 413 Intro to Assembly Lang Programming
No ratings yet
2 CPE 413 Intro to Assembly Lang Programming
68 pages
Week 12
No ratings yet
Week 12
64 pages
UNIT2 - Logic Design With Behavioral Models of Combinational and Sequential Logic
No ratings yet
UNIT2 - Logic Design With Behavioral Models of Combinational and Sequential Logic
48 pages
2 CPE 413 Intro To Assembly Lang Programming
No ratings yet
2 CPE 413 Intro To Assembly Lang Programming
68 pages
01 The Transformer
No ratings yet
01 The Transformer
64 pages
CPE313_Week6
No ratings yet
CPE313_Week6
36 pages
1 2 3 4 5 6 7 8 Merged
No ratings yet
1 2 3 4 5 6 7 8 Merged
78 pages
CS205-2020 Spring - Lecture 7 PDF
No ratings yet
CS205-2020 Spring - Lecture 7 PDF
25 pages
LEVERAGING REDUNDANCY IN ATTENTION
No ratings yet
LEVERAGING REDUNDANCY IN ATTENTION
15 pages
Structural Design Patterns: Part-2
No ratings yet
Structural Design Patterns: Part-2
77 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
16_
No ratings yet
16_
41 pages
Transformers
No ratings yet
Transformers
30 pages
Module 3 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 3 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
20 pages
Attention
No ratings yet
Attention
12 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
Ia3-1
No ratings yet
Ia3-1
11 pages
3-Natural Language Processing With Attention Models
No ratings yet
3-Natural Language Processing With Attention Models
62 pages
4.attention Mechanism
No ratings yet
4.attention Mechanism
10 pages
NLP Script
No ratings yet
NLP Script
2 pages
Presentation 11 (1)
No ratings yet
Presentation 11 (1)
20 pages
FP-UNIT 5 - Clojure
No ratings yet
FP-UNIT 5 - Clojure
52 pages
UCLA EE 201A - VLSI Design Automation - Winter 2018 Course Project: Automated Inter-Chip Pin Assignment
No ratings yet
UCLA EE 201A - VLSI Design Automation - Winter 2018 Course Project: Automated Inter-Chip Pin Assignment
8 pages
DL unitwuse
No ratings yet
DL unitwuse
5 pages
An Efficient, Practical Parallelization Methodology
No ratings yet
An Efficient, Practical Parallelization Methodology
4 pages
JioDiscover-What is the neural networ
No ratings yet
JioDiscover-What is the neural networ
5 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Openai Chatgpt Arhitektura
No ratings yet
Openai Chatgpt Arhitektura
13 pages
Large Language Models
No ratings yet
Large Language Models
10 pages
Lecture 2.3.5lstmencoders
No ratings yet
Lecture 2.3.5lstmencoders
9 pages
Unit 3 Chapter 1 RNN
No ratings yet
Unit 3 Chapter 1 RNN
121 pages
PLD Unit 2.3
No ratings yet
PLD Unit 2.3
20 pages
Nonlinear Modeling With OpenSees
100% (1)
Nonlinear Modeling With OpenSees
71 pages
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Whats New
No ratings yet
Whats New
36 pages
Using
No ratings yet
Using
106 pages
Howto Pyporting
No ratings yet
Howto Pyporting
7 pages
Howto Isolating Extensions
No ratings yet
Howto Isolating Extensions
8 pages
Howto Argparse
No ratings yet
Howto Argparse
12 pages
Howto Regex
No ratings yet
Howto Regex
19 pages
Howto Logging
No ratings yet
Howto Logging
17 pages
Howto Urllib2
No ratings yet
Howto Urllib2
11 pages
Howto Sorting
No ratings yet
Howto Sorting
5 pages
Accounting For Price Level Changes 1
No ratings yet
Accounting For Price Level Changes 1
8 pages
Consolidated Financial Statements Sample 1 - Nation Media
No ratings yet
Consolidated Financial Statements Sample 1 - Nation Media
94 pages
Industrial Attachment Report
No ratings yet
Industrial Attachment Report
47 pages
Lesson 2 Lease
No ratings yet
Lesson 2 Lease
26 pages
Conditional Statements: Unit 5
No ratings yet
Conditional Statements: Unit 5
15 pages
Silo - Tips - Student S Attachment Log Book
No ratings yet
Silo - Tips - Student S Attachment Log Book
13 pages
19733_CULTISM II WEEK 2
No ratings yet
19733_CULTISM II WEEK 2
2 pages
Unthinking Coloniality Refashioning Paracolonial Identity Through History in Neocolonial Contexts
No ratings yet
Unthinking Coloniality Refashioning Paracolonial Identity Through History in Neocolonial Contexts
6 pages
Intro Cyber 15.1
No ratings yet
Intro Cyber 15.1
3 pages
Pharma Calculation of Oral Medications Solids and Liquids
No ratings yet
Pharma Calculation of Oral Medications Solids and Liquids
2 pages
Grade 11 Log and Indices II
No ratings yet
Grade 11 Log and Indices II
23 pages
A Detailed Lesson Plan in Teaching Mathematics V
No ratings yet
A Detailed Lesson Plan in Teaching Mathematics V
15 pages
Aa PPT Semiotics g3
No ratings yet
Aa PPT Semiotics g3
30 pages
Cs8481 Data Base Management System
No ratings yet
Cs8481 Data Base Management System
62 pages
JD Edwards EnterpriseOne 9.1.X Business Services Package Build Notes For All Platforms Doc ID 1516189.1 PDF
No ratings yet
JD Edwards EnterpriseOne 9.1.X Business Services Package Build Notes For All Platforms Doc ID 1516189.1 PDF
4 pages
Course Guide Topic Grammar Expected Time of Work
No ratings yet
Course Guide Topic Grammar Expected Time of Work
9 pages
psychological wellbeing and bhagvad gita
No ratings yet
psychological wellbeing and bhagvad gita
6 pages
Script
No ratings yet
Script
7 pages
LISTENING EXAM 3 2017 2 (Students Version)
100% (1)
LISTENING EXAM 3 2017 2 (Students Version)
3 pages
Romaticism and The Gothic
No ratings yet
Romaticism and The Gothic
10 pages
Buy ebook Advances in Quantum Field Theory 1st Edition Sergey Ketov (Editor) cheap price
100% (7)
Buy ebook Advances in Quantum Field Theory 1st Edition Sergey Ketov (Editor) cheap price
60 pages
Issue - Configure Management Network Option Is Grayed Out Into ESXi
No ratings yet
Issue - Configure Management Network Option Is Grayed Out Into ESXi
2 pages
English Narrative Text
No ratings yet
English Narrative Text
14 pages
Passives, Unit 10
No ratings yet
Passives, Unit 10
3 pages
Types of Hair
No ratings yet
Types of Hair
7 pages
Resume Rodel
100% (1)
Resume Rodel
2 pages
Database Security - Concepts, Approaches: IEEE Transactions On Dependable and Secure Computing February 2005
No ratings yet
Database Security - Concepts, Approaches: IEEE Transactions On Dependable and Secure Computing February 2005
23 pages
Make Time To Take The Practice Test.: It Is One of The Best Ways To Get Ready For The SAT
0% (1)
Make Time To Take The Practice Test.: It Is One of The Best Ways To Get Ready For The SAT
13 pages
English Course A1
No ratings yet
English Course A1
6 pages
Nature of Performance-Based Assessment
No ratings yet
Nature of Performance-Based Assessment
3 pages
SSC MTS Question Paper 7 October 2021 2nd Shift in Hindi
No ratings yet
SSC MTS Question Paper 7 October 2021 2nd Shift in Hindi
29 pages
API Security Fundamentals
No ratings yet
API Security Fundamentals
4 pages

Transformer

Uploaded by

Transformer

Uploaded by

Attention Is All You Need

Vaswani et al. NeurIPS 2017

● Mechanism used to let individual tokens

● Input embedding is first added with

Source: Attention Is All You Need

● Mapping a query and set of key-value

● Apply attention to different versions of

Source: Attention Is All You Need

• Since attention mechanism in the

• Layer normalization (Ba et al. 2016) is

• Fully connected feed-forward network

In case you are curious

• Less total computational complexity per layer

● Attention key size is

You might also like