0% found this document useful (0 votes)

23 views1 page

The Transformer Model

Uploaded by

9m8cr5k72j

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views1 page

The Transformer Model

Uploaded by

9m8cr5k72j

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

!

Navigation

Click to Take the FREE Crash-Course

Search... "

The Transformer Model

by Stefania Cristina on November 4, 2021 in Attention

Tweet Share Share

We have already familiarized ourselves with the concept

of self-attention as implemented by the Transformer
attention mechanism for neural machine translation. We
will now be shifting our focus on the details of the
Transformer architecture itself, to discover how self-
attention can be implemented without relying on the use
of recurrence and convolutions.

In this tutorial, you will discover the network architecture

of the Transformer model.

After completing this tutorial, you will know:

How the Transformer architecture implements an

encoder-decoder structure without recurrence and
convolutions.
How the Transformer encoder and decoder work.
How the Transformer self-attention compares to the
use of recurrent and convolutional layers.

Let’s get started.

The Transformer Model

Photo by Samule Sun, some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

The Transformer Architecture

The Encoder
The Decoder
Sum Up: The Transformer Model
Comparison to Recurrent and Convolutional Layers

Prerequisites
For this tutorial, we assume that you are already familiar
with:

The concept of attention

The attention mechanism
The Transfomer attention mechanism

The Transformer Architecture

The Transformer architecture follows an encoder-decoder
structure, but does not rely on recurrence and
convolutions in order to generate an output.

Output
Probabilities

Softmax

Linear

Add&Norm

Feed
Forward

Add&Norm
Add&Norm
Multi-Head
Feed Attention
Forward

1Add&Norm
Add&Norm
Masked
Multi-Head Multi-Head
Attention Attention

Positional
Encoding
h Positional
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shiftedright)

The Encoder-Decoder Structure of the Transformer Architecture

Taken from “Attention Is All You Need“

In a nutshell, the task of the encoder, on the left half of

the Transformer architecture, is to map an input
sequence to a sequence of continuous representations,
which is then fed into a decoder.

The decoder, on the right half of the architecture, receives

the output of the encoder together with the decoder
output at the previous time step, to generate an output
sequence.

At each step the model is auto-regressive,

% consuming the previously generated symbols
as additional input when generating the next.

– Attention Is All You Need, 2017.

The Encoder

Output
Probabilities

Softmax

Linear

Add&Norm

Feed
Forward

Add&Norm
Add&Norm
Multi-Head
Feed Attention
Forward

1Add&Norm
-Add&Norm
Masked
Multi-Head Multi-Head
Attention Attention

Positional Positional
Encoding Encoding
Input output
Embedding Embedding

Inputs Outputs
(shiftedright)

The Encoder Block of the Transformer Architecture

Taken from “Attention Is All You Need“

The encoder consists of a stack of N = 6 identical layers,

where each layer is composed of two sublayers:

1. The first sublayer implements a multi-head self-

attention mechanism. We had seen that the multi-
head mechanism implements h heads that receive a
(diﬀerent) linearly projected version of the queries,
keys and values each, to produce h outputs in
parallel that are then used to generate a final result.

2. The second sublayer is a fully connected feed-

forward network, consisting of two linear
transformations with Rectified Linear Unit (ReLU)
activation in between:

FFN(x) = ReLU(W 1x + b1)W 2 + b2

The six layers of the Transformer encoder apply the same
linear transformations to all of the words in the input
sequence, but each layer employs diﬀerent weight (
W 1, W 2) and bias (b1, b2) parameters to do so.

Furthermore, each of these two sublayers has a residual

connection around it.

Each sublayer is also succeeded by a normalization layer,

layernorm(. ) , which normalizes the sum computed
between the sublayer input, x , and the output generated
by the sublayer itself, sublayer(x) :

layernorm(x + sublayer(x))
An important consideration to keep in mind is that the
Transformer architecture cannot inherently capture any
information about the relative positions of the words in
the sequence, since it does not make use of recurrence.
This information has to be injected by introducing
positional encodings to the input embeddings.

The positional encoding vectors are of the same

dimension as the input embeddings, and are generated
using sine and cosine functions of diﬀerent frequencies.
Then, they are simply summed to the input embeddings
in order to inject the positional information.

The Decoder

The Decoder Block of the Transformer Architecture

Taken from “Attention Is All You Need“

The decoder shares several similarities with the encoder.

The decoder also consists of a stack of N = 6 identical

layers that are, each, composed of three sublayers:

1. The first sublayer receives the previous output of the

decoder stack, augments it with positional
information, and implements multi-head self-
attention over it. While the encoder is designed to
attend to all words in the input sequence, regardless
of their position in the sequence, the decoder is
modified to attend only to the preceding words.
Hence, the prediction for a word at position, i , can
only depend on the known outputs for the words
that come before it in the sequence. In the multi-
head attention mechanism (which implements
multiple, single attention functions in parallel), this is
achieved by introducing a mask over the values
produced by the scaled multiplication of matrices Q
and K . This masking is implemented by suppressing
the matrix values that would, otherwise, correspond
to illegal connections:

e11 e12 … e1n

e21 e22 … e2n
mask(Q K T ) = mask =
⋮ ⋮ ⋱ ⋮
em 1 em 2 … em n

The Multi-Head Attention in the

Decoder Implements Several
Masked, Single Attention Functions
Taken from “Attention Is All You
Need“

The masking makes the decoder unidirectional

% (unlike the bidirectional encoder).

– Advanced Deep Learning with Python, 2019.

2. The second layer implements a multi-head self-

attention mechanism, which is similar to the one
implemented in the first sublayer of the encoder. On
the decoder side, this multi-head mechanism
receives the queries from the previous decoder
sublayer, and the keys and values from the output of
the encoder. This allows the decoder to attend to all
of the words in the input sequence.

3. The third layer implements a fully connected feed-

forward network, which is similar to the one
implemented in the second sublayer of the encoder.

Furthermore, the three sublayers on the decoder side

also have residual connections around them, and are
succeeded by a normalization layer.

Positional encodings are also added to the input

embeddings of the decoder, in the same manner as
previously explained for the encoder.

Sum Up: The Transformer Model

The Transformer model runs as follows:

1. Each word forming an input sequence is transformed

into a dmodel-dimensional embedding vector.

2. Each embedding vector representing an input word

is augmented by summing it (element-wise) to a
positional encoding vector of the same dmodel length,
hence introducing positional information into the
input.

3. The augmented embedding vectors are fed into the

encoder block, consisting of the two sublayers
explained above. Since the encoder attends to all
words in the input sequence, irrespective if they
precede or succeed the word under consideration,
then the Transformer encoder is bidirectional.

4. The decoder receives as input its own predicted

output word at time-step, t–1.

5. The input to the decoder is also augmented by

positional encoding, in the same manner as this is
done on the encoder side.

6. The augmented decoder input is fed into the three

sublayers comprising the decoder block explained
above. Masking is applied in the first sublayer, in
order to stop the decoder from attending to
succeeding words. At the second sublayer, the
decoder also receives the output of the encoder,
which now allows the decoder to attend to all of the
words in the input sequence.

7. The output of the decoder finally passes through a

fully connected layer, followed by a softmax layer, to
generate a prediction for the next word of the output
sequence.

Comparison to Recurrent and

Convolutional Layers
Vaswani et al. (2017) explain that their motivation for
abandoning the use of recurrence and convolutions was
based on several factors:

1. Self-attention layers were found to be faster than

recurrent layers for shorter sequence lengths, and
can be restricted to consider only a neighbourhood
in the input sequence for very long sequence
lengths.

2. The number of sequential operations required by a

recurrent layer is based upon the sequence length,
whereas this number remains constant for a self-
attention layer.

3. In convolutional neural networks, the kernel width

directly aﬀects the long-term dependencies that can
be established between pairs of input and output
positions. Tracking long-term dependencies would
require the use of large kernels, or stacks of
convolutional layers that could increase the
computational cost.

Further Reading
This section provides more resources on the topic if you
are looking to go deeper.

Books
Advanced Deep Learning with Python, 2019.

Papers
Attention Is All You Need, 2017.

Summary
In this tutorial, you discovered the network architecture of
the Transformer model.

Specifically, you learned:

How the Transformer architecture implements an

encoder-decoder structure without recurrence and
convolutions.
How the Transformer encoder and decoder work.
How the Transformer self-attention compares to
recurrent and convolutional layers.

Do you have any questions?

Ask your questions in the comments below and I will do
my best to answer.

Tweet Share Share

The Transformer Attention Mechanism

Model Prediction Accuracy Versus Interpretation

in…

Improve Model Accuracy with Data Pre-

Processing

Clever Application Of A Predictive Model

How To Estimate Model Accuracy in R Using

The Caret Package

Model Selection Tips From Competitive Machine

Learning

About Stefania Cristina

Stefania Cristina, PhD is a Lecturer with the
Department of Systems and Control
Engineering, at the University of Malta.
View all posts by Stefania Cristina →

$ attention, machine translation, transformer

# The Transformer Attention Mechanism

No comments yet.

Email (will not be published) (required)

Website

SUBMIT COMMENT

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with
machine learning.
Read more

Never miss a tutorial:

Picked for you:

Your First Deep Learning Project in Python with Keras

Step-By-Step

Your First Machine Learning Project in Python Step-By-

Step

How to Develop LSTM Models for Time Series

Forecasting

How to Create an ARIMA Model for Time Series

Forecasting in Python

Machine Learning for Developers

Loving the Tutorials?

The EBook Catalog is where

you'll find the Really Good stuﬀ.

>> SEE WHAT'S INSIDE

LinkedIn | Twitter | Facebook | Newsletter | RSS

Privacy | Disclaimer | Terms | Contact | Sitemap | Search

Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
One Wide Feedforward Is All You Need
No ratings yet
One Wide Feedforward Is All You Need
14 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
43 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
41 pages
NLP-week8-transformers
No ratings yet
NLP-week8-transformers
66 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Transformers 22nd April 2025 (2)
No ratings yet
Transformers 22nd April 2025 (2)
67 pages
anlp-05-transformers
No ratings yet
anlp-05-transformers
40 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
16_
No ratings yet
16_
41 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Generative AI
No ratings yet
Generative AI
54 pages
encode and decoder diagram explanation
No ratings yet
encode and decoder diagram explanation
8 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Transformer
No ratings yet
Transformer
58 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Chap6 Transformer (20240219) - DL4H practioner guide
No ratings yet
Chap6 Transformer (20240219) - DL4H practioner guide
36 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
23 pages
Transformer
No ratings yet
Transformer
31 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Solved-example-of-transformers
No ratings yet
Solved-example-of-transformers
20 pages
Attention is all you need
No ratings yet
Attention is all you need
19 pages
ITOBOS_COR_Attention
No ratings yet
ITOBOS_COR_Attention
2 pages
Transformer
No ratings yet
Transformer
10 pages
Aiayn
No ratings yet
Aiayn
15 pages
attention
No ratings yet
attention
15 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
7181-attention-is-all-you-need
No ratings yet
7181-attention-is-all-you-need
11 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
macmillan english language book 3 unit 4
No ratings yet
macmillan english language book 3 unit 4
8 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformer
No ratings yet
Transformer
10 pages
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Example File
No ratings yet
Example File
3 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformer
No ratings yet
Transformer
4 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Using Transformers For Computer Vision - by Cameron R. Wolfe, Ph.D. - Towards Data Science
No ratings yet
Using Transformers For Computer Vision - by Cameron R. Wolfe, Ph.D. - Towards Data Science
2 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
A1
No ratings yet
A1
11 pages
Wave Worksheet
No ratings yet
Wave Worksheet
11 pages
Physics MCQ Questions For SSC CGL
No ratings yet
Physics MCQ Questions For SSC CGL
20 pages
Waters 22
No ratings yet
Waters 22
55 pages
Revisiting LGU Vision Statement
100% (1)
Revisiting LGU Vision Statement
32 pages
The Mechanical Systems Design Handbook Modeling Measurement and Control 1st Edition by Yildirim Hurmuzlu 1498797784 9781498797788 download
100% (5)
The Mechanical Systems Design Handbook Modeling Measurement and Control 1st Edition by Yildirim Hurmuzlu 1498797784 9781498797788 download
39 pages
Linear Programming: Optimization), Comparative Statics, and Dynamics, Let Us Return To The Problem of
No ratings yet
Linear Programming: Optimization), Comparative Statics, and Dynamics, Let Us Return To The Problem of
40 pages
4-Catalogue Yesovens 2018
No ratings yet
4-Catalogue Yesovens 2018
68 pages
Differential Equation Prelims1, Prelims 2
100% (2)
Differential Equation Prelims1, Prelims 2
54 pages
MS-7681 VER 2.01 PDF
No ratings yet
MS-7681 VER 2.01 PDF
43 pages
REVIEWER-IN-EARTH-SCIENCE-QUARTER-1
No ratings yet
REVIEWER-IN-EARTH-SCIENCE-QUARTER-1
9 pages
List » Monsters - DnD 2024
No ratings yet
List » Monsters - DnD 2024
1 page
RC Design (Theory)
No ratings yet
RC Design (Theory)
47 pages
Tvarovky A Armatury 2022 - Aj
No ratings yet
Tvarovky A Armatury 2022 - Aj
92 pages
Physics - Test 1 - Watermark
No ratings yet
Physics - Test 1 - Watermark
45 pages
Gec 12 Readings in Philippine History
No ratings yet
Gec 12 Readings in Philippine History
39 pages
PHARM 122 8 Alkenes
No ratings yet
PHARM 122 8 Alkenes
22 pages
BOQ Araria
No ratings yet
BOQ Araria
30 pages
Sangkuriang Narative Text (9D)
No ratings yet
Sangkuriang Narative Text (9D)
12 pages
Cosh Post Test New Jbe
67% (3)
Cosh Post Test New Jbe
3 pages
The Glymphatic System and New Etiopathogenic Hypotheses Concerning Glaucoma Based On Pilot Study On Glaucoma Patients Who Underwent Osteopathic Manipulative Treatment (OMT)
No ratings yet
The Glymphatic System and New Etiopathogenic Hypotheses Concerning Glaucoma Based On Pilot Study On Glaucoma Patients Who Underwent Osteopathic Manipulative Treatment (OMT)
7 pages
Fruit Ripening Rooms: Banana Oranges
No ratings yet
Fruit Ripening Rooms: Banana Oranges
34 pages
1.ever/download Song? Have You Ever Downloaded A Song 1. I've Downloaded em Cua Ngay Hom Qua by Son Tung
No ratings yet
1.ever/download Song? Have You Ever Downloaded A Song 1. I've Downloaded em Cua Ngay Hom Qua by Son Tung
14 pages
Mom HCWM
No ratings yet
Mom HCWM
2 pages
Cody and The Fountain of Happiness Chapter Sampler
No ratings yet
Cody and The Fountain of Happiness Chapter Sampler
26 pages
đề mẫu
No ratings yet
đề mẫu
3 pages
Computer Controlled Gas Absorption Column, With SCADA and PID Control Cagc
No ratings yet
Computer Controlled Gas Absorption Column, With SCADA and PID Control Cagc
15 pages
Technical-Booklet B Materials and Workmanship 2013
No ratings yet
Technical-Booklet B Materials and Workmanship 2013
20 pages
Kit Part Number 1068 298 043: Solenoid Harness Park Cylinder
No ratings yet
Kit Part Number 1068 298 043: Solenoid Harness Park Cylinder
4 pages
Practical Apporoach To Uttarbasti
No ratings yet
Practical Apporoach To Uttarbasti
19 pages
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
Node.js: Tools & Skills
From Everand
Node.js: Tools & Skills
James Hibbard
No ratings yet
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
From Everand
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
Michael Blake
5/5 (1)
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet