0% found this document useful (0 votes)
23 views1 page

The Transformer Model

Uploaded by

9m8cr5k72j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views1 page

The Transformer Model

Uploaded by

9m8cr5k72j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

!

Navigation

Click to Take the FREE Crash-Course

Search... "

The Transformer Model


by Stefania Cristina on November 4, 2021 in Attention

Tweet Share Share

We have already familiarized ourselves with the concept


of self-attention as implemented by the Transformer
attention mechanism for neural machine translation. We
will now be shifting our focus on the details of the
Transformer architecture itself, to discover how self-
attention can be implemented without relying on the use
of recurrence and convolutions.

In this tutorial, you will discover the network architecture


of the Transformer model.

After completing this tutorial, you will know:

How the Transformer architecture implements an


encoder-decoder structure without recurrence and
convolutions.
How the Transformer encoder and decoder work.
How the Transformer self-attention compares to the
use of recurrent and convolutional layers.

Let’s get started.

The Transformer Model


Photo by Samule Sun, some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

The Transformer Architecture


The Encoder
The Decoder
Sum Up: The Transformer Model
Comparison to Recurrent and Convolutional Layers

Prerequisites
For this tutorial, we assume that you are already familiar
with:

The concept of attention


The attention mechanism
The Transfomer attention mechanism

The Transformer Architecture


The Transformer architecture follows an encoder-decoder
structure, but does not rely on recurrence and
convolutions in order to generate an output.

Output
Probabilities

Softmax

Linear

Add&Norm

Feed
Forward

Add&Norm
Add&Norm
Multi-Head
Feed Attention
Forward

1Add&Norm
Add&Norm
Masked
Multi-Head Multi-Head
Attention Attention

Positional
Encoding
h Positional
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shiftedright)

The Encoder-Decoder Structure of the Transformer Architecture


Taken from “Attention Is All You Need“

In a nutshell, the task of the encoder, on the left half of


the Transformer architecture, is to map an input
sequence to a sequence of continuous representations,
which is then fed into a decoder.

The decoder, on the right half of the architecture, receives


the output of the encoder together with the decoder
output at the previous time step, to generate an output
sequence.

At each step the model is auto-regressive,


% consuming the previously generated symbols
as additional input when generating the next.

– Attention Is All You Need, 2017.

The Encoder

Output
Probabilities

Softmax

Linear

Add&Norm

Feed
Forward

Add&Norm
Add&Norm
Multi-Head
Feed Attention
Forward

1Add&Norm
-Add&Norm
Masked
Multi-Head Multi-Head
Attention Attention

Positional Positional
Encoding Encoding
Input output
Embedding Embedding

Inputs Outputs
(shiftedright)

The Encoder Block of the Transformer Architecture


Taken from “Attention Is All You Need“

The encoder consists of a stack of N = 6 identical layers,


where each layer is composed of two sublayers:

1. The first sublayer implements a multi-head self-


attention mechanism. We had seen that the multi-
head mechanism implements h heads that receive a
(different) linearly projected version of the queries,
keys and values each, to produce h outputs in
parallel that are then used to generate a final result.

2. The second sublayer is a fully connected feed-


forward network, consisting of two linear
transformations with Rectified Linear Unit (ReLU)
activation in between:

FFN(x) = ReLU(W 1x + b1)W 2 + b2


The six layers of the Transformer encoder apply the same
linear transformations to all of the words in the input
sequence, but each layer employs different weight (
W 1, W 2) and bias (b1, b2) parameters to do so.

Furthermore, each of these two sublayers has a residual


connection around it.

Each sublayer is also succeeded by a normalization layer,


layernorm(. ) , which normalizes the sum computed
between the sublayer input, x , and the output generated
by the sublayer itself, sublayer(x) :

layernorm(x + sublayer(x))
An important consideration to keep in mind is that the
Transformer architecture cannot inherently capture any
information about the relative positions of the words in
the sequence, since it does not make use of recurrence.
This information has to be injected by introducing
positional encodings to the input embeddings.

The positional encoding vectors are of the same


dimension as the input embeddings, and are generated
using sine and cosine functions of different frequencies.
Then, they are simply summed to the input embeddings
in order to inject the positional information.

The Decoder

The Decoder Block of the Transformer Architecture


Taken from “Attention Is All You Need“

The decoder shares several similarities with the encoder.

The decoder also consists of a stack of N = 6 identical


layers that are, each, composed of three sublayers:

1. The first sublayer receives the previous output of the


decoder stack, augments it with positional
information, and implements multi-head self-
attention over it. While the encoder is designed to
attend to all words in the input sequence, regardless
of their position in the sequence, the decoder is
modified to attend only to the preceding words.
Hence, the prediction for a word at position, i , can
only depend on the known outputs for the words
that come before it in the sequence. In the multi-
head attention mechanism (which implements
multiple, single attention functions in parallel), this is
achieved by introducing a mask over the values
produced by the scaled multiplication of matrices Q
and K . This masking is implemented by suppressing
the matrix values that would, otherwise, correspond
to illegal connections:

e11 e12 … e1n


e21 e22 … e2n
mask(Q K T ) = mask =
⋮ ⋮ ⋱ ⋮
em 1 em 2 … em n

The Multi-Head Attention in the


Decoder Implements Several
Masked, Single Attention Functions
Taken from “Attention Is All You
Need“

The masking makes the decoder unidirectional


% (unlike the bidirectional encoder).

– Advanced Deep Learning with Python, 2019.

2. The second layer implements a multi-head self-


attention mechanism, which is similar to the one
implemented in the first sublayer of the encoder. On
the decoder side, this multi-head mechanism
receives the queries from the previous decoder
sublayer, and the keys and values from the output of
the encoder. This allows the decoder to attend to all
of the words in the input sequence.

3. The third layer implements a fully connected feed-


forward network, which is similar to the one
implemented in the second sublayer of the encoder.

Furthermore, the three sublayers on the decoder side


also have residual connections around them, and are
succeeded by a normalization layer.

Positional encodings are also added to the input


embeddings of the decoder, in the same manner as
previously explained for the encoder.

Sum Up: The Transformer Model


The Transformer model runs as follows:

1. Each word forming an input sequence is transformed


into a dmodel-dimensional embedding vector.

2. Each embedding vector representing an input word


is augmented by summing it (element-wise) to a
positional encoding vector of the same dmodel length,
hence introducing positional information into the
input.

3. The augmented embedding vectors are fed into the


encoder block, consisting of the two sublayers
explained above. Since the encoder attends to all
words in the input sequence, irrespective if they
precede or succeed the word under consideration,
then the Transformer encoder is bidirectional.

4. The decoder receives as input its own predicted


output word at time-step, t–1.

5. The input to the decoder is also augmented by


positional encoding, in the same manner as this is
done on the encoder side.

6. The augmented decoder input is fed into the three


sublayers comprising the decoder block explained
above. Masking is applied in the first sublayer, in
order to stop the decoder from attending to
succeeding words. At the second sublayer, the
decoder also receives the output of the encoder,
which now allows the decoder to attend to all of the
words in the input sequence.

7. The output of the decoder finally passes through a


fully connected layer, followed by a softmax layer, to
generate a prediction for the next word of the output
sequence.

Comparison to Recurrent and


Convolutional Layers
Vaswani et al. (2017) explain that their motivation for
abandoning the use of recurrence and convolutions was
based on several factors:

1. Self-attention layers were found to be faster than


recurrent layers for shorter sequence lengths, and
can be restricted to consider only a neighbourhood
in the input sequence for very long sequence
lengths.

2. The number of sequential operations required by a


recurrent layer is based upon the sequence length,
whereas this number remains constant for a self-
attention layer.

3. In convolutional neural networks, the kernel width


directly affects the long-term dependencies that can
be established between pairs of input and output
positions. Tracking long-term dependencies would
require the use of large kernels, or stacks of
convolutional layers that could increase the
computational cost.

Further Reading
This section provides more resources on the topic if you
are looking to go deeper.

Books
Advanced Deep Learning with Python, 2019.

Papers
Attention Is All You Need, 2017.

Summary
In this tutorial, you discovered the network architecture of
the Transformer model.

Specifically, you learned:

How the Transformer architecture implements an


encoder-decoder structure without recurrence and
convolutions.
How the Transformer encoder and decoder work.
How the Transformer self-attention compares to
recurrent and convolutional layers.

Do you have any questions?


Ask your questions in the comments below and I will do
my best to answer.

Tweet Share Share

More On This Topic

The Transformer Attention Mechanism

Model Prediction Accuracy Versus Interpretation


in…

Improve Model Accuracy with Data Pre-


Processing

Clever Application Of A Predictive Model

How To Estimate Model Accuracy in R Using


The Caret Package

Model Selection Tips From Competitive Machine


Learning

About Stefania Cristina


Stefania Cristina, PhD is a Lecturer with the
Department of Systems and Control
Engineering, at the University of Malta.
View all posts by Stefania Cristina →

$ attention, machine translation, transformer

# The Transformer Attention Mechanism

No comments yet.

Leave a Reply

Name (required)

Email (will not be published) (required)

Website

SUBMIT COMMENT

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with
machine learning.
Read more

Never miss a tutorial:

Picked for you:

Your First Deep Learning Project in Python with Keras


Step-By-Step

Your First Machine Learning Project in Python Step-By-


Step

How to Develop LSTM Models for Time Series


Forecasting

How to Create an ARIMA Model for Time Series


Forecasting in Python

Machine Learning for Developers

Loving the Tutorials?

The EBook Catalog is where


you'll find the Really Good stuff.

>> SEE WHAT'S INSIDE

© 2021 Machine Learning Mastery. All Rights Reserved.


LinkedIn | Twitter | Facebook | Newsletter | RSS

Privacy | Disclaimer | Terms | Contact | Sitemap | Search

You might also like