The Transformer Model
The Transformer Model
Navigation
Search... "
Tutorial Overview
This tutorial is divided into three parts; they are:
Prerequisites
For this tutorial, we assume that you are already familiar
with:
Output
Probabilities
Softmax
Linear
Add&Norm
Feed
Forward
Add&Norm
Add&Norm
Multi-Head
Feed Attention
Forward
1Add&Norm
Add&Norm
Masked
Multi-Head Multi-Head
Attention Attention
Positional
Encoding
h Positional
Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shiftedright)
The Encoder
Output
Probabilities
Softmax
Linear
Add&Norm
Feed
Forward
Add&Norm
Add&Norm
Multi-Head
Feed Attention
Forward
1Add&Norm
-Add&Norm
Masked
Multi-Head Multi-Head
Attention Attention
Positional Positional
Encoding Encoding
Input output
Embedding Embedding
Inputs Outputs
(shiftedright)
layernorm(x + sublayer(x))
An important consideration to keep in mind is that the
Transformer architecture cannot inherently capture any
information about the relative positions of the words in
the sequence, since it does not make use of recurrence.
This information has to be injected by introducing
positional encodings to the input embeddings.
The Decoder
Further Reading
This section provides more resources on the topic if you
are looking to go deeper.
Books
Advanced Deep Learning with Python, 2019.
Papers
Attention Is All You Need, 2017.
Summary
In this tutorial, you discovered the network architecture of
the Transformer model.
No comments yet.
Leave a Reply
Name (required)
Website
SUBMIT COMMENT
Welcome!
I'm Jason Brownlee PhD
and I help developers get results with
machine learning.
Read more