SlideShare a Scribd company logo
The Transformer
Lecture 20
Xavier Giro-i-Nieto
Associate Professor
Universitat Politecnica de Catalunya
@DocXavi
xavier.giro@upc.edu
2
Video-lecture
3
Acknowledgments
Marta R. Costa-jussà
Associate Professor
Universitat Politècnica de Catalunya
Carlos Escolano
PhD Candidate
Universitat Politècnica de Catalunya
Gerard I. Gállego
PhD Student
Universitat Politècnica de Catalunya
gerard.ion.gallego@upc.edu
@geiongallego
Outline
1. Reminders
4
5
Reminder
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
6
Reminder
Attention is a mechanism to compute a context vector (c) for a query (Q) as a
weighted sum of values (V).
Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
7
Reminder
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
8
Reminder: Seq2Seq with Cross-Attention
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020
In this case, cross-attention
refers to the attention
between the encoder and
decoder states.
9
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
What may the term “self” refer to, as a contrast of “cross”-attention ?
Outline
1. Motivation
2. Self-attention
10
11
Self-Attention (or intra-Attention)
Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. A structured self-attentive sentence embedding.
ICLR 2017.
Figure:
Jay Alammar,
“The Illustrated Transformer”
Self-attention refers to attending to other elements from the SAME sequence.
12
Self-Attention (or intra-Attention)
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
Query (Q)
g(x) = WQ
x
Key (K)
f(x) = WK
x
Value (V)
h(x) = WV
x
WQ
, WK
and WV
are projection
layers shared across all words.
13
Self-Attention (or intra-Attention)
Which steps are necessary to compute the contextual representation of a word
embedding e2
in a sequences of four words embeddings (e1
, e2
, e3
, e4
) ?
A (scaled) dot-product is computed between each pair of word embeddings
(eg. e1
and e2
)...
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
14
Self-Attention (or intra-Attention)
Which steps are necessary to compute the contextual representation of a word
embedding e2
in a sequences of four words embeddings (e1
, e2
, e3
, e4
) ?
… a softmax layer normalizes the attention scores to obtain the attention
distribution...
15
Self-Attention (or intra-Attention)
Which steps are necessary to compute the contextual representation of a word
embedding e2
in a sequences of four words embeddings (e1
, e2
, e3
, e4
) ?
...the same word
embeddings are combined to
obtain the contextual
representation e2
’.
16
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
17
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
18
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
19
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
20
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
21
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
22
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
23
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
24
Self-Attention (or intra-Attention) Scaled dot-product
attention
Figure: Jay Alammar, “The illustrated Transformer” (2018)
25
Study case: Self-Attention in for image generation
#SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial
networks." ICML 2019. [video]
Figure:
Frank Xu
Generator (G): Details can be generated using cues from all feature locations.
Discriminator: Can check consistency betweenn features in distant portions of the image.
26
Study case: Self-Attention in for image generation
#SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial
networks." ICML 2019. [video]
Query locations Attention maps for differet query locations
Outline
1. Motivation
2. Self-attention
3. Multi-head Self-Attention (MHSA)
27
28
Multi-Head Self-Attention (MHSA)
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
In vanilla self-attention, a single set of projection matrices WQ
, WK
, WV
is used.
29
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
In multi-head self-attention, multiple sets of projection matrices are used, and can
provide different contextual representations for the same input token.
Multi-Head Self-Attention (MHSA)
30
The multi-head self-attended E’i
matrixes are concatenated:
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Multi-Head Self-Attention (MHSA)
31
A fully connected layer on top combines everything in a new E’.
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Multi-Head Self-Attention (MHSA)
Multi-head Self-Attention: Visualization
32
#BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet]
Each colour
corresponds
to a head.
Blue: First
head only.
Multi-color:
Multiple
heads.
33
Self-Attention and Convolutional Layers
Cordonnier, J. B., Loukas, A., & Jaggi, M. On the relationship between self-attention and convolutional layers. ICLR 2020.
[tweet] [code]
Outline
1. Motivation
2. Self-attention
3. Multi-head Attention
4. Positional Encoding
34
Positional Encoding
35
Given that the attention mechanism allows accessing all input (and output)
tokens, we no longer need a memory through recurrent layers.
Positional Encoding
36
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Where is the relative relation in the sequence encoded ?
Positional Encoding
37
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Where is the relative relation in the sequence encoded ?
Positional Encoding
38
Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.
Sinusoidal functions are typically used to provide positional encodings.
Positional Encoding
39
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Outline
1. Motivation
2. Self-attention
3. Multi-head Attention
4. Positional Encoding
5. The Transformer
40
The Transformer
41
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
The Transformer removed the recurrency mechanism thanks to self-attention.
The Transformer
42
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
Positional Encoding over the output
sequence.
Positional Encoding
over the input
sequence.
Auto-regressive (at test).
The Transformer
43
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
Cross-Attention (or inter-attention)
between input and output tokens
Self-attention for
the input tokens.
Self-attention for the output tokens.
The Transformer: Layers
44
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
N decoder layers
N encoder
layers
The Transformer: Layers
45
#BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet]
A birds-eye view of attention across all of the model’s layers and heads
The Transformer: Visualization
46
Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.
47
Are Transformers for Language only ? NO !!
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code]
Outline
1. Motivation
2. Self-attention
3. Multi-head Attention
4. Positional Encoding
5. The Transformer
48
49
(extra) PyTorch Lab on Google Colab
DL resources from UPC Telecos:
● Lectures (with Slides & Videos)
● Labs
Gerard Gallego
gerard.ion.gallego@upc.edu
Student PhD
Universitat Politecnica de Catalunya
Technical University of Catalonia
50
Software
● Transformers in HuggingFace.
● GPT-Neo by EleutherAI
○ Similar results to GPT-3, but smaller and open source.
● Andrej Karpathy, minGPT (2020).
51
Learn more
Ashish Vaswani, Stanford CS224N 2019.
52
Learn more
● Tutorials
○ Sebastian Ruder, Deep Learning for NLP Best Practices # Attention (2017).
○ Chris Olah, Shan Carter, “Attention and Augmented Recurrent Neural Networks”. distill.pub 2016.
○ Lilian Weg, The Transformer Family. Lil’Log 2020
● Twitter threads
○ Christian Wolf (INSA Lyon)
● Scientific publications
○ #Perceiver Jaegle, Andrew, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira.
"Perceiver: General perception with iterative attention." arXiv preprint arXiv:2103.03206 (2021).
○ #Longformer Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv
preprint arXiv:2004.05150 (2020).
○ Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. Transformers are RNS: Fast autoregressive transformers with linear
attention. ICML 2020.
○ Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye
Teh, Tim Harley, Razvan Pascanu, “Multiplicative Interactions and Where to Find Them”. ICLR 2020. [tweet]
○ Self-attention in language
■ Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. arXiv preprint
arXiv:1601.06733.
○ Self-attention in images
■ Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. ICML
2018.
■ Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. "Non-local neural networks." In CVPR 2018.
53
Questions ?

More Related Content

PDF
Chicken swarm optimization (CSO)
Abd ElRahman Mahreek
 
PPTX
Introduction to PyTorch
Jun Young Park
 
PPTX
Cuckoo search
Prachi Gulihar
 
PDF
Introduction au Machine Learning
Mathieu Goeminne
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PPT
introduction to deep Learning with full detail
sonykhan3
 
PDF
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
PPTX
Introduction to Deep learning
leopauly
 
Chicken swarm optimization (CSO)
Abd ElRahman Mahreek
 
Introduction to PyTorch
Jun Young Park
 
Cuckoo search
Prachi Gulihar
 
Introduction au Machine Learning
Mathieu Goeminne
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
introduction to deep Learning with full detail
sonykhan3
 
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
Introduction to Deep learning
leopauly
 

What's hot (20)

PPTX
Intelligent agent
Geeta Jaswani
 
PDF
BAAI Conference 2021: The Thousand Brains Theory - A Roadmap for Creating Mac...
Numenta
 
PDF
PX4 Seminar 01
Jungtaek Kim
 
PPTX
1.Introduction to deep learning
KONGU ENGINEERING COLLEGE
 
PPTX
Introduction to Genetic Algorithms
Dr. C.V. Suresh Babu
 
PDF
Tutorial on Deep Learning
inside-BigData.com
 
PPTX
[AIoTLab]attention mechanism.pptx
TuCaoMinh2
 
PPTX
Intro to deep learning
David Voyles
 
PDF
GPU - Basic Working
Nived R Nambiar
 
PPTX
Evolutionary Algorithms
Reem Alattas
 
PPTX
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Simplilearn
 
PPTX
Uninformed Search technique
Kapil Dahal
 
PDF
Introduction to Neural Networks
Databricks
 
PPT
lecun-01.ppt
VenkyChinna8
 
PDF
Adaptive Resonance Theory (ART)
Amir Masoud Sefidian
 
PPT
Logic agent
Slideshare
 
PDF
230309_LoRa
YongSang Yoo
 
PDF
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Yoonho Lee
 
PDF
Future Trends in Artificial Intelligence
DR.P.S.JAGADEESH KUMAR
 
PPT
Genetic algorithms full lecture
sadiacs
 
Intelligent agent
Geeta Jaswani
 
BAAI Conference 2021: The Thousand Brains Theory - A Roadmap for Creating Mac...
Numenta
 
PX4 Seminar 01
Jungtaek Kim
 
1.Introduction to deep learning
KONGU ENGINEERING COLLEGE
 
Introduction to Genetic Algorithms
Dr. C.V. Suresh Babu
 
Tutorial on Deep Learning
inside-BigData.com
 
[AIoTLab]attention mechanism.pptx
TuCaoMinh2
 
Intro to deep learning
David Voyles
 
GPU - Basic Working
Nived R Nambiar
 
Evolutionary Algorithms
Reem Alattas
 
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Simplilearn
 
Uninformed Search technique
Kapil Dahal
 
Introduction to Neural Networks
Databricks
 
lecun-01.ppt
VenkyChinna8
 
Adaptive Resonance Theory (ART)
Amir Masoud Sefidian
 
Logic agent
Slideshare
 
230309_LoRa
YongSang Yoo
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Yoonho Lee
 
Future Trends in Artificial Intelligence
DR.P.S.JAGADEESH KUMAR
 
Genetic algorithms full lecture
sadiacs
 
Ad

Similar to The Transformer - Xavier Giró - UPC Barcelona 2021 (20)

PDF
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Universitat Politècnica de Catalunya
 
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
alexVAE_New.pdf
sourabhgothe1
 
PDF
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Universitat Politècnica de Catalunya
 
PDF
深度学习639页PPT/////////////////////////////
alicejiang7888
 
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Striving to Demystify Bayesian Computational Modelling
Marco Wirthlin
 
PDF
Deep Learning Representations for All (a.ka. the AI hype)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
ALASI15 Writing Analytics Workshop
Simon Buckingham Shum
 
PDF
Multimodal Residual Networks for Visual QA
Jin-Hwa Kim
 
PDF
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
PDF
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Universitat Politècnica de Catalunya
 
PDF
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Universitat Politècnica de Catalunya
 
PDF
ICED 2013 A
victor tang
 
PDF
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
Universitat Politècnica de Catalunya
 
PDF
Interpretability of machine learning
Daiki Tanaka
 
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Deep Learning: concepts and use cases (October 2018)
Julien SIMON
 
PDF
Natural Language Processing: Lecture 255
deffa5
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
alexVAE_New.pdf
sourabhgothe1
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Universitat Politècnica de Catalunya
 
深度学习639页PPT/////////////////////////////
alicejiang7888
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Striving to Demystify Bayesian Computational Modelling
Marco Wirthlin
 
Deep Learning Representations for All (a.ka. the AI hype)
Universitat Politècnica de Catalunya
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
ALASI15 Writing Analytics Workshop
Simon Buckingham Shum
 
Multimodal Residual Networks for Visual QA
Jin-Hwa Kim
 
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Universitat Politècnica de Catalunya
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Universitat Politècnica de Catalunya
 
ICED 2013 A
victor tang
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
Universitat Politècnica de Catalunya
 
Interpretability of machine learning
Daiki Tanaka
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
Deep Learning: concepts and use cases (October 2018)
Julien SIMON
 
Natural Language Processing: Lecture 255
deffa5
 
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
PDF
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
PDF
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
PDF
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
PDF
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Universitat Politècnica de Catalunya
 
PDF
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Universitat Politècnica de Catalunya
 
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Universitat Politècnica de Catalunya
 
PDF
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Universitat Politècnica de Catalunya
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Universitat Politècnica de Catalunya
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Universitat Politècnica de Catalunya
 

Recently uploaded (20)

PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
International-health-agency and it's work.pptx
shreehareeshgs
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PDF
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
mswindow00
 
PDF
Digital Infrastructure – Powering the Connected Age
Heera Yadav
 
PPTX
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
PDF
CH1-MODEL-BUILDING-v2017.1-APR27-2017.pdf
jcc00023con
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
International-health-agency and it's work.pptx
shreehareeshgs
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
mswindow00
 
Digital Infrastructure – Powering the Connected Age
Heera Yadav
 
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Chad Readey - An Independent Thinker
Chad Readey
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
CH1-MODEL-BUILDING-v2017.1-APR27-2017.pdf
jcc00023con
 

The Transformer - Xavier Giró - UPC Barcelona 2021

  • 1. The Transformer Lecture 20 Xavier Giro-i-Nieto Associate Professor Universitat Politecnica de Catalunya @DocXavi [email protected]
  • 3. 3 Acknowledgments Marta R. Costa-jussà Associate Professor Universitat Politècnica de Catalunya Carlos Escolano PhD Candidate Universitat Politècnica de Catalunya Gerard I. Gállego PhD Student Universitat Politècnica de Catalunya [email protected] @geiongallego
  • 5. 5 Reminder Nikhil Sha, “Attention ? An other Perspective!”. 2020.
  • 6. 6 Reminder Attention is a mechanism to compute a context vector (c) for a query (Q) as a weighted sum of values (V). Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
  • 7. 7 Reminder Nikhil Sha, “Attention ? An other Perspective!”. 2020.
  • 8. 8 Reminder: Seq2Seq with Cross-Attention Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020 In this case, cross-attention refers to the attention between the encoder and decoder states.
  • 9. 9 Nikhil Sha, “Attention ? An other Perspective!”. 2020. What may the term “self” refer to, as a contrast of “cross”-attention ?
  • 11. 11 Self-Attention (or intra-Attention) Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. A structured self-attentive sentence embedding. ICLR 2017. Figure: Jay Alammar, “The Illustrated Transformer” Self-attention refers to attending to other elements from the SAME sequence.
  • 12. 12 Self-Attention (or intra-Attention) Nikhil Sha, “Attention ? An other Perspective!”. 2020. Query (Q) g(x) = WQ x Key (K) f(x) = WK x Value (V) h(x) = WV x WQ , WK and WV are projection layers shared across all words.
  • 13. 13 Self-Attention (or intra-Attention) Which steps are necessary to compute the contextual representation of a word embedding e2 in a sequences of four words embeddings (e1 , e2 , e3 , e4 ) ? A (scaled) dot-product is computed between each pair of word embeddings (eg. e1 and e2 )... #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017.
  • 14. 14 Self-Attention (or intra-Attention) Which steps are necessary to compute the contextual representation of a word embedding e2 in a sequences of four words embeddings (e1 , e2 , e3 , e4 ) ? … a softmax layer normalizes the attention scores to obtain the attention distribution...
  • 15. 15 Self-Attention (or intra-Attention) Which steps are necessary to compute the contextual representation of a word embedding e2 in a sequences of four words embeddings (e1 , e2 , e3 , e4 ) ? ...the same word embeddings are combined to obtain the contextual representation e2 ’.
  • 16. 16 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 17. 17 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 18. 18 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 19. 19 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 20. 20 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 21. 21 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 22. 22 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 23. 23 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 24. 24 Self-Attention (or intra-Attention) Scaled dot-product attention Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 25. 25 Study case: Self-Attention in for image generation #SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial networks." ICML 2019. [video] Figure: Frank Xu Generator (G): Details can be generated using cues from all feature locations. Discriminator: Can check consistency betweenn features in distant portions of the image.
  • 26. 26 Study case: Self-Attention in for image generation #SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial networks." ICML 2019. [video] Query locations Attention maps for differet query locations
  • 27. Outline 1. Motivation 2. Self-attention 3. Multi-head Self-Attention (MHSA) 27
  • 28. 28 Multi-Head Self-Attention (MHSA) Nikhil Sha, “Attention ? An other Perspective!”. 2020. In vanilla self-attention, a single set of projection matrices WQ , WK , WV is used.
  • 29. 29 Nikhil Sha, “Attention ? An other Perspective!”. 2020. In multi-head self-attention, multiple sets of projection matrices are used, and can provide different contextual representations for the same input token. Multi-Head Self-Attention (MHSA)
  • 30. 30 The multi-head self-attended E’i matrixes are concatenated: Figure: Jay Alammar, “The illustrated Transformer” (2018) Multi-Head Self-Attention (MHSA)
  • 31. 31 A fully connected layer on top combines everything in a new E’. Figure: Jay Alammar, “The illustrated Transformer” (2018) Multi-Head Self-Attention (MHSA)
  • 32. Multi-head Self-Attention: Visualization 32 #BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet] Each colour corresponds to a head. Blue: First head only. Multi-color: Multiple heads.
  • 33. 33 Self-Attention and Convolutional Layers Cordonnier, J. B., Loukas, A., & Jaggi, M. On the relationship between self-attention and convolutional layers. ICLR 2020. [tweet] [code]
  • 34. Outline 1. Motivation 2. Self-attention 3. Multi-head Attention 4. Positional Encoding 34
  • 35. Positional Encoding 35 Given that the attention mechanism allows accessing all input (and output) tokens, we no longer need a memory through recurrent layers.
  • 36. Positional Encoding 36 Figure: Jay Alammar, “The illustrated Transformer” (2018) Where is the relative relation in the sequence encoded ?
  • 37. Positional Encoding 37 Figure: Jay Alammar, “The illustrated Transformer” (2018) Where is the relative relation in the sequence encoded ?
  • 38. Positional Encoding 38 Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020. Sinusoidal functions are typically used to provide positional encodings.
  • 39. Positional Encoding 39 Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 40. Outline 1. Motivation 2. Self-attention 3. Multi-head Attention 4. Positional Encoding 5. The Transformer 40
  • 41. The Transformer 41 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. The Transformer removed the recurrency mechanism thanks to self-attention.
  • 42. The Transformer 42 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. Positional Encoding over the output sequence. Positional Encoding over the input sequence. Auto-regressive (at test).
  • 43. The Transformer 43 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. Cross-Attention (or inter-attention) between input and output tokens Self-attention for the input tokens. Self-attention for the output tokens.
  • 44. The Transformer: Layers 44 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. N decoder layers N encoder layers
  • 45. The Transformer: Layers 45 #BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet] A birds-eye view of attention across all of the model’s layers and heads
  • 46. The Transformer: Visualization 46 Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.
  • 47. 47 Are Transformers for Language only ? NO !! #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code]
  • 48. Outline 1. Motivation 2. Self-attention 3. Multi-head Attention 4. Positional Encoding 5. The Transformer 48
  • 49. 49 (extra) PyTorch Lab on Google Colab DL resources from UPC Telecos: ● Lectures (with Slides & Videos) ● Labs Gerard Gallego [email protected] Student PhD Universitat Politecnica de Catalunya Technical University of Catalonia
  • 50. 50 Software ● Transformers in HuggingFace. ● GPT-Neo by EleutherAI ○ Similar results to GPT-3, but smaller and open source. ● Andrej Karpathy, minGPT (2020).
  • 51. 51 Learn more Ashish Vaswani, Stanford CS224N 2019.
  • 52. 52 Learn more ● Tutorials ○ Sebastian Ruder, Deep Learning for NLP Best Practices # Attention (2017). ○ Chris Olah, Shan Carter, “Attention and Augmented Recurrent Neural Networks”. distill.pub 2016. ○ Lilian Weg, The Transformer Family. Lil’Log 2020 ● Twitter threads ○ Christian Wolf (INSA Lyon) ● Scientific publications ○ #Perceiver Jaegle, Andrew, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. "Perceiver: General perception with iterative attention." arXiv preprint arXiv:2103.03206 (2021). ○ #Longformer Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020). ○ Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. Transformers are RNS: Fast autoregressive transformers with linear attention. ICML 2020. ○ Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, Razvan Pascanu, “Multiplicative Interactions and Where to Find Them”. ICLR 2020. [tweet] ○ Self-attention in language ■ Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. ○ Self-attention in images ■ Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. ICML 2018. ■ Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. "Non-local neural networks." In CVPR 2018.