Vision Transformer Overview
Vision Transformer Overview
Presented by
Hayat Ullah
Supervisor:
Arslan Munir
• Problem Statement
• Key Observations
• CNNs although perform well, however struggles with global features extraction
because they rely on local regions or receptive field.
• CNNs have limited performance on the tasks that have long range dependencies
such as complex video understanding or semantic segmentation task.
Class L×
Bird MLP +
Ball Head
Car MLP
Plane 𝑍𝐿0
…... Norm
Transformer Encoder
Learnable position +
Patch + Position
Embedding encoding
* Extra learnable 0
* 1 2 3 4 5 6 7 8 9 Epos ∈ R(N+1)× D Multi-Head
(class) embedding Self-Attention
Input Image
Linear Projection of Flattened Patches Learnable linear
projection Norm
xp ∈ RN(P
2×C)
xp E ∈ RN×D
2×C) Embedded
A sequence of flattened 2D patches xp ∈ RN(P Patches
Patch Grid of
x ∈ RH×W×C images
Architecture Overview (Patch Embedding)
3 nn.Conv2D
196
14
224 4 channel_in = 3
channel_out = 768
kernel_size = 16×16
stride = (16,16)
196
Architecture Overview (Class Token)
3 nn.Conv2D
14
224 4 channel_in = 3
channel_out = 768
kernel_size = 16×16
stride = (16,16)
196
(16×16×3)
Architecture Overview (Position Embedding)
▪ Given an image of size (224×224×3) Serve as the image label (196×768) (196+1)×768
𝑧𝑝𝑖 E ∈ R(N×D) ; i ∈ 1,2,3, … N
𝑧00 Epos ∈ R(N+1)×D
▪ Divide into 196 (14×14) patches of size 16×16×3.
16 768 768
Each patch is converted 0 +
16 1 (RGB) into a vector
of size (16×16×3) = 768
1 +
2 +
+
3 nn.Conv2D
14 +
224 4 channel_in = 3
channel_out = 768
kernel_size = 16×16
stride = (16,16)
197 +
196
(16×16×3)
Architecture overview (Position Embedding)
768 768
GELU
Input to MLP
196 196 Output of MLP
block block
fc1 fc2
N× Current encoder
L×
layer
×4➔
×4➔
×4➔
197×768
Multi-Head Self Attention
Single Self-Attention Multi-Head Self-Attention
Scaled Dot-Product Attention
For an input sequence of size n tokens in
Linear
sequence with d embedding size. MatMul
Concat
[q, k, v] = z × Uqkv (projection matrix) SoftMax
Scale
Linear Linear Linear
Image = (224×244×3) For ViT-B/16
Patch size =16 MatMul
Number of patches = 196 (14×14)
16×16×3
Q K V
(196×768)
Q K V
(768/12=64) Multi-head self-attention has the exact same
(196×64) (64×196) process over multiple attention heads.
(768×192) 𝑞𝑘𝑇
𝑆𝐴 𝑧 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 .𝑣 (196×64)
(196×768) × (768×192) = (196×192) 𝑑ℎ
(196×196)
(64) MSA(z) = [SA1(z); SA2(z); ……. SAn(z)] Umsa .
Umsa is projection matrix (Umsa ∈ R(h×Dh)×D) that transforms the
concatenated output from multiple heads (h × Dh) to model embedding
dimension D
Architecture Recap (MLP Head)
3072 ➔ 1000 (ImageNet-1K) Both during pre-training and fine-tuning, classification head is
attached to 𝒁𝟎𝑳
Fully connected Reduce the dimension
(Linear) Layer to number of classes K
Pretraining Finetuning
▪ Dependence on Pretraining
▪ Scaling Challenges
References
1. Alexey, Dosovitskiy. "An image is worth 16x16 words: Transformers for image recognition at
scale." ICLR (2021).
3. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. and Jégou, H., 2021, July. Training
data-efficient image transformers & distillation through attention. In International conference on
machine learning (pp. 10347-10357). PMLR.
4. https://tugot17.github.io/Vision-Transformer-Presentation/#/19
Thank You
Q&A