0% found this document useful (0 votes)
11 views21 pages

Vision Transformer Overview

Provides brief yet insightful information on Vision Transformer

Uploaded by

khanh9474
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views21 pages

Vision Transformer Overview

Provides brief yet insightful information on Vision Transformer

Uploaded by

khanh9474
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

AN IMAGE IS WORTH 16X16 WORDS:

TRANSFORMERS FOR IMAGE RECOGNITION AT


SCALE

Presented by
Hayat Ullah

Supervisor:
Arslan Munir

Department of Computer and Electrical Engineering and


Computer Science
Outline

• Problem Statement

• Solution (Vision Transformer)

• Experiments & Results

• Key Observations

• Limitations of Vision Transformers

• Questions & Answers


Problem Statement

• CNNs although perform well, however struggles with global features extraction
because they rely on local regions or receptive field.

• CNNs have limited performance on the tasks that have long range dependencies
such as complex video understanding or semantic segmentation task.

On the other hand


• Vision transformer has lack of inductive biases thereby focusing more on global
regions.

• Vision transformer generalize well on large datasets.


Vision Transformer Architecture

Vision Transformer (ViT)


Transformer Encoder

Class L×
Bird MLP +
Ball Head
Car MLP
Plane 𝑍𝐿0
…... Norm
Transformer Encoder
Learnable position +
Patch + Position
Embedding encoding
* Extra learnable 0
* 1 2 3 4 5 6 7 8 9 Epos ∈ R(N+1)× D Multi-Head
(class) embedding Self-Attention
Input Image
Linear Projection of Flattened Patches Learnable linear
projection Norm
xp ∈ RN(P
2×C)
xp E ∈ RN×D
2×C) Embedded
A sequence of flattened 2D patches xp ∈ RN(P Patches
Patch Grid of
x ∈ RH×W×C images
Architecture Overview (Patch Embedding)

▪ Given an image of size (224×224×3)

▪ Divide into 196 (14×14) patches of size 16×16×3.

• (N = HW/P2) ➔ N = (224×224)/162 ➔ N = 196

▪ Similarly, original image = N×(P2×C) ➔ 196×(162×3) ➔ 150528 (Pixels) = 224×224×3


Embedding Space
2D Space 1D Space A matrix of (196, 768)
(RGB) 224 14 Flattened patches 768
16
Each patch is converted
16 1 (RGB) into a vector
of size (16×16x3) = 768
2

3 nn.Conv2D

196
14
224 4 channel_in = 3
channel_out = 768
kernel_size = 16×16
stride = (16,16)

196
Architecture Overview (Class Token)

▪ Given an image of size (224×224×3) Serve as the image label


𝑧𝑝𝑖 ; 𝑖 ∈ 1,2,3, … N
𝑧00
▪ Divide into 196 (14×14) patches of size 16×16×3.

• (N = HW/P2) ➔ N = (224×224)/162 ➔ N = 196


▪ Class token is a vector of size (1,768), special learnable embedding added to every Patch embedding
Class token
input example. embedding

2D Space 1D Space A matrix of (197, 768)


(RGB) 224 14 Flattened patches
16
Each patch is converted
16 1 (RGB) into a vector
of size (16×16x3) = 768
2

3 nn.Conv2D
14
224 4 channel_in = 3
channel_out = 768
kernel_size = 16×16
stride = (16,16)

196
(16×16×3)
Architecture Overview (Position Embedding)

▪ Given an image of size (224×224×3) Serve as the image label (196×768) (196+1)×768
𝑧𝑝𝑖 E ∈ R(N×D) ; i ∈ 1,2,3, … N
𝑧00 Epos ∈ R(N+1)×D
▪ Divide into 196 (14×14) patches of size 16×16×3.

• (N = HW/P2) ➔ N = (224×224)/162 ➔ N = 196


▪ Position embeddings are added to the patch embeddings to retain position
Patch embedding
Class token
information. Authors used standard learnable1D position embeddings. embedding
Position encoding
2D Space 1D Space
(RGB) 224 14 Flattened patches + Concatenation not addition

16 768 768
Each patch is converted 0 +
16 1 (RGB) into a vector
of size (16×16×3) = 768
1 +
2 +
+
3 nn.Conv2D
14 +
224 4 channel_in = 3
channel_out = 768
kernel_size = 16×16
stride = (16,16)
197 +
196
(16×16×3)
Architecture overview (Position Embedding)

Similarity of position embeddings of ViT-L/32

This grid shows the cosine similarity between


the position embeddings of the patch with the
the indicated row and column and positions
of all other patches.

Ablation of ViT-B/16 using different types of


position embedding.

1D position embedding yields the best results


Architecture Overview (Transformer Encoder)
Increase embedding dimension Decrease back the embedding
by factor 4 MLP Block dimension by factor 4

768 768
GELU
Input to MLP
196 196 Output of MLP
block block

fc1 fc2

Hidden dimension Output dimension


(196 × 3072) (196 × 768) Output of previous
encoder layer
The second fc layer reduce back the embedding dimension N×
To perform element-wise addition to the output of multi-head
attention 𝑧𝑙′ B=12, L=24, H=32

N× Current encoder

layer

×4➔
×4➔
×4➔

197×768
Multi-Head Self Attention
Single Self-Attention Multi-Head Self-Attention
Scaled Dot-Product Attention
For an input sequence of size n tokens in
Linear
sequence with d embedding size. MatMul

Concat
[q, k, v] = z × Uqkv (projection matrix) SoftMax

Where, Z ∈ RN×D and Uqkv ∈ RD×3Dh ScaledDot-Product


Dot-Product
Mask (opt.) Scaled
Attention h
Dh = D/H , H = number of heads Attention (Multi-Head)

Scale
Linear Linear Linear
Image = (224×244×3) For ViT-B/16
Patch size =16 MatMul
Number of patches = 196 (14×14)
16×16×3
Q K V
(196×768)
Q K V
(768/12=64) Multi-head self-attention has the exact same
(196×64) (64×196) process over multiple attention heads.
(768×192) 𝑞𝑘𝑇
𝑆𝐴 𝑧 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 .𝑣 (196×64)
(196×768) × (768×192) = (196×192) 𝑑ℎ
(196×196)
(64) MSA(z) = [SA1(z); SA2(z); ……. SAn(z)] Umsa .
Umsa is projection matrix (Umsa ∈ R(h×Dh)×D) that transforms the
concatenated output from multiple heads (h × Dh) to model embedding
dimension D
Architecture Recap (MLP Head)

3072 ➔ 1000 (ImageNet-1K) Both during pre-training and fine-tuning, classification head is
attached to 𝒁𝟎𝑳
Fully connected Reduce the dimension
(Linear) Layer to number of classes K

Vision Transformer (ViT)


Introduce non-linearity to GELU
the data
Class
Fully connected Bird MLP Containing Ground truth
Increase embedding dimension
(Linear) Layer Ball Head label (y)
by factor 4
Car
Plane 𝑍𝐿0
768 ➔ 3072
…...
Normalization
Layer / Norm Transformer Encoder
Patch + Position
Embedding
0 1 2 3 4 5 6 7 8 9
* Extra learnable *
Input Image (class) embedding
Linear Projection of Flattened Patches
The classification head is implemented using MLP
with one hidden layer at pre-training time and a
single liner layer at fine-tuning time. 2×C)
A sequence of flattened 2D patches xp ∈ RN(P
Patch Grid of
x ∈ RH×W×C images
ViT Model Variants

Dimension of Embedding Vectors Hidden dimension of MLP (D×4)

ViT-Huge (the largest variant of ViT


ViT-Base and ViT-Large models are directly
architecture family) is introduced by the
Adopted from the BERT model (i.e., BERT configurations)
authors (Dosovitskiy et al.)
Experiments (Datasets)

Pretraining Finetuning

ILSVRC-2012 ImageNet ImageNet Real


JFT-300M (by Google)
Number of classes : 1000 Number of classes : 1000
Number of classes : 18000
Number of Images : 1.2M Number of Images : 50,000
Number of Images : 303M
CIFAR-10 CIFAR-100
Number of classes : 10 Number of classes : 100
ImageNet-21K (by Alibaba Group) Number of Images : 60,000 Number of Images : 60,000
Number of classes : 21841
Number of Images : 14M
Oxford-IIIT Pets Oxford Flowers-120
Number of classes : 37 Number of classes : 102
ILSVRC-2012 ImageNet (by Alibaba Group) Number of Images : 7400 Number of Images : 8189
Number of classes : 1000
VTAB (19 tasks)
Number of Images : 1.2M
Number of tasks : 19
Number of Images : 1000 per task
Experiments & Results (Pre-training Data Requirements)

Larger Dataset Larger Dataset


Experiments and Results (Scaling Study)

As the model size increases, the performance start improving


Experiments & Results (Comparison with State-of-the-Art)
Pretraining
Tasks
Finetuning/Downstream
Tasks
Experiments (Inspecting Vision Transformer)

Learned Embedding Filters

As the network depth increase,


yielding in better global attention
Key Observations

▪ Worse than Resnet when pre-trained on smaller datasets (i.e


ImageNet).

▪ Performance improved when pre-trained on large datasets (i.e,


ImageNet-21K and JFT-300M).

▪ Pretrained outperforms much bigger CNNs (BiT).

▪ Transformer’s sequence length is inversely proportional to the patch size.


Thus, models with smaller patch size are computationally more expensive
(having large sequence length).
Limitations of Vision Transformer

▪ Data Hunger & Training Efficiency

▪ Overfitting on Small Datasets

▪ Dependence on Pretraining

▪ Scaling Challenges
References

1. Alexey, Dosovitskiy. "An image is worth 16x16 words: Transformers for image recognition at
scale." ICLR (2021).

2. Vaswani, A. "Attention is all you need." Advances in Neural Information Processing


Systems (2017).

3. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. and Jégou, H., 2021, July. Training
data-efficient image transformers & distillation through attention. In International conference on
machine learning (pp. 10347-10357). PMLR.

4. https://tugot17.github.io/Vision-Transformer-Presentation/#/19
Thank You
Q&A

You might also like