0% found this document useful (0 votes)
17 views10 pages

VR Part2 Lecture 6 Annotated

Uploaded by

Achintya Harsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

VR Part2 Lecture 6 Annotated

Uploaded by

Achintya Harsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

VISUAL RECOGNITION – PART 2

Lecture 6: Transformer based Language Modelling ;


Transformers for Image Classification
http://jalammar.github.io/illustrated-transformer/

Position Encoding in Transformers

Will changing the order of input sequence affects the respective ‘𝑧 ′ values ?

We need to add additional position information to every token to maintain sequence information
https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Machine Translation : Self Attention + Cross Attention


Representation Learning with Self-Supervision + Transformers: BERT

Bi-directional Modeling done with the help of [MASK] tokens

Mask a percentage of input tokens at random (e.g. 15%)

Predict the masked token using Transformer encoder architecture

Sentence Embeddings :
Generative Modeling with Self-Supervision + Transformers: GPT
Vision Transformer (ViT)

Transformers can replace CNNs in image recognition !


Vision Transformer Steps:

• Split an image into fixed-size patches,


• Linearly embed each of them
• Add position embeddings

• Feed the resulting sequence of vectors to a standard


Transformer encoder.

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


ViT : Patch Creation

𝑃 = PatchSize
𝑥𝑝1 𝑥𝑝N

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


ViT : Patch Input Embedding
Patch Embedding

𝑥𝑝1
𝑥𝑝2

𝑧0

𝑥𝑝1 𝑥𝑝N 𝑥𝑝N

Learnable Position
Embeddings

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


ViT : Encoder & Final MLP Head

Probabilities

Cat Dog Horse Pattern

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


Vision Transformer – Attention Maps

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

You might also like