CNNs and Transformers
CNNs and Transformers
96x96x64
96x96x32
48x48x64 48x48x128
24x24x128 24x24x256
48x48x64
12x12x256 12x12x512
24x24x128
12x12x256
6x6x256 6x6x256
Ronneberger, et al. "U-Net: Convolutional networks for biomedical image segmentation" MICCAI, 2015
96x96x4
We get four channels for each pixel:
➢ Left ventricle: light blue area
➢ Myocardium: pink area
➢ Right ventricle: red area
➢ Background: blue area
96x96x32
Next convert these logits into probability distribution with
the Softmax activation function
Example Calculation:
• Hugely influential
• Basis for ChatGPT (Generative Pre-trained Transformer)
• Also for vision
The Transformer
Decoder
The Transformer
Encoder
Latent Code
• Process set of tokens
Latent Code
𝑥1 𝑥2 𝑥3
• Process set of tokens
Latent Code
Attention
𝑥1 𝑥2 𝑥3
Attention
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Self-Attention
Self-attention:
● every element in sequence can
influence every other element
● learn weighting (“attention”)
for each pair of elements
0
1
0
𝑛
𝑧 = 1 𝒒 = 𝒌𝑗 𝒗𝑗
𝑗=1
Relaxed Query
.1
.1
.7
.1
𝑛
𝒛 = score𝑗 𝒗𝑗 score𝑗 = softmax similarity(𝑞, 𝒌𝑗 )
𝑗=1
𝒒𝑻 𝒌𝑗
𝒅𝒌
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Self-Attention
Application to sequence: matrix multiplication
24
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Self-Attention
Application to sequence: matrix multiplication
Self-Attention
Application to sequence: matrix multiplication
Self-Attention
Application to sequence: matrix multiplication
Self-Attention
Application to sequence: matrix multiplication
Self-Attention
Application to sequence: matrix multiplication
Self-Attention
T
K
30
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Self-Attention
31
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Self-Attention
32
Attention Matrix
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Multi-headed self-attention
34
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Multi-headed self-attention
Multiple attention heads for
increased model capacity
35
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Multi-headed self-attention
Multiple attention heads for
increased model capacity
Encoder
Latent Code
The Decoder - Part 1
Autoregressive Generative
Models
Decoder
The Transformer
Encoder
Latent Code
Autoregressive Models
Problem:
• Model high dimensional difficult distribution
Idea:
• Factorise distribution
𝑛 True
dist.
𝑝θ 𝒙 = ෑ 𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
𝑖=1
Neural network:
• Parameters θ Model
Data
• Input 𝑥1, … , 𝑥𝑖−1 𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
• Output dist. over 𝑥𝑖 𝑥𝑖
Pressure
Time
Pressure
Time
𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
𝑥1 𝑥𝑖−1
Time
Pressure
Time
𝑥1 𝑥𝑖−1
Time
Pressure
Time
𝑥1 𝑥𝑖−1
Time
Pressure
Time
𝑥1 𝑥𝑖−1
Time
Sampling from the Model
• Predict dist. for next audio sample
• Sample from distribution
• Append new sample
• Repeat
𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
𝑥1 𝑥𝑖−1
Time
Sampling from the Model
• Predict dist. for next audio sample
• Sample from distribution
• Append new sample
• Repeat
𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
𝑥1 𝑥𝑖−1
Time
Image Generation:
• Sample one pixel
• Apply network
• Repeat
Summary
• Interpret data as sequence 𝑛
𝑝θ 𝒙 = ෑ 𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
• Train neural network 𝑖=1
• Sampling:
• One sample at a time
• Slow, involves repeated application of model
Decoder
The Transformer
Encoder
Latent Code
The Decoder - Part 2
Putting Everything Together
Decoder
The Transformer
Encoder
Latent Code
The Encoder
Latent Code
𝑥1 𝑥2 𝑥3
Decoder
The Transformer
Encoder
Latent Code
Self-Attention Idea:
• Every token makes:
• Key
• Value
𝑊𝑄 𝑊𝐾 𝑊𝑉 • Query
𝑋 Latent Code
Cross-Attention
𝑊𝐾 𝑊𝑉
𝑊𝑄
The Transformer
Encoder
Latent Code
The Transformer
Encoder
Latent Code
Decoding Strategies
https://huggingface.co/blog/how-to-generate
Training the Transformer
Loss function ? ? ? ?
Training Data
Latent Code
Masked Attention
Time
Autoregressive Model:
• Output: dist. of next item
• Input: only previous items
Loss function ? ? ? ?
Training Data
Latent Code
Loss function ? ? ? ?
Training Data
Latent Code
Loss function ? ? ? ?
Training Data
Latent Code
Loss function ? ? ? ?
Training Data
Latent Code
Transformer Variants
Decoder Only
<EOS>
Vision Transformers
Transformers for Vision
Why?
Very successful in NLP, esp. for large data! Maybe same for vision?
Naive Approach: flatten the image, each pixel is treated as sequence element
BUT: self-attention has quadratic complexity in sequence length -> not feasible
Possible solutions:
● Apply convolutional back-bone first, use lower resolution feature map
● Use image patches
77
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929
78
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929
Position embedding
79
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929
Classifier head
Position embedding
80
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929
Encoder
Conv1 (4, 1, 96, 96) (4, 128, 48, 48) kernel size 7x7,stride=2
EncoderBottleneck1 (4, 128, 48, 48) (4, 256, 24, 24) stride=2, downsampling
EncoderBottleneck2 (4, 256, 24, 24) (4, 512, 12, 12) stride=2, downsampling
Tokenisation (4, 1024, 6, 6) (4, 36, 1024) convert patches to tokens, each of size 1x1x1024
ViT Projection + Position Encoding (4, 36, 1024) (4, 36, 1024) positional embedding
Transformer Encoder (4, 36, 1024) (4, 36, 1024) 8 cascades of self-attention layers
Patchification (4, 36, 1024) (4, 1024, 6, 6) recover patches from tokens
Conv2 (4, 1024, 6, 6) (4, 512, 6, 6) convolution with reduced channel size
Decoder
DecoderBottleneck1 (4, 512, 6, 6) (4, 256, 12, 12) concatenate with Encoder2 output (512, 12, 12)
DecoderBottleneck2 (4, 256, 12, 12) (4, 128, 24, 24) concatenate with Encoder1 output (256, 24, 24)
DecoderBottleneck3 (4, 128, 24, 24) (4, 64, 48, 48) concatenate with the initial output (128, 48, 48)
DecoderBottleneck4 (4, 64, 48, 48) (4, 16, 96, 96) upsample to original size
Final Conv (4, 16, 96, 96) (4, 4, 96, 96) use 1x1 convolution to adjust channels (classes)
Thanks!