0% found this document useful (0 votes)
2 views

CNNs and Transformers

The document discusses Convolutional Neural Networks (CNNs) and Transformers, highlighting their architectures and applications in tasks like biomedical image segmentation and natural language processing. It explains the significance of self-attention mechanisms in Transformers, including key-value pairs and multi-headed attention, while also addressing the challenges of applying Transformers to vision tasks. Additionally, it covers the concept of autoregressive models and the importance of sampling in generating outputs from these models.

Uploaded by

CelesteCebedio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CNNs and Transformers

The document discusses Convolutional Neural Networks (CNNs) and Transformers, highlighting their architectures and applications in tasks like biomedical image segmentation and natural language processing. It explains the significance of self-attention mechanisms in Transformers, including key-value pairs and multi-headed attention, while also addressing the challenges of applying Transformers to vision tasks. Additionally, it covers the concept of autoregressive models and the importance of sampling in generating outputs from these models.

Uploaded by

CelesteCebedio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Convolution Neural Networks

96x96x64
96x96x32

48x48x64 48x48x128

24x24x128 24x24x256
48x48x64

12x12x256 12x12x512
24x24x128

12x12x256

6x6x256 6x6x256

Ronneberger, et al. "U-Net: Convolutional networks for biomedical image segmentation" MICCAI, 2015
96x96x4
We get four channels for each pixel:
➢ Left ventricle: light blue area
➢ Myocardium: pink area
➢ Right ventricle: red area
➢ Background: blue area
96x96x32
Next convert these logits into probability distribution with
the Softmax activation function

Softmax is given as follows:

Denominator ensures sum of all outputs is 1, making it a valid probability distribution


96x96x4

Example Calculation:

➢ Given raw logits:

96x96x32 ➢ After using the Softmax activation:

The first class (logit = 3.0) has


the highest probability (75.56%)

➢ Final Softmax Output:


We need a loss to compute distance between prediction dist. and gt labels: we use
cross entropy defined below: Measures the distance between two distributions
Introduction to the
Transformer
Thanks to Alexander Krull, Constantin Pape and Alex Ecker!
• Sequence-to-Sequence Architecture

• Hugely influential
• Basis for ChatGPT (Generative Pre-trained Transformer)
• Also for vision
The Transformer
Decoder

The Transformer
Encoder

Latent Code
• Process set of tokens

The Encoder • Tokens remain separate


• (except for attention layer)
• Tokens don’t have order
• (except for positional encoding)

Latent Code

𝑥1 𝑥2 𝑥3
• Process set of tokens

The Encoder • Tokens remain separate


• (except for attention layer)
• Tokens don’t have order
• (except for positional encoding)

Latent Code

Attention
𝑥1 𝑥2 𝑥3
Attention
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention
Self-attention:
● every element in sequence can
influence every other element
● learn weighting (“attention”)
for each pair of elements

Attention: learnable pairwise


weighting that depends on other
sequence

Self-Attention: use input sequence for


attention
Key-Value Pairs

JSON-Files: Python Dictionary:


Key-Value Pairs

Query Keys Values


Date of birth Name Jane Doe
Address 37 Coronation street
B12 9TK
Date of birth May 5th 2000
Place of birth Hull
Key-Value Pairs

Query Keys Values Result


Date of birth Name Jane Doe May 5th 2000
Address 37 Coronation street
B12 9TK
Date of birth May 5th 2000
Place of birth Hull
𝑥1 𝑥2 𝑥3 𝑥4
Idea:
• Every token makes:
• Key
• Value
𝑊𝑄 𝑊𝐾 𝑊𝑉 • Query
𝑥1 𝑥2 𝑥3 𝑥4
Idea:
• Every token makes:
• Key
• Value
𝑊𝑄 𝑊𝐾 𝑊𝑉 • Query
𝑥1 𝑥2 𝑥3 𝑥4
Idea:
• Every token makes:
• Key
• Value
𝑊𝑄 𝑊𝐾 𝑊𝑉 • Query
𝑥1 𝑥2 𝑥3 𝑥4
Idea:
• Every token makes:
• Key
• Value
𝑊𝑄 𝑊𝐾 𝑊𝑉 • Query
𝑋
Idea:
• Every token makes:
• Key
• Value
𝑊𝑄 𝑊𝐾 𝑊𝑉 • Query
Query

0
1
0

𝑛
𝑧 = ෍ 1 𝒒 = 𝒌𝑗 𝒗𝑗
𝑗=1
Relaxed Query

.1

.1
.7
.1

𝑛
𝒛 = ෍ score𝑗 𝒗𝑗 score𝑗 = softmax similarity(𝑞, 𝒌𝑗 )
𝑗=1
𝒒𝑻 𝒌𝑗
𝒅𝒌
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention
Application to sequence: matrix multiplication

Q, K, V are computed from X (embedding of


input sequence) with learned weight matrix

24
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention
Application to sequence: matrix multiplication

Q, K, V are computed from X (embedding of


input sequence) with learned weight matrix
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention
Application to sequence: matrix multiplication

Q, K, V are computed from X (embedding of


input sequence) with learned weight matrix
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention
Application to sequence: matrix multiplication

Q, K, V are computed from X (embedding of


input sequence) with learned weight matrix
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention
Application to sequence: matrix multiplication

Q, K, V are computed from X (embedding of


input sequence) with learned weight matrix
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention
Application to sequence: matrix multiplication

Q, K, V are computed from X (embedding of


input sequence) with learned weight matrix
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention

T
K

30
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention

31
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Self-Attention

32
Attention Matrix
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Multi-headed self-attention

34
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Multi-headed self-attention
Multiple attention heads for
increased model capacity

35
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762

Multi-headed self-attention
Multiple attention heads for
increased model capacity

Combine with additional feed-


forward layer

Concat zi outputs, project to z


with learned weight matrix
36
Ashish Vaswani et al. https://arxiv.org/pdf/1706.03762.pdf
Decoder

Encoder

Latent Code
The Decoder - Part 1
Autoregressive Generative
Models
Decoder

The Transformer
Encoder

Latent Code
Autoregressive Models
Problem:
• Model high dimensional difficult distribution

𝑝θ 𝒙 = 𝑝data(𝒙), with 𝒙 = (𝑥1 , … , 𝑥𝑛 )

Idea:
• Factorise distribution
𝑛 True
dist.
𝑝θ 𝒙 = ෑ 𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
𝑖=1

Neural network:
• Parameters θ Model
Data
• Input 𝑥1, … , 𝑥𝑖−1 𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
• Output dist. over 𝑥𝑖 𝑥𝑖
Pressure
Time
Pressure
Time

• Predict dist. for next audio sample

𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1

𝑥1 𝑥𝑖−1
Time
Pressure
Time

• Predict dist. for next audio sample


• Fully conv architecture:
• Simultaneous pred. for all timepoints
𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1

𝑥1 𝑥𝑖−1
Time
Pressure
Time

• Predict dist. for next audio sample


• Fully conv architecture:
• Simultaneous pred. for all timepoints
𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1

𝑥1 𝑥𝑖−1
Time
Pressure
Time

• Predict dist. for next audio sample


• Fully conv architecture:
• Simultaneous pred. for all timepoints
𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1

𝑥1 𝑥𝑖−1
Time
Sampling from the Model
• Predict dist. for next audio sample
• Sample from distribution
• Append new sample
• Repeat

𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1

𝑥1 𝑥𝑖−1
Time
Sampling from the Model
• Predict dist. for next audio sample
• Sample from distribution
• Append new sample
• Repeat

𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1

𝑥1 𝑥𝑖−1
Time
Image Generation:
• Sample one pixel
• Apply network
• Repeat
Summary
• Interpret data as sequence 𝑛
𝑝θ 𝒙 = ෑ 𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
• Train neural network 𝑖=1

• Input: previous values 𝑥1 , … , 𝑥𝑖−1


• Distribution of possible next values 𝑝θ 𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1
• E.g. as histogram
• Or Parametric dist.
• Ensure correct receptive field, e.g. special convolutions

• Sampling:
• One sample at a time
• Slow, involves repeated application of model
Decoder

The Transformer
Encoder

Latent Code
The Decoder - Part 2
Putting Everything Together
Decoder

The Transformer
Encoder

Latent Code
The Encoder
Latent Code

𝑥1 𝑥2 𝑥3
Decoder

The Transformer
Encoder

Latent Code

Executed once Executed repeatedly


𝑋

Self-Attention Idea:
• Every token makes:
• Key
• Value
𝑊𝑄 𝑊𝐾 𝑊𝑉 • Query
𝑋 Latent Code

Cross-Attention

𝑊𝐾 𝑊𝑉
𝑊𝑄

From Decoder From Encoder


Decoder

The Transformer
Encoder

Latent Code

Executed once Executed repeatedly


Decoder

The Transformer
Encoder

Latent Code

Executed once Executed repeatedly


Jay Alammar: https://jalammar.github.io/illustrated-transformer/
Jay Alammar: https://jalammar.github.io/illustrated-transformer/
Random Sampling with temperature:

Decoding Strategies

Greedy Beam Search

https://huggingface.co/blog/how-to-generate
Training the Transformer
Loss function ? ? ? ?

Training Data

Latent Code
Masked Attention

Time

Autoregressive Model:
• Output: dist. of next item
• Input: only previous items
Loss function ? ? ? ?

Training Data

Latent Code
Loss function ? ? ? ?

Training Data

Latent Code
Loss function ? ? ? ?

Training Data

Latent Code
Loss function ? ? ? ?

Training Data

Latent Code
Transformer Variants
Decoder Only

<EOS>
Vision Transformers
Transformers for Vision
Why?
Very successful in NLP, esp. for large data! Maybe same for vision?

Problem: Image = 2D input, transformer expects 1D sequence


-> need tokenization for images

Naive Approach: flatten the image, each pixel is treated as sequence element
BUT: self-attention has quadratic complexity in sequence length -> not feasible

Possible solutions:
● Apply convolutional back-bone first, use lower resolution feature map
● Use image patches

77
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

Vision Transformer (ViT)

“Tokenization” into patches

78
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

Vision Transformer (ViT)

Position embedding

Patch embedding: strided conv.

“Tokenization” into patches

79
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

Vision Transformer (ViT)

Classifier head

Same as in prev. section

Position embedding

Patch embedding: strided conv.

“Tokenization” into patches

80
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

CNN vs ViT: Inductive biases


Property CNN ViT

Shift invariance Mostly No


https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

CNN vs ViT: Inductive biases


Property CNN ViT

Shift invariance Mostly No


Permutation Invariance No Mostly
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

CNN vs ViT: Inductive biases


Property CNN ViT

Shift invariance Mostly No


Permutation Invariance No Mostly
Spatially local processing Yes (3x3, 5x5 conv etc.) Partially (only patch level)
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

CNN vs ViT: Inductive biases


Property CNN ViT

Shift invariance Mostly No


Permutation Invariance No Mostly
Spatially local processing Yes (3x3, 5x5 conv etc.) Partially (only patch level)
Parameter sharing Yes Yes
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

CNN vs ViT: Inductive biases


Property CNN ViT

Shift invariance Mostly No


Permutation Invariance No Mostly
Spatially local processing Yes (3x3, 5x5 conv etc.) Partially (only patch level)
Parameter sharing Yes Yes

Increased depth: lower resolution, Yes (pyramidal design) No


wider features
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

CNN vs ViT: Inductive biases


Property CNN ViT

Shift invariance Mostly No


Permutation Invariance No Mostly
Spatially local processing Yes (3x3, 5x5 conv etc.) Partially (only patch level)
Parameter sharing Yes Yes

Increased depth: lower resolution, Yes (pyramidal design) No


wider features

ViT: lacks inductive biases for image data.


● Can it learn to overcome this (with enough training data)?
● If so, does it bring anything? E.g. better performance?
https://sites.google.com/view/cvpr-2022-beyond-cnn
https://arxiv.org/abs/2010.11929

What does attention learn?


Chen, et al. “Transunet: Transformers make strong encoders for medical image segmentation” MICCAI, 2021
Layer/Block Input shape Output shape Explanation

Encoder

Conv1 (4, 1, 96, 96) (4, 128, 48, 48) kernel size 7x7,stride=2

EncoderBottleneck1 (4, 128, 48, 48) (4, 256, 24, 24) stride=2, downsampling

EncoderBottleneck2 (4, 256, 24, 24) (4, 512, 12, 12) stride=2, downsampling

EncoderBottleneck3 (4, 512, 12, 12) (4, 1024, 6, 6) stride=2, downsampling

Tokenisation (4, 1024, 6, 6) (4, 36, 1024) convert patches to tokens, each of size 1x1x1024

ViT Projection + Position Encoding (4, 36, 1024) (4, 36, 1024) positional embedding

Transformer Encoder (4, 36, 1024) (4, 36, 1024) 8 cascades of self-attention layers

Patchification (4, 36, 1024) (4, 1024, 6, 6) recover patches from tokens

Conv2 (4, 1024, 6, 6) (4, 512, 6, 6) convolution with reduced channel size

Decoder

DecoderBottleneck1 (4, 512, 6, 6) (4, 256, 12, 12) concatenate with Encoder2 output (512, 12, 12)

DecoderBottleneck2 (4, 256, 12, 12) (4, 128, 24, 24) concatenate with Encoder1 output (256, 24, 24)

DecoderBottleneck3 (4, 128, 24, 24) (4, 64, 48, 48) concatenate with the initial output (128, 48, 48)

DecoderBottleneck4 (4, 64, 48, 48) (4, 16, 96, 96) upsample to original size

Final Conv (4, 16, 96, 96) (4, 4, 96, 96) use 1x1 convolution to adjust channels (classes)
Thanks!

You might also like