DL6 - Convnets 4
DL6 - Convnets 4
20
16.4
15
11.7
10
7.3
6.7
5 Human
3.57
0
1
• Convolution layers
• Pooling layers
6
Are fully connected layers
invariant to translation? neuron k
• Fully connected layer with images input:
p ixels 𝐱 [ i, j ] W[i,j,k]
? Translation
invariant
Why invariance is not enough?
• Shallow layer features: “eyes” + “nose” + “mouth”
• Deeper layer should detect “face”,
but only of all features are spatially related
• is equivariant 𝜏
if for all
• Example: Edge detection 𝑓 𝑓
𝜏
Convnets: built-in translation invariance
• Label unaffected by translation
Classifier should be translation invariant
• Features should translate with image
Convnet hidden layers are equivariant
Translation Translation
equivariant invariant
How do we build equivariant layers?
Exercise: Let where is component-wise invertible non-linearity and
Prove that is equivariant to transformation family if and only if
13
2D Convolution on a single map (channel)
Examples of 2D Convolutions
∗ ¿ Edge detection
∗ ¿ Sharpening
Zero Padding
Padding = 0 Padding = 1 Padding = 2
learned
kernel
Input map
Convolution Layer: Single Input, Two
Outputs
Output map
Another
learned
kernel
image
Convolution Layer: Single Input, Many Outputs
N kernels
Convolution Layer: Many Inputs, One Output
Convolution Layer: Many Inputs, Many Outputs
3
2
1
3
2 Q: What does a 1x1 kernel do?
1 A: “Point-wise convolution”.
Scalars multiplying entire
maps. Useful for cheaply
3 combining maps.
123 2
1
3 N output maps
M input maps
2
1
MN KxK kernels
Grouped Convolution
2
1
2
1
4
3
1234
4
M output maps
M input maps 3
1234 3
4
M input maps M output maps
KxK kernels
Common Cheap Option: Separable Convolution
• Key idea: divide the convolution to two steps [introduced in the Mobilenet architecture]
3
2
1 1
3
2 2
1
3
2 123
1234 3 1
3
4 2
1
M input maps M KxK kernels M intermediate maps MN 1x1 kernel N output maps
Q: #parameters in comparison to standard convolution? Q: #multiplications in comparison to standard convolution?
A: standard: , separable:
Convolution Neural net: Building blocks
• Convolution layers
• Pooling layers
• No
• Pooling layers
Result: single pixel shifts can change
classification
[
Azulay & Weiss, JMLR 201
9
] • Problem can be solved,
even for fractional shifts [Hagay et al., CVPR 2023]
• But was is the problem?
What is the problem? Aliasing!
Original
Reconstructed
𝑥
Convnet backpropagation:
¿ ∗
For example:
¿ ∗
.
.
.
Convnet backpropagation:
¿ ∗
0 0 0 0
0 0
¿ ∗
0 0
0 0 0 0
History & Architectures
History – first CNN (1993)
ImageNet challenge
• Innovation:
• Max pooling, ReLU nonlinearity
• More data and bigger model (7 hidden layers, 650K units, 60M params)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
• Dropout regularization (later)
ImageNet ImageNet
20
ImageNet top-5
16.4
15
11.7
10
7.3
6.7
5 Human
3.57
0
1
AlexNet VGG ResNet
2010 2011 2012 2013 2014(1) 2014(2) 2015
Comparing architectures
https://culurciello.github.io/tech/2016/06/04/nets.html
Basic Residual Architectures
More efficient variations
• Concatenation?
• Architecture search
• Standard hyperparameters need tuning (learning rate, etc.) for new architecture
Neural architectural search (NAS)
• Motivation: automating the architectural design process
• Huge search space → use reduced space
• Search on small datasets (CIFAR10), apply to large datasets (ImageNet)
• Optimization methods:
• Evolutionary algorithms [AmobaNet]
• Reinforcement learning [NasNet]
• Grid search [EfficientNet]
• Gradient based methods (e.g., Darts) …
[MnasNet,
• Hardware objectives (FLOPS, power, latency) can be added
EfficientNetV2]
Extensions
Other Uses in Vision Tasks
55
Convnets for Speech Classification
Amplitude Speech Signal
time
2D Input
to a convnet
time
56
Summary
• Hierarchical representation using local connectivity
• Invariance and Equivariance using convolutions
• Building blocks
• Architectures
• Extensions
Some slides and visual adopted from courses cs231n (Standford) , 236278 (Technion), A guide to convolution arithmetic for deep learning