0% found this document useful (0 votes)
20 views

DL6 - Convnets 4

Convolutional neural networks use three main types of layers: convolution layers, activation layers, and pooling layers. Convolution layers apply filters to input data to extract features. These layers incorporate translation invariance, allowing the network to detect patterns regardless of position. Deeper convolution layers detect more complex patterns by processing information from larger regions of the input. Overall, convolutional neural networks use local connectivity and weight sharing to efficiently process visual input data and learn hierarchical representations.

Uploaded by

razifa0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

DL6 - Convnets 4

Convolutional neural networks use three main types of layers: convolution layers, activation layers, and pooling layers. Convolution layers apply filters to input data to extract features. These layers incorporate translation invariance, allowing the network to detect patterns regardless of position. Deeper convolution layers detect more complex patterns by processing information from larger regions of the input. Overall, convolutional neural networks use local connectivity and weight sharing to efficiently process visual input data and learn hierarchical representations.

Uploaded by

razifa0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Convolutional neural networks

Deep Learning – 046211


Daniel Soudry and Yossi Keshet
https://deepdreamgenerator.com/
Motivation - Classification task progress
30
28.2 Deep learning era
25.8
25
ImageNet top-5 Error

20

16.4

15

11.7

10

7.3
6.7

5 Human
3.57

0
1

2010 2011 2012 2013 2014(1) 2014(2) 2015


Convolution Neural nets: Building blocks

• Convolution layers

• Activation Layers (ReLU)

• Pooling layers

• Fully connected layers


Convolutional Neural Nets Properties

• Property 1: Deeper layers represent more complex parts of the image

• Property 2: Translational invariance/equivariance


4
Local Connectivity for Hierarchical
Representations
• Multilayer neural networks: each neuron
connected to every neuron in next layer

• Convnets: only local connectivity


• Deeper neurons affected by larger input regions
• Hierarchical Representation:
• Shallow layers detect local features
• Deeper layers detect global features
Invariance to translations

Invariance: moving the image


does not change its class

Note: not necessary if all images (train+test) are pre-aligned


(e.g., see Deepface: Convnet without weight sharing)

6
Are fully connected layers
invariant to translation? neuron k
• Fully connected layer with images input:
p ixels 𝐱 [ i, j ] W[i,j,k]

On this training image:


• Cat detector: Red weights will change
to better detect a cat

• Not invariant to On this test image:


Green weights will not
translation! detect a cat
Convnets: built-in translation invariance
• Label unaffected by translation
 classifier should be translation invariant

? Translation
invariant
Why invariance is not enough?
• Shallow layer features: “eyes” + “nose” + “mouth”
• Deeper layer should detect “face”,
but only of all features are spatially related

Don’t lose location information!

Invariant layers will detect face in a Picasso picture:

Source: 03:22 here.


Invariance / equivariance
• is invariant 𝜏
transformation that
if for all changes order of

• Example: Image classification


components
𝑓 𝑓
𝐶𝑎𝑡

• is equivariant 𝜏
if for all
• Example: Edge detection 𝑓 𝑓
𝜏
Convnets: built-in translation invariance
• Label unaffected by translation
 Classifier should be translation invariant
• Features should translate with image
 Convnet hidden layers are equivariant

Translation Translation
equivariant invariant
How do we build equivariant layers?
Exercise: Let where is component-wise invertible non-linearity and
Prove that is equivariant to transformation family if and only if

Example: find W equivariant (1,2)

to Cyclic Translation: (2,3)


W
(3,4)
Cyclic Convolution (in 1D) !
(4,1)

Exercise: Find conditions on so that


is equivariant to permutations of the components of .
Marron et al. 2020
Convolution and Cross-correlation in 2D
• Two real signals:
Q: Which one is used in
convolutional neural nets?
• 2D convolution:
A: Cross-correlation! (but we
call it convolution anyway)

Q: Are both translation


equivariant?
• 2D cross-correlation:
A: Convolution – yes.
Cross-correlation – only in y

13
2D Convolution on a single map (channel)
Examples of 2D Convolutions

∗ ¿ Edge detection

∗ ¿ Sharpening
Zero Padding
Padding = 0 Padding = 1 Padding = 2

kernel size=3, stride=1, dilation=1


Stride
Stride = 1 Stride = 2

kernel size=3, padding=1, dilation=1


Dilation
Dilation=2

kernel size=3, stride=1, padding=0


Upsampled/Transposed convolution
• More relevant for generative models (e.g., Generative Adversarial models, GANs)
• Not “Deconvolution”!
Convolution Layer: Single Input, Single
Output
Output map

learned
kernel

Input map
Convolution Layer: Single Input, Two
Outputs
Output map

Another
learned
kernel

image
Convolution Layer: Single Input, Many Outputs

Input map N output maps

N kernels
Convolution Layer: Many Inputs, One Output
Convolution Layer: Many Inputs, Many Outputs
3
2
1

3
2 Q: What does a 1x1 kernel do?
1 A: “Point-wise convolution”.
Scalars multiplying entire
maps. Useful for cheaply
3 combining maps.
123 2
1

3 N output maps

M input maps
2
1
MN KxK kernels
Grouped Convolution
2
1

2
1
4
3
1234

4
M output maps
M input maps 3

G groups of KxK kernels


Depth-wise convolution
• Grouped convolution with #groups = M = N

1234 3

4
M input maps M output maps
KxK kernels
Common Cheap Option: Separable Convolution
• Key idea: divide the convolution to two steps [introduced in the Mobilenet architecture]

Depthwise convolution Pointwise convolution


(spatial domain) (channel domain)

3
2
1 1
3
2 2
1
3
2 123
1234 3 1
3
4 2
1
M input maps M KxK kernels M intermediate maps MN 1x1 kernel N output maps
Q: #parameters in comparison to standard convolution? Q: #multiplications in comparison to standard convolution?
A: standard: , separable:
Convolution Neural net: Building blocks

• Convolution layers

• Activation Layers (ReLU)

• Pooling layers

• Fully connected layers


Pooling layer
• Makes the representation smaller
• Operates map-wise
• Can use instead: strides, dilation
Are convnets really equivariant/invariants?

Building blocks: Equivariant?


• Convolution layers • Yes
• Activation Layers (e.g., ReLU, GeLU) • Yes
• Normalization layers • Yes (for some types)

• No
• Pooling layers
Result: single pixel shifts can change
classification

[
Azulay & Weiss, JMLR 201
9
] • Problem can be solved,
even for fractional shifts [Hagay et al., CVPR 2023]
• But was is the problem?
What is the problem? Aliasing!

Original

Reconstructed

[Tero et al., NeurIPS 2021]


Improved performance (in GANs)

[Tero et al., NeurIPS 2021]


A few comments on convolution layers
(1) Convolution layer vs. fully connected
layer
• Convnets use same kernel for every output neuron (share parameters)
• Example: 300 x 300 input map, 300 x 300 output map
Q: How many parameters in a
 Convolution layer (5 x 5 kernel)?
26 parameters
 Fully connected layer?
8.1 x parameters
Q: Convents better due to #params savings?
A: No, even with similar parameter numbers,
convnets outperform fully connected networks Malach & Shalev-Shwartz ICLR 2020
(2) Can we use FFT to accelerate
convolution operation?
• Most common convulsion kernel is 3x3
→ Best convolution implementations are spatial (not Fourier)

• Fourier can be beneficial in non-standard domains


[Spherical CNNs, Clebsch–Gordan Nets]

• Usually better: approximate domain


so we can use spatial convolution [Cohen et al.]
(3) How to backpropagate through convolution?
𝑤

𝑥
Convnet backpropagation:

¿ ∗

For example:
¿ ∗
.
.
.
Convnet backpropagation:

¿ ∗

0 0 0 0

0 0

¿ ∗
0 0

0 0 0 0
History & Architectures
History – first CNN (1993)
ImageNet challenge

• ~14 million labeled images, 20k classes

• Images gathered from Internet

• Human labels via Amazon Mechanical Turk

• ImageNet Large-Scale Visual Recognition


Challenge (ILSVRC):
1.2 million training images, 1000 classes
AlexNet – ILSVRC 2012 winner

• Innovation:
• Max pooling, ReLU nonlinearity
• More data and bigger model (7 hidden layers, 650K units, 60M params)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
• Dropout regularization (later)

• Achieved top-5 ~11% lower than second position!


• AlexNet: arguably the most influential paper published in computer vision
VGG (2014)
• Deeper network
• Improved AlexNet by 9%

• Only 3x3 convolution


How do we train more than 30 layers?
How do we train more than 30 layers?
Skip Connections!
[He et al. 2015]

ImageNet ImageNet

Bold: test Bold: test


Thin: train add skip Thin: train

connections Q: What happens to


with standard inits ?
A: it explodes with depth!
Q: Solution?
A: Normalization layers, or
modify init (later lecture)
ImageNet progress
30
28.2 Deep learning era
25.8
25

20
ImageNet top-5

16.4

15

11.7

10

7.3
6.7

5 Human
3.57

0
1
AlexNet VGG ResNet
2010 2011 2012 2013 2014(1) 2014(2) 2015
Comparing architectures

https://culurciello.github.io/tech/2016/06/04/nets.html
Basic Residual Architectures
More efficient variations

ResNext Module MobileNetV2 Module


ResNet Module
(“Inverted Bottleneck”, also
(Grouped convolution) different activations)
Densenet
• Densenet Block

• Concatenation?

• Putting it all together


How to find new architectures?
• (low risk) Start from current good architecture
• Make operations more efficient (e.g., group conv)
• If does not hurt much, increase depth/width to improve performance

• (high risk) Try something new


• In many cases, extending existing good ideas (e.g., Resnet -> Densenet)

• Architecture search

• Standard hyperparameters need tuning (learning rate, etc.) for new architecture
Neural architectural search (NAS)
• Motivation: automating the architectural design process
• Huge search space → use reduced space
• Search on small datasets (CIFAR10), apply to large datasets (ImageNet)
• Optimization methods:
• Evolutionary algorithms [AmobaNet]
• Reinforcement learning [NasNet]
• Grid search [EfficientNet]
• Gradient based methods (e.g., Darts) …
[MnasNet,
• Hardware objectives (FLOPS, power, latency) can be added
EfficientNetV2]
Extensions
Other Uses in Vision Tasks

:.Different Tasks require architecture changes, e.g

55
Convnets for Speech Classification
Amplitude Speech Signal

time

Frequency Short time Fourier transform

2D Input
to a convnet

time
56
Summary
• Hierarchical representation using local connectivity
• Invariance and Equivariance using convolutions
• Building blocks
• Architectures
• Extensions

Some slides and visual adopted from courses cs231n (Standford) , 236278 (Technion), A guide to convolution arithmetic for deep learning

You might also like