CNN Normalization
CNN Normalization
Convolutional Networks
W
R
Downstream
gradients
f
Local
gradients
Forward pass computes outputs
Upstream
gradient
Backward pass computes gradients
56
56 231
231
Input: 24 2
x W1 h W2 s 24
3072
Output: 10 Input image
Hidden layer: 2
(2, 2)
100
(4,)
56
56 231
231
Input: 24 2
x W1 h W2 s 24
3072
Output: 10 Input image
Hidden layer: 2
(2, 2)
100
(4,)
x h s
x h s
𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀
x h s
𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀
Input Output
1 1
10 x 3072
3072 10
weights
Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of W
and the input (a 3072-
dimensional dot product)
32 height
32 width
3 depth /
channels
Justin Johnson Lecture 7 - 15 January 31, 2022
Convolution Layer
3x32x32 image
3x5x5 filter
32 width
3 depth /
channels
Justin Johnson Lecture 7 - 16 January 31, 2022
Convolution Layer Filters always extend the full
depth of the input volume
3x32x32 image
3x5x5 filter
32 width
3 depth /
channels
Justin Johnson Lecture 7 - 17 January 31, 2022
Convolution Layer
3x32x32 image
3x5x5 filter
1 number:
32 the result of taking a dot product between the filter
and a small 3x5x5 chunk of the image
(i.e. 3*5*5 = 75-dimensional dot product + bias)
32
3 𝑤!𝑥 + 𝑏
Justin Johnson Lecture 7 - 18 January 31, 2022
Convolution Layer 1x28x28
activation map
3x32x32 image
3x5x5 filter
28
28
32
1
3
28
32
1 1
3
Convolution
Layer
32
32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 21 January 31, 2022
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 22 January 31, 2022
Convolution Layer 28x28 grid, at each
point a 6-dim vector
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 23 January 31, 2022
Convolution Layer 2x6x28x28
2x3x32x32 Batch of outputs
Batch of images Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters
Convolution
Layer
H
W Cout x Cinx Kw x Kh
Cout
Cin filters
32 28 26
32 28 26
32 28 26
32 28 Solution: Add
26
activation function
between conv layers
Conv ReLU Conv ReLU Conv ReLU ….
32 28 26
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 31 January 31, 2022
What do convolutional filters learn?
MLP: Bank of whole-image templates
32 28
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 32 January 31, 2022
What do convolutional filters learn?
First-layer conv filters: local image templates
(Often learns oriented edges, opposing colors)
32 28
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 AlexNet: 64 filters, each 3x11x11
32 28
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 34 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
Justin Johnson Lecture 7 - 35 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
Justin Johnson Lecture 7 - 36 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
Justin Johnson Lecture 7 - 37 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
Justin Johnson Lecture 7 - 38 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5
7
Justin Johnson Lecture 7 - 39 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5
In general: Problem: Feature
7 maps “shrink”
Input: W
Filter: K with each layer!
Output: W – K + 1
7
Justin Johnson Lecture 7 - 40 January 31, 2022
A closer look at spatial dimensions
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0
In general: Problem: Feature
0 0
Input: W maps “shrink”
0 0
Filter: K with each layer!
0 0 Output: W – K + 1
0 0
Solution: padding
0 0 0 0 0 0 0 0 0 Add zeros around the input
Justin Johnson Lecture 7 - 41 January 31, 2022
A closer look at spatial dimensions
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0
0 0
In general: Very common:
Input: W Set P = (K – 1) / 2 to
0 0
Filter: K make output have
same size as input!
0 0 Padding: P
0 0 Output: W – K + 1 + 2P
0 0 0 0 0 0 0 0 0
Input Output
Input Output
Be careful – ”receptive field in the input” vs “receptive field in the previous layer”
Hopefully clear from context!
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
64 32
1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
Stacking 1x1 conv layers
64 gives MLP operating on 32
each input position Lin et al, “Network in Network”, ICLR 2014
W
Cin
Justin Johnson Lecture 7 - 61 January 31, 2022
Other types of convolution
So far: 2D Convolution 1D Convolution
Input: Cin x H x W Input: Cin x W
Weights: Cout x Cin x K x K Weights: Cout x Cin x K
H Cin
W
Cin W
Justin Johnson Lecture 7 - 62 January 31, 2022
Other types of convolution
So far: 2D Convolution 3D Convolution
Input: Cin x H x W Input: Cin x H x W x D
Weights: Cout x Cin x K x K Weights: Cout x Cin x K x K x K
H
Cin-dim vector
H
at each point
in the volume
W D
Cin W
Justin Johnson Lecture 7 - 63 January 31, 2022
PyTorch Convolution Layer
x h s
𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀
Hyperparameters:
Kernel Size
Stride
Pooling function
3 2 1 0 3 4
1 2 3 4 Introduces invariance to
small spatial shifts
y No learnable parameters!
Justin Johnson Lecture 7 - 68 January 31, 2022
Pooling Summary
Input: C x H x W
Hyperparameters:
- Kernel size: K Common settings:
- Stride: S max, K = 2, S = 2
- Pooling function (max, avg) max, K = 3, S = 2 (AlexNet)
Output: C x H’ x W’ where
- H’ = (H – K) / S + 1
- W’ = (W – K) / S + 1
Learnable parameters: None!
Justin Johnson Lecture 7 - 69 January 31, 2022
Components of a Convolutional Network
Fully-Connected Layers Activation Function
x h s
𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀
Example: LeNet-5
x h s
𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀
$ 1 ' $
Per-channel
𝜎# = + 𝑥!,# − 𝜇#
𝑁 !%& std, shape is D
N X 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
𝜎#$ + 𝜀 Shape is N x D
D
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015
$ 1 ' $
Per-channel
𝜎# = + 𝑥!,# − 𝜇#
𝑁 !%& std, shape is D
N X 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
𝜎#$ + 𝜀 Shape is N x D
!×# 1 '
Per-channel
Input: 𝑥∈ℝ 𝜇# = + 𝑥!,#
𝑁 !%& mean, shape is D
FC
𝑥−𝐸 𝑥
BN
𝑥! =
tanh 𝑉𝑎𝑟 𝑥
Ioffe and Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift”, ICML 2015
FC
BN ImageNet
accuracy
tanh
tanh
𝑥 ∶𝑁×𝐷 𝑥 ∶𝑁×𝐷
Normalize Normalize
𝜇, 𝜎 ∶ 1 × 𝐷 𝜇, 𝜎 ∶ 𝑁 × 1
𝛾, 𝛽 ∶ 1 × 𝐷 𝛾, 𝛽 ∶ 1 × 𝐷
𝑥−𝜇 𝑥−𝜇
𝑦= 𝛾+𝛽 𝑦= 𝛾+𝛽
𝜎 𝜎
Justin Johnson Lecture 7 - 97 January 31, 2022
Instance Normalization
Batch Normalization for Instance Normalization for
convolutional networks convolutional networks
𝑥 ∶𝑁×𝐶×𝐻×𝑊 𝑥 ∶𝑁×𝐶×𝐻×𝑊
Normalize Normalize
𝜇, 𝜎 ∶ 1 × 𝐶 × 1 × 1 𝜇, 𝜎 ∶ 𝑁 × 𝐶 × 1 × 1
𝛾, 𝛽 ∶ 1 × 𝐶 × 1 × 1 𝛾, 𝛽 ∶ 1 × 𝐶 × 1 × 1
𝑥−𝜇 𝑥−𝜇
𝑦= 𝛾+𝛽 𝑦= 𝛾+𝛽
𝜎 𝜎
Justin Johnson Lecture 7 - 98 January 31, 2022
Comparison of Normalization Layers
x h s
x h s
Most
computationally
expensive!
x h s