0% found this document useful (0 votes)
7 views

CNN Normalization

Uploaded by

kaushalmeena3003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

CNN Normalization

Uploaded by

kaushalmeena3003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Lecture 7:

Convolutional Networks

Justin Johnson Lecture 7 - 1 January 31, 2022


Lecture Format

Justin Johnson Lecture 7 - 2 January 31, 2022


Lecture Format

Justin Johnson Lecture 7 - 3 January 31, 2022


Lecture Format
- We will remain remote for at least another 2-3 weeks
- Idea: book a conference room for “watch parties?”
Or just use lecture hall
- COVID in MI have (hopefully!) peaked? If they continue to
drop we will consider in-person OH in the next 1-2 weeks
- May revisit after Spring Break
- Feel free to raise hand to ask questions in Zoom!
- Midterm will be remote (but still working on exact format)

Justin Johnson Lecture 7 - 4 January 31, 2022


Reminder: A2

Due last Friday

Justin Johnson Lecture 7 - 5 January 31, 2022


A3

Will be released tonight, covering:

- Backpropagation with modular API


- Different update rules (Momentum, RMSProp, Adam, etc)
- Batch Normalization
- Dropout
- Convolutional Networks

Justin Johnson Lecture 7 - 6 January 31, 2022


Last Time: Backpropagation
During the backward pass, each node in
Represent complex expressions the graph receives upstream gradients
as computational graphs and multiplies them by local gradients to
compute downstream gradients
x
s (scores)
*
hinge
loss
+
L

W
R

Downstream
gradients
f
Local
gradients
Forward pass computes outputs
Upstream
gradient
Backward pass computes gradients

Justin Johnson Lecture 7 - 7 January 31, 2022


f(x,W) = Wx
Problem: So far our classifiers don’t
respect the spatial structure of images!

Stretch pixels into column

56

56 231
231

Input: 24 2
x W1 h W2 s 24
3072
Output: 10 Input image
Hidden layer: 2
(2, 2)
100
(4,)

Justin Johnson Lecture 7 - 8 January 31, 2022


f(x,W) = Wx
Problem: So far our classifiers don’t
respect the spatial structure of images!
Solution: Define new computational
nodes that operate on images!
Stretch pixels into column

56

56 231
231

Input: 24 2
x W1 h W2 s 24
3072
Output: 10 Input image
Hidden layer: 2
(2, 2)
100
(4,)

Justin Johnson Lecture 7 - 9 January 31, 2022


Components of a Fully-Connected Network
Fully-Connected Layers Activation Function

x h s

Justin Johnson Lecture 7 - 10 January 31, 2022


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 11 January 31, 2022


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 12 January 31, 2022


Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights

Justin Johnson Lecture 7 - 13 January 31, 2022


Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of W
and the input (a 3072-
dimensional dot product)

Justin Johnson Lecture 7 - 14 January 31, 2022


Convolution Layer
3x32x32 image: preserve spatial structure

32 height

32 width
3 depth /
channels
Justin Johnson Lecture 7 - 15 January 31, 2022
Convolution Layer
3x32x32 image

3x5x5 filter

Convolve the filter with the image


32 height i.e. “slide over the image spatially,
computing dot products”

32 width
3 depth /
channels
Justin Johnson Lecture 7 - 16 January 31, 2022
Convolution Layer Filters always extend the full
depth of the input volume
3x32x32 image

3x5x5 filter

Convolve the filter with the image


32 height i.e. “slide over the image spatially,
computing dot products”

32 width
3 depth /
channels
Justin Johnson Lecture 7 - 17 January 31, 2022
Convolution Layer
3x32x32 image

3x5x5 filter

1 number:
32 the result of taking a dot product between the filter
and a small 3x5x5 chunk of the image
(i.e. 3*5*5 = 75-dimensional dot product + bias)
32
3 𝑤!𝑥 + 𝑏
Justin Johnson Lecture 7 - 18 January 31, 2022
Convolution Layer 1x28x28
activation map
3x32x32 image

3x5x5 filter
28

convolve (slide) over


32 all spatial locations

28
32
1
3

Justin Johnson Lecture 7 - 19 January 31, 2022


Convolution Layer two 1x28x28
Consider repeating with activation map
3x32x32 image
a second (green) filter:
3x5x5 filter
28 28

convolve (slide) over


32 all spatial locations

28
32
1 1
3

Justin Johnson Lecture 7 - 20 January 31, 2022


Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Consider 6 filters,
each 3x5x5

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 21 January 31, 2022
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 22 January 31, 2022
Convolution Layer 28x28 grid, at each
point a 6-dim vector
3x32x32 image Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 23 January 31, 2022
Convolution Layer 2x6x28x28
2x3x32x32 Batch of outputs
Batch of images Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters

Justin Johnson Lecture 7 - 24 January 31, 2022


Convolution Layer N x Cout x H’ x W’
N x Cin x H x W Batch of outputs
Batch of images Also Cout-dim bias vector:

Convolution
Layer
H

W Cout x Cinx Kw x Kh
Cout
Cin filters

Justin Johnson Lecture 7 - 25 January 31, 2022


Stacking Convolutions

32 28 26

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 26 January 31, 2022
Q: What happens if we stack
Stacking Convolutions two convolution layers?

32 28 26

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 27 January 31, 2022
Q: What happens if we stack (Recall y=W2W1x is
Stacking Convolutions two convolution layers? a linear classifier)
A: We get another convolution!

32 28 26

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 28 January 31, 2022
Q: What happens if we stack (Recall y=W2W1x is
Stacking Convolutions two convolution layers? a linear classifier)
A: We get another convolution!

32 28 Solution: Add
26
activation function
between conv layers
Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 29 January 31, 2022
What do convolutional filters learn?

32 28 26

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 30 January 31, 2022
What do convolutional filters learn?

32 28 Linear classifier: One template per class

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 31 January 31, 2022
What do convolutional filters learn?
MLP: Bank of whole-image templates

32 28

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 32 January 31, 2022
What do convolutional filters learn?
First-layer conv filters: local image templates
(Often learns oriented edges, opposing colors)
32 28

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 AlexNet: 64 filters, each 3x11x11

Justin Johnson Lecture 7 - 33 January 31, 2022


A closer look at spatial dimensions

32 28

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 34 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 35 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 36 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 37 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 38 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5

7
Justin Johnson Lecture 7 - 39 January 31, 2022
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5
In general: Problem: Feature
7 maps “shrink”
Input: W
Filter: K with each layer!
Output: W – K + 1

7
Justin Johnson Lecture 7 - 40 January 31, 2022
A closer look at spatial dimensions
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0
In general: Problem: Feature
0 0
Input: W maps “shrink”
0 0
Filter: K with each layer!
0 0 Output: W – K + 1
0 0
Solution: padding
0 0 0 0 0 0 0 0 0 Add zeros around the input
Justin Johnson Lecture 7 - 41 January 31, 2022
A closer look at spatial dimensions
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0

0 0
In general: Very common:
Input: W Set P = (K – 1) / 2 to
0 0
Filter: K make output have
same size as input!
0 0 Padding: P
0 0 Output: W – K + 1 + 2P
0 0 0 0 0 0 0 0 0

Justin Johnson Lecture 7 - 42 January 31, 2022


Receptive Fields
For convolution with kernel size K, each element in the
output depends on a K x K receptive field in the input

Input Output

Justin Johnson Lecture 7 - 43 January 31, 2022


Receptive Fields
Each successive convolution adds K – 1 to the receptive field size
With L layers the receptive field size is 1 + L * (K – 1)

Input Output
Be careful – ”receptive field in the input” vs “receptive field in the previous layer”
Hopefully clear from context!

Justin Johnson Lecture 7 - 44 January 31, 2022


Receptive Fields
Each successive convolution adds K – 1 to the receptive field size
With L layers the receptive field size is 1 + L * (K – 1)

Input Problem: For large images we need many layers Output


for each output to “see” the whole image image

Justin Johnson Lecture 7 - 45 January 31, 2022


Receptive Fields
Each successive convolution adds K – 1 to the receptive field size
With L layers the receptive field size is 1 + L * (K – 1)

Input Problem: For large images we need many layers Output


for each output to “see” the whole image image
Solution: Downsample inside the network

Justin Johnson Lecture 7 - 46 January 31, 2022


Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2

Justin Johnson Lecture 7 - 47 January 31, 2022


Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2

Justin Johnson Lecture 7 - 48 January 31, 2022


Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2

Justin Johnson Lecture 7 - 49 January 31, 2022


Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2
In general:
Input: W
Filter: K
Padding: P
Stride: S
Output: (W – K + 2P) / S + 1
Justin Johnson Lecture 7 - 50 January 31, 2022
Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: ?

Justin Johnson Lecture 7 - 51 January 31, 2022


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
10 x 32 x 32

Justin Johnson Lecture 7 - 52 January 31, 2022


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: ?

Justin Johnson Lecture 7 - 53 January 31, 2022


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: 760
Parameters per filter: 3*5*5 + 1 (for bias) = 76
10 filters, so total is 10 * 76 = 760

Justin Johnson Lecture 7 - 54 January 31, 2022


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: 760
Number of multiply-add operations: ?

Justin Johnson Lecture 7 - 55 January 31, 2022


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: 760
Number of multiply-add operations: 768,000
10*32*32 = 10,240 outputs; each output is the inner product
of two 3x5x5 tensors (75 elems); total = 75*10240 = 768K
Justin Johnson Lecture 7 - 56 January 31, 2022
Example: 1x1 Convolution

1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
64 32

Justin Johnson Lecture 7 - 57 January 31, 2022


Example: 1x1 Convolution

1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
Stacking 1x1 conv layers
64 gives MLP operating on 32
each input position Lin et al, “Network in Network”, ICLR 2014

Justin Johnson Lecture 7 - 58 January 31, 2022


Convolution Summary
Input: Cin x H x W
Hyperparameters:
- Kernel size: KH x KW
- Number filters: Cout
- Padding: P
- Stride: S
Weight matrix: Cout x Cin x KH x KW
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
- W’ = (W – K + 2P) / S + 1
Justin Johnson Lecture 7 - 59 January 31, 2022
Convolution Summary
Input: Cin x H x W
Hyperparameters: Common settings:
- Kernel size: KH x KW KH = KW (Small square filters)
- Number filters: Cout P = (K – 1) / 2 (”Same” padding)
- Padding: P Cin, Cout = 32, 64, 128, 256 (powers of 2)
- Stride: S K = 3, P = 1, S = 1 (3x3 conv)
Weight matrix: Cout x Cin x KH x KW K = 5, P = 2, S = 1 (5x5 conv)
giving Cout filters of size Cin x KH x KW K = 1, P = 0, S = 1 (1x1 conv)
Bias vector: Cout K = 3, P = 1, S = 2 (Downsample by 2)
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
- W’ = (W – K + 2P) / S + 1
Justin Johnson Lecture 7 - 60 January 31, 2022
Other types of convolution
So far: 2D Convolution
Input: Cin x H x W
Weights: Cout x Cin x K x K

W
Cin
Justin Johnson Lecture 7 - 61 January 31, 2022
Other types of convolution
So far: 2D Convolution 1D Convolution
Input: Cin x H x W Input: Cin x W
Weights: Cout x Cin x K x K Weights: Cout x Cin x K

H Cin

W
Cin W
Justin Johnson Lecture 7 - 62 January 31, 2022
Other types of convolution
So far: 2D Convolution 3D Convolution
Input: Cin x H x W Input: Cin x H x W x D
Weights: Cout x Cin x K x K Weights: Cout x Cin x K x K x K

H
Cin-dim vector
H
at each point
in the volume
W D
Cin W
Justin Johnson Lecture 7 - 63 January 31, 2022
PyTorch Convolution Layer

Justin Johnson Lecture 7 - 64 January 31, 2022


PyTorch Convolution Layers

Justin Johnson Lecture 7 - 65 January 31, 2022


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 66 January 31, 2022


Pooling Layers: Another way to downsample
64 x 224 x 224
64 x 112 x 112

Hyperparameters:
Kernel Size
Stride
Pooling function

Justin Johnson Lecture 7 - 67 January 31, 2022


Max Pooling 64 x 224 x 224

Single depth slice


1 1 2 4
x Max pooling with 2x2
5 6 7 8 kernel size and stride 2 6 8

3 2 1 0 3 4

1 2 3 4 Introduces invariance to
small spatial shifts
y No learnable parameters!
Justin Johnson Lecture 7 - 68 January 31, 2022
Pooling Summary
Input: C x H x W
Hyperparameters:
- Kernel size: K Common settings:
- Stride: S max, K = 2, S = 2
- Pooling function (max, avg) max, K = 3, S = 2 (AlexNet)
Output: C x H’ x W’ where
- H’ = (H – K) / S + 1
- W’ = (W – K) / S + 1
Learnable parameters: None!
Justin Johnson Lecture 7 - 69 January 31, 2022
Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 70 January 31, 2022


Convolutional Networks
Classic architecture: [Conv, ReLU, Pool] x N, flatten, [FC, ReLU] x N, FC

Example: LeNet-5

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 71 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 72 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 73 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 74 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 75 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 76 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 77 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 78 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 79 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5 As we go through the network:
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Spatial size decreases
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
(using pooling or strided conv)
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450 Number of channels increases
Linear (2450 -> 500) 500 2450 x 500 (total “volume” is preserved!)
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 80 January 31, 2022


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5 As we go through the network:
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Spatial size decreases
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
(using pooling or strided conv)
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450 Number of channels increases
Linear (2450 -> 500) 500 2450 x 500 (total “volume” is preserved!)
ReLU 500
Linear (500 -> 10) 10 500 x 10 Some modern architectures
Lecun et al, “Gradient-based learning applied to document recognition”, 1998 break this trend -- stay tuned!
Justin Johnson Lecture 7 - 81 January 31, 2022
Problem: Deep Networks very hard to train!

Justin Johnson Lecture 7 - 82 January 31, 2022


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 83 January 31, 2022


Batch Normalization
Idea: “Normalize” the outputs of a layer so they have zero mean
and unit variance

Why? Helps reduce “internal covariate shift”, improves optimization

We can normalize a batch of activations like this:


𝑥−𝐸 𝑥 This is a differentiable function, so
𝑥! = we can use it as an operator in our
𝑉𝑎𝑟 𝑥 networks and backprop through it!
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 84 January 31, 2022


Batch Normalization
!×# 1 '
Per-channel
Input: 𝑥∈ℝ 𝜇# = + 𝑥!,#
𝑁 !%& mean, shape is D

$ 1 ' $
Per-channel
𝜎# = + 𝑥!,# − 𝜇#
𝑁 !%& std, shape is D
N X 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
𝜎#$ + 𝜀 Shape is N x D

D
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 85 January 31, 2022


Batch Normalization
!×# 1 '
Per-channel
Input: 𝑥∈ℝ 𝜇# = + 𝑥!,#
𝑁 !%& mean, shape is D

$ 1 ' $
Per-channel
𝜎# = + 𝑥!,# − 𝜇#
𝑁 !%& std, shape is D
N X 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
𝜎#$ + 𝜀 Shape is N x D

D Problem: What if zero-mean, unit


variance is too hard of a constraint?
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 86 January 31, 2022


Batch Normalization
!×# 1 '
Per-channel
Input: 𝑥∈ℝ 𝜇# = + 𝑥!,#
𝑁 !%& mean, shape is D

Learnable scale and $ 1 ' $


Per-channel
𝜎# = + 𝑥!,# − 𝜇#
shift parameters: 𝑁 !%& std, shape is D
𝛾, 𝛽 ∈ ℝ # 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
Learning 𝛾 = 𝜎, 𝛽 = 𝜇 𝜎#$ + 𝜀 Shape is N x D
will recover the identity
𝑦!,# = 𝛾# 𝑥!!,# + 𝛽# Output,
function (in expectation) Shape is N x D

Justin Johnson Lecture 7 - 87 January 31, 2022


Problem: Estimates depend on
Batch Normalization minibatch; can’t do this at test-time!

!×# 1 '
Per-channel
Input: 𝑥∈ℝ 𝜇# = + 𝑥!,#
𝑁 !%& mean, shape is D

Learnable scale and $ 1 ' $


Per-channel
𝜎# = + 𝑥!,# − 𝜇#
shift parameters: 𝑁 !%& std, shape is D
𝛾, 𝛽 ∈ ℝ # 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
Learning 𝛾 = 𝜎, 𝛽 = 𝜇 𝜎#$ + 𝜀 Shape is N x D
will recover the identity
𝑦!,# = 𝛾# 𝑥!!,# + 𝛽# Output,
function (in expectation) Shape is N x D

Justin Johnson Lecture 7 - 88 January 31, 2022


Batch Normalization: Test-Time
!×# 1 ' average of
(Running)
Per-channel
Input: 𝑥∈ℝ 𝜇# = +seen 𝑥during
values !,#
𝑁
training!%& mean, shape is D

Learnable scale and $ 1 '


(Running) average of $
Per-channel
𝜎# = values
+ seen 𝑥during− 𝜇
!,# training
#
shift parameters: 𝑁 !%& std, shape is D
𝛾, 𝛽 ∈ ℝ # 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
Learning 𝛾 = 𝜎, 𝛽 = 𝜇 𝜎#$ + 𝜀 Shape is N x D
will recover the identity
𝑦!,# = 𝛾# 𝑥!!,# + 𝛽# Output,
function (in expectation) Shape is N x D

Justin Johnson Lecture 7 - 89 January 31, 2022


Batch Normalization: Test-Time
!×# 1 ' average of
(Running)
Per-channel
Input: 𝑥∈ℝ 𝜇# = +seen 𝑥during
values !,#
𝑁
training!%& mean, shape is D
#$%# 1 '
Learnable scale and 𝜇𝜎
" $ ==(Running)
0+ average𝑥 −of
𝜇
$
Per-channel
# !,# #
shift parameters: values !%&
𝑁 seen during training
For each training iteration: std, shape is D
𝛾, 𝛽 ∈ ℝ # 𝑥&!,# −'𝜇#
𝜇" == ' ∑()& 𝑥(," Normalized x,
𝑥!!,#
#$%# 𝜎 $ + 𝜀 #$%#Shape is N x D
Learning 𝛾 = 𝜎, 𝛽 = 𝜇 𝜇" =# 0.99 𝜇" + 0.01 𝜇"
will recover the identity
𝑦!,# = 𝛾# 𝑥!!,# + 𝛽# Output,
function (in expectation) (Similar for 𝜎) Shape is N x D

Justin Johnson Lecture 7 - 90 January 31, 2022


Batch Normalization: Test-Time
!×# 1 ' average of
(Running)
Per-channel
Input: 𝑥∈ℝ 𝜇# = +seen 𝑥during
values !,#
𝑁
training!%& mean, shape is D

Learnable scale and $ 1 '


(Running) average of $
Per-channel
𝜎# = values
+ seen 𝑥during− 𝜇
!,# training
#
shift parameters: 𝑁 !%& std, shape is D
𝛾, 𝛽 ∈ ℝ # 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
Learning 𝛾 = 𝜎, 𝛽 = 𝜇 𝜎#$ + 𝜀 Shape is N x D
will recover the identity
𝑦!,# = 𝛾# 𝑥!!,# + 𝛽# Output,
function (in expectation) Shape is N x D

Justin Johnson Lecture 7 - 91 January 31, 2022


Batch Normalization: Test-Time
!×# 1 ' average of
(Running)
Per-channel
Input: 𝑥∈ℝ 𝜇# = +seen 𝑥during
values !,#
𝑁
training!%& mean, shape is D

Learnable scale and $ 1 '


(Running) average of $
Per-channel
𝜎# = values
+ seen 𝑥during− 𝜇
!,# training
#
shift parameters: 𝑁 !%& std, shape is D
𝛾, 𝛽 ∈ ℝ # 𝑥!,# − 𝜇#
𝑥!!,# = Normalized x,
During testing batchnorm 𝜎#$ + 𝜀 Shape is N x D
becomes a linear operator!
Can be fused with the previous 𝑦!,# = 𝛾# 𝑥!!,# + 𝛽# Output,
fully-connected or conv layer Shape is N x D

Justin Johnson Lecture 7 - 92 January 31, 2022


Batch Normalization for ConvNets
Batch Normalization for Batch Normalization for
fully-connected networks convolutional networks
(Spatial Batchnorm, BatchNorm2D)
𝑥 ∶𝑁×𝐷 𝑥 ∶𝑁×𝐶×𝐻×𝑊
Normalize Normalize
𝜇, 𝜎 ∶ 1 × 𝐷 𝜇, 𝜎 ∶ 1 × 𝐶 × 1 × 1
𝛾, 𝛽 ∶ 1 × 𝐷 𝛾, 𝛽 ∶ 1 × 𝐶 × 1 × 1
𝑥−𝜇 𝑥−𝜇
𝑦= 𝛾+𝛽 𝑦= 𝛾+𝛽
𝜎 𝜎
Justin Johnson Lecture 7 - 93 January 31, 2022
Batch Normalization
FC Usually inserted after Fully Connected
BN
or Convolutional layers, and before
nonlinearity.
tanh

FC
𝑥−𝐸 𝑥
BN
𝑥! =
tanh 𝑉𝑎𝑟 𝑥
Ioffe and Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 94 January 31, 2022


Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh

FC

BN ImageNet
accuracy

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep Training iterations


network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 95 January 31, 2022


Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh - Not well-understood theoretically (yet)
- Behaves differently during training and testing: this
FC is a very common source of bugs!
BN

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep


network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 96 January 31, 2022


Layer Normalization Layer Normalization for fully-
connected networks
Batch Normalization for Same behavior at train and test!
fully-connected networks Used in RNNs, Transformers

𝑥 ∶𝑁×𝐷 𝑥 ∶𝑁×𝐷
Normalize Normalize
𝜇, 𝜎 ∶ 1 × 𝐷 𝜇, 𝜎 ∶ 𝑁 × 1
𝛾, 𝛽 ∶ 1 × 𝐷 𝛾, 𝛽 ∶ 1 × 𝐷
𝑥−𝜇 𝑥−𝜇
𝑦= 𝛾+𝛽 𝑦= 𝛾+𝛽
𝜎 𝜎
Justin Johnson Lecture 7 - 97 January 31, 2022
Instance Normalization
Batch Normalization for Instance Normalization for
convolutional networks convolutional networks

𝑥 ∶𝑁×𝐶×𝐻×𝑊 𝑥 ∶𝑁×𝐶×𝐻×𝑊
Normalize Normalize
𝜇, 𝜎 ∶ 1 × 𝐶 × 1 × 1 𝜇, 𝜎 ∶ 𝑁 × 𝐶 × 1 × 1
𝛾, 𝛽 ∶ 1 × 𝐶 × 1 × 1 𝛾, 𝛽 ∶ 1 × 𝐶 × 1 × 1
𝑥−𝜇 𝑥−𝜇
𝑦= 𝛾+𝛽 𝑦= 𝛾+𝛽
𝜎 𝜎
Justin Johnson Lecture 7 - 98 January 31, 2022
Comparison of Normalization Layers

Wu and He, “Group Normalization”, ECCV 2018

Justin Johnson Lecture 7 - 99 January 31, 2022


Group Normalization

Wu and He, “Group Normalization”, ECCV 2018

Justin Johnson Lecture 7 - 100 January 31, 2022


Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 101 January 31, 2022


Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s
Most
computationally
expensive!

Activation Function Normalization


𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 102 January 31, 2022


Summary: Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


𝑥!,# − 𝜇#
𝑥!!,# =
𝜎#$ + 𝜀

Justin Johnson Lecture 7 - 103 January 31, 2022


Summary: Components of a Convolutional Network

Problem: What is the right way to combine all these components?

Justin Johnson Lecture 7 - 104 January 31, 2022


Next time:
CNN Architectures

Justin Johnson Lecture 7 - 105 January 31, 2022

You might also like