12 Convolutional Neural Networks
12 Convolutional Neural Networks
Fundamentals of
Image Processing
32 height
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
32 width
3 depth
!2
Convolution Layer
32x32x3 image
5x5x3 filter
32
!3
Convolution Layer
Filters always extend the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
!4
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!5
Convolution Layer
activation
32x32x3 image map
5x5x3 filter
32
28
32
3 1
!6
Convolution Layer
consider a second, green filter
32x32x3 image activation maps
5x5x3 filter
32
28
32
3 1
!7
For example, if we had 6 5x5 filters, we’ll get 6
separate activation maps:
activation maps
32
28
Convolution Layer
32 28
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
3 6
!8
Preview: ConvNet is a sequence of
Convolutional Layers, interspersed with
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
32 28
filters
3 6
!9
Preview: ConvNet is a sequence of
Convolutional Layers, interspersed with
activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
32 28 24
filters filters
3 6 10
!10
!11
[From recent Yann
LeCun slides]
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!12
[From recent Yann
LeCun slides]
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
one filter =>
one activation map
example 5x5
filters
(32 total)
We call the layer convolutional
because it is related to
convolution of two signals:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
elementwise multiplication
and sum of a filter and the
signal (image)
!13
!14
!14
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
A closer look at spatial
dimensions:
activation
32x32x3 image map
5x5x3 filter
32
28
32 28
3 1
!15
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!16
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!17
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!18
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!19
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
!20
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!21
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!22
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!23
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!24
A closer look at spatial
dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
doesn’t fit!
cannot apply 3x3 filter on
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!25
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :
\
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!26
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
0
(recall:)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
(N - F) / stride + 1
!27
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0 7x7 output!
0
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!28
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0 7x7 output!
0 in general, common to see CONV layers
with stride 1, filters of size FxF, and zero-
padding with (F-1)/2. (will preserve size
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
!29
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters
shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t
work well.
32 28 24
….
CONV, CONV, CONV,
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!30
Recap: Convolution Layer
!33
Examples
time:
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
32x32x10
!34
Examples
time:
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
!35
Examples
time:
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
!36
!37
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Common settings:
!38
(btw, 1x1 convolution layers make perfect
sense)
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
64-dimensional dot
56 product) 56
64 32
!39
The brain/neuron view of CONV
Layer
32x32x3 image
5x5x3 filter
32
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
1 number:
32 the result of taking a dot product between
3 the filter and this part of the image
(i.e. 5*5*3 = 75-dimensional dot product)
!40
The brain/neuron view of CONV
Layer
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
3 the filter and this part of the image
(i.e. 5*5*3 = 75-dimensional dot product)
!41
The brain/neuron view of CONV
Layer
32
32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3
!42
The brain/neuron view of CONV
Layer
32
!43
!44
Activation Functions
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Activation Functions
Sigmoid
tanh tanh(x)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
ReLU max(0,x)
!45
Activation Functions
3 problems:
Sigmoid gradients
2. Sigmoid outputs are not zero-
centered
3. exp() is a bit compute expensive
!46
Activation Functions
tanh(x)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!47
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in
+region)
- Very computationally
efficient
- Converges much faster than
sigmoid/tanh in practice
(e.g. 6x)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
ReLU
(Rectified Linear Unit)
!48
!49
!49
two more layers to go: POOL/FC
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!50
Max Pooling
Single depth
slice
1 1 2 4
x max pool with 2x2
5 6 7 8 filters and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!51
!52
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!53
Common settings:
F = 2, S = 2
F = 3, S = 2
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary
Neural Networks
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!54
!54
[ConvNetJS demo: training on CIFAR-10]
http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!55
Case studies
!56
Case Study: LeNet-5 [LeCun et al., 1998]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!58
Case Study: AlexNet
[Krizhevsky et al. 2012]
!59
Case Study: AlexNet
[Krizhevsky et al. 2012]
!60
Case Study: AlexNet
[Krizhevsky et al. 2012]
!61
Case Study: AlexNet
[Krizhevsky et al. 2012]
!62
Case Study: AlexNet
[Krizhevsky et al. 2012]
Parameters: 0!
!63
Case Study: AlexNet
[Krizhevsky et al. 2012]
!64
Case Study: AlexNet
[Krizhevsky et al. 2012]
best model
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!67
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
(not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!68
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
!69
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
(not counting biases) Note:
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 early CONV
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
!70
Case Study: GoogLeNet [Szegedy et al.,
2014]
Inception module
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!71
Case Study: ResNetILSVRC
[He et al., 2015]
2015 winner
(3.6% top 5 error)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
2-3 weeks of
training on 8 GPU
machine
at runtime: faster
than a VGGNet!
(even though it has
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
8x more layers)
224x224x3
spatial dimension
only 56x56!
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!74
Input
Input
ImageInput
Image 96 256 354
Image filters filters filters
7x7x3 Convolution 5x5x96 Convolution 3x3x256 Convolution
RGB Input Image 3x3 Max Pooling 3x3 Max Pooling 13 x 13 x 354
224 x 224 x 3 Down Sample 4x Down Sample 4x
55 x 55 x 96 13 x 13 x 256
256 354
filters filters
3x3x354 Convolution 3x3x354 Convolution
Logistic Standard Standard 3x3 Max Pooling
Regression 4096 Units 4096 Units 13 x 13 x 354
Down Sample 2x
≈1000 Classes 6 x 6 x 256
slide by Yisong Yue
http://www.image-net.org/ http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
Visualizing CNN (Layer 1)
slide by Yisong Yue
http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
76
Visualizing CNN (Layer 2)
http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
77
Visualizing CNN (Layer 3)
http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
78
Visualizing CNN (Layer 4)
http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
79
Visualizing CNN (Layer 5)
http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
80
Tips and Tricks
!81
• Shuffle the training samples
!82
“Given a rectangular image, we first rescaled
Input representation the image such that the shorter side was of
length 256, and then cropped out the central
256×256 patch from the resulting image”
!83
Data Augmentation
• The neural net has 60M
real-valued parameters and
650,000 neurons
• It overfits a lot. Therefore,
they train on 224x224
patches extracted randomly
from 256x256 images, and
also their horizontal
reflections.
“This increases the size of our training
slide by Alex Krizhevsky
Random mix/combinations of :
- translation
- rotation
- stretching
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
- shearing,
- lens distortions, … (go crazy)
!87
Transfer Learning with ConvNets
1. Train on
Imagenet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!88
Transfer Learning with ConvNets
2. Small dataset:
1. Train on feature extractor
Imagenet
Freeze
these
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Train
this
!89
Transfer Learning with ConvNets
2. Small dataset: 3. Medium dataset:
1. Train on feature extractor finetuning
Imagenet
more data = retrain more of
the network (or all of it)
Freeze these
Freeze
these
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!90
Transfer Learning with ConvNets
2. Small dataset: 3. Medium dataset:
1. Train on feature extractor finetuning
Imagenet
more data = retrain more of
the network (or all of it)
Freeze these
Freeze tip: use only ~1/10th of
these the original learning rate
in finetuning top layer,
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
and ~1/100th on
intermediate layers
!91
Today ConvNets are everywhere
Classification Retrieval
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
[Krizhevsky 2012]
!92
Today ConvNets are everywhere
Detection Segmentation
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
[Faster R-CNN: Ren, He, Girshick, Sun 2015] [Farabet et al., 2012]
!93
Today ConvNets are everywhere
NVIDIA Tegra X1
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
self-driving cars
!94
Today ConvNets are everywhere
[Taigman et al.
2014]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!95
Today ConvNets are everywhere
[Mnih 2013]
!96
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!97
Today ConvNets are everywhere
!98
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!99
Today ConvNets are everywhere
Image
Captioning
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
!100
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
reddit.com/r/deepdream
!101