04-CNN PDF
04-CNN PDF
NETWORKS [email protected]
OUTLINE Additional Reading:
http://cs231n.github.io/convolutional-networks/
Visual Recognition
Image Representation
Challenges
4
REPRESENTING IMAGE AS A MATRIX
5
REPRESENTING IMAGE AS A MATRIX
6
COMPUTER VISION – MAKE SENSE OF
NUMBERS
255 255 240 255
255 248 232 255
252 247 238 239
255 255 255 255
7
VISUAL RECOGNITION
Design algorithms that are capable of
Classifying images or videos
Detect and localize image
Estimate semantic and geometrical attributes
Classify human activity and events
8
HOW MANY OBJECT CATEGORIES ARE
THERE?
9
CHALLENGES – SHAPE AND APPEARANCE
VARIATIONS
10
CHALLENGES – VIEWPOINT VARIATIONS
11
CHALLENGES – ILLUMINATION
12
CHALLENGES – BACKGROUND CLUTTER
13
CHALLENGES – SCALE
14
CHALLENGES – OCCLUSION
15
CHALLENGES DO NOT APPEAR IN
ISOLATION!
Task: Detect phones in this image
Appearance variations
Viewpoint variations
Illumination variations
Background clutter
Scale changes
Occlusion
16
CONVOLUTIONAL NEURAL
NETWORK
CONVOLUTIONAL NEURAL NETWORK
CNN or Convnet is feed forward neural network specially designed
for images
A two-dimensional
array of pixels
CNN X or O
FOR EXAMPLE
CNN X
CNN O
TRICKIER CASES
CNN X
CNN O
DECIDING IS HARD
?
=
WHAT COMPUTERS SEE
?
=
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
COMPUTERS ARE LITERAL
=
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
x
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
CONVNETS MATCH PIECES OF THE IMAGE
=
=
PIECES OF THE IMAGE ARE CALLED
FEATURES
1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1
1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1
1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1
1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1
1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1
1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1
HOW COMPUTER MATCH FEATURES:
CONVOLUTION (LINEAR FILTERING)
1 -1 -1
-1 1 -1
Convolution is a
-1 -1 1 neighborhood
operation in which
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
each output pixel is the
-1 -1 1 -1 -1 -1 1 -1 -1 weighted sum of
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 neighboring input
-1 -1 -1 1 -1 1 -1 -1 -1 pixels. The matrix of
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 weights is called
-1 -1 -1 -1 -1 -1 -1 -1 -1 the convolution kernel,
also known as the filter.
CONVOLUTION
1 -1 -1
1
-1 1 -1
9
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
CONVOLUTION
1 -1 -1
1
-1 1 -1
9
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 55
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
CONVOLUTION
1 -1 -1
1 -1 1 -1
9 -1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
LINEAR FILTERS: EXAMPLES
1 1 1
1 1 1
1 1 1 =
Original Blur (with a mean
filter)
Source: D. Lowe
PRACTICE WITH LINEAR FILTERS
0 0 0
0 1 0
0 0 0 ?
Original
Source: D. Lowe
PRACTICE WITH LINEAR FILTERS
0 0 0
0 1 0
0 0 0
Original Filtered
(no change)
Source: D. Lowe
PRACTICE WITH LINEAR FILTERS
0 0 0
0 0 1
0 0 0 ?
Original
Source: D. Lowe
PRACTICE WITH LINEAR FILTERS
0 0 0
0 0 1
0 0 0
Source: D. Lowe
Image from http://www.texasexplorer.com/austincap2.jpg
Kristen Grauman
Showing magnitude of responses
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Kristen Grauman
Fully Connected Layer
Example: 200x200 image
40K hidden units
~2B parameters!!!
59
Locally Connected Layer
60
Locally Connected Layer
STATIONARITY? Statistics is similar at
different locations
Ranzato
61
Convolutional Layer
Ranzato
62
CONVOLUTION
Border Handling:
Zero-Padding
CONVOLUTION
-1
-1
-1
-1
-1
-1
-1
1
1
-1
-1
1
-1
-1
-1
-1
-1
-1
-1 1 -1 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
28
32 28
3 1
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
27 Jan 2016
N
Output size:
(N - F) / stride + 1
F
N e.g. N = 7, F = 3:
F stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
27 Jan 2016
In practice: Common to zero pad the border
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
27 Jan 2016
In practice: Common to zero pad the border
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
0 7x7 output!
0
27 Jan 2016
In practice: Common to zero pad the border
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
0 7x7 output!
in general, common to see CONV layers with
0 stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
27 Jan 2016
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
27 Jan 2016
Examples time:
27 Jan 2016
Examples time:
27 Jan 2016
Examples time:
27 Jan 2016
Examples time:
27 Jan 2016
CONVOLUTION LAYER
N -> size of image
F -> Size of filter
S -> Stride
P -> Padding
Output size:
(N-F+2P)/S + 1
e.g
(7-3+2)/2 + 1 = 4
27 Jan 2016
Common settings:
27 Jan 2016
(btw, 1x1 convolution layers make perfect sense)
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
27 Jan 2016
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
115
27 Jan 2016
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
27 Jan 2016
The brain/neuron view of CONV Layer
32
27 Jan 2016
The brain/neuron view of CONV Layer
32
27 Jan 2016
Pooling Layer
Let us assume filter is an “eye” detector.
Ranzato
119
Pooling Layer
By “pooling” (e.g., taking max) filter
responses at different locations we gain
robustness to the exact spatial location
of features.
Ranzato
120
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
27 Jan 2016
MAX POOLING
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0.33 1.00 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 0.55 0.11 0.77
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.55 0.55 0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 0.11 0.11 0.33
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 1.00 1.00 0.11 0.55
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0.33 1.00 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 0.55 0.11 0.77
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.55 0.55 0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 0.11 0.11 0.33
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 1.00 1.00 0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 0.33 0.55 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
POOLING LAYER
Summary:
Accepts a volume of size W1 x H1 x D1
Requires four hyper-parameters:
Kernel Size F
Stride S
Produces a volume of size W2 x H2 x D2 where:
W2 = (W1 – F)/S + 1
H2 = (H1 – F)/S + 1
D2 = D1
Introduces zero parameters since it computes a fixed function
of the input
Note: Zero padding not common in case of pooling
RECTIFIED LINEAR UNITS (RELU)
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0 1.00 0 0.33 0 0.11 0
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.11 0 1.00 0 0.11 0 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77
RELU LAYER
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0 0.33 0 1.00 0 0.33 0
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.11 0 1.00 0 0.11 0 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0 1.00 0 0.33 0 0.11 0
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33
LAYERS GET STACKED
-1 -1 1 -1 -1 -1 1 -1 -1
0.55 0.33 0.55 0.33
-1 -1 -1 1 -1 1 -1 -1 -1
0.33 1.00 0.55 0.11
-1 -1 -1 -1 1 -1 -1 -1 -1
0.55 0.55 0.55 0.11
-1 -1 -1 1 -1 1 -1 -1 -1
0.33 0.11 0.11 0.33
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.33 0.55 1.00 0.77
1.00 0.55
-1 -1 -1 -1 -1 -1 -1 -1 -1 0.55 1.00
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 1.00 0.55
-1 -1 -1 -1 1 -1 -1 -1 -1
0.55 0.55
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.55 1.00
-1 -1 -1 -1 -1 -1 -1 -1 -1
1.00 0.55
FULLY CONNECTED LAYER
1.00
X
0.55
0.55
1.00
1.00
0.55
0.55
O
0.55
0.55
1.00
1.00
0.55
FULLY CONNECTED LAYER
0.55
X
1.00
1.00
0.55
0.55
0.55
0.55
O
0.55
1.00
0.55
0.55
1.00
FULLY CONNECTED LAYER
0.9
X
0.65
0.45
0.87
0.96
0.73
0.23
O
0.63
0.44
0.89
0.94
0.53
FULLY CONNECTED LAYER
0.9
X
0.65
0.45
0.87
0.96
0.73
0.23
O
0.63
0.44
0.89
0.94
0.53
PUTTING IT ALL TOGETHER
-1
-1
-1
-1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
-1
1
-1
-1
-1
1
-1
-1
-1
1
-1
-1
-1
-1
-1
-1
-1
-1
X
-1 -1 -1 1 -1 1 -1 -1 -1
O
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
PUTTING IT ALL TOGETHER
IMPLEMENTATION – CIFAR10
IMPLEMENTATION – CIFAR10
FAMOUS CNN ARCHITECTURES
IMAGE NET
• The ImageNet project is a large visual database
designed for use in visual object recognition
software research. As of 2016, over ten million
URLs of images have been hand-annotated by
ImageNet to indicate what objects are pictured.
148
VARIOUS CNN ARCHITECTURES
PERFORMANCE
149
Case Study: LeNet-5
[LeCun et al., 1998]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
15
6
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
27 Jan 2016
VGGNET
161
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
27 Jan 2016
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
27 Jan 2016
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note:
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 early CONV
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
27 Jan 2016
GOOGLENET
Szegedy et al., 2014
Inception Module
ILSVRC 2014 winner (6.7% top 5 error)
ImageNet Large Scale Visual Recognition Challenge
165
RESNET
He et al., 2015
ILSVRC 2015 winner (3.6% top 5 error)
166
RESNET (CONTD.)
167
SUMMARY
Visual Recognition
Challenges
170