Deep Learning CNN
Deep Learning CNN
2
Presentations
• You can use any presentation tool (e.g., Powerpoint, Keynote,
LaTex) provided that the tool has options to export the slides
to PDF.
• Each presentation should be clear, well organized and very
technical, and roughly 30 minutes long.
• You are allowed to reuse the material already exist on the web
as long as you clearly cite the source of the media that you
have used in your presentation.
• Extra credit will be awarded to those students who also
conduct some experiments demonstrating how the method
works in practice.
3
Presentations
Deadline:
• You should meet with me 3-4 days before the
presentation date to discuss your slides
• The presentation should be submitted by the night before
the class
• Presentations grading rubric on the webpage
4
Suggested Presentation Outline
• High-level overview of the paper (main contributions)
• Problem statement and motivation (clear definition of the
problem, why it is interesting and important)
• Key technical ideas (overview of the approach)
• Experimental set-up (datasets, evaluation metrics, applications)
• Strengths and weaknesses (discussion of the results obtained)
• Connections with other work (how it relates to other
approaches, its similarities and differences)
• Future direction (open research questions)
5
Homework
Due March 15, 2016
(till 12:30pm)
AND FUNCTIONAL
ARCHITECTURE IN
THE CAT'S VISUAL CORTEX
1968...
https://youtu.be/8VdFf3egwfg?t=1m10s 7
8
A bit of history
Topographical mapping
in the cortex:
nearby cells in cortex
represented
nearby regions in the
visual field
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
9
10
Hierarchical organization
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
The “Halle Berry” Neuron
Invariant visual representation by single neurons
in the human brain [Quiroga et al., Nature, 2005]
Neurocognitron
[Fukushima 1980]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
12
A bit of history
Gradient-based learning
applied to document
recognition
[LeCun, Bottou, Bengio, Haffner
1998]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
LeNet-5
13
A bit of history
ImageNet Classification with Deep
Convolutional Neural Networks
[Krizhevsky, Sutskever, Hinton, 2012]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
“AlexNet”
14
Convolutional Neural Networks
(First without the brain stuff)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
15
Convolution Layer
32x32x3 image
32 height
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
32 width
3 depth
16
Convolution Layer
32x32x3 image
5x5x3 filter
32
32
3
17
Convolution Layer
Filters always extend the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
32
3
18
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
19
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
20
Convolution Layer
consider a second, green filter
32x32x3 image activation maps
5x5x3 filter
32
28
32 28
3 1
21
For example, if we had 6 5x5 filters, we’ll get 6
separate activation maps:
activation maps
32
28
Convolution Layer
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
32 28
3 6
22
Preview: ConvNet is a sequence of Convolutional
Layers, interspersed with activation functions
32 28
CONV,
ReLU
e.g. 6
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
5x5x3
32 filters 28
3 6
23
Preview: ConvNet is a sequence of Convolutional
Layers, interspersed with activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
24
25
LeCun slides]
[From recent Yann
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
26
LeCun slides]
[From recent Yann
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
one filter =>
one activation map
example 5x5 filters
(32 total)
elementwise multiplication
and sum of a filter and the
signal (image)
27
28
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
29
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
30
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
31
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
32
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
33
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
34
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
35
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
36
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
37
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
38
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
doesn’t fit!
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
39
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
N stride 1 => (7 - 3)/1 + 1 = 5
F stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
40
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
0
(recall:)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
(N - F) / stride + 1
41
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
7x7 output!
0
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
42
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
7x7 output!
0 in general, common to see CONV layers
with stride 1, filters of size FxF, and zero-
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
43
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters
shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t
work well.
32 28 24
….
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
44
Examples time:
45
Examples time:
32x32x10
46
Examples time:
47
Examples time:
48
49
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Common settings:
50
(btw, 1x1 convolution layers make perfect sense)
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
64-dimensional dot
56 product)
56
64 32
51
52
Example: CONV layer in Torch
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
53
Example: CONV layer in Caffe
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
54
Example: CONV layer in Lasagne
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
55
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
56
The brain/neuron view of CONV Layer
32
32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3
57
The brain/neuron view of CONV Layer
32
58
59
two more layers to go: POOL/FC
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
60
Max Pooling
3 2 1 0 3 4
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
1 2 3 4
61
62
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
63
Common settings:
F = 2, S = 2
F = 3, S = 2
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary
Neural Networks
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
64
[ConvNetJS demo: training on CIFAR-10]
http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
65
Case Study: LeNet-5 [LeCun et al., 1998]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
66
Case Study: AlexNet
[Krizhevsky et al. 2012]
67
Case Study: AlexNet
[Krizhevsky et al. 2012]
68
Case Study: AlexNet
[Krizhevsky et al. 2012]
69
Case Study: AlexNet
[Krizhevsky et al. 2012]
70
Case Study: AlexNet
[Krizhevsky et al. 2012]
71
Case Study: AlexNet
[Krizhevsky et al. 2012]
Parameters: 0!
72
Case Study: AlexNet
[Krizhevsky et al. 2012]
73
Case Study: AlexNet
[Krizhevsky et al. 2012]
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - Learning rate 1e-2, reduced by 10
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 manually when val accuracy plateaus
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - L2 weight decay 5e-4
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
75
Case Study: ZFNet [Zeiler and Fergus, 2013]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
best model
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
77
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
78
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
79
(not counting biases) Note:
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 early CONV
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
Most params are
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
80
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Inception module
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
81
Case Study: GoogLeNet
Fun features:
Compared to AlexNet:
- 12X less params
- 2x more compute
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
82
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner
(3.6% top 5 error)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
224x224x3
spatial dimension
only 56x56!
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
87
88
[He et al., 2015]
Case Study: ResNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Case Study: ResNet [He et al., 2015]
89
90
[He et al., 2015]
Case Study: ResNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Case Study: ResNet [He et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
91
92
[He et al., 2015]
Case Study: ResNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
93
Case Study Bonus: DeepMind’s
AlphaGo
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)
94
Summary
95
Tips and Tricks
96
• Shuffle the training samples
97
Input representation
“Given a rectangular image, we first rescaled
Input representation the image such that the shorter side was of
length 256, and then cropped out the central
256×256 patch from the resulting image”
● Centered (0-mean) RGB values.
• Centered (0-mean) RGB values.
slide by Alex Krizhevsky
Random mix/combinations of :
- translation
- rotation
- stretching
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
- shearing,
- lens distortions, … (go crazy)
102
Data augmentation improves human
learning, not just deep learning
If you're trying to improve your golf swing or master that tricky guitar
chord progression, here's some good news from researchers at Johns
Hopkins University: You may be able to double how quickly you learn
skills like these by introducing subtle variations into your practice
routine.
The received wisdom on learning motor skills goes something like this:
You need to build up "muscle memory" in order to perform mechanical
tasks, like playing musical instruments or sports, quickly and efficiently.
And the way you do that is via rote repetition — return hundreds of
tennis serves, play that F major scale over and over until your fingers
bleed, etc.
The wisdom on this isn't necessarily wrong, but the Hopkins research
https://www.washingtonpost.com/ suggests it's incomplete. Rather than doing the same thing over and
news/wonk/wp/2016/02/12/how- over, you might be able to learn things even faster — like, twice as fast —
if you change up your routine. Practicing your baseball swing? Change
to-learn-new-skills-twice-as-fast/ the size and weight of your bat. Trying to nail a 12-bar blues in A major
on the guitar? Spend 20 minutes playing the blues in E major, too.
Practice your backhand using tennis rackets of varying size and weight.
103
Convolutions
scale alone rotation
shift are notcolor
enough!
space
Convolutions alone cannot handle this!
slide by Alex Smola
Canonical Distance
Canonical Distance
0.5 Shih−Tzu
P(true class)
0.6
African Crocodile
5 0.4 African Grey 0.5
Entertrainment Center
0.4 Lawn Mower
0.3
Lawn Mower Shih−Tzu
3 0.3
African Crocodile
0.2
African Crocodile 0.2 African Grey
African Grey Entertrainment Center
0.1
1 0.1
Entertrainment Center
0 0
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
Vertical Translation (Pixels) Vertical Translation (Pixels) Vertical Translation (Pixels)
12 0.7 1
Canonical Distance
Canonical Distance
8
Lawn Mower
P(true class)
0.6
0.4 Shih−Tzu
6 0.5 African Crocodile
African Grey
0.3
0.4 Entertrainment Center
Lawn Mower
4 Lawn Mower
0.2 Shih−Tzu 0.3
Shih−Tzu African Crocodile
African Crocodile African Grey 0.2
2 0.1
African Grey Entertrainment Center 0.1
Entertrainment Center
0 0 0
1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8
Scale (Ratio) Scale (Ratio) Scale (Ratio)
15 1.4 1
10 Lawn Mower
Shih−Tzu
P(true class)
0.6
0.8 African Crocodile
0.5 African Grey
0.6 Entertrainment Center
0.4
5 Lawn Mower Lawn Mower
0.4 0.3
Shih−Tzu Shih−Tzu
African Crocodile African Crocodile 0.2
African Grey 0.2 African Grey
0.1
Entertrainment Center Entertrainment Center
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Rotation Degrees Rotation Degrees Rotation Degrees
gure 4. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1:
Visualizing
ample images undergoingandtheUnderstanding
transformations. Convolutional Networks
Col 2 & 3: Euclidean [Zeiler
distance and
between Fergus,
feature 2014]
vectors from the origin
105
Invariance and Covariance
Table 3: Relative variance and intrinsic dimensionality av
eraged overTable
experiments for different
3: Relative object categories
variance and intrinan
viewpoints (3D orientation, translation, and scale). Each cel
eragedbottom
top – rel. variance; over –experiments fordo
intrinsic dim. We different
not repo
the intrinsicviewpoints (3D itorientation,
dim. of L since translation
is typically larger than 1K
top – rel. variance; bottom – intrinsic
across the experiments and expensive to compute.
(a) Car, pool5 (b) Chair, pool5 the intrinsic dim. of L since it is ty
pool5 fc6 fc7
across the experiments and expensive
(a) Lighting (b) Scale Places
26.8 % 21.4 % 17.8 %
8.5 7.0 5.9
Viewpoint AlexNet 26.4 % 19.4 % pool5
15.6 %
(a) Lighting (b) Scale 8.3Places 7.2 26.86.0%
VGG 21.2 % 16.4 % 12.3 %
8.5
10.0 7.7 6.2
Viewpoint AlexNet 26.4 %
(c) Car, fc6 (d) Chair, fc6 Places 26.8 % 39.1 % 49.4 %
136.3 105.5 8.3
54.6
Style AlexNet 28.2VGG% 40.3 % 21.2
49.4%%
(c) Object color (d) Background color 121.1 125.5 96.7
10.0
VGG 26.4 % 44.3 % 56.2 %
Figure 4: PCA embeddings for different factors using 181.9Places136.3 26.8 %
94.2
AlexNet pool5 features on “car” images. Colors in (a) corre- Places 46.8 % 39.5 % 136.3
32.9 %
spond to location of the light source (green – center). L Style 45.0AlexNet
AlexNet % 40.3 % 28.2
35.0%%
(e) Car, fc7 (f) Chair, fc7 (c) Object color (d) Background color VGG 52.4 % 39.3 % 121.1%
31.5
intuition that higher layers are more invariant to viewpoint. VGG 26.4 %
slide by Joan Bruna
107
of parts at layer 5 in the model. At layer 7, the scores ing invar
are more similar, perhaps due to upper layers trying to class
ble 4, probability,
using [A] as [B]
Acc % 5 train/test a function of %the
folds.
Ours Acc position
Training
[A] of Ours
took
[B] the17gray
min-squ
the layer
for “pomeranian”
utes for 30 Airplane 92.0 drops
images/class. 97.3 significantly.
96.0
77.1The
Dining tab (e):
63.2the
Dogpre-trained
most
77.8 67.7probable la
model beats
scratch however
to the dog
Cat
Chair(blue
76.4
does
region
65.2 75.4 terribly,
89.3 91.2
in65.0
Sofa
onlyitachieving
(d)),Train
since
65.4 73.4
uses 94.5
86.7 multiple 46.5%.
61.1
91.8
resentatio
feature m
generalize to other datasets, namely Caltech-101 (Fei- Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1 useful for
feature, Mean
but the74.3
output
82.2 depends on many
% 0 parts
Accof%
5 the vehicle.
ImageNet-trained model
(Jianchao et al., 2009) 73.2 84.3
of our ImageNet-trained model fixed and train a new we explore
methodsthe([A]=
“one-shot
(Sande et learning”
al., 2012) and(Fei-fei
[B] = (Yan etetal.,
al., 2006)
parts in
2012)).
Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7 its predic
softmax classifier on top (for the appropriate number regime. With our pre-trained83.8
ImageNet-pretrained convnet
model,
± 0.5
just86.5
6 Caltech-
± 0.5trained m
fixed
5.1. Layer-by-Layer Performance Breakdown
of classes) using the training images of the new dataset. Table 4. Caltech-101 classification accuracy for leading
256 training images are needed to beat the our For Calte
con-
Since the softmax contains relatively few parameters, method Weusing
vnet models,
explore10howtimes as many
discriminative theimages.
features inThis each shows ilar enou
layer ofagainst two leading alternatemodel are. approaches.
it can be trained quickly from a relatively small num- the power of our
the Imagenet-pretrained
ImageNet feature extractor. We do in the la
this by varying the number of layers retained from
• Train a new softmax classifier Caltech-256: We follow the procedure of sult bring
the ImageNetAcc % andAcc % either Acc % Acc (Griffin
%
ber of examples, as is the case for certain datasets. model place a linear SVM small (i.e
et# Train
al., 2006),
or softmax
15/class15, 30/class
selecting
classifier
(Sohn et al., 2011) 35.1
on 30,
top.
42.1
45,
Tableor45/class
7 60
45.7
training
shows
60/class
results
47.9
images
generaliz
on since
toptheusing
bulk of thethe
modeltraining
This approach is a supervised form of pre-training, per on
(Boclass, Caltech-101
et al., reporting
2013) and
40.5the
± 0.4Caltech-256.
average
48.0 ± 0.2 For
of the both datasets,
51.9per-class
± 0.2 55.2 accura- su↵ering
± 0.3
a steady improvement can be seen as we ascend the although
parameters have been cies in Table
Non-pretr.
model, 5. Our
with
9.0 ±ImageNet-pretrained
best
1.4
results
22.5 ± 0.7 31.2 ± 0.5model
being obtained by using
38.8 ± beats
all
1.4
result, de
ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3
images of direct
the comparisons
new dataset.
learned in a supervised fashion on the ImageNet data. the state-of-the-art
layers. This supports results obtained
the premise that as thebyfeature
(Bo etperforma al.,
Table 5. Caltech
hierarchies 256 classification
become deeper, accuracies.
they learn increasing pow- was used
This prevents to existing algo- 2013) by a significant margin: 74.2% vs 55.2% for 60
erful features.
rithms since they did not use the ImageNet data dur- training
PASCAL images/class. However,
2012: We used as withtraining
the standard
Cal-101 Cal-256 Caltech-101,
and
(30/class) (60/class)
validation images to train
SVM (1) 44.8 a 20-way
± 0.7 24.6 ±softmax
0.4 on top of
the ImageNet-pretrained
SVM (2) convnet.
66.2 ± 0.5 39.6This
± 0.3 is not ideal, as
SVM (3) 72.3 ± 0.4 46.0 ± 0.3
PASCAL images SVMcan
(4) contain multiple
76.6 ± 0.4 51.3 ± 0.1objects and our
model just provides a single exclusive
SVM (5) 86.2 ± 0.8 65.6 ± 0.3 prediction for
SVM (7) 85.5 ± 0.4 71.7 ± 0.2
Softmax (5) 82.9 ± 0.4 65.7 ± 0.5
Softmax (7) 85.4 ± 0.4 72.6 ± 0.1
Table 7. Analysis of the discriminative information con-
tained in each layer of feature maps within our ImageNet-
pretrained convnet. We train either a linear SVM or soft-
max on features from di↵erent layers (as indicated in brack-
ets) from the convnet. Higher layers generally produce
more discriminative features.
6. Discussion
Visualizing and Understanding Convolutional NetworksWe [Zeiler and
explored large Fergus,
convolutional neural2014]
network mod-
els, trained for image classification, in a number ways.
First, we systematically modified the network archi- 108
Stability: Transfer learning
Transfer Learning with ConvNets
• a CNN trained on a (large enough) dataset generalizes
• A ConvNet
to othertrained
visualon a (large enough) dataset generalizes to
tasks:
other visual tasks
Figure 4. t-SNE map of 20, 000 Flickr test images based on features extracted from the last layer of an AlexNet trained with K = 1, 000.
A full-resolution map is presented in the supplemental material. The inset shows a cluster of sports.
ing one-versus-all logistic loss: using a dictionary of K = times with the name of the individual sport itself. A model
1, 000 words, such a model achieves a precision@10 of trained on classification datasets such as Pascal VOC is un-
16.43 (compared to 17.98 for multiclass logistic loss). We likely to learn similar structure unless an explicit target tax-
surmise this is due to the problems one-versus-all logistic onomy is defined (as in the Imagenet dataset). Our results
loss has in dealing with class imbalance: because the num- suggest that such taxonomies can be learned from weakly
ber of negative examples is much higher than the number labeled data instead.
slide by Joan Bruna
1. Train on
Imagenet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
110
Transfer Learning with ConvNets
2. Small dataset:
1. Train on feature extractor
Imagenet
Freeze
these
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Train
this
111
Transfer Learning with ConvNets
112
Transfer Learning with ConvNets
and ~1/100th on
intermediate layers
113
Today ConvNets are everywhere
Classification Retrieval
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
[Krizhevsky 2012]
114
Today ConvNets are everywhere
Detection Segmentation
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
[Faster R-CNN: Ren, He, Girshick, Sun 2015] [Farabet et al., 2012]
115
Today ConvNets are everywhere
NVIDIA Tegra X1
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
self-driving cars
116
Today ConvNets are everywhere
[Taigman et al. 2014]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
117
Today ConvNets are everywhere
[Mnih 2013]
118
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
119
Today ConvNets are everywhere
120
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
121
Today ConvNets are everywhere
Image
Captioning
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
122
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
reddit.com/r/deepdream
123
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
126
Understanding ConvNets
- Visualize patches that maximally activate neurons
- Visualize the weights
- Visualize the representation space (e.g. with t-SNE)
- Occlusion experiments
- Human experiment comparisons
- Deconv approaches (single backward pass)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
127
Visualize patches that maximally
activate neurons
one-stream AlexNet
pool5
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Rich feature hierarchies for accurate object detection and semantic segmentation
[Girshick, Donahue, Darrell, Malik]
128
Visualize the filters/kernels
(raw weights)
one-stream AlexNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
conv1
weights
from ConvNetJS
CIFAR-10 demo)
130
131
The gabor-like filters fatigue
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Visualizing the representation
fc7 layer
132
Visualizing the representation
t-SNE visualization
[van der Maaten & Hinton]
http://cs.stanford.edu/people/
karpathy/cnnembed/
134
Occlusion experiments
[Zeiler & Fergus 2013]
(as a function of
the position of the
square of zeros in
the original image)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
135
Occlusion experiments
[Zeiler & Fergus 2013]
(as a function of
the position of the
square of zeros in
the original image)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
136
Visualizing Activations
http://yosinski.com/deepvis
YouTube video
https://www.youtube.com/watch?v=A
gkfIQ4IGaM
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
(4min)
137
138
Deconv approaches
1. Feed image into net
139
140
Deconv approaches
1. Feed image into net
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Deconv approaches
1. Feed image into net
2. Pick a layer, set the gradient there to be all zero except for one 1
for some neuron of interest
3. Backprop to image:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
141
Deconv approaches
1. Feed image into net
2. Pick a layer, set the gradient there to be all zero except for one 1
for some neuron of interest
3. Backprop to image:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
“Guided
backpropagation:”
instead
142
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
143
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
144
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
145
[Striving for Simplicity: The all convolutional net,
Springenberg, Dosovitskiy, et al., 2015]
Visualization of patterns
learned by the layer
conv6 (top) and layer
conv9 (bottom) of the
network trained on
ImageNet.
146
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
bit weird
147
[Visualizing and Understanding Convolutional Networks
Zeiler & Fergus, 2013]
Visualizing arbitrary neurons along the way to the top...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
148
Visualizing arbitrary neurons along the
way to the top...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
149
Visualizing arbitrary neurons along the
way to the top...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
150
Optimization to Image
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
1. feed in
zeros. zero image
153
Optimization to Image
1. feed in
zeros. zero image
154
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
155
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
156
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
2. Visualize the
Data gradient:
157
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
2. Visualize the
Data gradient:
158
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
159
We can in fact do this for arbitrary neurons
along the ConvNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Repeat:
1. Forward an image
2. Set activations in layer of interest to all zero, except
for a 1.0 for a neuron of interest
3. Backprop to image
4. Do an “image update”
160
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
162
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
163
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
164
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
165
http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html
166
http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html
167
Multifaceted Feature Visualization: Uncovering the Different Types of
JEFFCLUNE @ UWYO . EDU
Features Learned By Each Neuron in Deep Neural Networks
[Nguyen et al.,’16]
Multifaceted Feature Vis
169
Multifaceted Feature Visualization: Uncovering the Different Types of
Features Learned By Each Neuron in Deep Neural Networks
[Nguyen et al.,’16]
Figure 6. Multifaceted visualization of example neuron feature detectors from all eight layers of a deep convolutional neural network.
The images reflect the true sizes of the receptive fields at different layers. For each neuron, we show visualizations of 4 different
170
Question: Given a CNN code, is it
possible to reconstruct the original
image?
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
171
Find an image such that:
- Its code is similar to a given code
- It “looks natural” (image prior regularization)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
172
Understanding Deep Image Representations by Inverting Them
[Mahendran and Vedaldi, 2014]
reconstructions
original from the 1000
image log probabilities
for ImageNet
(ILSVRC)
classes
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
173
Reconstructions from the representation after last pooling
layer (immediately before the first Fully Connected layer)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
174
175
Reconstructions from intermediate layers
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Multiple reconstructions. Images in
quadrants all “look” the same to the CNN
(same code)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
176
177
DeepDream https://github.com/google/deepdream
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
178
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
DeepDream: set dx = x :)
jitter regularizer
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
“image update”
179
inception_4c/output
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
DeepDream modifies the image in a way that “boosts” all activations, at any layer
this creates a feedback loop: e.g. any slightly detected dog face will be made more
and more dog like over time
180
inception_4c/output
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
DeepDream modifies the image in a way that “boosts” all activations, at any layer
181
inception_3b/5x5_reduce
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
DeepDream modifies the image in a way that “boosts” all activations, at any layer
182
Bonus videos
Deep Dream Grocery Trip
https://www.youtube.com/watch?v=DgPaCWJL7XI
Deep Dreaming Fear & Loathing in Las Vegas: the Great San Francisco Acid Wave
https://www.youtube.com/watch?v=oyxSerkkP4o
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
183
NeuralStyle
[A Neural Algorithm of Artistic Style
by Leon A. Gatys, Alexander S. Ecker, and
Matthias Bethge, 2015]
good implementation by Justin in Torch:
https://github.com/jcjohnson/neural-style
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
184
185
make your own easily on deepart.io
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Step 1: Extract content targets (ConvNet activations of
all layers for the given content image)
content activations
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
e.g.
at CONV5_1 layer we would have a [14x14x512] array of target activations
186
Step 2: Extract style targets (Gram matrices of ConvNet
activations of all layers for the given style image)
e.g.
at CONV1 layer (with [224x224x64] activations) would give a [64x64] Gram
matrix of all pairwise activation covariances (summed across spatial locations)
187
Step 3: Optimize over image to have:
- The content of the content image (activations match
content)
- The style of the style image (Gram matrices of
activations match style)
match content
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
match style
188
We can pose an optimization over the input
image to maximize any class score.
That seems useful.
189
[Intriguing properties of neural networks,
Szegedy et al., 2013]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
190
[Deep Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images
Nguyen, Yosinski, Clune, 2014]
>99.6%
confidences
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
191
[Deep Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images
Nguyen, Yosinski, Clune, 2014]
>99.6%
confidences
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
192
These kinds of results were around even
before ConvNets…
[Exploring the Representation Capabilities of the HOG Descriptor,
Tatu et al., 2011]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
193
Explaining and Harnessing Adversarial Examples
[Goodfellow, Shlens & Szegedy, 2014]
194
Lets fool a binary linear classifier:
(logistic regression)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
195
Lets fool a binary linear classifier:
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
196
Lets fool a binary linear classifier:
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
197
Lets fool a binary linear classifier:
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
adversarial x ? ? ? ? ? ? ? ? ? ?
198
Lets fool a binary linear classifier:
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
adversarial x 1.5 -1.5 3.5 -2.5 2.5 1.5 1.5 -3.5 4.5 1.5
199
Lets fool a binary linear classifier:
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
adversarial x 1.5 -1.5 3.5 -2.5 2.5 1.5 1.5 -3.5 4.5 1.5
input dimensions. A
=> probability of class 1 is 1/(1+e^(-(-3))) = 0.0474 224x224 input image
-1.5+1.5+3.5+2.5+2.5-1.5+1.5-3.5-4.5+1.5 = 2 has 150,528.
=> probability of class 1 is now 1/(1+e^(-(2))) = 0.88
i.e. we improved the class 1 probability from 5% to 88% (It’s significantly easier
with more numbers,
need smaller nudge for
each)
200
Blog post: Breaking Linear Classifiers
on ImageNet
Recall CIFAR-10
linear classifiers:
ImageNet classifiers:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
201
mix in a tiny bit of
Goldfish classifier weights
+ =
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
100% Goldfish
202
203
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
204
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson