Week7_ConvNets and Transfer Learning
Week7_ConvNets and Transfer Learning
2
LeNet-5
▪ Created by Yann LeCun in the 1990s
▪ Used on the MNIST data set
▪ Novel Idea: Use convolutions to efficiently learn features on data set
3
LeNet—Structure Diagram
Input: A 32 x 32 grayscale image (28 x 28)
with 2 pixels of padding all around.
4
LeNet—Structure Diagram
Next, we have a
convolutional layer.
5
LeNet—Structure Diagram
6
LeNet—Structure Diagram
7
LeNet—Structure Diagram
8
LeNet—Structure Diagram
9
LeNet—Structure Diagram
10
LeNet—Structure Diagram
What is the total number of Answer: Each kernel has 5x5=25 weights (plus a
weights in this layer? bias term, so actually 26 weights). So total
weights = 6x26 = 156.
11
LeNet—Structure Diagram
Next is a 2x2 pooling layer. (with stride 2)
12
LeNet—Structure Diagram
So output size is 6x14x14.
(we downsample by a factor of 2)
13
LeNet—Structure Diagram
So output size is 6x14x14.
(we downsample by a factor of 2)
Note: The original paper actually does a more complicated pooling then max or
avg. pooling, but this is considered obsolete now.
14
LeNet—Structure Diagram
No weights! (pooling layers have no weights to be
learned – it is a fixed operation.)
15
LeNet—Structure Diagram
16
LeNet—Structure Diagram
17
LeNet—Structure Diagram
The kernels “take in” the full depth of the previous layer. So each
5x5 kernel now “looks at” 6x5x5 pixels.
Each kernel has 6x5x5 = 150 weights + bias term = 151.
18
LeNet—Structure Diagram
19
LeNet—Structure Diagram
20
LeNet—Structure Diagram
We “flatten” this to a length
400 vector. (not shown)
21
LeNet—Structure Diagram
The following layers are just
fully connected layers!
22
LeNet—Structure Diagram
From 400 to 120.
23
LeNet—Structure Diagram
Then from 120 to 84.
24
LeNet—Structure Diagram
Then from 84 to 10.
25
LeNet—Structure Diagram
And a softmax output of
size 10 for the 10 digits.
26
LeNet-5
How many total weights in the network?
Conv1: 1*6*5*5 + 6 = 156
Conv3: 6*16*5*5 + 16 = 2416
FC1: 400*120 + 120 = 48120
FC2: 120*84 + 84 = 10164
FC3: 84*10 + 10 = 850
Total: = 61706
27
Motivation
▪ Early layers in a Neural Network are the
hardest (i.e. slowest) to train
▪ Due to vanishing gradient property
▪ But these ”primitive” features should be
general across many image classification
tasks
28
Motivation
▪ Later layers in the network are capturing features that are more particular to the specific image
classification problem
▪ Later layers are easier (quicker) to train since adjusting their weights has a more immediate
impact on the final result
29
Motivation
▪ Famous, competition-winning models are difficult to train from scratch
– Huge datasets (like ImageNet)
– Long number of training iterations
– Very heavy computing machinery
– Time experimenting to get hyper-parameters just right
30
Transfer Learning
▪ However, the basic features (edges, shapes) learned in the early layers of the network should
generalize
▪ Results of the training are just weights (numbers) that are easy to store
▪ Idea: keep the early layers of a pre-trained network, and re-train the later layers for a specific
application
▪ This is called Transfer Learning
31
Transfer Learning
Convolutions
Fully Connected
softmax classifier
32
Transfer Learning
Train last layer
on new data.
Convolutions
Fully Connected
33
Transfer Learning
Perhaps, after a while Train last layer
train back a few more layers on new data.
(or even the whole network).
Convolutions
Fully Connected
34
Transfer Learning Options
▪ The additional training of a pre-trained network on a specific new dataset is referred to as
“Fine-Tuning”
▪ There are different options on “how much” and “how far back” to fine-tune
– Should I train just the very last layer?
– Go back a few layers?
– Re-train the entire network (from the starting point of the existing network)?
35
Guiding Principles for
Fine-Tuning
While there are no “hard and fast” rules,
there are some guiding principles to keep
in mind.
1) The more similar your data and problem are E.g. Using a network trained on ImageNet to
to the source data of the pre-trained network, distinguish “dogs” from “cats” should need
the less fine-tuning is necessary relatively little fine-tuning. It already
distinguished different breeds of dogs and
cats, so likely has all the features you will
need.
36
Guiding Principles for
Fine-Tuning
2) The more data you have about your E.g. If you have only 100 dogs and 100 cats
specific problem, the more the network will in your training data, you probably want to
benefit from longer and deeper fine-tuning do very little fine-tuning. If you have 10,000
dogs and 10,000 cats you may get more
value from longer and deeper fine-tuning.
37
Guiding Principles for
Fine-Tuning
3) If your data is substantially different in E.g. A network that was trained on recognizing
nature than the data the source model was typed Latin alphabet characters would not be
trained on, Transfer Learning may be of useful in distinguishing cats from dogs. But it
little value likely would be useful as a starting point for
recognizing Cyrillic Alphabet characters.
38