DNN tip
DNN tip
Learning
Recipe of Deep Learning
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Do not always blame
Overfitting
Not well trained
Overfitting?
Good Results on
Different approaches for Testing Data?
different problems.
Neural
Network
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?
Dropout YES
Good Results on
New activation function Training Data?
……
……
……
……
xN …… yM
x1 …… 𝑦1 ^
𝑦1
Small
x2 output 𝑦2 ^
…… 𝑦2
𝐶
……
……
……
……
……
……
xN …… 𝑦𝑀 +∆ 𝐶 ^
𝑦 𝑀
Large
+∆ 𝑤 input
x1 y1
0 y2
x2
0
0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎= 0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
ReLU - variant
𝑧 𝑧
𝑎=0.01 𝑧 𝑎=𝛼 𝑧
α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + −1 + 4
Max 1 Max 4
+ 1 + 3
𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x 0 + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }
❑
𝑏 0
1 1
𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥
0
Maxout More than ReLU
𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }
❑
′
𝑏 𝑤
′
1 𝑏
1 Learnable Activation
Function
𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥
𝑧 2=𝑤′ 𝑥 +𝑏′
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
x2 + 𝑧3
1
+
2
𝑧3
1 2
𝑥 Max 𝑎2 Max 𝑎2
+ +
1 2 2
𝑧4 1 𝑧4 𝑎
𝑎
Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧 11 +
2
𝑧1
Input 1 2
𝑎1 𝑎1
x1 + 𝑧2
1
+ 𝑧2
2
x2 + 𝑧3
1
+ 𝑧3
2
1 2
𝑥 𝑎2 𝑎2
+ +
1 2 2
𝑧4 1 𝑧4 𝑎
𝑎
• Train this thin and linear network
Different thin and linear network for different examples
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?
Dropout YES
Good Results on
New activation function Training Data?
𝑤2
Larger
Learning Rate
Adagrad
𝑤1
𝑡 +1 𝑡 𝜂 𝑡
𝑤 ←𝑤 − 𝑔
√
𝑡
∑ )
2
( 𝑔
𝑖
𝑖 =0
Smaller
Learning Rate
𝑤2
Larger
Learning Rate
𝑤1
RMSProp
𝜂
1 0 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎
𝜂
2 1
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1
√ 0 2
𝜎 = 𝛼 ( 𝜎 ) + ( 1− 𝛼 ) ( 𝑔 )
1 2
𝜂
𝜎
2 2
√
𝑤 ← 𝑤 − 2 𝑔 𝜎 2= 𝛼 ( 𝜎 1 ) 2+ ( 1− 𝛼 ) ( 𝑔 2 )2
3
… …
𝜂
𝑤
𝑡 +1 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
√
𝜎 = 𝛼(𝜎
𝑡 𝑡− 1 2 𝑡 2
) + ( 1− 𝛼 ) ( 𝑔 )
Root Mean Square of the gradients
with previous gradients being decayed
Hard to find
optimal network
parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point
𝜃
0 𝛻𝐿 𝜃 )
( 1 Compute gradient at
1 Move to = - η
𝜃 𝛻 𝐿( 𝜃 )
2
2 Compute gradient at
Gradient 𝜃
Move to = – η
Movement 3
𝛻 𝐿 ( 𝜃3 )
𝜃
……
Stop until
Momentum
Movement: movement of last Start at point
step minus gradient at present Movement v0=0
𝛻 𝐿 ( 𝜃0 ) Compute gradient at
𝛻 𝐿 ( 𝜃1 )
𝜃
0 Movement v1 = λv0 - η
𝜃
1 Move to = + v1
𝛻 𝐿( 𝜃 )2
2 Compute gradient at
𝜃
Gradient Movement v2 = λv1 - η
Movement 3 Move to = + v2
𝜃 𝛻 𝐿 ( 𝜃3 )
Movement Movement not just based
of last step on gradient, but previous
movement.
Momentum
Movement: movement of last Start at point
step minus gradient at present Movement v0=0
Compute gradient at
v is actually the weighted sum of
i
Negative of
Momentum
Real Movement
𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp + Momentum
for momentum
for RMSprop
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?
Dropout YES
Good Results on
New activation function Training Data?
Training set
Epochs
Keras: http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?
Dropout YES
Good Results on
New activation function Training Data?
1 L L
L L Gradient: w
2 2
w w
L L t
Update: w t 1
w
t
w
t
w
w w
L
1 w
t
Weight Decay
w
Closer to zero
L1 regularization:
Regularization 1 w1 w2
w
w
t t
sgn w
w
L
w
t
w
sgn w Always delete
t
L …… L2
1 w
t
w
Regularization - Weight
Decay
• Our brain prunes out the useless link between
neurons.
Dropout YES
Good Results on
New activation function Training Data?
Thinner!
Each time before updating the parameters
Each neuron has p% to dropout
The structure of the network is changed.
Using the new network for training
No dropout
If the dropout rate at training is p%,
all the weights times 1-p%
Assume that the dropout rate is 50%.
If a weight by training, set for testing.
Dropout
- Intuitive Reason
Testing
No dropout
( 拿下重物後就變很強 )
Training
Dropout ( 腳上綁重物 )
Dropout - Intuitive Reason
我的 partner 會
擺爛,所以我要好好做
𝑤3 0 .5 × 𝑤3
𝑤4 0 .5 ×𝑤 4
Weights multiply 1-p%
′
𝑧 ≈𝑧
Dropout is a kind of
ensemble. Training
Ensemble Set
y1 y2 y3 y4
average
Dropout is a kind of
ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
All the
weights
……
multiply
1-p%
y1 y2 y3
????
average ? ≈ y
Testing of Dropout
x1 x2
x1 x2 x1 x2 w1 w2
w1 w2 w1 w2
z=w1x1+w2x2
z=w1x1+w2x2 z=w2x2
x1 x2
x1 x2 x1 x2
1 1
w1 w1 w w
w2 w2 2 1 2 2
1 1
𝑧 = 𝑤1 𝑥 1 + 𝑤2 𝑥2
z=w1x1 z=0 2 2
Recipe of Deep Learning
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network