4 - DNN Tip
4 - DNN Tip
Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Do not always blame Overfitting
Not well trained
Overfitting?
Good Results on
Different approaches for Testing Data?
different problems.
Neural
Network
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on
New activation function Training Data?
……
……
……
……
xN …… yM
x1 …… 𝑦1 𝑦ො1
Small
x2 output
…… 𝑦2 𝑦ො2
……
……
……
……
𝑙
……
……
+∆𝑙
xN …… 𝑦𝑀 𝑦
ො
Large 𝑀
+∆𝑤 input
x1 y1
0 y2
x2
0
0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
ReLU - variant
𝑧 𝑧
𝑎 = 0.01𝑧 𝑎 = 𝛼𝑧
α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + −1 + 4
Max 1 Max 4
+ 1 + 3
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x 0 + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑏
0
1 1
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 =0
Maxout More than ReLU
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤′
𝑏
1 𝑏′
1 Learnable Activation
Function
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 = 𝑤 ′ 𝑥 + 𝑏 ′
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
x2 + 𝑧31 + 𝑧32
x2 + 𝑧31 + 𝑧32
𝑥 𝑎21 𝑎22
+ 𝑧41 𝑎 1 + 𝑧42 𝑎2
• Train this thin and linear network
Different thin and linear network for different examples
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on
New activation function Training Data?
𝑤2
Larger
Learning Rate
Adagrad
𝑤1
𝜂
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑔𝑡
σ𝑡𝑖=0 𝑔𝑖 2
Smaller
Learning Rate
𝑤2
Larger
Learning Rate
𝑤1
RMSProp
𝜂 0
𝑤1 ← − 0𝑔𝑤0 𝜎 0 = 𝑔0
𝜎
2 1
𝜂 1
𝑤 ← 𝑤 − 1𝑔 𝜎1 = 𝛼 𝜎0 2 + 1 − 𝛼 𝑔1 2
𝜎
3 2
𝜂 2
𝑤 ← 𝑤 − 2𝑔 𝜎2 = 𝛼 𝜎1 2 + 1 − 𝛼 𝑔2 2
𝜎
……
𝜂 𝑡
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑡𝑔 𝜎𝑡 = 𝛼 𝜎 𝑡−1 2 + 1 − 𝛼 𝑔𝑡 2
𝜎
Root Mean Square of the gradients
with previous gradients being decayed
Hard to find
optimal network parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
𝛻𝐿 𝜃 0 Start at position 𝜃 0
𝛻𝐿 𝜃1 Compute gradient at 𝜃 0
𝜃0
𝜃 1 Move to 𝜃 1 = 𝜃 0 - η𝛻𝐿 𝜃 0
𝛻𝐿 𝜃 2
Compute gradient at 𝜃 1
Gradient 𝜃2
Move to 𝜃 2 = 𝜃 1 – η𝛻𝐿 𝜃 1
Movement 𝛻𝐿 𝜃3
𝜃3
……
Stop until 𝛻𝐿 𝜃 𝑡 ≈ 0
Momentum
Movement: movement of last Start at point 𝜃 0
step minus gradient at present Movement v0=0
𝛻𝐿 𝜃 0 Compute gradient at 𝜃 0
𝛻𝐿 𝜃 1
𝜃0 Movement v1 = λv0 - η𝛻𝐿 𝜃 0
𝜃1 Move to 𝜃 1 = 𝜃 0 + v1
𝛻𝐿 𝜃 2
Compute gradient at 𝜃 1
𝜃2
Gradient Movement v2 = λv1 - η𝛻𝐿 𝜃 1
Movement Move to 𝜃 2 = 𝜃 1 + v2
𝜃3
Movement 𝛻𝐿 𝜃 3 Movement not just based
of last step on gradient, but previous
movement.
Momentum
Movement: movement of last Start at point 𝜃 0
step minus gradient at present Movement v0=0
Compute gradient at 𝜃 0
vi is actually the weighted sum of
all the previous gradient: Movement v1 = λv0 - η𝛻𝐿 𝜃 0
𝛻𝐿 𝜃 0 ,𝛻𝐿 𝜃 1 , … 𝛻𝐿 𝜃 𝑖−1 Move to 𝜃 1 = 𝜃 0 + v1
v0 = 0 Compute gradient at 𝜃 1
v1 = - η𝛻𝐿 𝜃0 Movement v2 = λv1 - η𝛻𝐿 𝜃 1
Move to 𝜃 2 = 𝜃 1 + v2
v2 = - λ η𝛻𝐿 𝜃 0 - η𝛻𝐿 𝜃 1
Movement not just based
……
𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp + Momentum
for momentum
for RMSprop
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on
New activation function Training Data?
Training set
Epochs
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
Keras: the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on
New activation function Training Data?
L L
1
2
Regularization term
2
w1 , w2 ,
Original loss L2 regularization:
w1 w2
2 2
(e.g. minimize square
2
error, cross entropy …)
(usually not consider biases)
L2 regularization:
Regularization w1 w2
2 2
2
L L
L L
1
Gradient: w
2 2
w w
L L t
Update: w t 1
w
t
w
t
w
w w
L
1 w
t
Weight Decay
w
Closer to zero
L1 regularization:
Regularization 1 w1 w2
w w
L
w
t
w
sgn w Always delete
t
L …… L2
1 w
t
w
Regularization - Weight Decay
• Our brain prunes out the useless link between
neurons.
Dropout YES
Good Results on
New activation function Training Data?
Thinner!
➢ No dropout
If the dropout rate at training is p%,
all the weights times 1-p%
Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout
- Intuitive Reason
Testing
No dropout
(拿下重物後就變很強)
Training
Dropout (腳上綁重物)
Dropout - Intuitive Reason
我的 partner
會擺爛,所以
我要好好做
y1 y2 y3 y4
average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
All the
weights
……
multiply
1-p%
y1 y2 y3
?????
average ≈ y
Testing of Dropout
x1 x2
x1 x2 x1 x2 w1 w2
w1 w2 w1 w2
z=w1x1+w2x2
z=w1x1+w2x2 z=w2x2
x1 x2
x1 x2 x1 x2
1 1
w1 w1 w1 w2
w2 w2 2 2
1 1
z=w1x1 𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2
z=0 2 2
Recipe of Deep Learning
YES
Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Try another task
政治
“stock” in document
經濟
Machine
體育
“president” in document
體育 政治 財經
http://top-breaking-news.com/
Try another task
Live Demo