0% found this document useful (0 votes)
4 views

DNN tip

The document outlines a recipe for deep learning, emphasizing the importance of defining functions, evaluating their performance on training and testing data, and addressing overfitting. It discusses various techniques such as early stopping, regularization, dropout, and adaptive learning rates, as well as activation functions like ReLU and Maxout. Additionally, it highlights challenges like the vanishing gradient problem and the complexities of optimizing network parameters.

Uploaded by

snowxiaoyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DNN tip

The document outlines a recipe for deep learning, emphasizing the importance of defining functions, evaluating their performance on training and testing data, and addressing overfitting. It discusses various techniques such as early stopping, regularization, dropout, and adaptive learning rates, as well as activation functions like ReLU and Maxout. Additionally, it highlights challenges like the vanishing gradient problem and the complexities of optimizing network parameters.

Uploaded by

snowxiaoyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Tips for Deep

Learning
Recipe of Deep Learning
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network
Do not always blame
Overfitting
Not well trained

Overfitting?

Training Data Testing Data

Deep Residual Learning for Image


Recognition
http://arxiv.org/abs/1512.03385
Recipe of Deep Learning
YES

Good Results on
Different approaches for Testing Data?
different problems.

e.g. dropout for good results YES


on testing data
Good Results on
Training Data?

Neural
Network
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Hard to get the power of
Deep …

Results on Training Data

Deeper usually does not imply better.


Vanishing Gradient
Problem
x1 …… y1
x2 …… y2
……

……
……

……

……
xN …… yM

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge


based on
Vanishing Gradient
Problem
Smaller gradients

x1 …… 𝑦1 ^
𝑦1
Small
x2 output 𝑦2 ^
…… 𝑦2
𝐶
……

……
……

……

……

……
xN …… 𝑦𝑀 +∆ 𝐶 ^
𝑦 𝑀
Large
+∆ 𝑤 input

Intuitive way to compute the derivatives …


𝜕𝐶 ∆𝐶
=?
𝜕𝑤 ∆𝑤
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 (𝑧) 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎= 0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
𝑎
𝑎=𝑧
ReLU
𝑎= 0
𝑧
0

x1 y1

0 y2
x2
0

0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎= 0
𝑧

x1 y1

y2
x2

Do not have
smaller gradients
ReLU - variant

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃 𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈


𝑎 𝑎
𝑎=𝑧 𝑎=𝑧

𝑧 𝑧
𝑎=0.01 𝑧 𝑎=𝛼 𝑧

α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2

x2 + −1 + 4
Max 1 Max 4
+ 1 + 3

You can have more than 2 elements in a group.


Maxout ReLU is a special cases of Maxout

𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x 0 + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }

𝑏 0
1 1

𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥
0
Maxout More than ReLU

𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }


𝑏 𝑤

1 𝑏
1 Learnable Activation
Function
𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥

𝑧 2=𝑤′ 𝑥 +𝑏′
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group

2 elements in a group 3 elements in a group


Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧 11 +
2
𝑧1
Input 1 2
Max 𝑎1 Max 𝑎1
x1 + 𝑧2
1
𝑚𝑎𝑥 { 𝑧 11 , 𝑧 12 } +
2
𝑧2

x2 + 𝑧3
1
+
2
𝑧3
1 2
𝑥 Max 𝑎2 Max 𝑎2
+ +
1 2 2
𝑧4 1 𝑧4 𝑎
𝑎
Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧 11 +
2
𝑧1
Input 1 2
𝑎1 𝑎1
x1 + 𝑧2
1
+ 𝑧2
2

x2 + 𝑧3
1
+ 𝑧3
2

1 2
𝑥 𝑎2 𝑎2
+ +
1 2 2
𝑧4 1 𝑧4 𝑎
𝑎
• Train this thin and linear network
Different thin and linear network for different examples
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Review Smaller
Learning Rate

𝑤2

Larger
Learning Rate

Adagrad
𝑤1
𝑡 +1 𝑡 𝜂 𝑡
𝑤 ←𝑤 − 𝑔


𝑡

∑ )
2
( 𝑔
𝑖

𝑖 =0

Use first derivative to estimate second derivative


RMSProp
Error Surface can be very complex when training NN.

Smaller
Learning Rate
𝑤2

Larger
Learning Rate

𝑤1
RMSProp
𝜂
1 0 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎
𝜂
2 1
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1
√ 0 2
𝜎 = 𝛼 ( 𝜎 ) + ( 1− 𝛼 ) ( 𝑔 )
1 2

𝜂
𝜎
2 2

𝑤 ← 𝑤 − 2 𝑔 𝜎 2= 𝛼 ( 𝜎 1 ) 2+ ( 1− 𝛼 ) ( 𝑔 2 )2
3
… …

𝜂
𝑤
𝑡 +1 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡

𝜎 = 𝛼(𝜎
𝑡 𝑡− 1 2 𝑡 2
) + ( 1− 𝛼 ) ( 𝑔 )
Root Mean Square of the gradients
with previous gradients being decayed
Hard to find
optimal network
parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

The value of a network parameter w


In physical world ……
• Momentum

How about put this phenomenon


in gradient descent?
Review: Vanilla Gradient
Descent
𝛻 𝐿 ( 𝜃0 ) Start at position

𝜃
0 𝛻𝐿 𝜃 )
( 1 Compute gradient at
1 Move to = - η
𝜃 𝛻 𝐿( 𝜃 )
2

2 Compute gradient at
Gradient 𝜃
Move to = – η
Movement 3
𝛻 𝐿 ( 𝜃3 )
𝜃

……
Stop until
Momentum
Movement: movement of last Start at point
step minus gradient at present Movement v0=0
𝛻 𝐿 ( 𝜃0 ) Compute gradient at
𝛻 𝐿 ( 𝜃1 )
𝜃
0 Movement v1 = λv0 - η

𝜃
1 Move to = + v1
𝛻 𝐿( 𝜃 )2

2 Compute gradient at
𝜃
Gradient Movement v2 = λv1 - η
Movement 3 Move to = + v2
𝜃 𝛻 𝐿 ( 𝜃3 )
Movement Movement not just based
of last step on gradient, but previous
movement.
Momentum
Movement: movement of last Start at point
step minus gradient at present Movement v0=0
Compute gradient at
v is actually the weighted sum of
i

all the previous gradient: , Movement v1 = λv0 - η


Move to = + v1
v0 = 0 Compute gradient at
v1 = - η Movement v2 = λv1 - η
Move to = + v2
v2 = - λ η - η
Movement not just based
……

on gradient, but previous


movement
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost

Negative of 𝜕𝐿∕𝜕𝑤 + Momentum


Movement =

Negative of
Momentum
Real Movement

𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp + Momentum

for momentum
for RMSprop
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Early Stopping
Total
Loss
Stop at Validation set
here Testing set

Training set

Epochs

Keras: http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Regularization
• New loss function to be minimized
• Find a set of weight not only minimizing original
cost but also close to zero
1
L  L    2
Regularization term
2
 w1 , w2 ,
Original loss L2 regularization:
w1   w2   
2 2
(e.g. minimize square  2
error, cross entropy …)
(usually not consider biases)
L2 regularization:
Regularization 2
w1   w2   
2 2

• New loss function to be minimized

1 L L
L  L    Gradient:   w
2 2
w w
L  L t 
Update: w t 1
 w 
t
w   
t
 w 
w  w 
L
1   w  
t
Weight Decay
w
Closer to zero
L1 regularization:
Regularization  1  w1  w2  

• New loss function to be minimized


1 L L
L  L       sgn w
2 1
w w
Update:
L  L
t 1
w  w  t

w
w   
t t 
  sgn w   
 w 
L
w  
t

w
 
  sgn w Always delete
t

L …… L2
1   w  
t

w
Regularization - Weight
Decay
• Our brain prunes out the useless link between
neurons.

Doing the same thing to machine’s brain improves


the performance.
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Dropout
Training:

 Each time before updating the parameters


 Each neuron has p% to dropout
Dropout
Training:

Thinner!
 Each time before updating the parameters
 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training

For each mini-batch, we resample the dropout neurons


Dropout
Testing:

 No dropout
 If the dropout rate at training is p%,
all the weights times 1-p%
 Assume that the dropout rate is 50%.
If a weight by training, set for testing.
Dropout
- Intuitive Reason
Testing
No dropout
( 拿下重物後就變很強 )
Training
Dropout ( 腳上綁重物 )
Dropout - Intuitive Reason
我的 partner 會
擺爛,所以我要好好做

 When teams up, if everyone expect the partner will do


the work, nothing will be done finally.
 However, if you know your partner will dropout, you
will do better.
 When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 ′
0 .5 ×𝑤1 𝑧 ≈2 𝑧
𝑤2 𝑧 0 .5 ×𝑤2 𝑧

𝑤3 0 .5 × 𝑤3
𝑤4 0 .5 ×𝑤 4
Weights multiply 1-p%

𝑧 ≈𝑧
Dropout is a kind of
ensemble. Training
Ensemble Set

Set 1 Set 2 Set 3 Set 4

Network Network Network Network


1 2 3 4

Train a bunch of networks with different structures


Dropout is a kind of
ensemble.
Ensemble
Testing data x

Network Network Network Network


1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of
ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons

……
2M possible
networks

Using one mini-batch to train one network


Some parameters in the network are shared
Dropout is a kind of
ensemble.
Testing of Dropout testing data x

All the
weights

……
multiply
1-p%

y1 y2 y3
????
average ? ≈ y
Testing of Dropout
x1 x2
x1 x2 x1 x2 w1 w2
w1 w2 w1 w2

z=w1x1+w2x2

z=w1x1+w2x2 z=w2x2
x1 x2
x1 x2 x1 x2
1 1
w1 w1 w w
w2 w2 2 1 2 2

1 1
𝑧 = 𝑤1 𝑥 1 + 𝑤2 𝑥2
z=w1x1 z=0 2 2
Recipe of Deep Learning
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network

You might also like