0% found this document useful (0 votes)

4 views

DNN tip

The document outlines a recipe for deep learning, emphasizing the importance of defining functions, evaluating their performance on training and testing data, and addressing overfitting. It discusses various techniques such as early stopping, regularization, dropout, and adaptive learning rates, as well as activation functions like ReLU and Maxout. Additionally, it highlights challenges like the vanishing gradient problem and the complexities of optimizing network parameters.

Uploaded by

snowxiaoyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

DNN tip

Uploaded by

snowxiaoyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Tips for Deep

Learning
Recipe of Deep Learning
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network
Do not always blame
Overfitting
Not well trained

Overfitting?

Training Data Testing Data

Deep Residual Learning for Image

Recognition
http://arxiv.org/abs/1512.03385
Recipe of Deep Learning
YES

Good Results on
Different approaches for Testing Data?
different problems.

e.g. dropout for good results YES

on testing data
Good Results on
Training Data?

Neural
Network
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate

Hard to get the power of
Deep …

Results on Training Data

Deeper usually does not imply better.

Vanishing Gradient
Problem
x1 …… y1
x2 …… y2
……

……
……

……

……
xN …… yM

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge

based on
Vanishing Gradient
Problem
Smaller gradients

x1 …… 𝑦1 ^
𝑦1
Small
x2 output 𝑦2 ^
…… 𝑦2
𝐶
……

……
……

……

……
xN …… 𝑦𝑀 +∆ 𝐶 ^
𝑦 𝑀
Large
+∆ 𝑤 input

Intuitive way to compute the derivatives …

𝜕𝐶 ∆𝐶
=?
𝜕𝑤 ∆𝑤
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 (𝑧) 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎= 0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
𝑎
𝑎=𝑧
ReLU
𝑎= 0
𝑧
0

x1 y1

0 y2
x2
0

0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎= 0
𝑧

x1 y1

y2
x2

Do not have
smaller gradients
ReLU - variant

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃 𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

𝑎 𝑎
𝑎=𝑧 𝑎=𝑧

𝑧 𝑧
𝑎=0.01 𝑧 𝑎=𝛼 𝑧

α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2

x2 + −1 + 4
Max 1 Max 4
+ 1 + 3

You can have more than 2 elements in a group.

Maxout ReLU is a special cases of Maxout

𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x 0 + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }
❑

𝑏 0
1 1

𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥
0
Maxout More than ReLU

𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }
❑
′
𝑏 𝑤
′
1 𝑏
1 Learnable Activation
Function
𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥

𝑧 2=𝑤′ 𝑥 +𝑏′
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group

2 elements in a group 3 elements in a group

Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧 11 +
2
𝑧1
Input 1 2
Max 𝑎1 Max 𝑎1
x1 + 𝑧2
1
𝑚𝑎𝑥 { 𝑧 11 , 𝑧 12 } +
2
𝑧2

x2 + 𝑧3
1
+
2
𝑧3
1 2
𝑥 Max 𝑎2 Max 𝑎2
+ +
1 2 2
𝑧4 1 𝑧4 𝑎
𝑎
Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧 11 +
2
𝑧1
Input 1 2
𝑎1 𝑎1
x1 + 𝑧2
1
+ 𝑧2
2

x2 + 𝑧3
1
+ 𝑧3
2

1 2
𝑥 𝑎2 𝑎2
+ +
1 2 2
𝑧4 1 𝑧4 𝑎
𝑎
• Train this thin and linear network
Different thin and linear network for different examples
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate

Review Smaller
Learning Rate

𝑤2

Larger
Learning Rate

Adagrad
𝑤1
𝑡 +1 𝑡 𝜂 𝑡
𝑤 ←𝑤 − 𝑔

√
𝑡

∑ )
2
( 𝑔
𝑖

𝑖 =0

Use first derivative to estimate second derivative

RMSProp
Error Surface can be very complex when training NN.

Smaller
Learning Rate
𝑤2

Larger
Learning Rate

𝑤1
RMSProp
𝜂
1 0 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎
𝜂
2 1
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1
√ 0 2
𝜎 = 𝛼 ( 𝜎 ) + ( 1− 𝛼 ) ( 𝑔 )
1 2

𝜂
𝜎
2 2
√
𝑤 ← 𝑤 − 2 𝑔 𝜎 2= 𝛼 ( 𝜎 1 ) 2+ ( 1− 𝛼 ) ( 𝑔 2 )2
3
… …

𝜂
𝑤
𝑡 +1 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
√
𝜎 = 𝛼(𝜎
𝑡 𝑡− 1 2 𝑡 2
) + ( 1− 𝛼 ) ( 𝑔 )
Root Mean Square of the gradients
with previous gradients being decayed
Hard to find
optimal network
parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

The value of a network parameter w

In physical world ……
• Momentum

How about put this phenomenon

in gradient descent?
Review: Vanilla Gradient
Descent
𝛻 𝐿 ( 𝜃0 ) Start at position

𝜃
0 𝛻𝐿 𝜃 )
( 1 Compute gradient at
1 Move to = - η
𝜃 𝛻 𝐿( 𝜃 )
2

2 Compute gradient at
Gradient 𝜃
Move to = – η
Movement 3
𝛻 𝐿 ( 𝜃3 )
𝜃

……
Stop until
Momentum
Movement: movement of last Start at point
step minus gradient at present Movement v0=0
𝛻 𝐿 ( 𝜃0 ) Compute gradient at
𝛻 𝐿 ( 𝜃1 )
𝜃
0 Movement v1 = λv0 - η

𝜃
1 Move to = + v1
𝛻 𝐿( 𝜃 )2

2 Compute gradient at
𝜃
Gradient Movement v2 = λv1 - η
Movement 3 Move to = + v2
𝜃 𝛻 𝐿 ( 𝜃3 )
Movement Movement not just based
of last step on gradient, but previous
movement.
Momentum
Movement: movement of last Start at point
step minus gradient at present Movement v0=0
Compute gradient at
v is actually the weighted sum of
i

all the previous gradient: , Movement v1 = λv0 - η

Move to = + v1
v0 = 0 Compute gradient at
v1 = - η Movement v2 = λv1 - η
Move to = + v2
v2 = - λ η - η
Movement not just based
……

on gradient, but previous

movement
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost

Negative of 𝜕𝐿∕𝜕𝑤 + Momentum

Movement =

Negative of
Momentum
Real Movement

𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp + Momentum

for momentum
for RMSprop
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate

Early Stopping
Total
Loss
Stop at Validation set
here Testing set

Training set

Epochs

Keras: http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate

Regularization
• New loss function to be minimized
• Find a set of weight not only minimizing original
cost but also close to zero
1
L  L    2
Regularization term
2
 w1 , w2 ,
Original loss L2 regularization:
w1   w2   
2 2
(e.g. minimize square  2
error, cross entropy …)
(usually not consider biases)
L2 regularization:
Regularization 2
w1   w2   
2 2

• New loss function to be minimized

1 L L
L  L    Gradient:   w
2 2
w w
L  L t 
Update: w t 1
 w 
t
w   
t
 w 
w  w 
L
1   w  
t
Weight Decay
w
Closer to zero
L1 regularization:
Regularization  1  w1  w2  

• New loss function to be minimized

1 L L
L  L       sgn w
2 1
w w
Update:
L  L
t 1
w  w  t

w
w   
t t 
  sgn w   
 w 
L
w  
t

w
 
  sgn w Always delete
t

L …… L2
1   w  
t

w
Regularization - Weight
Decay
• Our brain prunes out the useless link between
neurons.

Doing the same thing to machine’s brain improves

the performance.
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Regularization Testing Data?

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate

Dropout
Training:

 Each time before updating the parameters

 Each neuron has p% to dropout
Dropout
Training:

Thinner!
 Each time before updating the parameters
 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training

For each mini-batch, we resample the dropout neurons

Dropout
Testing:

 No dropout
 If the dropout rate at training is p%,
all the weights times 1-p%
 Assume that the dropout rate is 50%.
If a weight by training, set for testing.
Dropout
- Intuitive Reason
Testing
No dropout
( 拿下重物後就變很強 )
Training
Dropout ( 腳上綁重物 )
Dropout - Intuitive Reason
我的 partner 會
擺爛，所以我要好好做

𝑤3 0 .5 × 𝑤3
𝑤4 0 .5 ×𝑤 4
Weights multiply 1-p%
′
𝑧 ≈𝑧
Dropout is a kind of
ensemble. Training
Ensemble Set

Set 1 Set 2 Set 3 Set 4

Network Network Network Network

1 2 3 4

Train a bunch of networks with different structures

Dropout is a kind of
ensemble.
Ensemble
Testing data x

Network Network Network Network

1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of
ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons

……
2M possible
networks

Using one mini-batch to train one network

Some parameters in the network are shared
Dropout is a kind of
ensemble.
Testing of Dropout testing data x

All the
weights

……
multiply
1-p%

y1 y2 y3
????
average ? ≈ y
Testing of Dropout
x1 x2
x1 x2 x1 x2 w1 w2
w1 w2 w1 w2

z=w1x1+w2x2

z=w1x1+w2x2 z=w2x2
x1 x2
x1 x2 x1 x2
1 1
w1 w1 w w
w2 w2 2 1 2 2

1 1
𝑧 = 𝑤1 𝑥 1 + 𝑤2 𝑥2
z=w1x1 z=0 2 2
Recipe of Deep Learning
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network

CXPA CCXP ExamResources Ebook FINAL
100% (2)
CXPA CCXP ExamResources Ebook FINAL
11 pages
Aashto Cfrp-Prestressed Concrete Design Training Course
100% (1)
Aashto Cfrp-Prestressed Concrete Design Training Course
93 pages
Importance of Leadership
100% (2)
Importance of Leadership
12 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
14_Học sâu (3)_Improve DNN_v3
No ratings yet
14_Học sâu (3)_Improve DNN_v3
129 pages
Artificial Neural Networks_dl
No ratings yet
Artificial Neural Networks_dl
55 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
cours4
No ratings yet
cours4
30 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
ITNN Week3
No ratings yet
ITNN Week3
21 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
UNIT 4
No ratings yet
UNIT 4
13 pages
Lect 12 -Deep Feed Forward NN- Review
No ratings yet
Lect 12 -Deep Feed Forward NN- Review
93 pages
A Recipe For Training Neural Networks
No ratings yet
A Recipe For Training Neural Networks
15 pages
tutorial 4
No ratings yet
tutorial 4
6 pages
Lect 7
No ratings yet
Lect 7
43 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
DL Class3
No ratings yet
DL Class3
28 pages
Machine Learning With Artificial Neural Networks
No ratings yet
Machine Learning With Artificial Neural Networks
44 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Unit-1 and 2 and 3 (1)
No ratings yet
Unit-1 and 2 and 3 (1)
212 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
ANN_Presentation_Exam_Hafsa
No ratings yet
ANN_Presentation_Exam_Hafsa
29 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
Slides 11
No ratings yet
Slides 11
48 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
a imprimer 4
No ratings yet
a imprimer 4
4 pages
LecML -3 NN
No ratings yet
LecML -3 NN
33 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Training Neural Networks
No ratings yet
Training Neural Networks
109 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Lec 8
No ratings yet
Lec 8
43 pages
DL_Session10_WK11_Part2
No ratings yet
DL_Session10_WK11_Part2
7 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Handbook+ +Neural+Networks
No ratings yet
Handbook+ +Neural+Networks
5 pages
Lecture_09_slides_-_after
No ratings yet
Lecture_09_slides_-_after
57 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
Neural Network BSC
No ratings yet
Neural Network BSC
32 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
SAT Math: Master the Skills in 40 Pages
From Everand
SAT Math: Master the Skills in 40 Pages
Jennifer L Johnson
No ratings yet
Attacking Problems in Logarithms and Exponential Functions
From Everand
Attacking Problems in Logarithms and Exponential Functions
David S. Kahn
5/5 (1)
Appendix 2: Self-Monitoring Form For Asymptomatic Healthcare Workers With Low-Risk Exposure
No ratings yet
Appendix 2: Self-Monitoring Form For Asymptomatic Healthcare Workers With Low-Risk Exposure
2 pages
Instruction manual-Venus 4 Sport
No ratings yet
Instruction manual-Venus 4 Sport
32 pages
Fokker 50-Aircraft General
100% (2)
Fokker 50-Aircraft General
36 pages
Toxiclimits PDF
No ratings yet
Toxiclimits PDF
13 pages
Brochure Conference Isiem 2 PDF
No ratings yet
Brochure Conference Isiem 2 PDF
6 pages
STP Pump Oil marcol_152
No ratings yet
STP Pump Oil marcol_152
2 pages
Juniper CLI User Guide
No ratings yet
Juniper CLI User Guide
286 pages
Ahmed Gamal Eldin Amin: Mechanical Power Engineer
No ratings yet
Ahmed Gamal Eldin Amin: Mechanical Power Engineer
2 pages
Tundi 15 Kva Quotation
No ratings yet
Tundi 15 Kva Quotation
3 pages
NU-477E OM0215 S-5 R6_Nov15 _NU-s477-400
No ratings yet
NU-477E OM0215 S-5 R6_Nov15 _NU-s477-400
59 pages
Babu 2017
No ratings yet
Babu 2017
7 pages
LAB Instructions For DFT Advisor
No ratings yet
LAB Instructions For DFT Advisor
32 pages
Marketing Strategy Text and Cases 6th Edition Ferrell Solutions Manual Download
100% (23)
Marketing Strategy Text and Cases 6th Edition Ferrell Solutions Manual Download
9 pages
MS 2000 Vs MS 3000
No ratings yet
MS 2000 Vs MS 3000
9 pages
APS25Z_IM
No ratings yet
APS25Z_IM
15 pages
Antenne Eggbeater-Engl-Part2-Full
No ratings yet
Antenne Eggbeater-Engl-Part2-Full
5 pages
LASA List To Be Filled by Drug Distributors
No ratings yet
LASA List To Be Filled by Drug Distributors
460 pages
June 2001 QP - S1 Edexcel
No ratings yet
June 2001 QP - S1 Edexcel
5 pages
Frequently Asked Questions: Mode Are Not Eligible)
No ratings yet
Frequently Asked Questions: Mode Are Not Eligible)
2 pages
12 IP CBSE Practical File (PART-1)
No ratings yet
12 IP CBSE Practical File (PART-1)
27 pages
Basics of Web Development
No ratings yet
Basics of Web Development
73 pages
Rfid Essay
No ratings yet
Rfid Essay
7 pages
International Recruitment: Prof. Nishikant C. Warbhuwan
No ratings yet
International Recruitment: Prof. Nishikant C. Warbhuwan
24 pages
Individual Workweek Accomplishment Report: Irene G. Catubig
No ratings yet
Individual Workweek Accomplishment Report: Irene G. Catubig
5 pages
CSI Cardiology Update 2022 Two Volume Set 1st Edition Vijay Bang all chapter instant download
100% (9)
CSI Cardiology Update 2022 Two Volume Set 1st Edition Vijay Bang all chapter instant download
82 pages
Loesche Technical Seminarapr 15 PDF Free
No ratings yet
Loesche Technical Seminarapr 15 PDF Free
176 pages
Ghosh 2018
No ratings yet
Ghosh 2018
22 pages

DNN tip

Uploaded by

DNN tip

Uploaded by

Tips for Deep

Training Data Testing Data

Deep Residual Learning for Image

e.g. dropout for good results YES

Adaptive Learning Rate

Results on Training Data

Deeper usually does not imply better.

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge

Intuitive way to compute the derivatives …

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃 𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

• Learnable activation function [Ian J. Goodfellow, ICML’13]

You can have more than 2 elements in a group.

2 elements in a group 3 elements in a group

Adaptive Learning Rate

Use first derivative to estimate second derivative

Stuck at local minima

The value of a network parameter w

How about put this phenomenon

all the previous gradient: , Movement v1 = λv0 - η

on gradient, but previous

Negative of 𝜕𝐿∕𝜕𝑤 + Momentum

Adaptive Learning Rate

Adaptive Learning Rate

• New loss function to be minimized

• New loss function to be minimized

Doing the same thing to machine’s brain improves

Adaptive Learning Rate

 Each time before updating the parameters

For each mini-batch, we resample the dropout neurons

 When teams up, if everyone expect the partner will do

Set 1 Set 2 Set 3 Set 4

Network Network Network Network

Train a bunch of networks with different structures

Network Network Network Network

Using one mini-batch to train one network

You might also like