SlideShare a Scribd company logo
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
DEEP
LEARNING
WORKSHOP
Dublin City University
27-28 April 2017
Training Deep Networks with
Backprop
Day 1 Lecture 4
1
Recall: multi-layer perceptron
2
Recall: multi-layer perceptron
3
Recall: multi-layer perceptron
4
Fitting deep networks to data
We need an algorithm to find good weight configurations.
This is an unconstrained continuous optimization problem.
We can use standard iterative optimization methods like gradient descent.
To use gradient descent, we need a way to find the gradient of the loss with
respect to the parameters (weights and biases) of the network.
Error backpropagation is an efficient algorithm for finding these gradients.
Basically an application of the multivariate chain rule and dynamic programming.
In practice, computing the full gradient is expensive. Backpropagation is typically
used with stochastic gradient descent.
Gradient descent
If we had a way to compute the
gradient of the loss with respect to
the parameters, we could use
gradient descent to optimize
6
Stochastic gradient descent
Computing the gradient for the full dataset at each step is slow
● Especially if the dataset is large!
For most losses we care about, the total loss can be expressed as a sum (or
average) of losses on the individual examples
The gradient is the average of the gradients on individual examples
7
Stochastic gradient descent
SGD: estimate the gradient using a subset of the examples
● Pick a single random training example
● Estimate a (noisy) loss on this single training example (the stochastic
gradient)
● Compute gradient wrt. this loss
● Take a step of gradient descent using the estimated loss
8
Stochastic gradient descent
Advantages
● Very fast (only need to compute gradient on single example)
● Memory efficient (does not need the full dataset to compute gradient)
● Online (don’t need full dataset at each step)
Disadvantages
● Gradient is very noisy, may not always point in correct direction
● Convergence can be slower
Improvement
● Estimate gradient on small batch of training examples (say 50)
● Known as mini-batch stochastic gradient descent
9
Finding the gradient with backprop
Combination of the chain rule and dynamic
programming
Chain rule: allows us to find gradient of
the loss with respect to any input,
activation, or parameter
Dynamic programming: reuse
computations from previous steps. You
don’t need to evaluate the full chain for
every parameter.
The chain rule
Easily differentiate compositions of functions.
11
The chain rule
The chain rule
The chain rule
Dynamic programming
Modular backprop
You could use the chain rule on all the individual neurons to compute the gradients
with respect to the parameters and backpropagate the error signal.
Much more useful to use the layer abstraction
Then define the backpropation algorithm in terms of three operations that layers
need to be able to do.
This is called modular backpropagation
The layer abstraction
The layer abstraction
Linear layer
ReLU layer
Modular backprop
Using this idea, it is possible to create
many types of layers
● Linear (fully connected layers)
● Activation functions (sigmoid, ReLU)
● Convolutions
● Pooling
● Dropout
Once layers support the backward and
forward operations, they can be plugged
together to create more complex functions
Convolution
Input Error (L)
Gradients
ReLU
Linear
Gradients
Output Error (L+1)
21
Implementation notes
Caffe and Torch
Libraries like Caffe and Torch implement
backpropagation this way
To define a new layer, you need to create
an class and define the forward and
backward operations
Theano and TensorFlow
Libraries like Theano and TensorFlow
operate on a computational graph
To define a new layer, you only need to
specify the forward operation. Autodiff is
used to automatically infer backward.
You also don't need to implement
backprop manually in Theano or
TensorFlow. It uses computational graph
optimizations to automatically factor out
common computations.
22
Practical tips for training deep nets
23
Choosing hyperparameters
Can already see we have lots of
hyperparameters to choose:
1. Learning rate
2. Regularization constant
3. Number of epochs
4. Number of hidden layers
5. Nodes in each hidden layer
6. Weight initialization strategy
7. Loss function
8. Activation functions
9. …
:(
Choosing these is a bit of an art.
Good news: in practice many
configurations work well
There are some reasonable heuristics. E.g
1. Try 0.1 for the learning rate. If this diverges,
divide by 3. Repeat.
2. Try an existing network architecture and
adapt it for your problem
3. Try overfit the data with a big model, then
regularize
You can also do a hyperparameter search
if you have enough compute:
● Randomized search tends to work well
Choosing the learning rate
For first order optimization methods, we need to
choose a learning rate (aka step size)
● Too large: overshoots local minimum, loss
increases
● Too small: makes very slow progress, can get stuck
● Good learning rate: makes steady progress toward
local minimum
Usually want a higher learning rate at the start and
a lower one later on.
Common strategy in practice:
● Start off with a high LR (like 0.1 - 0.001),
● Run for several epochs (1 - 10)
● Decrease LR by multiplying a constant factor (0.1 -
0.5)
w
L
Loss
w
t
α too large
Good α
α too
small
25
Training and monitoring progress
1. Split data into train, validation, and test sets
○ Keep 10-30% of data for validation
2. Fit model parameters on train set using SGD
3. After each epoch:
○ Test model on validation set and compute loss
■ Also compute whatever other metrics you
are interested in, e.g. top-5 accuracy
○ Save a snapshot of the model
4. Plot learning curves as training progresses
5. Stop when validation loss starts to increase
6. Use model with minimum validation loss epoch
Loss
Validation loss
Training loss
Best model
26
Divergence
Symptoms:
● Training loss keeps increasing
● Inf, NaN in loss
Try:
● Reduce learning rate
● Zero center and scale inputs/targets
● Check weight initialization strategy (monitor
gradients)
● Numerically check your gradients
● Clip gradient norm
w
L
Loss
wt
α too large
27
Slow convergence
Symptoms:
● Training loss decreases slowly
● Training loss does not decrease
Try:
● Increase learning rate
● Zero center and scale inputs/targets
● Check weight initialization strategy
(monitor gradients)
● Numerically check gradients
● Use ReLUs
● Increase batch size
● Change loss, model architecture
w
L
Loss
wt
28
Overfitting
Symptoms:
● Validation loss decreases at first, then starts
increasing
● Training loss continues to go down
Try:
● Find more training data
● Add stronger regularization
○ dropout, drop-connect, L2
● Data augmentation (flips, rotations, noise)
● Reduce complexity of your model
epoch
Loss
Validation loss
Training loss
29
Underfitting
Symptoms:
● Training loss decreases at first but then stops
● Training loss still high
● Training loss tracks validation loss
Try:
● Increase model capacity
○ Add more layers, increase layer size
● Use more suitable network architecture
○ E.g. multi-scale architecture
● Decrease regularization strength
epoch
Loss
Validation loss
Training loss
30
Questions?
31

More Related Content

What's hot (20)

PDF
Deep Learning for Computer Vision: Visualization (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Computer Vision: Backward Propagation (UPC 2016)
Universitat Politècnica de Catalunya
 
PPTX
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Computer Vision: Memory usage and computational considerati...
Universitat Politècnica de Catalunya
 
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Lecture 6: Convolutional Neural Networks
Sang Jun Lee
 
PDF
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Computer Vision: Generative models and adversarial training...
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Universitat Politècnica de Catalunya
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Universitat Politècnica de Catalunya
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Backward Propagation (UPC 2016)
Universitat Politècnica de Catalunya
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Universitat Politècnica de Catalunya
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Universitat Politècnica de Catalunya
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Lecture 6: Convolutional Neural Networks
Sang Jun Lee
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Generative models and adversarial training...
Universitat Politècnica de Catalunya
 

Similar to Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Workshop 2017) (20)

PPTX
Deep learning crash course
Vishwas N
 
PPTX
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
PPTX
Nimrita deep learning
Nimrita Koul
 
PPTX
An Introduction to Deep Learning
milad abbasi
 
PPTX
Introduction to Deep Learning
Mehrnaz Faraz
 
PPTX
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
PPTX
Training Neural Networks.pptx
ksghuge
 
PDF
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PPTX
DeepLearningLecture.pptx
ssuserf07225
 
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
PPTX
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
RajeswariBsr1
 
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
PPT
deep learning UNIT-1 Introduction Part-1.ppt
shashikanthsana
 
PPTX
ML_ Unit 2_Part_B
Srimatre K
 
PPTX
back propagation1_presenation_lab 6.pptx
someyamohsen2
 
PDF
Training Neural Networks
Databricks
 
PDF
lecture3.pdf
Tigabu Yaya
 
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
PPTX
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
PDF
#7 Neural Networks Artificial intelligence
MustansarAli20
 
Deep learning crash course
Vishwas N
 
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
Nimrita deep learning
Nimrita Koul
 
An Introduction to Deep Learning
milad abbasi
 
Introduction to Deep Learning
Mehrnaz Faraz
 
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
Training Neural Networks.pptx
ksghuge
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
DeepLearningLecture.pptx
ssuserf07225
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
RajeswariBsr1
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
deep learning UNIT-1 Introduction Part-1.ppt
shashikanthsana
 
ML_ Unit 2_Part_B
Srimatre K
 
back propagation1_presenation_lab 6.pptx
someyamohsen2
 
Training Neural Networks
Databricks
 
lecture3.pdf
Tigabu Yaya
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
#7 Neural Networks Artificial intelligence
MustansarAli20
 
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
PDF
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
PDF
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
PDF
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
PDF
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Ad

Recently uploaded (20)

PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
Mircosoft azure SQL detailing about how to use SQL with Microsoft Azure.
shrijasheth64
 
PPTX
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPT
dsaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasassas2.ppt
UzairAfzal13
 
PDF
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
PDF
NRRM 200 Statistics on Bycatch's Effects on Marine Mammals Slideshow.pdf
Rowan Sales
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
Lecture_9_EPROM_Flash univeristy lecture fall 2022
ssuser5047c5
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Mircosoft azure SQL detailing about how to use SQL with Microsoft Azure.
shrijasheth64
 
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
dsaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasassas2.ppt
UzairAfzal13
 
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
NRRM 200 Statistics on Bycatch's Effects on Marine Mammals Slideshow.pdf
Rowan Sales
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
Climate Action.pptx action plan for climate
justfortalabat
 
Lecture_9_EPROM_Flash univeristy lecture fall 2022
ssuser5047c5
 

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Workshop 2017)

  • 1. Kevin McGuinness [email protected] Research Fellow Insight Centre for Data Analytics Dublin City University DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Training Deep Networks with Backprop Day 1 Lecture 4 1
  • 5. Fitting deep networks to data We need an algorithm to find good weight configurations. This is an unconstrained continuous optimization problem. We can use standard iterative optimization methods like gradient descent. To use gradient descent, we need a way to find the gradient of the loss with respect to the parameters (weights and biases) of the network. Error backpropagation is an efficient algorithm for finding these gradients. Basically an application of the multivariate chain rule and dynamic programming. In practice, computing the full gradient is expensive. Backpropagation is typically used with stochastic gradient descent.
  • 6. Gradient descent If we had a way to compute the gradient of the loss with respect to the parameters, we could use gradient descent to optimize 6
  • 7. Stochastic gradient descent Computing the gradient for the full dataset at each step is slow ● Especially if the dataset is large! For most losses we care about, the total loss can be expressed as a sum (or average) of losses on the individual examples The gradient is the average of the gradients on individual examples 7
  • 8. Stochastic gradient descent SGD: estimate the gradient using a subset of the examples ● Pick a single random training example ● Estimate a (noisy) loss on this single training example (the stochastic gradient) ● Compute gradient wrt. this loss ● Take a step of gradient descent using the estimated loss 8
  • 9. Stochastic gradient descent Advantages ● Very fast (only need to compute gradient on single example) ● Memory efficient (does not need the full dataset to compute gradient) ● Online (don’t need full dataset at each step) Disadvantages ● Gradient is very noisy, may not always point in correct direction ● Convergence can be slower Improvement ● Estimate gradient on small batch of training examples (say 50) ● Known as mini-batch stochastic gradient descent 9
  • 10. Finding the gradient with backprop Combination of the chain rule and dynamic programming Chain rule: allows us to find gradient of the loss with respect to any input, activation, or parameter Dynamic programming: reuse computations from previous steps. You don’t need to evaluate the full chain for every parameter.
  • 11. The chain rule Easily differentiate compositions of functions. 11
  • 16. Modular backprop You could use the chain rule on all the individual neurons to compute the gradients with respect to the parameters and backpropagate the error signal. Much more useful to use the layer abstraction Then define the backpropation algorithm in terms of three operations that layers need to be able to do. This is called modular backpropagation
  • 21. Modular backprop Using this idea, it is possible to create many types of layers ● Linear (fully connected layers) ● Activation functions (sigmoid, ReLU) ● Convolutions ● Pooling ● Dropout Once layers support the backward and forward operations, they can be plugged together to create more complex functions Convolution Input Error (L) Gradients ReLU Linear Gradients Output Error (L+1) 21
  • 22. Implementation notes Caffe and Torch Libraries like Caffe and Torch implement backpropagation this way To define a new layer, you need to create an class and define the forward and backward operations Theano and TensorFlow Libraries like Theano and TensorFlow operate on a computational graph To define a new layer, you only need to specify the forward operation. Autodiff is used to automatically infer backward. You also don't need to implement backprop manually in Theano or TensorFlow. It uses computational graph optimizations to automatically factor out common computations. 22
  • 23. Practical tips for training deep nets 23
  • 24. Choosing hyperparameters Can already see we have lots of hyperparameters to choose: 1. Learning rate 2. Regularization constant 3. Number of epochs 4. Number of hidden layers 5. Nodes in each hidden layer 6. Weight initialization strategy 7. Loss function 8. Activation functions 9. … :( Choosing these is a bit of an art. Good news: in practice many configurations work well There are some reasonable heuristics. E.g 1. Try 0.1 for the learning rate. If this diverges, divide by 3. Repeat. 2. Try an existing network architecture and adapt it for your problem 3. Try overfit the data with a big model, then regularize You can also do a hyperparameter search if you have enough compute: ● Randomized search tends to work well
  • 25. Choosing the learning rate For first order optimization methods, we need to choose a learning rate (aka step size) ● Too large: overshoots local minimum, loss increases ● Too small: makes very slow progress, can get stuck ● Good learning rate: makes steady progress toward local minimum Usually want a higher learning rate at the start and a lower one later on. Common strategy in practice: ● Start off with a high LR (like 0.1 - 0.001), ● Run for several epochs (1 - 10) ● Decrease LR by multiplying a constant factor (0.1 - 0.5) w L Loss w t α too large Good α α too small 25
  • 26. Training and monitoring progress 1. Split data into train, validation, and test sets ○ Keep 10-30% of data for validation 2. Fit model parameters on train set using SGD 3. After each epoch: ○ Test model on validation set and compute loss ■ Also compute whatever other metrics you are interested in, e.g. top-5 accuracy ○ Save a snapshot of the model 4. Plot learning curves as training progresses 5. Stop when validation loss starts to increase 6. Use model with minimum validation loss epoch Loss Validation loss Training loss Best model 26
  • 27. Divergence Symptoms: ● Training loss keeps increasing ● Inf, NaN in loss Try: ● Reduce learning rate ● Zero center and scale inputs/targets ● Check weight initialization strategy (monitor gradients) ● Numerically check your gradients ● Clip gradient norm w L Loss wt α too large 27
  • 28. Slow convergence Symptoms: ● Training loss decreases slowly ● Training loss does not decrease Try: ● Increase learning rate ● Zero center and scale inputs/targets ● Check weight initialization strategy (monitor gradients) ● Numerically check gradients ● Use ReLUs ● Increase batch size ● Change loss, model architecture w L Loss wt 28
  • 29. Overfitting Symptoms: ● Validation loss decreases at first, then starts increasing ● Training loss continues to go down Try: ● Find more training data ● Add stronger regularization ○ dropout, drop-connect, L2 ● Data augmentation (flips, rotations, noise) ● Reduce complexity of your model epoch Loss Validation loss Training loss 29
  • 30. Underfitting Symptoms: ● Training loss decreases at first but then stops ● Training loss still high ● Training loss tracks validation loss Try: ● Increase model capacity ○ Add more layers, increase layer size ● Use more suitable network architecture ○ E.g. multi-scale architecture ● Decrease regularization strength epoch Loss Validation loss Training loss 30