SlideShare a Scribd company logo
2
Most read
5
Most read
13
Most read
Regularization for
Deep Learning
Lecture slides for Chapter 7 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-27
(Goodfellow 2016)
Definition
• “Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error.”
(Goodfellow 2016)
Weight Decay as Constrained
Optimization
ARIZATION FOR DEEP LEARNING
w1
w2
w⇤
˜w
Figure 7.1
(Goodfellow 2016)
Norm Penalties
• L1: Encourages sparsity, equivalent to MAP
Bayesian estimation with Laplace prior
• Squared L2: Encourages small weights, equivalent to
MAP Bayesian estimation with Gaussian prior
(Goodfellow 2016)
Dataset Augmentation
Affine
Distortion
Noise
Elastic
Deformation
Horizontal
flip
Random
Translation
Hue Shift
(Goodfellow 2016)
Multi-Task Learning
network in figure 7.2.
2. Generic parameters, shared across all the tasks (which benefit from th
pooled data of all the tasks). These are the lower layers of the neural networ
in figure 7.2.
h(1)
h(1)
h(2)
h(2)
h(3)
h(3)
y(1)
y(1)
y(2)
y(2)
h(shared)
h(shared)
xx
Figure 7.2: Multi-task learning can be cast in several ways in deep learning frameworFigure 7.2
(Goodfellow 2016)
Learning CurvesHAPTER 7. REGULARIZATION FOR DEEP LEARNING
0 50 100 150 200 250
Time (epochs)
0.00
0.05
0.10
0.15
0.20
Loss(negativelog-likelihood)
Training set loss
Validation set loss
gure 7.3: Learning curves showing how the negative log-likelihood loss changes o
Figure 7.3
Early stopping: terminate while validation set
performance is better
(Goodfellow 2016)
Early Stopping and Weight
Decay
R 7. REGULARIZATION FOR DEEP LEARNING
w1
w2
w⇤
˜w
w1
w2
w⇤
˜w
Figure 7.4
(Goodfellow 2016)
Sparse Representations
HAPTER 7. REGULARIZATION FOR DEEP LEARNING
2
6
6
6
6
4
14
1
19
2
23
3
7
7
7
7
5
=
2
6
6
6
6
4
3 1 2 5 4 1
4 2 3 1 1 3
1 5 4 2 3 2
3 1 2 3 0 3
5 4 2 2 5 1
3
7
7
7
7
5
2
6
6
6
6
6
6
4
0
2
0
0
3
0
3
7
7
7
7
7
7
5
y 2 Rm B 2 Rm⇥n h 2 Rn
(7.47)
In the first expression, we have an example of a sparsely parametrized linear
egression model. In the second, we have linear regression with a sparse representa-
on h of the data x. That is, h is a function of x that, in some sense, represents
he information present in x, but does so with a sparse vector.
Representational regularization is accomplished by the same sorts of mechanisms
hat we have used in parameter regularization.
Norm penalty regularization of representations is performed by adding to the
(Goodfellow 2016)
BaggingCHAPTER 7. REGULARIZATION FOR DEEP LEARNING
8
8
First ensemble member
Second ensemble member
Original dataset
First resampled dataset
Second resampled dataset
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector
the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two differ
resampled datasets. The bagging training procedure is to construct each of these data
by sampling with replacement. The first dataset omits the 9 and repeats the 8. On t
Figure 7.5
(Goodfellow 2016)
Dropout
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
yy
h1h1 h2h2
x1x1 x2x2
yy
h1h1 h2h2
x1x1 x2x2
yy
h1h1 h2h2
x2x2
yy
h1h1 h2h2
x1x1
yy
h2h2
x1x1 x2x2
yy
h1h1
x1x1 x2x2
yy
h1h1 h2h2
yy
x1x1 x2x2
yy
h2h2
x2x2
yy
h1h1
x1x1
yy
h1h1
x2x2
yy
h2h2
x1x1
yy
x1x1
yy
x2x2
yy
h2h2
yy
h1h1
yy
Base network
Ensemble of subnetworks
Figure 7.6
(Goodfellow 2016)
Adversarial ExamplesCHAPTER 7. REGULARIZATION FOR DEEP LEARNING
+ .007 ⇥ =
x sign(rxJ(✓, x, y))
x +
✏ sign(rxJ(✓, x, y))
y =“panda” “nematode” “gibbon”
w/ 57.7%
confidence
w/ 8.2%
confidence
w/ 99.3 %
confidence
Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet
(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose
elements are equal to the sign of the elements of the gradient of the cost function with
respect to the input, we can change GoogLeNet’s classification of the image. Reproduced
with permission from Goodfellow et al. (2014b).
to optimize. Unfortunately, the value of a linear function can change very rapidly
if it has numerous inputs. If we change each input by ✏, then a linear function
with weights w can change by as much as ✏||w||1, which can be a very large
amount if w is high-dimensional. Adversarial training discourages this highly
sensitive locally linear behavior by encouraging the network to be locally constant
Figure 7.8
Training on adversarial examples is mostly
intended to improve security, but can sometimes
provide generic regularization.
(Goodfellow 2016)
Tangent Propagation
ER 7. REGULARIZATION FOR DEEP LEARNING
x1
x2
Normal Tangent
7.9: Illustration of the main idea of the tangent prop algorithm (Sima
nd manifold tangent classifier (Rifai et al., 2011c), which both regul
Figure 7.9

More Related Content

What's hot (20)

PDF
Digital Image Processing: Image Segmentation
Mostafa G. M. Mostafa
 
PPTX
Watershed Segmentation Image Processing
Arshad Hussain
 
PDF
Autoencoder
HARISH R
 
PPSX
Image Enhancement in Spatial Domain
Dr. A. B. Shinde
 
PPTX
Inductive bias
swapnac12
 
PPTX
Types of Machine Learning
Samra Shahzadi
 
PPTX
Color fundamentals and color models - Digital Image Processing
Amna
 
PPT
Wavelet transform in image compression
jeevithaelangovan
 
PPT
Image Restoration
Poonam Seth
 
PDF
Edge linking in image processing
VARUN KUMAR
 
PPSX
Edge Detection and Segmentation
Dr. A. B. Shinde
 
PPT
Image segmentation ppt
Gichelle Amon
 
PPTX
Machine Learning: Bias and Variance Trade-off
International Institute of Information Technology (I²IT)
 
PPTX
Machine learning with ADA Boost
Aman Patel
 
PPT
Computational Learning Theory
butest
 
PPTX
Fuzzy arithmetic
Mohit Chimankar
 
PPTX
Image filtering in Digital image processing
Abinaya B
 
PPTX
Deep Learning With Neural Networks
Aniket Maurya
 
PDF
Optimization for Deep Learning
Sebastian Ruder
 
PPT
Back propagation
Nagarajan
 
Digital Image Processing: Image Segmentation
Mostafa G. M. Mostafa
 
Watershed Segmentation Image Processing
Arshad Hussain
 
Autoencoder
HARISH R
 
Image Enhancement in Spatial Domain
Dr. A. B. Shinde
 
Inductive bias
swapnac12
 
Types of Machine Learning
Samra Shahzadi
 
Color fundamentals and color models - Digital Image Processing
Amna
 
Wavelet transform in image compression
jeevithaelangovan
 
Image Restoration
Poonam Seth
 
Edge linking in image processing
VARUN KUMAR
 
Edge Detection and Segmentation
Dr. A. B. Shinde
 
Image segmentation ppt
Gichelle Amon
 
Machine Learning: Bias and Variance Trade-off
International Institute of Information Technology (I²IT)
 
Machine learning with ADA Boost
Aman Patel
 
Computational Learning Theory
butest
 
Fuzzy arithmetic
Mohit Chimankar
 
Image filtering in Digital image processing
Abinaya B
 
Deep Learning With Neural Networks
Aniket Maurya
 
Optimization for Deep Learning
Sebastian Ruder
 
Back propagation
Nagarajan
 

Similar to 07 regularization (20)

PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Ono Shigeru
 
PDF
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
PDF
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
PDF
02.cnn - CNN 파헤치기 3탄
Jeong-gyu Kim
 
PDF
Understanding Deep Learning Requires Rethinking Generalization
Ahmet Kuzubaşlı
 
PDF
Neural Network Part-2
Venkata Reddy Konasani
 
PPSX
Learning Sparse Neural Networksvia Sensitivity-Driven Regularization
Enzo Tartaglione
 
PPTX
NITW_Improving Deep Neural Networks (1).pptx
DrKBManwade
 
PPTX
NITW_Improving Deep Neural Networks.pptx
ssuserd23711
 
PPTX
Nonlinear Exponential Regularization : An Improved Version of Regularization ...
Seoung-Ho Choi
 
PPTX
Learning sparse Neural Networks using L0 Regularization
Varun Reddy
 
PPTX
PRML 5.5
Ryuta Shitomi
 
PPTX
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Batch normalization paper review
Minho Heo
 
PDF
Batch normalization
Yuichiro Iio
 
PDF
Easy to learn deep learning guide - elementry
AnjaliSohoni
 
PPTX
Regularizing DNN.pptx
SnehashisPaul8
 
PDF
Why Batch Normalization Works so Well
Chun-Ming Chang
 
PDF
DL_lecture3_regularization_I.pdf
sagayalavanya2
 
PDF
Cheatsheet deep-learning-tips-tricks
Steve Nouri
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Ono Shigeru
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
02.cnn - CNN 파헤치기 3탄
Jeong-gyu Kim
 
Understanding Deep Learning Requires Rethinking Generalization
Ahmet Kuzubaşlı
 
Neural Network Part-2
Venkata Reddy Konasani
 
Learning Sparse Neural Networksvia Sensitivity-Driven Regularization
Enzo Tartaglione
 
NITW_Improving Deep Neural Networks (1).pptx
DrKBManwade
 
NITW_Improving Deep Neural Networks.pptx
ssuserd23711
 
Nonlinear Exponential Regularization : An Improved Version of Regularization ...
Seoung-Ho Choi
 
Learning sparse Neural Networks using L0 Regularization
Varun Reddy
 
PRML 5.5
Ryuta Shitomi
 
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
The Statistical and Applied Mathematical Sciences Institute
 
Batch normalization paper review
Minho Heo
 
Batch normalization
Yuichiro Iio
 
Easy to learn deep learning guide - elementry
AnjaliSohoni
 
Regularizing DNN.pptx
SnehashisPaul8
 
Why Batch Normalization Works so Well
Chun-Ming Chang
 
DL_lecture3_regularization_I.pdf
sagayalavanya2
 
Cheatsheet deep-learning-tips-tricks
Steve Nouri
 
Ad

More from Ronald Teo (16)

PDF
Mc td
Ronald Teo
 
PDF
Dp
Ronald Teo
 
PDF
06 mlp
Ronald Teo
 
PDF
Mdp
Ronald Teo
 
PDF
04 numerical
Ronald Teo
 
PPTX
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Ronald Teo
 
PDF
Intro rl
Ronald Teo
 
PDF
Lec7 deeprlbootcamp-svg+scg
Ronald Teo
 
PDF
Lec5 advanced-policy-gradient-methods
Ronald Teo
 
PDF
Lec6 nuts-and-bolts-deep-rl-research
Ronald Teo
 
PDF
Lec4b pong from_pixels
Ronald Teo
 
PDF
Lec4a policy-gradients-actor-critic
Ronald Teo
 
PDF
Lec3 dqn
Ronald Teo
 
PDF
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
PDF
Lec1 intro-mdps-exact-methods
Ronald Teo
 
PDF
02 linear algebra
Ronald Teo
 
Mc td
Ronald Teo
 
06 mlp
Ronald Teo
 
04 numerical
Ronald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Ronald Teo
 
Intro rl
Ronald Teo
 
Lec7 deeprlbootcamp-svg+scg
Ronald Teo
 
Lec5 advanced-policy-gradient-methods
Ronald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Ronald Teo
 
Lec4b pong from_pixels
Ronald Teo
 
Lec4a policy-gradients-actor-critic
Ronald Teo
 
Lec3 dqn
Ronald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
Lec1 intro-mdps-exact-methods
Ronald Teo
 
02 linear algebra
Ronald Teo
 
Ad

Recently uploaded (20)

PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PDF
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
Kubernetes - Architecture & Components.pdf
geethak285
 
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 

07 regularization

  • 1. Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-27
  • 2. (Goodfellow 2016) Definition • “Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”
  • 3. (Goodfellow 2016) Weight Decay as Constrained Optimization ARIZATION FOR DEEP LEARNING w1 w2 w⇤ ˜w Figure 7.1
  • 4. (Goodfellow 2016) Norm Penalties • L1: Encourages sparsity, equivalent to MAP Bayesian estimation with Laplace prior • Squared L2: Encourages small weights, equivalent to MAP Bayesian estimation with Gaussian prior
  • 6. (Goodfellow 2016) Multi-Task Learning network in figure 7.2. 2. Generic parameters, shared across all the tasks (which benefit from th pooled data of all the tasks). These are the lower layers of the neural networ in figure 7.2. h(1) h(1) h(2) h(2) h(3) h(3) y(1) y(1) y(2) y(2) h(shared) h(shared) xx Figure 7.2: Multi-task learning can be cast in several ways in deep learning frameworFigure 7.2
  • 7. (Goodfellow 2016) Learning CurvesHAPTER 7. REGULARIZATION FOR DEEP LEARNING 0 50 100 150 200 250 Time (epochs) 0.00 0.05 0.10 0.15 0.20 Loss(negativelog-likelihood) Training set loss Validation set loss gure 7.3: Learning curves showing how the negative log-likelihood loss changes o Figure 7.3 Early stopping: terminate while validation set performance is better
  • 8. (Goodfellow 2016) Early Stopping and Weight Decay R 7. REGULARIZATION FOR DEEP LEARNING w1 w2 w⇤ ˜w w1 w2 w⇤ ˜w Figure 7.4
  • 9. (Goodfellow 2016) Sparse Representations HAPTER 7. REGULARIZATION FOR DEEP LEARNING 2 6 6 6 6 4 14 1 19 2 23 3 7 7 7 7 5 = 2 6 6 6 6 4 3 1 2 5 4 1 4 2 3 1 1 3 1 5 4 2 3 2 3 1 2 3 0 3 5 4 2 2 5 1 3 7 7 7 7 5 2 6 6 6 6 6 6 4 0 2 0 0 3 0 3 7 7 7 7 7 7 5 y 2 Rm B 2 Rm⇥n h 2 Rn (7.47) In the first expression, we have an example of a sparsely parametrized linear egression model. In the second, we have linear regression with a sparse representa- on h of the data x. That is, h is a function of x that, in some sense, represents he information present in x, but does so with a sparse vector. Representational regularization is accomplished by the same sorts of mechanisms hat we have used in parameter regularization. Norm penalty regularization of representations is performed by adding to the
  • 10. (Goodfellow 2016) BaggingCHAPTER 7. REGULARIZATION FOR DEEP LEARNING 8 8 First ensemble member Second ensemble member Original dataset First resampled dataset Second resampled dataset Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two differ resampled datasets. The bagging training procedure is to construct each of these data by sampling with replacement. The first dataset omits the 9 and repeats the 8. On t Figure 7.5
  • 11. (Goodfellow 2016) Dropout CHAPTER 7. REGULARIZATION FOR DEEP LEARNING yy h1h1 h2h2 x1x1 x2x2 yy h1h1 h2h2 x1x1 x2x2 yy h1h1 h2h2 x2x2 yy h1h1 h2h2 x1x1 yy h2h2 x1x1 x2x2 yy h1h1 x1x1 x2x2 yy h1h1 h2h2 yy x1x1 x2x2 yy h2h2 x2x2 yy h1h1 x1x1 yy h1h1 x2x2 yy h2h2 x1x1 yy x1x1 yy x2x2 yy h2h2 yy h1h1 yy Base network Ensemble of subnetworks Figure 7.6
  • 12. (Goodfellow 2016) Adversarial ExamplesCHAPTER 7. REGULARIZATION FOR DEEP LEARNING + .007 ⇥ = x sign(rxJ(✓, x, y)) x + ✏ sign(rxJ(✓, x, y)) y =“panda” “nematode” “gibbon” w/ 57.7% confidence w/ 8.2% confidence w/ 99.3 % confidence Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change GoogLeNet’s classification of the image. Reproduced with permission from Goodfellow et al. (2014b). to optimize. Unfortunately, the value of a linear function can change very rapidly if it has numerous inputs. If we change each input by ✏, then a linear function with weights w can change by as much as ✏||w||1, which can be a very large amount if w is high-dimensional. Adversarial training discourages this highly sensitive locally linear behavior by encouraging the network to be locally constant Figure 7.8 Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.
  • 13. (Goodfellow 2016) Tangent Propagation ER 7. REGULARIZATION FOR DEEP LEARNING x1 x2 Normal Tangent 7.9: Illustration of the main idea of the tangent prop algorithm (Sima nd manifold tangent classifier (Rifai et al., 2011c), which both regul Figure 7.9