SlideShare a Scribd company logo
DEEP FEEDFORWARD NETWORKS
AND REGULARIZATION
LICHENG ZHANG
OVERVIEW
• Regularization
• L2/L1/elastic
• Dropout
• Batch normalization
• Data augmentation
• Early stopping
• Neural network
• Perceptron
• Activation functions
• Back-propagation
FEEDFORWARD NETWORK
“3-layer neural net” or “2-hidden-layers neural net”
FEEDFORWARD NETWORK (ANIMATION)
NEURON (UNIT)
PERCEPTRON FORWARD PASS
2
3
-1
1
Inputs weights sum activation function
∑ f
bias
0.1
0.5
2.5
3.0
output
(2*0.1)+
(3*0.5)+
(-1*2.5)+
(1*3.0)
Output = f(
)
PERCEPTRON FORWARD PASS
2
3
-1
1
Inputs weights sum activation function
∑ f
bias
0.1
0.5
2.5
3.0
output
Output = f(2.2)
=𝜎(2.2)
=
1
1+𝑒−2.2 = 0.90
MULTI-OUTPUT PERCEPTRON
𝑥0
𝑜0
𝑥1
𝑥2
Input layer Output layer
MULTI-OUTPUT PERCEPTRON
𝑥0
𝑜1
𝑥1
𝑥2
Input layer Output layer
𝑜0
MULTI-LAYER PERCEPTRON (MLP)
𝑥0
ℎ2
𝑥1
𝑥2
Input layer Hidden layer
ℎ1
𝑜1
Output layer
𝑜0
ℎ0
ℎ3
DEEP NEURAL NETWORK
𝑥0
ℎ2
𝑥1
𝑥2
Input layer Hidden layer
ℎ1
𝑜1
Output layer
𝑜0
ℎ0
ℎ3
ℎ2
ℎ1
ℎ0
ℎ3
……
http://www.asimovinstitute.org/neural-network-zoo/
UNIVERSAL APPROXIMATION THEOREM
“A feedforward network with a linear output layer and at least one hidden layer with any ‘squashing’
activation function (such as the logistic sigmoid) can approximate any Borel measurable function from one
finite-dimensional space to another with any desired nonzero amount of error, provided that the network
is given enough hidden units.”
• ----- Hornik et al., Cybenko, 1989
COMPUTATIONAL GRAPHS
Z=x*y
𝑦 = 𝜎(𝑤𝑥 + 𝑏)
H = 𝑟𝑒𝑙𝑢(𝑊𝑋 + 𝑏)
= max(0, 𝑊𝑋 + 𝑏)
𝑦 = 𝑤𝑥
𝑢(3)
= 𝜆∑ 𝑤
LOSS FUNCTION
• A loss function (cost function) tells us how good our current model is, or
how far away our model to the real answer.
𝐿(𝑤) =
1
𝑁
𝑖
𝑁
𝑙𝑜𝑠𝑠 (𝑓(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 )
• Hinge loss
• Softmax loss
• Mean Squared Error (L2 loss)  Regression 𝐿(𝑤) =
1
𝑁
∑𝑖
𝑁
𝑓 𝑥 𝑖
; 𝑤 − 𝑦 𝑖 2
• Cross entropy Loss  Classification 𝐿 𝑤 =
1
𝑁
∑𝑖
𝑁
[ 𝑦 𝑖
𝑙𝑜𝑔 𝑓 𝑥 𝑖
; 𝑤 + 1 − 𝑦 𝑖
log 1 − 𝑓 𝑥 𝑖
; 𝑤 ]
• …
N = # examples
predicted actual
GRADIENT DESCENT
• Designing and training a neural network is not much different from training
any other machine learning model with gradient descent: use Calculus to get
derivatives of the loss function respect to each parameter.
𝑤𝑗 = 𝑤𝑗 − α
𝜕𝐿(𝑤)
𝜕𝑤𝑗
𝛼 is learning rate
https://developers.google.com/machine-learning/crash-course/fitter/graph
GRADIENT DESCENT
• In practice, instead of using all data points, we do
• Stochastic gradient descent (using 1 sample at each iteration)
• Mini-Batch gradient descent (using n samples at each iteration)
Problems with SGD:
• If loss changes quickly in one direction and slowly in another  jitter along steep direction
• If loss function has a local minima or saddle point  zero gradient, SGD gets stuck
Solutions:
• SGD + momentum, etc
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤2
=
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤2
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤2
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
𝜕𝑂0
𝜕𝑤2
Chain rule
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤1
=
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤1
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
𝜕𝑂0
𝜕ℎ0
∗
Chain rule
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤1
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
𝜕𝑂0
𝜕ℎ0
∗
𝜕ℎ0
𝜕𝑤1
Chain rule
BACK-PROPAGATION: SIMPLE EXAMPLE
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
e.g. x = -2, y = 5, z=-4
𝑞 = 𝑥 + 𝑦
𝜕𝑞
𝜕𝑥
= 1,
𝜕𝑞
𝜕𝑦
= 1
f = qz
𝜕𝑓
𝜕𝑞
= 𝑧,
𝜕𝑓
𝜕𝑧
= 𝑞
+
*
x
y
z
-2
5
-4
3
-12f
q
Want:
𝜕𝑓
𝜕𝑥
,
𝜕𝑓
𝜕𝑦
,
𝜕𝑓
𝜕𝑧
𝜕𝑓
𝜕𝑓
1
𝜕𝑓
𝜕𝑧
3
𝜕𝑓
𝜕𝑞
-4
𝜕𝑓
𝜕𝑦
-4
Chain Rule:
𝜕𝑓
𝜕𝑦
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑦
𝜕𝑓
𝜕𝑥
-4
Chain Rule:
𝜕𝑓
𝜕𝑥
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑥
ACTIVATION FUNCTIONS
𝒇
Importance of activation functions is to introduce non-linearity into the network.
ACTIVATION FUNCTIONS
For output layer:
• Sigmoid
• Softmax
• Tanh
For hidden layer:
• ReLU
• LeakyReLU
• ELU
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
3 problems:
1. Saturated neurons “kill” the gradients
• What happens when x= -10?
• What happens when x = 0?
• What happens when x = 10
Sigmoid
gate
𝜎 𝑥 =
1
1 + 𝑒−𝑥
𝜕𝜎
𝜕𝑥
x
𝜕𝐿
𝜕𝜎
𝜕𝐿
𝜕𝜎
=
𝜕𝜎
𝜕𝑥
𝜕𝐿
𝜕𝜎
ACTIVATION FUNCTIONS
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
Consider what happens when the input to a neuron is always positive…
𝑓(
𝑖
𝑤𝑖 𝑥𝑖 + 𝑏 )
What can we say about the gradients on w?
Always all positive or all negative 
(this is also why you want zero-mean data!)
𝜕𝐿
𝜕𝑤𝑖
=
𝜕𝐿
𝜕𝑓
𝜕𝑓
𝜕𝑤𝑖
=
𝜕𝐿
𝜕𝑓
∗ 𝑥𝑖
Inefficient!
ACTIVATION FUNCTIONS
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
3. Exp() is a bit compute expensive
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
• Not zero-centered output
• An annoyance when x < 0
People like to initialize ReLU
neurons with slightly positive biases
(e.g. 0.01)
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
Clevert et al., 2015
MAXOUT “NEURON”
IN PRACTICE (GOOD RULE OF THUMB)
• For hidden layers:
• Use ReLU. Be careful with your learning rates
• Try out Leaky ReLU / Maxout / ELU
• Try out tanh but don’t expect too much
• Don’t use Sigmoid
REGULARIZATION
• Regularization is “any modification we make to the
learning algorithm that is intended to reduce the
generalization error, but not its training error”.
REGULARIZATION
𝐿 𝑊 =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
Data loss: model predictions
should match training data
REGULARIZATION
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
+ 𝜆𝑅(𝑊)
Data loss: model predictions
should match training data
Regularization: Model
Should be “simple”, so it
works on test data
Occam’s Razor:
“Among competing hypotheses,
The simplest is the best”
William of Ockham, 1285-1347
REGULARIZATION
• In common use:
• L2 regularization
• L1 regularization
• Elastic net (L1 + L2)
• Dropout
• Batch normalization
• Data Augmentation
• Early Stopping
𝑅 𝑤 = ∑𝑤𝑗
2
𝑅 𝑤 = ∑|𝑤𝑗|
𝑅 𝑤 = ∑(𝛽𝑤𝑗
2
+ wj )
Regularization is a technique designed to counter neural
network over-fitting.
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆𝑅(𝑊)
L2 REGULARIZATION
• penalizes the square value of the weight (which explains also the “2”
from the name).
• tends to drive all the weights to smaller values.
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
+ 𝜆∑𝑤𝑗
2
No regularization
L2 regularization
Weights distribution
L1 REGULARIZATION
• penalizes the absolute value of the weight (v- shape function)
• tends to drive some weights to exactly zero (introducing sparsity in the
model), while allowing some weights to be big
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
+ 𝜆|𝑤𝑗|
No regularization
L1 regularization
Weights distribution
DROPOUT
In each forward pass, randomly set
some neurons to zero. Probability of
dropping is a hyperparameter; 0.5 is
common.
You can imagine that if neurons are
randomly dropped out of the network
during training, that other neurons will
have to step in and handle the
representation required to make
predictions for the missing neurons.
This is believed to result in multiple
independent internal representations
being learned by the network.
DROPOUT
Another interpretation:
• Dropout is training a large ensemble of models (that
share parameters)
• Each binary mask is one model
An fully connected layer with 4096 units has
24096
~101233
possible masks!
Only ~1082
atoms in the universe…
DENSE-SPARSE-DENSE TRAINING
https://arxiv.org/pdf/1607.04381v1.pdf
BATCH NORMALIZATION
“you want unit Gaussian activations? Just make them so.”
BATCH NORMALIZATION
Usually inserted after fully
connected or convolutional layers,
and before nonlinearity.
• Improves gradient flow through the network
• Allows higher learning rates
• Reduces the strong dependence on initialization
• Acts as a form of regularization in a funny way,
and slightly reduces the need for dropout, maybe
Note: at test time BatchNorm layer
functions differently:
The mean/std are not computed
based on the batch. Instead, a
single fixed empirical mean of
activations during training is used.
(e.g. can be estimated during
training with running averages)
DATA AUGMENTATION
The best way to make a machine learning model generalize better is to train it on more data.
DATA AUGMENTATION
The best way to make a machine learning model generalize better is to train it on more data.
DATA AUGMENTATION
Horizontal flips
Random crops and scales
Color Jitter
• Simple: Randomize
contrast and brightness
Get creative for your problem!
 Translation
 Rotation
 Stretching
 Shearing
 Lens distortions
 (go crazy)
EARLY STOPPING
It is probably the most commonly used form of
regularization in deep learning to prevent overfitting:
• Effective
• Simple
Think of this as a hyperparameter selection
algorithm. The number of training steps is another
hyperparameter.
REFERENCE
• Deep Learning book ------ http://www.deeplearningbook.org/
• Stanford CNN course ----- http://cs231n.stanford.edu/index.html
• Regularization in deep learning ----- https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0
• So much more to learn, go explore!
• THANK YOU

More Related Content

What's hot (20)

PPTX
Feedforward neural network
Sopheaktra YONG
 
PPT
backpropagation in neural networks
Akash Goel
 
PPTX
Feed forward ,back propagation,gradient descent
Muhammad Rasel
 
PPTX
Ensemble Learning and Random Forests
CloudxLab
 
PDF
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
PPT
Principles of soft computing-Associative memory networks
Sivagowry Shathesh
 
PDF
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
PPTX
Multilayer perceptron
omaraldabash
 
PPT
Artificial neural network
mustafa aadel
 
PDF
Max net
Sandilya Sridhara
 
PDF
Introduction to Neural Networks
Databricks
 
PPTX
Image classification with Deep Neural Networks
Yogendra Tamang
 
PPTX
Feature selection
dkpawar
 
PDF
Dimensionality Reduction
Saad Elbeleidy
 
PPTX
Multilayer Perceptron Neural Network MLP
Abdullah al Mamun
 
PPTX
Hyperparameter Tuning
Jon Lederman
 
PPT
Perceptron
Nagarajan
 
PDF
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
Feedforward neural network
Sopheaktra YONG
 
backpropagation in neural networks
Akash Goel
 
Feed forward ,back propagation,gradient descent
Muhammad Rasel
 
Ensemble Learning and Random Forests
CloudxLab
 
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
Principles of soft computing-Associative memory networks
Sivagowry Shathesh
 
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Multilayer perceptron
omaraldabash
 
Artificial neural network
mustafa aadel
 
Introduction to Neural Networks
Databricks
 
Image classification with Deep Neural Networks
Yogendra Tamang
 
Feature selection
dkpawar
 
Dimensionality Reduction
Saad Elbeleidy
 
Multilayer Perceptron Neural Network MLP
Abdullah al Mamun
 
Hyperparameter Tuning
Jon Lederman
 
Perceptron
Nagarajan
 
Hierarchical Clustering
Carlos Castillo (ChaTo)
 

Similar to Deep Feed Forward Neural Networks and Regularization (20)

PPTX
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
PPTX
Batch normalization presentation
Owin Will
 
PPTX
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
PPTX
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machine...
resming1
 
PDF
Deep learning MindMap
Ashish Patel
 
PPTX
An Introduction to Deep Learning
milad abbasi
 
PPTX
Introduction to Deep Learning
Mehrnaz Faraz
 
PPTX
Deep neural networks & computational graphs
Revanth Kumar
 
PPTX
3. Training Artificial Neural Networks.pptx
munwar7
 
PPTX
DeepLearningLecture.pptx
ssuserf07225
 
PPTX
Nimrita deep learning
Nimrita Koul
 
PPTX
MACHINE LEARNING NEURAL NETWORK PPT UNIT 4
MulliMary
 
PDF
Neural networks
Prakhar Mishra
 
PPTX
PRML Chapter 5
Sunwoo Kim
 
PPTX
Deep learning crash course
Vishwas N
 
PDF
honn
William Yates
 
PPTX
lecture 9 pdddddddddddddddddssdsdnn.pptx
speedcomcyber25
 
PPTX
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
vallepubalaji66
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
Batch normalization presentation
Owin Will
 
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machine...
resming1
 
Deep learning MindMap
Ashish Patel
 
An Introduction to Deep Learning
milad abbasi
 
Introduction to Deep Learning
Mehrnaz Faraz
 
Deep neural networks & computational graphs
Revanth Kumar
 
3. Training Artificial Neural Networks.pptx
munwar7
 
DeepLearningLecture.pptx
ssuserf07225
 
Nimrita deep learning
Nimrita Koul
 
MACHINE LEARNING NEURAL NETWORK PPT UNIT 4
MulliMary
 
Neural networks
Prakhar Mishra
 
PRML Chapter 5
Sunwoo Kim
 
Deep learning crash course
Vishwas N
 
lecture 9 pdddddddddddddddddssdsdnn.pptx
speedcomcyber25
 
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
vallepubalaji66
 
Ad

More from Yan Xu (20)

PPTX
Kaggle winning solutions: Retail Sales Forecasting
Yan Xu
 
PDF
Basics of Dynamic programming
Yan Xu
 
PPTX
Walking through Tensorflow 2.0
Yan Xu
 
PPTX
Practical contextual bandits for business
Yan Xu
 
PDF
Introduction to Multi-armed Bandits
Yan Xu
 
PDF
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
Yan Xu
 
PDF
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
PDF
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Yan Xu
 
PDF
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Yan Xu
 
PDF
Introduction to Autoencoders
Yan Xu
 
PPTX
State of enterprise data science
Yan Xu
 
PDF
Long Short Term Memory
Yan Xu
 
PPTX
Linear algebra and probability (Deep Learning chapter 2&3)
Yan Xu
 
PPTX
HML: Historical View and Trends of Deep Learning
Yan Xu
 
PDF
Secrets behind AlphaGo
Yan Xu
 
PPTX
Optimization in Deep Learning
Yan Xu
 
PDF
Introduction to Recurrent Neural Network
Yan Xu
 
PDF
Introduction to Neural Network
Yan Xu
 
PDF
Nonlinear dimension reduction
Yan Xu
 
PDF
Mean shift and Hierarchical clustering
Yan Xu
 
Kaggle winning solutions: Retail Sales Forecasting
Yan Xu
 
Basics of Dynamic programming
Yan Xu
 
Walking through Tensorflow 2.0
Yan Xu
 
Practical contextual bandits for business
Yan Xu
 
Introduction to Multi-armed Bandits
Yan Xu
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
Yan Xu
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Yan Xu
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Yan Xu
 
Introduction to Autoencoders
Yan Xu
 
State of enterprise data science
Yan Xu
 
Long Short Term Memory
Yan Xu
 
Linear algebra and probability (Deep Learning chapter 2&3)
Yan Xu
 
HML: Historical View and Trends of Deep Learning
Yan Xu
 
Secrets behind AlphaGo
Yan Xu
 
Optimization in Deep Learning
Yan Xu
 
Introduction to Recurrent Neural Network
Yan Xu
 
Introduction to Neural Network
Yan Xu
 
Nonlinear dimension reduction
Yan Xu
 
Mean shift and Hierarchical clustering
Yan Xu
 
Ad

Recently uploaded (20)

PPTX
Earthquake1214435435665467576786587867876867888.pptx
JohnMarkBarrientos1
 
PPTX
Liquid Biopsy Biomarkers for early Diagnosis
KanakChaudhary10
 
PDF
Herbal Excipients: Natural Colorants & Perfumery Agents
Seacom Skills University
 
PPTX
Single-Cell Multi-Omics in Neurodegeneration p1.pptx
KanakChaudhary10
 
PDF
SCH 4103_Fibre Technology & Dyeing_07012020.pdf
samwelngigi37
 
PPTX
Comparative Testing of 2D Stroke Gesture Recognizers in Multiple Contexts of Use
Jean Vanderdonckt
 
PDF
HOW TO DEAL WITH THREATS FROM THE FORCES OF NATURE FROM OUTER SPACE.pdf
Faga1939
 
PDF
POLISH JOURNAL OF SCIENCE №87 (2025)
POLISH JOURNAL OF SCIENCE
 
PDF
Agentic AI: Autonomy, Accountability, and the Algorithmic Society
vs5qkn48td
 
PDF
Integrating Conversational Agents and Knowledge Graphs within the Scholarly D...
Angelo Salatino
 
PDF
Human-to-Robot Handovers track - RGMC - ICRA 2025
Alessio Xompero
 
DOCX
Accomplishment Report on YES- O SY 2025 2026.docx
WilsonVillamater
 
DOCX
Transportation in plants and animals.docx
bhatbashir421
 
PDF
Sujay Rao Mandavilli public profile June 2025.pdf
Sujay Rao Mandavilli
 
PPT
rate of reaction and the factors affecting it.ppt
MOLATELOMATLEKE
 
PDF
The MUSEview of the Sculptor galaxy: survey overview and the planetary nebula...
Sérgio Sacani
 
PDF
The Gender Binary & LGBTI People: Religious Myth and Medical Malpractice
Veronica Drantz, PhD
 
PPTX
Paired Sketching of Distributed User Interfaces:Workflow, Protocol, Software ...
Jean Vanderdonckt
 
PPTX
Cancer
Vartika
 
PDF
Evidence for a sub-Jovian planet in the young TWA 7 disk
Sérgio Sacani
 
Earthquake1214435435665467576786587867876867888.pptx
JohnMarkBarrientos1
 
Liquid Biopsy Biomarkers for early Diagnosis
KanakChaudhary10
 
Herbal Excipients: Natural Colorants & Perfumery Agents
Seacom Skills University
 
Single-Cell Multi-Omics in Neurodegeneration p1.pptx
KanakChaudhary10
 
SCH 4103_Fibre Technology & Dyeing_07012020.pdf
samwelngigi37
 
Comparative Testing of 2D Stroke Gesture Recognizers in Multiple Contexts of Use
Jean Vanderdonckt
 
HOW TO DEAL WITH THREATS FROM THE FORCES OF NATURE FROM OUTER SPACE.pdf
Faga1939
 
POLISH JOURNAL OF SCIENCE №87 (2025)
POLISH JOURNAL OF SCIENCE
 
Agentic AI: Autonomy, Accountability, and the Algorithmic Society
vs5qkn48td
 
Integrating Conversational Agents and Knowledge Graphs within the Scholarly D...
Angelo Salatino
 
Human-to-Robot Handovers track - RGMC - ICRA 2025
Alessio Xompero
 
Accomplishment Report on YES- O SY 2025 2026.docx
WilsonVillamater
 
Transportation in plants and animals.docx
bhatbashir421
 
Sujay Rao Mandavilli public profile June 2025.pdf
Sujay Rao Mandavilli
 
rate of reaction and the factors affecting it.ppt
MOLATELOMATLEKE
 
The MUSEview of the Sculptor galaxy: survey overview and the planetary nebula...
Sérgio Sacani
 
The Gender Binary & LGBTI People: Religious Myth and Medical Malpractice
Veronica Drantz, PhD
 
Paired Sketching of Distributed User Interfaces:Workflow, Protocol, Software ...
Jean Vanderdonckt
 
Cancer
Vartika
 
Evidence for a sub-Jovian planet in the young TWA 7 disk
Sérgio Sacani
 

Deep Feed Forward Neural Networks and Regularization

  • 1. DEEP FEEDFORWARD NETWORKS AND REGULARIZATION LICHENG ZHANG
  • 2. OVERVIEW • Regularization • L2/L1/elastic • Dropout • Batch normalization • Data augmentation • Early stopping • Neural network • Perceptron • Activation functions • Back-propagation
  • 3. FEEDFORWARD NETWORK “3-layer neural net” or “2-hidden-layers neural net”
  • 6. PERCEPTRON FORWARD PASS 2 3 -1 1 Inputs weights sum activation function ∑ f bias 0.1 0.5 2.5 3.0 output (2*0.1)+ (3*0.5)+ (-1*2.5)+ (1*3.0) Output = f( )
  • 7. PERCEPTRON FORWARD PASS 2 3 -1 1 Inputs weights sum activation function ∑ f bias 0.1 0.5 2.5 3.0 output Output = f(2.2) =𝜎(2.2) = 1 1+𝑒−2.2 = 0.90
  • 10. MULTI-LAYER PERCEPTRON (MLP) 𝑥0 ℎ2 𝑥1 𝑥2 Input layer Hidden layer ℎ1 𝑜1 Output layer 𝑜0 ℎ0 ℎ3
  • 11. DEEP NEURAL NETWORK 𝑥0 ℎ2 𝑥1 𝑥2 Input layer Hidden layer ℎ1 𝑜1 Output layer 𝑜0 ℎ0 ℎ3 ℎ2 ℎ1 ℎ0 ℎ3 ……
  • 13. UNIVERSAL APPROXIMATION THEOREM “A feedforward network with a linear output layer and at least one hidden layer with any ‘squashing’ activation function (such as the logistic sigmoid) can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.” • ----- Hornik et al., Cybenko, 1989
  • 14. COMPUTATIONAL GRAPHS Z=x*y 𝑦 = 𝜎(𝑤𝑥 + 𝑏) H = 𝑟𝑒𝑙𝑢(𝑊𝑋 + 𝑏) = max(0, 𝑊𝑋 + 𝑏) 𝑦 = 𝑤𝑥 𝑢(3) = 𝜆∑ 𝑤
  • 15. LOSS FUNCTION • A loss function (cost function) tells us how good our current model is, or how far away our model to the real answer. 𝐿(𝑤) = 1 𝑁 𝑖 𝑁 𝑙𝑜𝑠𝑠 (𝑓(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 ) • Hinge loss • Softmax loss • Mean Squared Error (L2 loss)  Regression 𝐿(𝑤) = 1 𝑁 ∑𝑖 𝑁 𝑓 𝑥 𝑖 ; 𝑤 − 𝑦 𝑖 2 • Cross entropy Loss  Classification 𝐿 𝑤 = 1 𝑁 ∑𝑖 𝑁 [ 𝑦 𝑖 𝑙𝑜𝑔 𝑓 𝑥 𝑖 ; 𝑤 + 1 − 𝑦 𝑖 log 1 − 𝑓 𝑥 𝑖 ; 𝑤 ] • … N = # examples predicted actual
  • 16. GRADIENT DESCENT • Designing and training a neural network is not much different from training any other machine learning model with gradient descent: use Calculus to get derivatives of the loss function respect to each parameter. 𝑤𝑗 = 𝑤𝑗 − α 𝜕𝐿(𝑤) 𝜕𝑤𝑗 𝛼 is learning rate https://developers.google.com/machine-learning/crash-course/fitter/graph
  • 17. GRADIENT DESCENT • In practice, instead of using all data points, we do • Stochastic gradient descent (using 1 sample at each iteration) • Mini-Batch gradient descent (using n samples at each iteration) Problems with SGD: • If loss changes quickly in one direction and slowly in another  jitter along steep direction • If loss function has a local minima or saddle point  zero gradient, SGD gets stuck Solutions: • SGD + momentum, etc
  • 18. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤2 =
  • 19. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤2 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗
  • 20. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤2 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗ 𝜕𝑂0 𝜕𝑤2 Chain rule
  • 21. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤1 =
  • 22. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤1 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗ 𝜕𝑂0 𝜕ℎ0 ∗ Chain rule
  • 23. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤1 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗ 𝜕𝑂0 𝜕ℎ0 ∗ 𝜕ℎ0 𝜕𝑤1 Chain rule
  • 24. BACK-PROPAGATION: SIMPLE EXAMPLE 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z=-4 𝑞 = 𝑥 + 𝑦 𝜕𝑞 𝜕𝑥 = 1, 𝜕𝑞 𝜕𝑦 = 1 f = qz 𝜕𝑓 𝜕𝑞 = 𝑧, 𝜕𝑓 𝜕𝑧 = 𝑞 + * x y z -2 5 -4 3 -12f q Want: 𝜕𝑓 𝜕𝑥 , 𝜕𝑓 𝜕𝑦 , 𝜕𝑓 𝜕𝑧 𝜕𝑓 𝜕𝑓 1 𝜕𝑓 𝜕𝑧 3 𝜕𝑓 𝜕𝑞 -4 𝜕𝑓 𝜕𝑦 -4 Chain Rule: 𝜕𝑓 𝜕𝑦 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑦 𝜕𝑓 𝜕𝑥 -4 Chain Rule: 𝜕𝑓 𝜕𝑥 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑥
  • 25. ACTIVATION FUNCTIONS 𝒇 Importance of activation functions is to introduce non-linearity into the network.
  • 26. ACTIVATION FUNCTIONS For output layer: • Sigmoid • Softmax • Tanh For hidden layer: • ReLU • LeakyReLU • ELU
  • 28. ACTIVATION FUNCTIONS 3 problems: 1. Saturated neurons “kill” the gradients
  • 29. • What happens when x= -10? • What happens when x = 0? • What happens when x = 10 Sigmoid gate 𝜎 𝑥 = 1 1 + 𝑒−𝑥 𝜕𝜎 𝜕𝑥 x 𝜕𝐿 𝜕𝜎 𝜕𝐿 𝜕𝜎 = 𝜕𝜎 𝜕𝑥 𝜕𝐿 𝜕𝜎
  • 30. ACTIVATION FUNCTIONS 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered
  • 31. Consider what happens when the input to a neuron is always positive… 𝑓( 𝑖 𝑤𝑖 𝑥𝑖 + 𝑏 ) What can we say about the gradients on w? Always all positive or all negative  (this is also why you want zero-mean data!) 𝜕𝐿 𝜕𝑤𝑖 = 𝜕𝐿 𝜕𝑓 𝜕𝑓 𝜕𝑤𝑖 = 𝜕𝐿 𝜕𝑓 ∗ 𝑥𝑖 Inefficient!
  • 32. ACTIVATION FUNCTIONS 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered 3. Exp() is a bit compute expensive
  • 35. ACTIVATION FUNCTIONS • Not zero-centered output • An annoyance when x < 0 People like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)
  • 39. IN PRACTICE (GOOD RULE OF THUMB) • For hidden layers: • Use ReLU. Be careful with your learning rates • Try out Leaky ReLU / Maxout / ELU • Try out tanh but don’t expect too much • Don’t use Sigmoid
  • 40. REGULARIZATION • Regularization is “any modification we make to the learning algorithm that is intended to reduce the generalization error, but not its training error”.
  • 41. REGULARIZATION 𝐿 𝑊 = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 Data loss: model predictions should match training data
  • 42. REGULARIZATION 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆𝑅(𝑊) Data loss: model predictions should match training data Regularization: Model Should be “simple”, so it works on test data Occam’s Razor: “Among competing hypotheses, The simplest is the best” William of Ockham, 1285-1347
  • 43. REGULARIZATION • In common use: • L2 regularization • L1 regularization • Elastic net (L1 + L2) • Dropout • Batch normalization • Data Augmentation • Early Stopping 𝑅 𝑤 = ∑𝑤𝑗 2 𝑅 𝑤 = ∑|𝑤𝑗| 𝑅 𝑤 = ∑(𝛽𝑤𝑗 2 + wj ) Regularization is a technique designed to counter neural network over-fitting. 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆𝑅(𝑊)
  • 44. L2 REGULARIZATION • penalizes the square value of the weight (which explains also the “2” from the name). • tends to drive all the weights to smaller values. 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆∑𝑤𝑗 2 No regularization L2 regularization Weights distribution
  • 45. L1 REGULARIZATION • penalizes the absolute value of the weight (v- shape function) • tends to drive some weights to exactly zero (introducing sparsity in the model), while allowing some weights to be big 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆|𝑤𝑗| No regularization L1 regularization Weights distribution
  • 46. DROPOUT In each forward pass, randomly set some neurons to zero. Probability of dropping is a hyperparameter; 0.5 is common. You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.
  • 47. DROPOUT Another interpretation: • Dropout is training a large ensemble of models (that share parameters) • Each binary mask is one model An fully connected layer with 4096 units has 24096 ~101233 possible masks! Only ~1082 atoms in the universe…
  • 49. BATCH NORMALIZATION “you want unit Gaussian activations? Just make them so.”
  • 50. BATCH NORMALIZATION Usually inserted after fully connected or convolutional layers, and before nonlinearity. • Improves gradient flow through the network • Allows higher learning rates • Reduces the strong dependence on initialization • Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe Note: at test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)
  • 51. DATA AUGMENTATION The best way to make a machine learning model generalize better is to train it on more data.
  • 52. DATA AUGMENTATION The best way to make a machine learning model generalize better is to train it on more data.
  • 53. DATA AUGMENTATION Horizontal flips Random crops and scales Color Jitter • Simple: Randomize contrast and brightness Get creative for your problem!  Translation  Rotation  Stretching  Shearing  Lens distortions  (go crazy)
  • 54. EARLY STOPPING It is probably the most commonly used form of regularization in deep learning to prevent overfitting: • Effective • Simple Think of this as a hyperparameter selection algorithm. The number of training steps is another hyperparameter.
  • 55. REFERENCE • Deep Learning book ------ http://www.deeplearningbook.org/ • Stanford CNN course ----- http://cs231n.stanford.edu/index.html • Regularization in deep learning ----- https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0 • So much more to learn, go explore!