0% found this document useful (0 votes)
3 views50 pages

Regularization Slides (2)

Uploaded by

kylorensw407
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views50 pages

Regularization Slides (2)

Uploaded by

kylorensw407
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Regularization and

Hyper parameters in
Deep Learning
Improve performance on unseen data by reducing overfitting
1. Preliminary understanding
2. Regularization - What & Why ?
3. Regularization Techniques
a. L 2, L1 regularization
b. Early stopping
c. Ensemble method - Drop out, Drop connect
d. Dataset augmentation
e. Adding Noise to the inputs / outputs
4. Hyperparameters in DL
1. Preliminary
understanding
➔ Model Fitting (Train-Test)

➔ Gold Standard Practice - A case


study

➔ Performance of model

➔ Model Generalization (Validate)


Model Fitting -
How well a
model performs on Three Situations

training & evaluation Underfit


Overfit

datasets?
Optimal fit / Good fit
Regression model
Fitting…….
➔ Under-fitting
Shows poor performance in training
Dataset or model capacity is Poor

➔ Optimal / Good-fitting / robust


fitting
Balanced - model at the sweet spot
between underfitting and overfitting

➔ Overfitting
Increased model Capacity of model
Gold standard
Practice - Limit
Overfit

➔ Use a resampling technique to estimate


model accuracy -Kfold cross validation

➔ Hold back a validation dataset. Sub set of


Training Dataset, 60 %, 40% ( 20 + 20)

➔ A case study : Compare 2 fits of a same


model
K-fold Cross validation
Hold Back a validation dataset
Fitting Vs Bias-
Variance
➔ Bias
Represent extent to which average
prediction over all train data sets

➔ Variance
Represent the extent to which the
model is sensitive to the particular
choice of data set (Test data)

➔ Relationship : Model Fitting and


Bias-Variance
Fitting Vs Bias-Variance
Model Performance
➔ Error (Generalization)
Due to generalization

Prediction error against Train Data +


Prediction error against Test Data

➔ Error due to :

Underfit + overfit

➔ Robust Model =
Min {Error}
Reasons and Counter measures
Counter measures: Underfit Reason for over-fit :
Increase the capacity (number of layers), Deep Neural networks are highly
Incorporation of more data set complex models.

Counter Measures : Over fit - Many parameters, many non-


linearities.
Model capacity is so big that it adapts too
well to training samples It is easy for them to overfit and
drive training error to 0.
Unable to generalize well to new, unseen
samples
Solution - Regularization
modification we make on a learning
algorithm that is intended to reduce
its generalization error, Strategy

But not its training error ? More data (or)


Reducing Network's
Capacity
Optimal ?
2. Regularization
➔ Reduces risk of overfitting

◆ Make an learning model to


perform well on Train data and
New input data

➔ Encourages models to have


preference towards simple
models

➔ Fewer parameter reduces the


computational power

➔ Best / Balanced performing


model
Regularization in
DeepNet ?
How does the regularization
works on Deep Neural Network
model ?
Simple Neural
Network model
Weight update to
Minimize the Loss
3. Regularization
Techniques
➔ L2, L1, Group regularization (Weight)

➔ Dataset augmentation

➔ Early stopping

➔ Ensemble method : Drop Out/Connect


Zero out input : Ridge(L2), Lasso (L1)

L2 adds “squared L1 adds “absolute Weight Penalties


magnitude” of value of magnitude” < leads> Smaller
coefficient as of coefficient as Weights <leads>
penalty term to penalty term to the
Simpler Model
the loss function. loss
<leads> Less
Overfit
math(W) represents the actual regularization operation.
λ (lambda) determines how strongly the regularization will influence the network’s training. ( 0 to n)
L1 (Lasso) Regularization
zeros out a certain input
Adds in a penalty for having weights of large absolute value.
Encourages model to make as many weights zero as possible.
Zero out inputs (L1) & Maximum punishment (L2)
Example : The weights corresponding to “Variable x (Blood pressure) ” and “Variable y (Body weight)”
are not useful in predicting future diagnosis of diseases.

L1 regularization :
0.5 gets a penalty of 0.5
L2 regularization
0.5 gets a penalty of 0.25
L1 a push to squish even small weights towards zero, more so than in L2 regularization

L1 regularization :
Weight of -9 gets a penalty of 9 but
L2 regularization
a weight of -9 gets a penalty of 81
Thus, bigger magnitude weights are punished much more severely in L2 regularization.
L1 & L2 regularization at the same time
Early stopping
There is point Thereafter, focus Solution
during training a only on learning the
• Stop whenever
large neural net statistical noise in
generalization
when the model the training dataset.
errors increases
will stop
generalizing
k-p
Return this model
Track the validation error

Have a watch parameter at “p”

If you are at step “k” and there was


no improvement in validation error in
the previous “p” steps
Tip
Then stop training and return the Keras Implementation has
option to save
model stored at step k − p
BEST_WEIGHT
https://keras.io/callbacks/
Callback during training
Zero out the nodes or connection :
Dropout, Drop Connect
Drops out some
nodes / links, an Dropping out can be Dropout nodes
efficient seen as temporarily /links so that the
approximate way of deactivating or network can
combining ignoring neurons of concentrate on
exponentially many the network other features.
different neural
networks
Randomly select a
subset of the units and
clamp their output to
zero, regardless of the
input;

Effectively removes
those units from the
model.

A different subset of
units is randomly
selected every time

Dropout
Disable individual
weights (i.e., set them
to zero), instead of
nodes, so a node can
remain partially active.

DropConnect is a
generalization of
DropOut because it
produces even more
possible models,

Drop connect
Dataset Augmentation
Typically, Works well for For some tasks it
NLP, may not be clear
More data =
image classification, how to generate
better learning
Object recognition, such data
Speech processing
4. Hyperparameters
in Deepnet
➔ What ?

➔ HP related to Network structure

➔ HP related Training methods

➔ Methods used to find out


Hyperparameters
Variables

Used to control the learning process


Hyperparameter Determines the network structure
By contrast, the values of other Determine how the network is
parameters are derived via trained
training,
Ex. weight, bias
HP related to Network

Number of Hidden Just keep adding layers until the test


error does not improve anymore
Dropout Dropout used on a larger network,
give l more of an opportunity to
Network Weight learn
Initialization According to the activation function

Activation function Generally, the rectifier activation


function
HP related Training methods

● Learning rate defines how quickly a network updates


its parameters.
● Low learning rate slows down the learning process but
converges smoothly.
● Larger learning rate speeds up the learning but may not
converge.
● Decaying Learning rate is preferred.
HP related Training methods
● Momentum : helps to know the direction of the next
step with the knowledge of the previous steps.
○ A typical choice between 0.5 to 0.9.
● Number of epochs, is the number of times the whole
training data is shown to the network while training.
○ Overfit when Increase
● Batch size, number of sub samples given to the network
after parameter update happens.
○ A good default for batch size might be 32.
○ Also tried with 32, 64, 128, 256, and so on.
Manual Search ,

Grid Search,
Find Out HP ?
Random Search,
Methods used to find out
Hyperparameters Bayesian Optimization

You might also like