0% found this document useful (0 votes)
2 views

Unit-2 L3 (3)

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-2 L3 (3)

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Regularization

Early Stopping
Early Stopping
• Increase in validation set error:
• When training large models with sufficient representational
capacity to overfit the task, training error decreases steadily
over time, but validation set error begins to rise again.
• An example of this behavior is shown next
Early Stopping
• Increase in validation set error:
• Shows how negative log-likelihood loss changes over time
(indicated as no. of Training iterations over the data set, or
epochs). In this example, we train a maxout network on
MNIST (maxout generalizes RELU further) Training
objective decreases consistently over time, but validation
set average Loss eventually begins to increase again
forming an asymmetric U-shaped curve
Early Stopping
• Saving parameters
• We can thus obtain a model with better validation set error
(and thus better test error) by returning to the parameter
setting at the point of time with the lowest validation set error.
• Every time the error on the validation set improves, we store a
copy of the model parameters.
• When the training algorithm terminates, we return to these
parameters, rather than the latest set.
Early Stopping
• Early stopping meta algorithm:
Early Stopping
• Strategy of Early Stopping:
• The above strategy is known as Early Stopping
• It is the most common form of regularization in deep learning
• Its popularity is due to its effectiveness and its simplicity
• We can think of early stopping as a very efficient hyperparameter
selection algorithm
• In this view no. of training steps is just a hyperparameter
• This hyperparameter has a U-shaped validation set performance curve
• Most hyperparameters have such a U-shaped validation set
performance curve, as seen below
• In the case of early stopping, we are controlling the effective capacity
of the model by determining how many steps it can take to fit the
training set
Early Stopping
• Early Stopping as Regularization:
• Early stopping is an unobtrusive form of regularization
• It requires almost no change to the underlying training
procedure, the objective function, or the set of allowable
parameter values
• So it is easy to use early stopping without damaging the
learning dynamics
• In contrast to weight decay, where we must be careful not to
use too much weight decay
• Otherwise, we trap the network in a bad local minimum
corresponding to pathologically small weights
Early Stopping
• Use of a second training step :
• Early stopping requires a validation set
• Thus some training data is not fed to the model
• To best exploit this extra data, one can perform extra
training after the initial training with early stopping has
completed
• In the second extra training step, all the training data is
included
• There are two basic strategies for the second training
procedure
Early Stopping
• First Strategy for Retraining
• One strategy is to initialize the model again and retrain
on all the data
• In the second training pass, we train for the same no. of
steps as the early stopping procedure determined was
optimal in first pass
• Whether to retrain for the same no. of parameter
updates or the same no of passes through the data set?
• On the second round, each pass through dataset will require
more parameter updates because dataset is bigger
Early Stopping
• First meta-algorithm for retraining
• A meta-algorithm for using early stopping to
determine how long to train, then retraining on all
the data
Early Stopping
• Second strategy for retraining
• Keep all the parameters obtained from the first round of
training and then continue training but now using all the
data
• We no longer have a guide for when to stop in terms of
the no of steps
• Instead, we monitor the average loss function on the
validation set and continue training until it falls below
the value of the training set objective of when early
stopping halted
Early Stopping
• Second meta algorithm for retraining
• Meta-algorithm using early stopping to determine at
what objective value we start to overfit, then continue
training until that value is reached
Early Stopping
• Early stopping as a regularizer:
• So far we have stated that early stopping is a regularization
strategy
• But supported the claim only by showing learning curves
where the validation set error has a U shaped curve
• What is the actual mechanism by which early stopping
regularizes the model?
• Early stopping has the effect of restricting the optimizing
procedure to a relatively small volume of parameter space in
the neighborhood of the initial parameters θ0
Early Stopping
• Early Stopping vs L regularization
2

• Two weights, Solid contour lines: contours of negative log-likelihood


• Left: dashed lines indicates trajectory of SGD. Rather than stopping at
point w* that minimizes cost, early stopping results in an earlier point in
trajectory
• Right: dashed circles indicate contours of L2 penalty which causes the
minimum of the total cost to lie nearer the origin than the minimum of
the unregularized cost
Regularization
Parameter Tying and Parameter Sharing
Parameter Tying
• L2 regularization (or weight decay) penalizes model
parameters for deviating from fixed value of zero
• Sometimes we need other ways to express prior knowledge of
parameters
• We may know from domain and model architecture that there
should be some dependencies between model parameters
• We want to express that certain parameters should be close to one
another
Parameter Tying
• A scenario of parameter tying:
• Two models performing the same classification task (with same set of
classes) but with somewhat different input distributions
• Model A with parameters w(A)
• Model B with parameters w(B)
• The two models map the input to two different but related outputs
Parameter Tying
• L2 penalty for parameter tying
• If the tasks are similar enough (perhaps with similar input and
output distributions) then we believe that the model
parameters should be close to each other:

• We can leverage this information via regularization


• Use a parameter norm penalty
Parameter Tying
• Use of Parameter Tying
• Approach was used for regularizing the parameters of one
model, trained as a supervised classifier, to be close to the
parameters of another model, trained in an unsupervised
paradigm (to capture the distribution of the input data)
• Ex. of unsupervised learning: k-means clustering
• Input x is mapped to a one-hot vector h. If x belongs to cluster i then hi
= 1 and the rest are zero corresponding to its cluster
• It could trained using an autoencoder with k hidden units
Parameter Sharing
• Parameter sharing is where we:
• Force sets of parameters to be equal
• Because we interpret various models or model components as sharing a unique set of
parameters
• Only a subset of the parameters needs to be stored in memory
• In a CNN significant reduction in the memory footprint of the model
Parameter Sharing
• CNN parameters
Parameter Sharing
• Use of parameter sharing in CNNs
• Most extensive use of parameter sharing is in convolutional neural
networks (CNNs)
• Natural images have many statistical properties that are invariant to
translation
• Ex: photo of a cat remains a photo of a cat if it is translated one pixel
to the right
• CNNs take this property into account by sharing parameters across
multiple image locations
• Thus we can find a cat with the same cat detector whether the cat
appears at column i or column i+1 in the image
Parameter Sharing
• Simple description of CNN

You might also like