Unit-2 L3 (3)
Unit-2 L3 (3)
Early Stopping
Early Stopping
• Increase in validation set error:
• When training large models with sufficient representational
capacity to overfit the task, training error decreases steadily
over time, but validation set error begins to rise again.
• An example of this behavior is shown next
Early Stopping
• Increase in validation set error:
• Shows how negative log-likelihood loss changes over time
(indicated as no. of Training iterations over the data set, or
epochs). In this example, we train a maxout network on
MNIST (maxout generalizes RELU further) Training
objective decreases consistently over time, but validation
set average Loss eventually begins to increase again
forming an asymmetric U-shaped curve
Early Stopping
• Saving parameters
• We can thus obtain a model with better validation set error
(and thus better test error) by returning to the parameter
setting at the point of time with the lowest validation set error.
• Every time the error on the validation set improves, we store a
copy of the model parameters.
• When the training algorithm terminates, we return to these
parameters, rather than the latest set.
Early Stopping
• Early stopping meta algorithm:
Early Stopping
• Strategy of Early Stopping:
• The above strategy is known as Early Stopping
• It is the most common form of regularization in deep learning
• Its popularity is due to its effectiveness and its simplicity
• We can think of early stopping as a very efficient hyperparameter
selection algorithm
• In this view no. of training steps is just a hyperparameter
• This hyperparameter has a U-shaped validation set performance curve
• Most hyperparameters have such a U-shaped validation set
performance curve, as seen below
• In the case of early stopping, we are controlling the effective capacity
of the model by determining how many steps it can take to fit the
training set
Early Stopping
• Early Stopping as Regularization:
• Early stopping is an unobtrusive form of regularization
• It requires almost no change to the underlying training
procedure, the objective function, or the set of allowable
parameter values
• So it is easy to use early stopping without damaging the
learning dynamics
• In contrast to weight decay, where we must be careful not to
use too much weight decay
• Otherwise, we trap the network in a bad local minimum
corresponding to pathologically small weights
Early Stopping
• Use of a second training step :
• Early stopping requires a validation set
• Thus some training data is not fed to the model
• To best exploit this extra data, one can perform extra
training after the initial training with early stopping has
completed
• In the second extra training step, all the training data is
included
• There are two basic strategies for the second training
procedure
Early Stopping
• First Strategy for Retraining
• One strategy is to initialize the model again and retrain
on all the data
• In the second training pass, we train for the same no. of
steps as the early stopping procedure determined was
optimal in first pass
• Whether to retrain for the same no. of parameter
updates or the same no of passes through the data set?
• On the second round, each pass through dataset will require
more parameter updates because dataset is bigger
Early Stopping
• First meta-algorithm for retraining
• A meta-algorithm for using early stopping to
determine how long to train, then retraining on all
the data
Early Stopping
• Second strategy for retraining
• Keep all the parameters obtained from the first round of
training and then continue training but now using all the
data
• We no longer have a guide for when to stop in terms of
the no of steps
• Instead, we monitor the average loss function on the
validation set and continue training until it falls below
the value of the training set objective of when early
stopping halted
Early Stopping
• Second meta algorithm for retraining
• Meta-algorithm using early stopping to determine at
what objective value we start to overfit, then continue
training until that value is reached
Early Stopping
• Early stopping as a regularizer:
• So far we have stated that early stopping is a regularization
strategy
• But supported the claim only by showing learning curves
where the validation set error has a U shaped curve
• What is the actual mechanism by which early stopping
regularizes the model?
• Early stopping has the effect of restricting the optimizing
procedure to a relatively small volume of parameter space in
the neighborhood of the initial parameters θ0
Early Stopping
• Early Stopping vs L regularization
2