Lecture6 Regularization
Lecture6 Regularization
Xavier Bresson
https://twitter.com/xbresson
Xavier Bresson 1
2
https://www.cs.cornell.edu/courses/cs4780/2018fa
https://knmnyn.github.io/cs3244-2210
https://arxiv.org/pdf/1812.11118.pdf
Prof Xavier Bresson, CS6208 NUS, Advanced Topics in Artificial Intelligence, 2023
Xavier Bresson 2
3
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 3
4
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 4
5
Xavier Bresson 5
6
Xavier Bresson 6
7
Xavier Bresson 7
8
Test error
U-shape
curve
Train error
Model complexity
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 9
10
Reducing over-fitting
How to avoid under-fitting? (easy)
Increase expressivity of the learner.
How to avoid over-fitting? (difficult)
Use regularization loss with cross-validation to estimate the regularization
parameter
Early stopping with a validation set
Regularization with stochastic gradient descent (SGD)
SGD not only speeds up gradient descent technique by computing an approximate
gradient with a mini-batch of data points, it also regularizes the predictive function
w.r.t. its parameters %, allowing better generalization performance.
Theoretically, we should use mini-batch of a single data point for best generalization
but it would be too slow. Using mini-batch of size e.g. 512 data points is the best
trade-off speed and accuracy.
Xavier Bresson 10
11
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 11
12
Loss regularization
Can we reduce the hypothesis space ℋ"! to ℋ# ?
We have ℋ"! = f! x = θ! + θ" x + θ# x # + θ$ x $ + ⋯ + θ"! x"!
and ℋ# = f! x = θ! + θ" x + θ# x #
Data
ℋ( Fit
ℋ)* Fit
Xavier Bresson 12
13
Loss regularization
Let us recall the MSE optimization problem for the regression task :
"
min! Lℋ (%) = ∑&'("(fℋ (4 ' ) − y (') )# unconstrained optimization
&
Xavier Bresson 13
14
Loss regularization
Let us relax the hard constraints, θj≥3 = 0 to let the optimization select the best value of θj≥3 ∶
Xavier Bresson 14
15
Loss regularization
Relationship between hypothesis spaces : ℋ is
larger
ℋ# = f! x = θ! + θ" x + θ# x # C=0
Xavier Bresson 15
16
Loss regularization
Let us consider the general constrained regularized problem :
For each value C, there exists a value λ such that (1) is equivalent to (2) (Lagrange multiplier).
Additionally, C ∝ 1/ λ
Xavier Bresson 16
17
Loss regularization
Study the influence of the regularization parameter λ on the solution of
min! L (%) + λ %- % ⇔ min! L (%) s.t. %- % ≤ C
Xavier Bresson 17
18
Loss regularization
Normal equations for linear regression with loss regularization :
./ ./ "
∇L = = F ⇒ C - C% − D = F ∇L = = F ⇒ & C - C% − D + λ% = F
.0 .0
⇒ % = C - C 1" C - D ⇒ % = C - C + λnH 1" C - D
Xavier Bresson 18
19
Loss regularization
Understanding the solution of normal equations :
min! L(%) = LMSE(%) s.t. LREG(%) ≤ C
" #
with LMSE(%)= C% − D and LREG(%)= %- % = % #
&
Let us plot the landscape of the LMSE loss and the LREG loss.
Xavier Bresson 19
20
Loss regularization
" #
Landscape of LMSE loss : LMSE(%)= C% − D
& LMSE(!)
Quadratic and convex function.
Let us suppose we have two parameters, i.e. %=(θ1, θ2), for visualization.
θ* θ* θ)
Xavier Bresson 20
21
Loss regularization
Landscape of LREG loss : LREG(%)= %- % = % #
θ*
# #
C ∇LREG(! )=2!
#
!
θ)
Xavier Bresson 21
22
Loss regularization
Minimizer of the total loss : %∗ = argmin! LMSE(%) s.t. LREG(%) ≤ C
⇔ %∗ = argmin! LMSE(%) + λ LREG(%)
θ*
Xavier Bresson 22
23
Loss regularization
The L2-ball regularization can be generalized to Lp-ball, p ∈ [0, +∞].
#
6 6
min! LMSE(%) s.t. % 6 ≤ C ⇔ min! LMSE(%) + % 6 where % 6 = ∑+7(" θ7 6 $
Xavier Bresson 23
24
Loss regularization
MSE + L2 regularization vs. L1 regularization
Xavier Bresson 24
25
Loss regularization
Lp regularization, 0<p≤1
Advantages : Very sparse solutions, better than L1 regularization.
Limitations : Non-convex, non-differentiable, solution depends on initial condition.
Lp regularization, p = ∞
Never used in practice (not stable)
Xavier Bresson 25
26
Loss regularization
Lp regularized loss for any predictive task :
6 6
min! LTask(%) s.t. % 6 ≤ C ⇔ min! LTask(%) + λ % 6
"
where LTask(%) = ∑&'(" ℓTask(f!(4 ' ), y (') )
&
6
and % 6 =∑+7(" θ7 6
Xavier Bresson 26
27
Loss regularization
Summary
Adding a regularization loss, a.k.a. regularizer, can reduce over-fitting.
With the right amount of regularization, controlled by the hyper-parameters λ (or C), the
regularizer can decrease the complexity of the predictive model, i.e. its variance, without
affecting the bias.
With no regularization, the model over-specializes to the training data, i.e. high variance.
With too much regularization, the model becomes too simple, i.e. high bias.
Xavier Bresson 27
28
Loss regularization
How to choose λ, i.e. the right amount of regularization?
Test error
U-shape
curve
Train error
Regularization
parameter λ
Right model Min over-fitting
Right complexity
Right-fitting
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 29
30
Cross-validation
We need a surrogate of the test set to estimate the regularization parameter λ which identifies
the right model complexity that minimizes the test error.
The simplest operation is to split the training set into two datasets; a smaller training set and a
validation set.
Train
Train
+
Validation
Xavier Bresson 30
31
Cross-validation
Give a training set S of L data points, S is split into
A smaller training set Strain of L − M data points.
A validation set Sval of M data points.
How to use Sval to estimate the regularization value λ?
Train
Train Strain
S L − M data
+
L data
Validation Sval
M data
Xavier Bresson 31
32
Cross-validation
Use p hypothesis/values to estimate λ :
After selecting λ∗ ,
learn f using all
training points.
S f8∗
We consider p values λ" , … , λ6 . (size .)
Xavier Bresson 32
33
Cross-validation
How robust is the estimation of the validation loss L9:; ?
Suppose that
ℓ(f!(x), y) is the loss value for the data point (x,y)
R(x,y)∼U ℓ(f!(x), y) = L<=>< (f? ), which is the mean error of the predictive function f? applied
to U, the set of all unseen points by f? during training, i.e. Stest and Sval.
Var(x,y)∼U ℓ(f!(x), y) = σ2, which corresponds to the variance of the prediction error.
"
L9:; f? = ∑@
A(" ℓ(f!(x
A ), y A ) is the validation loss.
@
Xavier Bresson 33
34
Cross-validation
Then, we have
"
R(x,y)∼U [ L9:; f? ] = ∑@
A(" R(x,y)∼U [ ℓ(f!(x
A ), y A ) ] = L<=>< (f? )
@
" @ A A " σ2 "
Var(x,y)∼U [ L9:; f? ] = ∑ Var(x,y)∼U [ ℓ(f!(x ), y )] = .m σ2 = ⇒ Std = O
@' A(" @' @ @
"
L9:; = L<=>< ± O
@
Consequence : A small validation set does not provide a good estimate of the test error
Xavier Bresson 34
35
Cross-validation
In practice, we have two situations
Big datasets
Modern situation, training sets are large, e.g. millions of data points.
We can use a small fraction, e.g. m = 100,000 data points as validation set.
The validation set will approximate well the test set distribution.
Small datasets
Situation before 2012 or today for highly expensive or challenging datasets to collect
(e.g. nuclear fusion) or for protected datasets (e.g. medical data).
For limited datasets, e.g. L = 1,000 data points, it is not possible to get simultaneously
good estimates of the predictive function and the validation set.
Xavier Bresson 35
36
Cross-validation
For small datasets, we have 2 opposite cases
Recall : Model f is trained on the full training set of L data, and f 1 is
Train
trained on the L − M training set 0=90
Case #1 : Small number M of validation data / large number L − M of
training data Validation, 1=10
Xavier Bresson 36
37
Cross-validation
k-fold cross-validation technique :
Split the original training set into k parts, i.e. each fold has L/k data Train
points. S
Repeat for all folds : Train on k-1 parts and leave one part out as
validation set.
Advantages +
Strain
Each data in the original training set will be used as a validation
data. Strain
10% Sval
Strain
10% 10%
For each fold, we have a large training set to train a good learner,
i.e. L<=>< (f) ≈ L<=>< (f 1 ).
Strain 10% 10% Strain
We also have a good estimate of the validation error by averaging
the validation error over all folds, i.e. L<=>< ≈ meanfolds L9:; ( f 1 ). 10% 10%
Strain Strain
10% 10%
Strain Strain
Xavier Bresson 37
38
Cross-validation
Example : Model selection using cross-validation
Models : Linear and constant models
Original training set S : L = 3 data points
Cross-validation value : M = 1 (validation set Sval), L − M = 2 (training set Strain)
Linear
ℓ2 ℓ3
model ℓ1
Cross-validation shows that the
L,B = 8.2
constant model is a better fit than
the linear model for this dataset.
1
ℓ2 L,B = (ℓ" + ℓ# + ℓ$ )
Constant 3
ℓ1 ℓ3
model
L,B = 4.3
Xavier Bresson 38
39
Cross-validation
In practice
Hypotheses are an arbitrary set of choices, e.g. choices of predictive models {f!}, choices of
parameter values {λ}, etc.
Note that parameters s.a. the number of layers in neural networks is not differentiable, i.e.
gradient descent cannot be used to select their optimal value.
For very small datasets, we cannot afford to leave out more than a single training data for
validation, so we use W = L folds (i.e. M = 1 validation point), a.k.a. Leave One Out Cross
Validation (LOOCV).
Telescopic search is a standard two-step approach to determine parameter values.
First step : Find the best order of magnitude for λ, e.g. λ = 0.01,0.1,1,10,100.
Second step : Do a fine-grained search around the best λ found in first step. For
example, if 10 is the best performing value from first step, then try out
λ=3,6,10,30,60,90.
Xavier Bresson 39
40
Cross-validation
Summary
Cross-validation is a sound technique, which works well in practice for different hypotheses
and different sizes of training set.
The validation set is a surrogate of the test set but its capacity to represent well the test
distribution depends on its size.
Best-case scenario : Training and validation sets are large enough.
Worst-case scenario : Either training or validation set is small.
Then k-fold cross-validation is required.
Even in the best-case scenario, it is still required to fully train the learner for each
hypothesis, which can be time consuming, e.g. deep learning.
Can we develop a faster regularization technique to avoid over-fitting?
Xavier Bresson 40
41
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 41
42
Early stopping
The fastest regularization technique to avoid over-fitting.
Stop optimization after T number of gradient steps, when the validation error starts increasing,
even if optimization has not converged yet.
Not really satisfying from an optimization theory perspective but it works well in practice.
One of the most common regularization techniques in deep learning to control over-fitting.
Test error
Validation error U-shape
curve
Train error
Number of
T gradient steps
Right-fitting Min over-fitting (iterations)
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 43
44
Double descent
Classical machine learning (ML)
Bias-variance trade-off curve w.r.t. model complexity
U-shape curve for test/generalization error
U-shape U-shape
curve curve
Variance
Train error
Model Model
complexity Bias2 complexity
Double descent
How to interpret the U-shape bias-variance trade-off curve?
In classical ML theory, when model complexity increases, variance and generalization error
also increase.
However, (deep learning) practitioners have observed an opposite phenomenon !
When model complexity increases, generalization error decreases !
This empirical result contradicts the conventional theory and shows a significant gap between
theory and practice.
To reconcile this inconsistency and better understand the properties of modern large ML
models, a new learning mechanism known as “double descent” was introduced in 2018.
Xavier Bresson 45
46
Double descent
Double descent curve w.r.t. the ratio between the model complexity and the dataset size
Test error
Classical ML Modern ML
Train error
Double descent
Double descent curve for bias and variance
Test error
Classical ML Modern ML
Variance
Bias2
10-2 10-1 1 101 102 Model complexity # parameters
=
Dataset size # data
Right-fitting Min over-fitting Double descent
(early stopping) (interpolation point) (min test error)
Double descent
New learning curves
Let p be the number of parameters of the learner and n the number of training data points.
Under-parametrized functions are defined by p ≪ n (classical ML)
Over-parametrized functions are defined by p ≫ n (modern ML)
We also introduce the interpolation point, i.e. p = n, the minimal capacity needed to overfit
the training set.
In the classical ML paradigm, the optimal test error is at the minimum of the bias-variance
trade-off and is captured in practice with early stopping using a validation set.
Classical ML establishes the existence of a right balance between under-fitting and over-
fitting. Beyond this balance point i.e. over-fitting, generalization fails.
In the modern ML regime, over-fitting is actually considered beneficial, and over-parametrized
functions with high model complexity lead to successful generalization.
Xavier Bresson 48
49
Double descent
Understanding the second descent (new learning mechanism)
When p=n, the model possesses just enough parameters to over-fit all the training data.
However, it also exhibits a significant variance, making it unable to generalize (standard
result).
When p≫n, the model has much greater parameters than the number of training data. In
this regime, the learner f!(x) continues to over-fit but critically, the L2 norm of its
parameters % # is significantly minimized by SGD, effectively reducing the model capacity
(regularization effect).
Double descent
GD vs. SGD
Observe that the full gradient of the loss is zero in the over-fitting region. Consequently,
the parameters cannot change, and the second descent phenomenon cannot emerge when
using standard GD.
In contrast, the stochastic mini-batch gradient is never zero in the over-fitting region. The
parameters continue to be updated and the second descent can occur.
SGD is critical in deep learning for several reasons
It helps to leave saddle points in the loss landscape during optimization.
It finds better local (or even global) minima, allowing successful generalization.
It speeds up computational time (by updating the parameters more often).
It is necessary for the double descent phenomenon to emerge.
Xavier Bresson 50
51
Double descent
Illustration
Under-parametrized function Over-parametrized function
Test error
Classical ML Modern ML
Variance
Bias2
Xavier Bresson 51
52
Double descent
Farewell to early stopping?
Do we “simply” need over-parametrized functions, train them, and achieve minimal error?
Unfortunately, the double descent regularization only emerges with exceedingly large networks.
The critical threshold to enable double descent is p* = O(n.k), where k is the number of classes.
Computer Vision
ImageNet : n = 106 (1.3M images), k = 103 (1k classes) ⇒ p* = 109
ResNet-152 has p = 60.2M (107) parameters ≪ 109
ViT : n = 109 (4B images), k = 104 (30k classes) ⇒ p* = 1013
ViT-22B has p = 22B (1010) parameters ≪ 1013
NLP : n=1011 (300B token data), k=104 (35k unique tokens) ⇒ p* = 1015
GPT-3 has p = 175B (1011) parameters ≪ 1015
At present, practitioners use early stopping as their primary regularization technique.
By design, early stopping does not lead to the double descent phenomenon.
Xavier Bresson 52
53
Double descent
Some key observations
The double descent learning mechanism is applicable to both non-linear and linear ML models.
This includes techniques such as decision trees, kernel methods, and deep learning.
The phenomenon is independent of the nature of the datasets involved.
One important ML principle is that more data provides better results.
Both theory and empirical experiments align on this principle "
However, this trend continually increases the critical threshold p* of required network
parameters for the double descent phenomenon to manifest.
Xavier Bresson 53
54
Outline
Reducing over-fitting
Loss regularization
Cross-validation
Early stopping
Double descent
Conclusion
Xavier Bresson 54
55
Conclusion
Over-fitting and high variance are among the most common issues in machine learning.
Regularization techniques
Stochastic gradient descent, the smaller the batch, the better but also slower.
Loss regularization and cross-validation to estimate the right amount of regularization.
Early stopping to terminate optimization before over-fitting.
Double descent with over-parametrized functions.
Xavier Bresson 55
56
Questions?
Xavier Bresson 56