0% found this document useful (0 votes)
40 views

Lecture6 Regularization

The document summarizes techniques for regularization in machine learning models. It discusses how overfitting occurs when models perfectly fit training data but do not generalize to new data. It introduces regularization as a technique to reduce overfitting by adding constraints or penalties to model parameters during training. Specifically, it covers loss regularization which adds penalty terms to the loss function related to model complexity. Cross-validation and early stopping are also covered as techniques to select hyperparameters and avoid overfitting.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Lecture6 Regularization

The document summarizes techniques for regularization in machine learning models. It discusses how overfitting occurs when models perfectly fit training data but do not generalize to new data. It introduces regularization as a technique to reduce overfitting by adding constraints or penalties to model parameters during training. Specifically, it covers loss regularization which adds penalty terms to the loss function related to model complexity. Cross-validation and early stopping are also covered as techniques to select hyperparameters and avoid overfitting.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

1

CS3244 : Machine Learning


Semester 1 2023/24

Lecture 6 : Regularization Techniques

Xavier Bresson
https://twitter.com/xbresson

Department of Computer Science


National University of Singapore (NUS)

Xavier Bresson 1
2

Material used for preparation

Prof Kilian Weinberger, CS4780 Cornell, Machine Learning, 2018

https://www.cs.cornell.edu/courses/cs4780/2018fa

Prof Min-Yen Kan, CS3244 NUS, Machine Learning, 2022

https://knmnyn.github.io/cs3244-2210

Prof Mikhail Belkin, 2018

https://arxiv.org/pdf/1812.11118.pdf

Prof Xavier Bresson, CS6208 NUS, Advanced Topics in Artificial Intelligence, 2023

Xavier Bresson 2
3

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 3
4

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 4
5

Fitting the training set


Goal : Fitting perfectly the training set
Example : Regression task
Training set : 5 data points sampled from data distribution (blue curve) + small noise
Model : 4th order polynomial function, i.e. f!(x) = θ! + θ" x + θ# x # + θ$ x $ + θ% x %

Data / train set


Target / test set
Fit / f!

Loss( f(x ∈ Strain ) ) = 0 / J


Loss( f(x ∈ Stest ) ) = large / L

Xavier Bresson 5
6

Fitting the training set


Another example

2nd Order 10th Order


L!"#$% 0.029 0.0001
L!&'! 0.120 7680.0
Data
L(f(Strain)) = large L(f(Strain)) = 0
ℋ( Fit
ℋ)* Fit L(f(Stest)) = small L(f(Stest)) = large

Xavier Bresson 6
7

Fitting the training set


Under-fitting and over-fitting
Two problems encounter when training with a dataset.
These problems are related to the degree to which the training set is extrapolated to apply
to unknown data.
Under-fitting : The learner is not expressive enough. It will make error on the provided
training set, i.e. unable to benefit from all information present in the training data. In this
case, both the training error and the test error will be high.
Over-fitting : The learner is too expressive and will become over-specialized of the training
data, unable to extrapolate to unseen data because of high variance. In this situation, the
training error will be small and the test error high.

Xavier Bresson 7
8

Fitting the training set


Under-fitting, over-fitting and right-fitting

Test error

U-shape
curve

Train error
Model complexity

Right model Min over-fitting


Right complexity
Right-fitting

Simple models Close-fitting Over-fitting


Low complexity Complex models
Under-fitting High complexity
Xavier Bresson 8
9

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 9
10

Reducing over-fitting
How to avoid under-fitting? (easy)
Increase expressivity of the learner.
How to avoid over-fitting? (difficult)
Use regularization loss with cross-validation to estimate the regularization
parameter
Early stopping with a validation set
Regularization with stochastic gradient descent (SGD)
SGD not only speeds up gradient descent technique by computing an approximate
gradient with a mini-batch of data points, it also regularizes the predictive function
w.r.t. its parameters %, allowing better generalization performance.
Theoretically, we should use mini-batch of a single data point for best generalization
but it would be too slow. Using mini-batch of size e.g. 512 data points is the best
trade-off speed and accuracy.

Xavier Bresson 10
11

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 11
12

Loss regularization
Can we reduce the hypothesis space ℋ"! to ℋ# ?
We have ℋ"! = f! x = θ! + θ" x + θ# x # + θ$ x $ + ⋯ + θ"! x"!

and ℋ# = f! x = θ! + θ" x + θ# x #

Then ℋ"! = ℋ# when θ$ = θ% = ⋯ = θ"! = 0

Data
ℋ( Fit
ℋ)* Fit

Xavier Bresson 12
13

Loss regularization
Let us recall the MSE optimization problem for the regression task :
"
min! Lℋ (%) = ∑&'("(fℋ (4 ' ) − y (') )# unconstrained optimization
&

Equivalent optimization problems :


min! Lℋ (%) ⇔ min! Lℋ (%) such that θ$ = θ% = ⋯ = θ"! = 0
" #$

constrained optimization with hard constraints

Xavier Bresson 13
14

Loss regularization
Let us relax the hard constraints, θj≥3 = 0 to let the optimization select the best value of θj≥3 ∶

min! Lℋ (%) such that ∑+'≥$ θ#' ≤ C , C > 0


#$

constrained optimization with soft constraints

Hyper-parameter C controls the amount of non-zero for the parameters θj≥3 .

Small value C implies most θj≥3 close to zero, i.e. ℋ"! = ℋ# .

Large value C provides non-zero θj≥3 , i.e. ℋ"! ≫ ℋ# .

Xavier Bresson 14
15

Loss regularization
Relationship between hypothesis spaces : ℋ is
larger

ℋ"! = f! x = θ! + θ" x + ⋯ + θ"! x"! C=∞

ℋ, = f! x = θ! + θ" x + ⋯ + θ"! x"! such that ∑"! #


'($ θ' ≤ C
C>0

ℋ# = f! x = θ! + θ" x + θ# x # C=0

= f! x = θ! + θ" x + ⋯ + θ"! x"! such that θ$ = θ% = ⋯ = θ"! = 0


ℋ is
smaller

Xavier Bresson 15
16

Loss regularization
Let us consider the general constrained regularized problem :

min! L (%) such that ∑+'(! θ#' = %- % ≤ C , C > 0 (1)

There exists an equivalent unconstrained optimization problem (easier to solve) :


min! L (%) + λ %- % , λ > 0 (2)

For each value C, there exists a value λ such that (1) is equivalent to (2) (Lagrange multiplier).
Additionally, C ∝ 1/ λ

Xavier Bresson 16
17

Loss regularization
Study the influence of the regularization parameter λ on the solution of
min! L (%) + λ %- % ⇔ min! L (%) s.t. %- % ≤ C

Example : Regression task with 4th order polynomial function

λ=0 λ = 10 λ = 100 λ = 1,000


C=∞ C = 0.1 C = 0.01 C = 0.001
Over-fitting Right-fitting Under-fitting Under-fitting

Xavier Bresson 17
18

Loss regularization
Normal equations for linear regression with loss regularization :

Original MSE loss : Regularized MSE loss :


&
1
min! L(%) = B(%- 4 '
− y (') )# min! L(%) =
"
C% − D # s.t. %- % ≤ C
n &
'(" " # + λ %- %
1 ⇔ min! L(%) = C% − D
# &
= C% − D
n

Set gradient of loss to zero : Set gradient of loss to zero :

./ ./ "
∇L = = F ⇒ C - C% − D = F ∇L = = F ⇒ & C - C% − D + λ% = F
.0 .0
⇒ % = C - C 1" C - D ⇒ % = C - C + λnH 1" C - D

Xavier Bresson 18
19

Loss regularization
Understanding the solution of normal equations :
min! L(%) = LMSE(%) s.t. LREG(%) ≤ C
" #
with LMSE(%)= C% − D and LREG(%)= %- % = % #
&

Let us plot the landscape of the LMSE loss and the LREG loss.

Xavier Bresson 19
20

Loss regularization
" #
Landscape of LMSE loss : LMSE(%)= C% − D
& LMSE(!)
Quadratic and convex function.
Let us suppose we have two parameters, i.e. %=(θ1, θ2), for visualization.
θ* θ* θ)

{ ! s.t. LMSE(!)= constant }


a.k.a. level set

!&'( = argmin" LMSE(!)=


)
'! − ) * !&'(
+
#
!
#
∇LMSE(! )
θ)

Xavier Bresson 20
21

Loss regularization
Landscape of LREG loss : LREG(%)= %- % = % #

Quadratic and convex function.

θ*

# #
C ∇LREG(! )=2!
#
!
θ)

{ ! s.t. LREG(!)= !,! = C }


{ ! s.t. LREG(!)= !,! ≤C}
L2-sphere
L2-ball

Xavier Bresson 21
22

Loss regularization
Minimizer of the total loss : %∗ = argmin! LMSE(%) s.t. LREG(%) ≤ C
⇔ %∗ = argmin! LMSE(%) + λ LREG(%)

θ*

Solution %∗ is as close as the MSE


solution %345 as allowed by the L2-
!&'(
ball constraint % # ≤ C.
!∗
-λ∇LREG (!∗ ) ∇LREG(!∗ )
∇LMSE(!∗ ) Gradient of loss at %∗ :
∇ (LMSE(%∗ ) + λ LREG(%∗ )) = 0
θ) ⇒ ∇LMSE(%∗ ) = -λ∇LREG(%∗ )
Gradients of LMSE and LREG are aligned
C (in the opposite direction) at the solution.

Xavier Bresson 22
23

Loss regularization
The L2-ball regularization can be generalized to Lp-ball, p ∈ [0, +∞].
#
6 6
min! LMSE(%) s.t. % 6 ≤ C ⇔ min! LMSE(%) + % 6 where % 6 = ∑+7(" θ7 6 $

L2-ball/L2 regularization, a.k.a. weight decay


Advantages : Strictly convex, differentiable, fast optimization, robust w.r.t. perturbation.
Limitations : Although θ7 values are minimized, solutions are dense, i.e. θ7 > 0.
This means no feature selection in e.g. f! x = θ! + θ" x + θ# x # + θ$ x $ + ⋯ + θ"! x"! ,
as all data features are used for prediction.
L1 regularization
Advantages : Convex (but not strictly), fast optimization algorithms exist, robust w.r.t.
perturbation, solutions are guaranteed to be sparse meaning feature selection, as only a few
data features are used for prediction.
Limitations : Not differentiable at the origin.

Xavier Bresson 23
24

Loss regularization
MSE + L2 regularization vs. L1 regularization

The probability to have a The probability to have a


solution on the axes is almost solution on diagonal edges is
zero, most solutions lie on the almost zero, most solutions lie
quadrants of the L2 ball. on a tip of the L1 ball.

Xavier Bresson 24
25

Loss regularization
Lp regularization, 0<p≤1
Advantages : Very sparse solutions, better than L1 regularization.
Limitations : Non-convex, non-differentiable, solution depends on initial condition.

Lp regularization, p = ∞
Never used in practice (not stable)

Xavier Bresson 25
26

Loss regularization
Lp regularized loss for any predictive task :
6 6
min! LTask(%) s.t. % 6 ≤ C ⇔ min! LTask(%) + λ % 6
"
where LTask(%) = ∑&'(" ℓTask(f!(4 ' ), y (') )
&
6
and % 6 =∑+7(" θ7 6

Xavier Bresson 26
27

Loss regularization
Summary
Adding a regularization loss, a.k.a. regularizer, can reduce over-fitting.
With the right amount of regularization, controlled by the hyper-parameters λ (or C), the
regularizer can decrease the complexity of the predictive model, i.e. its variance, without
affecting the bias.
With no regularization, the model over-specializes to the training data, i.e. high variance.
With too much regularization, the model becomes too simple, i.e. high bias.

Xavier Bresson 27
28

Loss regularization
How to choose λ, i.e. the right amount of regularization?

Test error

U-shape
curve

Train error
Regularization
parameter λ
Right model Min over-fitting
Right complexity
Right-fitting

Simple models Close-fitting Over-fitting


Low complexity Complex models
Under-fitting High complexity
Xavier Bresson 28
29

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 29
30

Cross-validation
We need a surrogate of the test set to estimate the regularization parameter λ which identifies
the right model complexity that minimizes the test error.
The simplest operation is to split the training set into two datasets; a smaller training set and a
validation set.

Train
Train

+
Validation

Never use the


Test test set during Test
training!

Xavier Bresson 30
31

Cross-validation
Give a training set S of L data points, S is split into
A smaller training set Strain of L − M data points.
A validation set Sval of M data points.
How to use Sval to estimate the regularization value λ?

Train
Train Strain
S L − M data

+
L data
Validation Sval
M data

Xavier Bresson 31
32

Cross-validation
Use p hypothesis/values to estimate λ :
After selecting λ∗ ,
learn f using all
training points.
S f8∗
We consider p values λ" , … , λ6 . (size .)

Use Strain to learn f81% for each λ value. Strain Sval


Choose
λ∗ that
(size . − 0) (size 0) min L'() (f*+! )

Evaluate f81% using Sval ∶ L9:; f81% for j = 1, … , p


ℋ" / λ" f81# L9:; (f81# )
Select value λ∗ = λ' with smallest L9:; f81' L9:; (f81' )
ℋ# / λ#

ℋ6 / λ6 f81( L9:; (f81( )

Xavier Bresson 32
33

Cross-validation
How robust is the estimation of the validation loss L9:; ?
Suppose that
ℓ(f!(x), y) is the loss value for the data point (x,y)
R(x,y)∼U ℓ(f!(x), y) = L<=>< (f? ), which is the mean error of the predictive function f? applied
to U, the set of all unseen points by f? during training, i.e. Stest and Sval.
Var(x,y)∼U ℓ(f!(x), y) = σ2, which corresponds to the variance of the prediction error.
"
L9:; f? = ∑@
A(" ℓ(f!(x
A ), y A ) is the validation loss.
@

Xavier Bresson 33
34

Cross-validation
Then, we have
"
R(x,y)∼U [ L9:; f? ] = ∑@
A(" R(x,y)∼U [ ℓ(f!(x
A ), y A ) ] = L<=>< (f? )
@
" @ A A " σ2 "
Var(x,y)∼U [ L9:; f? ] = ∑ Var(x,y)∼U [ ℓ(f!(x ), y )] = .m σ2 = ⇒ Std = O
@' A(" @' @ @

"
L9:; = L<=>< ± O
@

Consequence : A small validation set does not provide a good estimate of the test error

Xavier Bresson 34
35

Cross-validation
In practice, we have two situations
Big datasets
Modern situation, training sets are large, e.g. millions of data points.
We can use a small fraction, e.g. m = 100,000 data points as validation set.
The validation set will approximate well the test set distribution.
Small datasets
Situation before 2012 or today for highly expensive or challenging datasets to collect
(e.g. nuclear fusion) or for protected datasets (e.g. medical data).
For limited datasets, e.g. L = 1,000 data points, it is not possible to get simultaneously
good estimates of the predictive function and the validation set.

Xavier Bresson 35
36

Cross-validation
For small datasets, we have 2 opposite cases
Recall : Model f is trained on the full training set of L data, and f 1 is
Train
trained on the L − M training set 0=90
Case #1 : Small number M of validation data / large number L − M of
training data Validation, 1=10

Advantage : L<=>< (f) ≈ L<=>< (f 1 ) as f 1 is well estimated. Test

Limitation : L<=>< (f 1 ) ≠ L9:; (f 1 ) as the validation set is too small.


Case #2 : Large number M of validation data / small number L − M of Train, 0=10
training data
1 1 Validation
Advantage : L<=>< (f ) ≈ L9:; (f ) as the validation set is large enough. 1=90
Limitation : L<=>< (f) ≠ L<=>< (f 1 ) as f 1 is badly estimated.
How to reconcile the two cases? Test

Xavier Bresson 36
37

Cross-validation
k-fold cross-validation technique :
Split the original training set into k parts, i.e. each fold has L/k data Train
points. S

Repeat for all folds : Train on k-1 parts and leave one part out as
validation set.
Advantages +
Strain
Each data in the original training set will be used as a validation
data. Strain
10% Sval
Strain
10% 10%
For each fold, we have a large training set to train a good learner,
i.e. L<=>< (f) ≈ L<=>< (f 1 ).
Strain 10% 10% Strain
We also have a good estimate of the validation error by averaging
the validation error over all folds, i.e. L<=>< ≈ meanfolds L9:; ( f 1 ). 10% 10%
Strain Strain
10% 10%
Strain Strain

Xavier Bresson 37
38

Cross-validation
Example : Model selection using cross-validation
Models : Linear and constant models
Original training set S : L = 3 data points
Cross-validation value : M = 1 (validation set Sval), L − M = 2 (training set Strain)

Linear
ℓ2 ℓ3
model ℓ1
Cross-validation shows that the
L,B = 8.2
constant model is a better fit than
the linear model for this dataset.
1
ℓ2 L,B = (ℓ" + ℓ# + ℓ$ )
Constant 3
ℓ1 ℓ3
model
L,B = 4.3

Xavier Bresson 38
39

Cross-validation
In practice
Hypotheses are an arbitrary set of choices, e.g. choices of predictive models {f!}, choices of
parameter values {λ}, etc.
Note that parameters s.a. the number of layers in neural networks is not differentiable, i.e.
gradient descent cannot be used to select their optimal value.
For very small datasets, we cannot afford to leave out more than a single training data for
validation, so we use W = L folds (i.e. M = 1 validation point), a.k.a. Leave One Out Cross
Validation (LOOCV).
Telescopic search is a standard two-step approach to determine parameter values.
First step : Find the best order of magnitude for λ, e.g. λ = 0.01,0.1,1,10,100.
Second step : Do a fine-grained search around the best λ found in first step. For
example, if 10 is the best performing value from first step, then try out
λ=3,6,10,30,60,90.

Xavier Bresson 39
40

Cross-validation
Summary
Cross-validation is a sound technique, which works well in practice for different hypotheses
and different sizes of training set.
The validation set is a surrogate of the test set but its capacity to represent well the test
distribution depends on its size.
Best-case scenario : Training and validation sets are large enough.
Worst-case scenario : Either training or validation set is small.
Then k-fold cross-validation is required.
Even in the best-case scenario, it is still required to fully train the learner for each
hypothesis, which can be time consuming, e.g. deep learning.
Can we develop a faster regularization technique to avoid over-fitting?

Xavier Bresson 40
41

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 41
42

Early stopping
The fastest regularization technique to avoid over-fitting.
Stop optimization after T number of gradient steps, when the validation error starts increasing,
even if optimization has not converged yet.
Not really satisfying from an optimization theory perspective but it works well in practice.
One of the most common regularization techniques in deep learning to control over-fitting.

Test error
Validation error U-shape
curve

Train error

Number of
T gradient steps
Right-fitting Min over-fitting (iterations)

Xavier Bresson Under-fitting Close-fitting Over-fitting 42


43

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 43
44

Double descent
Classical machine learning (ML)
Bias-variance trade-off curve w.r.t. model complexity
U-shape curve for test/generalization error

Test error Test error

U-shape U-shape
curve curve
Variance

Train error
Model Model
complexity Bias2 complexity

Right-fitting Min over-fitting Right-fitting Min over-fitting


(early stopping) (interpolation point) (early stopping) (interpolation point)

Under-fitting Close-fitting Over-fitting Under-fitting Close-fitting Over-fitting


Xavier Bresson 44
45

Double descent
How to interpret the U-shape bias-variance trade-off curve?
In classical ML theory, when model complexity increases, variance and generalization error
also increase.
However, (deep learning) practitioners have observed an opposite phenomenon !
When model complexity increases, generalization error decreases !
This empirical result contradicts the conventional theory and shows a significant gap between
theory and practice.
To reconcile this inconsistency and better understand the properties of modern large ML
models, a new learning mechanism known as “double descent” was introduced in 2018.

Xavier Bresson 45
46

Double descent
Double descent curve w.r.t. the ratio between the model complexity and the dataset size

Under-parametrized function Over-parametrized function

Test error

Classical ML Modern ML

Train error

10-2 10-1 1 101 102 Model complexity # parameters


=
Dataset size # data
Right-fitting Min over-fitting Double descent
(early stopping) (interpolation point) (min test error)

Under-fitting Close-fitting Over-fitting


Xavier Bresson 46
47

Double descent
Double descent curve for bias and variance

Under-parametrized function Over-parametrized function

Test error

Classical ML Modern ML

Variance

Bias2
10-2 10-1 1 101 102 Model complexity # parameters
=
Dataset size # data
Right-fitting Min over-fitting Double descent
(early stopping) (interpolation point) (min test error)

Under-fitting Close-fitting Over-fitting


Xavier Bresson 47
48

Double descent
New learning curves
Let p be the number of parameters of the learner and n the number of training data points.
Under-parametrized functions are defined by p ≪ n (classical ML)
Over-parametrized functions are defined by p ≫ n (modern ML)
We also introduce the interpolation point, i.e. p = n, the minimal capacity needed to overfit
the training set.
In the classical ML paradigm, the optimal test error is at the minimum of the bias-variance
trade-off and is captured in practice with early stopping using a validation set.
Classical ML establishes the existence of a right balance between under-fitting and over-
fitting. Beyond this balance point i.e. over-fitting, generalization fails.
In the modern ML regime, over-fitting is actually considered beneficial, and over-parametrized
functions with high model complexity lead to successful generalization.

Xavier Bresson 48
49

Double descent
Understanding the second descent (new learning mechanism)
When p=n, the model possesses just enough parameters to over-fit all the training data.
However, it also exhibits a significant variance, making it unable to generalize (standard
result).
When p≫n, the model has much greater parameters than the number of training data. In
this regime, the learner f!(x) continues to over-fit but critically, the L2 norm of its
parameters % # is significantly minimized by SGD, effectively reducing the model capacity
(regularization effect).

Function with the Function with the


smallest norm smallest norm
! * = 8.7 ! * = 0.3

Space of functions Space of functions that overfit


that just overfit and possess high capacity
(interpolation point) (larger space ⇒ more functions ⇒ lower
! * value than at interpolation point)
Xavier Bresson 49
50

Double descent
GD vs. SGD
Observe that the full gradient of the loss is zero in the over-fitting region. Consequently,
the parameters cannot change, and the second descent phenomenon cannot emerge when
using standard GD.
In contrast, the stochastic mini-batch gradient is never zero in the over-fitting region. The
parameters continue to be updated and the second descent can occur.
SGD is critical in deep learning for several reasons
It helps to leave saddle points in the loss landscape during optimization.
It finds better local (or even global) minima, allowing successful generalization.
It speeds up computational time (by updating the parameters more often).
It is necessary for the double descent phenomenon to emerge.

Xavier Bresson 50
51

Double descent
Illustration
Under-parametrized function Over-parametrized function

Test error

Classical ML Modern ML

Variance
Bias2

10-2 10-1 1 101 102 Model complexity


Dataset size
Right-fitting Min over-fitting Double descent
(early stopping) (interpolation point) (min test error)

Under-fitting Close-fitting Over-fitting

Xavier Bresson 51
52

Double descent
Farewell to early stopping?
Do we “simply” need over-parametrized functions, train them, and achieve minimal error?
Unfortunately, the double descent regularization only emerges with exceedingly large networks.
The critical threshold to enable double descent is p* = O(n.k), where k is the number of classes.
Computer Vision
ImageNet : n = 106 (1.3M images), k = 103 (1k classes) ⇒ p* = 109
ResNet-152 has p = 60.2M (107) parameters ≪ 109
ViT : n = 109 (4B images), k = 104 (30k classes) ⇒ p* = 1013
ViT-22B has p = 22B (1010) parameters ≪ 1013
NLP : n=1011 (300B token data), k=104 (35k unique tokens) ⇒ p* = 1015
GPT-3 has p = 175B (1011) parameters ≪ 1015
At present, practitioners use early stopping as their primary regularization technique.
By design, early stopping does not lead to the double descent phenomenon.

Xavier Bresson 52
53

Double descent
Some key observations
The double descent learning mechanism is applicable to both non-linear and linear ML models.
This includes techniques such as decision trees, kernel methods, and deep learning.
The phenomenon is independent of the nature of the datasets involved.
One important ML principle is that more data provides better results.
Both theory and empirical experiments align on this principle "
However, this trend continually increases the critical threshold p* of required network
parameters for the double descent phenomenon to manifest.

Xavier Bresson 53
54

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 54
55

Conclusion
Over-fitting and high variance are among the most common issues in machine learning.
Regularization techniques
Stochastic gradient descent, the smaller the batch, the better but also slower.
Loss regularization and cross-validation to estimate the right amount of regularization.
Early stopping to terminate optimization before over-fitting.
Double descent with over-parametrized functions.

Xavier Bresson 55
56

Questions?

Xavier Bresson 56

You might also like