0% found this document useful (0 votes)

40 views

Lecture6 Regularization

The document summarizes techniques for regularization in machine learning models. It discusses how overfitting occurs when models perfectly fit training data but do not generalize to new data. It introduces regularization as a technique to reduce overfitting by adding constraints or penalties to model parameters during training. Specifically, it covers loss regularization which adds penalty terms to the loss function related to model complexity. Cross-validation and early stopping are also covered as techniques to select hyperparameters and avoid overfitting.

Uploaded by

Thành Dương Nhật

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Lecture6 Regularization

Uploaded by

Thành Dương Nhật

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

1

CS3244 : Machine Learning

Semester 1 2023/24

Lecture 6 : Regularization Techniques

Xavier Bresson
https://twitter.com/xbresson

Department of Computer Science

National University of Singapore (NUS)

Xavier Bresson 1
2

Material used for preparation

Prof Kilian Weinberger, CS4780 Cornell, Machine Learning, 2018

https://www.cs.cornell.edu/courses/cs4780/2018fa

Prof Min-Yen Kan, CS3244 NUS, Machine Learning, 2022

https://knmnyn.github.io/cs3244-2210

Prof Mikhail Belkin, 2018

https://arxiv.org/pdf/1812.11118.pdf

Prof Xavier Bresson, CS6208 NUS, Advanced Topics in Artificial Intelligence, 2023

Xavier Bresson 2
3

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 3
4

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 4
5

Fitting the training set

Goal : Fitting perfectly the training set
Example : Regression task
Training set : 5 data points sampled from data distribution (blue curve) + small noise
Model : 4th order polynomial function, i.e. f!(x) = θ! + θ" x + θ# x # + θ$ x $ + θ% x %

Data / train set

Target / test set
Fit / f!

Loss( f(x ∈ Strain ) ) = 0 / J

Loss( f(x ∈ Stest ) ) = large / L

Xavier Bresson 5
6

Fitting the training set

Another example

2nd Order 10th Order

L!"#$% 0.029 0.0001
L!&'! 0.120 7680.0
Data
L(f(Strain)) = large L(f(Strain)) = 0
ℋ( Fit
ℋ)* Fit L(f(Stest)) = small L(f(Stest)) = large

Xavier Bresson 6
7

Fitting the training set

Under-fitting and over-fitting
Two problems encounter when training with a dataset.
These problems are related to the degree to which the training set is extrapolated to apply
to unknown data.
Under-fitting : The learner is not expressive enough. It will make error on the provided
training set, i.e. unable to benefit from all information present in the training data. In this
case, both the training error and the test error will be high.
Over-fitting : The learner is too expressive and will become over-specialized of the training
data, unable to extrapolate to unseen data because of high variance. In this situation, the
training error will be small and the test error high.

Xavier Bresson 7
8

Fitting the training set

Under-fitting, over-fitting and right-fitting

Test error

U-shape
curve

Train error
Model complexity

Right model Min over-fitting

Right complexity
Right-fitting

Simple models Close-fitting Over-fitting

Low complexity Complex models
Under-fitting High complexity
Xavier Bresson 8
9

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 9
10

Reducing over-fitting
How to avoid under-fitting? (easy)
Increase expressivity of the learner.
How to avoid over-fitting? (difficult)
Use regularization loss with cross-validation to estimate the regularization
parameter
Early stopping with a validation set
Regularization with stochastic gradient descent (SGD)
SGD not only speeds up gradient descent technique by computing an approximate
gradient with a mini-batch of data points, it also regularizes the predictive function
w.r.t. its parameters %, allowing better generalization performance.
Theoretically, we should use mini-batch of a single data point for best generalization
but it would be too slow. Using mini-batch of size e.g. 512 data points is the best
trade-off speed and accuracy.

Xavier Bresson 10
11

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 11
12

Loss regularization
Can we reduce the hypothesis space ℋ"! to ℋ# ?
We have ℋ"! = f! x = θ! + θ" x + θ# x # + θ$ x $ + ⋯ + θ"! x"!

and ℋ# = f! x = θ! + θ" x + θ# x #

Then ℋ"! = ℋ# when θ$ = θ% = ⋯ = θ"! = 0

Data
ℋ( Fit
ℋ)* Fit

Xavier Bresson 12
13

Loss regularization
Let us recall the MSE optimization problem for the regression task :
"
min! Lℋ (%) = ∑&'("(fℋ (4 ' ) − y (') )# unconstrained optimization
&

Equivalent optimization problems :

min! Lℋ (%) ⇔ min! Lℋ (%) such that θ$ = θ% = ⋯ = θ"! = 0
" #$

constrained optimization with hard constraints

Xavier Bresson 13
14

Loss regularization
Let us relax the hard constraints, θj≥3 = 0 to let the optimization select the best value of θj≥3 ∶

min! Lℋ (%) such that ∑+'≥$ θ#' ≤ C , C > 0

constrained optimization with soft constraints

Hyper-parameter C controls the amount of non-zero for the parameters θj≥3 .

Small value C implies most θj≥3 close to zero, i.e. ℋ"! = ℋ# .

Large value C provides non-zero θj≥3 , i.e. ℋ"! ≫ ℋ# .

Xavier Bresson 14
15

Loss regularization
Relationship between hypothesis spaces : ℋ is
larger

ℋ"! = f! x = θ! + θ" x + ⋯ + θ"! x"! C=∞

ℋ, = f! x = θ! + θ" x + ⋯ + θ"! x"! such that ∑"! #

'($ θ' ≤ C
C>0

ℋ# = f! x = θ! + θ" x + θ# x # C=0

= f! x = θ! + θ" x + ⋯ + θ"! x"! such that θ$ = θ% = ⋯ = θ"! = 0

ℋ is
smaller

Xavier Bresson 15
16

Loss regularization
Let us consider the general constrained regularized problem :

min! L (%) such that ∑+'(! θ#' = %- % ≤ C , C > 0 (1)

There exists an equivalent unconstrained optimization problem (easier to solve) :

min! L (%) + λ %- % , λ > 0 (2)

For each value C, there exists a value λ such that (1) is equivalent to (2) (Lagrange multiplier).
Additionally, C ∝ 1/ λ

Xavier Bresson 16
17

Loss regularization
Study the influence of the regularization parameter λ on the solution of
min! L (%) + λ %- % ⇔ min! L (%) s.t. %- % ≤ C

Example : Regression task with 4th order polynomial function

λ=0 λ = 10 λ = 100 λ = 1,000

C=∞ C = 0.1 C = 0.01 C = 0.001
Over-fitting Right-fitting Under-fitting Under-fitting

Xavier Bresson 17
18

Loss regularization
Normal equations for linear regression with loss regularization :

Original MSE loss : Regularized MSE loss :

&
1
min! L(%) = B(%- 4 '
− y (') )# min! L(%) =
"
C% − D # s.t. %- % ≤ C
n &
'(" " # + λ %- %
1 ⇔ min! L(%) = C% − D
# &
= C% − D
n

Set gradient of loss to zero : Set gradient of loss to zero :

./ ./ "
∇L = = F ⇒ C - C% − D = F ∇L = = F ⇒ & C - C% − D + λ% = F
.0 .0
⇒ % = C - C 1" C - D ⇒ % = C - C + λnH 1" C - D

Xavier Bresson 18
19

Loss regularization
Understanding the solution of normal equations :
min! L(%) = LMSE(%) s.t. LREG(%) ≤ C
" #
with LMSE(%)= C% − D and LREG(%)= %- % = % #
&

Let us plot the landscape of the LMSE loss and the LREG loss.

Xavier Bresson 19
20

Loss regularization
" #
Landscape of LMSE loss : LMSE(%)= C% − D
& LMSE(!)
Quadratic and convex function.
Let us suppose we have two parameters, i.e. %=(θ1, θ2), for visualization.
θ* θ* θ)

{ ! s.t. LMSE(!)= constant }

a.k.a. level set

!&'( = argmin" LMSE(!)=

)
'! − ) * !&'(
+
#
!
#
∇LMSE(! )
θ)

Xavier Bresson 20
21

Loss regularization
Landscape of LREG loss : LREG(%)= %- % = % #

Quadratic and convex function.

θ*

# #
C ∇LREG(! )=2!
#
!
θ)

{ ! s.t. LREG(!)= !,! = C }

{ ! s.t. LREG(!)= !,! ≤C}
L2-sphere
L2-ball

Xavier Bresson 21
22

Loss regularization
Minimizer of the total loss : %∗ = argmin! LMSE(%) s.t. LREG(%) ≤ C
⇔ %∗ = argmin! LMSE(%) + λ LREG(%)

θ*

Solution %∗ is as close as the MSE

solution %345 as allowed by the L2-
!&'(
ball constraint % # ≤ C.
!∗
-λ∇LREG (!∗ ) ∇LREG(!∗ )
∇LMSE(!∗ ) Gradient of loss at %∗ :
∇ (LMSE(%∗ ) + λ LREG(%∗ )) = 0
θ) ⇒ ∇LMSE(%∗ ) = -λ∇LREG(%∗ )
Gradients of LMSE and LREG are aligned
C (in the opposite direction) at the solution.

Xavier Bresson 22
23

Loss regularization
The L2-ball regularization can be generalized to Lp-ball, p ∈ [0, +∞].
#
6 6
min! LMSE(%) s.t. % 6 ≤ C ⇔ min! LMSE(%) + % 6 where % 6 = ∑+7(" θ7 6 $

L2-ball/L2 regularization, a.k.a. weight decay

Advantages : Strictly convex, differentiable, fast optimization, robust w.r.t. perturbation.
Limitations : Although θ7 values are minimized, solutions are dense, i.e. θ7 > 0.
This means no feature selection in e.g. f! x = θ! + θ" x + θ# x # + θ$ x $ + ⋯ + θ"! x"! ,
as all data features are used for prediction.
L1 regularization
Advantages : Convex (but not strictly), fast optimization algorithms exist, robust w.r.t.
perturbation, solutions are guaranteed to be sparse meaning feature selection, as only a few
data features are used for prediction.
Limitations : Not differentiable at the origin.

Xavier Bresson 23
24

Loss regularization
MSE + L2 regularization vs. L1 regularization

The probability to have a The probability to have a

solution on the axes is almost solution on diagonal edges is
zero, most solutions lie on the almost zero, most solutions lie
quadrants of the L2 ball. on a tip of the L1 ball.

Xavier Bresson 24
25

Loss regularization
Lp regularization, 0<p≤1
Advantages : Very sparse solutions, better than L1 regularization.
Limitations : Non-convex, non-differentiable, solution depends on initial condition.

Lp regularization, p = ∞
Never used in practice (not stable)

Xavier Bresson 25
26

Loss regularization
Lp regularized loss for any predictive task :
6 6
min! LTask(%) s.t. % 6 ≤ C ⇔ min! LTask(%) + λ % 6
"
where LTask(%) = ∑&'(" ℓTask(f!(4 ' ), y (') )
&
6
and % 6 =∑+7(" θ7 6

Xavier Bresson 26
27

Loss regularization
Summary
Adding a regularization loss, a.k.a. regularizer, can reduce over-fitting.
With the right amount of regularization, controlled by the hyper-parameters λ (or C), the
regularizer can decrease the complexity of the predictive model, i.e. its variance, without
affecting the bias.
With no regularization, the model over-specializes to the training data, i.e. high variance.
With too much regularization, the model becomes too simple, i.e. high bias.

Xavier Bresson 27
28

Loss regularization
How to choose λ, i.e. the right amount of regularization?

Test error

U-shape
curve

Train error
Regularization
parameter λ
Right model Min over-fitting
Right complexity
Right-fitting

Simple models Close-fitting Over-fitting

Low complexity Complex models
Under-fitting High complexity
Xavier Bresson 28
29

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 29
30

Cross-validation
We need a surrogate of the test set to estimate the regularization parameter λ which identifies
the right model complexity that minimizes the test error.
The simplest operation is to split the training set into two datasets; a smaller training set and a
validation set.

Train
Train

+
Validation

Never use the

Test test set during Test
training!

Xavier Bresson 30
31

Cross-validation
Give a training set S of L data points, S is split into
A smaller training set Strain of L − M data points.
A validation set Sval of M data points.
How to use Sval to estimate the regularization value λ?

Train
Train Strain
S L − M data

+
L data
Validation Sval
M data

Xavier Bresson 31
32

Cross-validation
Use p hypothesis/values to estimate λ :
After selecting λ∗ ,
learn f using all
training points.
S f8∗
We consider p values λ" , … , λ6 . (size .)

Use Strain to learn f81% for each λ value. Strain Sval

Choose
λ∗ that
(size . − 0) (size 0) min L'() (f*+! )

Evaluate f81% using Sval ∶ L9:; f81% for j = 1, … , p

ℋ" / λ" f81# L9:; (f81# )
Select value λ∗ = λ' with smallest L9:; f81' L9:; (f81' )
ℋ# / λ#

ℋ6 / λ6 f81( L9:; (f81( )

Xavier Bresson 32
33

Cross-validation
How robust is the estimation of the validation loss L9:; ?
Suppose that
ℓ(f!(x), y) is the loss value for the data point (x,y)
R(x,y)∼U ℓ(f!(x), y) = L<=>< (f? ), which is the mean error of the predictive function f? applied
to U, the set of all unseen points by f? during training, i.e. Stest and Sval.
Var(x,y)∼U ℓ(f!(x), y) = σ2, which corresponds to the variance of the prediction error.
"
L9:; f? = ∑@
A(" ℓ(f!(x
A ), y A ) is the validation loss.
@

Xavier Bresson 33
34

Cross-validation
Then, we have
"
R(x,y)∼U [ L9:; f? ] = ∑@
A(" R(x,y)∼U [ ℓ(f!(x
A ), y A ) ] = L<=>< (f? )
@
" @ A A " σ2 "
Var(x,y)∼U [ L9:; f? ] = ∑ Var(x,y)∼U [ ℓ(f!(x ), y )] = .m σ2 = ⇒ Std = O
@' A(" @' @ @

"
L9:; = L<=>< ± O
@

Consequence : A small validation set does not provide a good estimate of the test error

Xavier Bresson 34
35

Cross-validation
In practice, we have two situations
Big datasets
Modern situation, training sets are large, e.g. millions of data points.
We can use a small fraction, e.g. m = 100,000 data points as validation set.
The validation set will approximate well the test set distribution.
Small datasets
Situation before 2012 or today for highly expensive or challenging datasets to collect
(e.g. nuclear fusion) or for protected datasets (e.g. medical data).
For limited datasets, e.g. L = 1,000 data points, it is not possible to get simultaneously
good estimates of the predictive function and the validation set.

Xavier Bresson 35
36

Cross-validation
For small datasets, we have 2 opposite cases
Recall : Model f is trained on the full training set of L data, and f 1 is
Train
trained on the L − M training set 0=90
Case #1 : Small number M of validation data / large number L − M of
training data Validation, 1=10

Advantage : L<=>< (f) ≈ L<=>< (f 1 ) as f 1 is well estimated. Test

Limitation : L<=>< (f 1 ) ≠ L9:; (f 1 ) as the validation set is too small.

Case #2 : Large number M of validation data / small number L − M of Train, 0=10
training data
1 1 Validation
Advantage : L<=>< (f ) ≈ L9:; (f ) as the validation set is large enough. 1=90
Limitation : L<=>< (f) ≠ L<=>< (f 1 ) as f 1 is badly estimated.
How to reconcile the two cases? Test

Xavier Bresson 36
37

Cross-validation
k-fold cross-validation technique :
Split the original training set into k parts, i.e. each fold has L/k data Train
points. S

Repeat for all folds : Train on k-1 parts and leave one part out as
validation set.
Advantages +
Strain
Each data in the original training set will be used as a validation
data. Strain
10% Sval
Strain
10% 10%
For each fold, we have a large training set to train a good learner,
i.e. L<=>< (f) ≈ L<=>< (f 1 ).
Strain 10% 10% Strain
We also have a good estimate of the validation error by averaging
the validation error over all folds, i.e. L<=>< ≈ meanfolds L9:; ( f 1 ). 10% 10%
Strain Strain
10% 10%
Strain Strain

Xavier Bresson 37
38

Cross-validation
Example : Model selection using cross-validation
Models : Linear and constant models
Original training set S : L = 3 data points
Cross-validation value : M = 1 (validation set Sval), L − M = 2 (training set Strain)

Linear
ℓ2 ℓ3
model ℓ1
Cross-validation shows that the
L,B = 8.2
constant model is a better fit than
the linear model for this dataset.
1
ℓ2 L,B = (ℓ" + ℓ# + ℓ$ )
Constant 3
ℓ1 ℓ3
model
L,B = 4.3

Xavier Bresson 38
39

Cross-validation
In practice
Hypotheses are an arbitrary set of choices, e.g. choices of predictive models {f!}, choices of
parameter values {λ}, etc.
Note that parameters s.a. the number of layers in neural networks is not differentiable, i.e.
gradient descent cannot be used to select their optimal value.
For very small datasets, we cannot afford to leave out more than a single training data for
validation, so we use W = L folds (i.e. M = 1 validation point), a.k.a. Leave One Out Cross
Validation (LOOCV).
Telescopic search is a standard two-step approach to determine parameter values.
First step : Find the best order of magnitude for λ, e.g. λ = 0.01,0.1,1,10,100.
Second step : Do a fine-grained search around the best λ found in first step. For
example, if 10 is the best performing value from first step, then try out
λ=3,6,10,30,60,90.

Xavier Bresson 39
40

Cross-validation
Summary
Cross-validation is a sound technique, which works well in practice for different hypotheses
and different sizes of training set.
The validation set is a surrogate of the test set but its capacity to represent well the test
distribution depends on its size.
Best-case scenario : Training and validation sets are large enough.
Worst-case scenario : Either training or validation set is small.
Then k-fold cross-validation is required.
Even in the best-case scenario, it is still required to fully train the learner for each
hypothesis, which can be time consuming, e.g. deep learning.
Can we develop a faster regularization technique to avoid over-fitting?

Xavier Bresson 40
41

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 41
42

Early stopping
The fastest regularization technique to avoid over-fitting.
Stop optimization after T number of gradient steps, when the validation error starts increasing,
even if optimization has not converged yet.
Not really satisfying from an optimization theory perspective but it works well in practice.
One of the most common regularization techniques in deep learning to control over-fitting.

Test error
Validation error U-shape
curve

Train error

Number of
T gradient steps
Right-fitting Min over-fitting (iterations)

Xavier Bresson Under-fitting Close-fitting Over-fitting 42

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 43
44

Double descent
Classical machine learning (ML)
Bias-variance trade-off curve w.r.t. model complexity
U-shape curve for test/generalization error

Test error Test error

U-shape U-shape
curve curve
Variance

Train error
Model Model
complexity Bias2 complexity

Right-fitting Min over-fitting Right-fitting Min over-fitting

(early stopping) (interpolation point) (early stopping) (interpolation point)

Under-fitting Close-fitting Over-fitting Under-fitting Close-fitting Over-fitting

Xavier Bresson 44
45

Double descent
How to interpret the U-shape bias-variance trade-off curve?
In classical ML theory, when model complexity increases, variance and generalization error
also increase.
However, (deep learning) practitioners have observed an opposite phenomenon !
When model complexity increases, generalization error decreases !
This empirical result contradicts the conventional theory and shows a significant gap between
theory and practice.
To reconcile this inconsistency and better understand the properties of modern large ML
models, a new learning mechanism known as “double descent” was introduced in 2018.

Xavier Bresson 45
46

Double descent
Double descent curve w.r.t. the ratio between the model complexity and the dataset size

Under-parametrized function Over-parametrized function

Test error

Classical ML Modern ML

Train error

10-2 10-1 1 101 102 Model complexity # parameters

=
Dataset size # data
Right-fitting Min over-fitting Double descent
(early stopping) (interpolation point) (min test error)

Under-fitting Close-fitting Over-fitting

Xavier Bresson 46
47

Double descent
Double descent curve for bias and variance

Under-parametrized function Over-parametrized function

Test error

Classical ML Modern ML

Variance

Bias2
10-2 10-1 1 101 102 Model complexity # parameters
=
Dataset size # data
Right-fitting Min over-fitting Double descent
(early stopping) (interpolation point) (min test error)

Under-fitting Close-fitting Over-fitting

Xavier Bresson 47
48

Double descent
New learning curves
Let p be the number of parameters of the learner and n the number of training data points.
Under-parametrized functions are defined by p ≪ n (classical ML)
Over-parametrized functions are defined by p ≫ n (modern ML)
We also introduce the interpolation point, i.e. p = n, the minimal capacity needed to overfit
the training set.
In the classical ML paradigm, the optimal test error is at the minimum of the bias-variance
trade-off and is captured in practice with early stopping using a validation set.
Classical ML establishes the existence of a right balance between under-fitting and over-
fitting. Beyond this balance point i.e. over-fitting, generalization fails.
In the modern ML regime, over-fitting is actually considered beneficial, and over-parametrized
functions with high model complexity lead to successful generalization.

Xavier Bresson 48
49

Double descent
Understanding the second descent (new learning mechanism)
When p=n, the model possesses just enough parameters to over-fit all the training data.
However, it also exhibits a significant variance, making it unable to generalize (standard
result).
When p≫n, the model has much greater parameters than the number of training data. In
this regime, the learner f!(x) continues to over-fit but critically, the L2 norm of its
parameters % # is significantly minimized by SGD, effectively reducing the model capacity
(regularization effect).

Function with the Function with the

smallest norm smallest norm
! * = 8.7 ! * = 0.3

Space of functions Space of functions that overfit

that just overfit and possess high capacity
(interpolation point) (larger space ⇒ more functions ⇒ lower
! * value than at interpolation point)
Xavier Bresson 49
50

Double descent
GD vs. SGD
Observe that the full gradient of the loss is zero in the over-fitting region. Consequently,
the parameters cannot change, and the second descent phenomenon cannot emerge when
using standard GD.
In contrast, the stochastic mini-batch gradient is never zero in the over-fitting region. The
parameters continue to be updated and the second descent can occur.
SGD is critical in deep learning for several reasons
It helps to leave saddle points in the loss landscape during optimization.
It finds better local (or even global) minima, allowing successful generalization.
It speeds up computational time (by updating the parameters more often).
It is necessary for the double descent phenomenon to emerge.

Xavier Bresson 50
51

Double descent
Illustration
Under-parametrized function Over-parametrized function

Test error

Classical ML Modern ML

Variance
Bias2

10-2 10-1 1 101 102 Model complexity

Dataset size
Right-fitting Min over-fitting Double descent
(early stopping) (interpolation point) (min test error)

Under-fitting Close-fitting Over-fitting

Xavier Bresson 51
52

Double descent
Farewell to early stopping?
Do we “simply” need over-parametrized functions, train them, and achieve minimal error?
Unfortunately, the double descent regularization only emerges with exceedingly large networks.
The critical threshold to enable double descent is p* = O(n.k), where k is the number of classes.
Computer Vision
ImageNet : n = 106 (1.3M images), k = 103 (1k classes) ⇒ p* = 109
ResNet-152 has p = 60.2M (107) parameters ≪ 109
ViT : n = 109 (4B images), k = 104 (30k classes) ⇒ p* = 1013
ViT-22B has p = 22B (1010) parameters ≪ 1013
NLP : n=1011 (300B token data), k=104 (35k unique tokens) ⇒ p* = 1015
GPT-3 has p = 175B (1011) parameters ≪ 1015
At present, practitioners use early stopping as their primary regularization technique.
By design, early stopping does not lead to the double descent phenomenon.

Xavier Bresson 52
53

Double descent
Some key observations
The double descent learning mechanism is applicable to both non-linear and linear ML models.
This includes techniques such as decision trees, kernel methods, and deep learning.
The phenomenon is independent of the nature of the datasets involved.
One important ML principle is that more data provides better results.
Both theory and empirical experiments align on this principle "
However, this trend continually increases the critical threshold p* of required network
parameters for the double descent phenomenon to manifest.

Xavier Bresson 53
54

Outline

Fitting the training set

Reducing over-fitting

Loss regularization

Cross-validation

Early stopping

Double descent

Conclusion

Xavier Bresson 54
55

Conclusion
Over-fitting and high variance are among the most common issues in machine learning.
Regularization techniques
Stochastic gradient descent, the smaller the batch, the better but also slower.
Loss regularization and cross-validation to estimate the right amount of regularization.
Early stopping to terminate optimization before over-fitting.
Double descent with over-parametrized functions.

Xavier Bresson 55
56

Questions?

Xavier Bresson 56

Coursera R Lab - Correlation and Regression Answers
100% (1)
Coursera R Lab - Correlation and Regression Answers
6 pages
Generalization Error: Elie Kawerk
No ratings yet
Generalization Error: Elie Kawerk
37 pages
Generalization Error: Elie Kawerk
No ratings yet
Generalization Error: Elie Kawerk
37 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Slides (A19 A20)
No ratings yet
Slides (A19 A20)
261 pages
Practical-3 Ritesh
No ratings yet
Practical-3 Ritesh
5 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
No ratings yet
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
28 pages
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
No ratings yet
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
28 pages
ML Lab Program 7
No ratings yet
ML Lab Program 7
7 pages
27 ShivangiSrivastava ML Lab
No ratings yet
27 ShivangiSrivastava ML Lab
52 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
Ensemble methods_b45145f8047e51ea0d65d32fc07eb528
No ratings yet
Ensemble methods_b45145f8047e51ea0d65d32fc07eb528
21 pages
Regularization
No ratings yet
Regularization
9 pages
COL 774: Assignment 2
No ratings yet
COL 774: Assignment 2
3 pages
2020 answer v2 by sallam
No ratings yet
2020 answer v2 by sallam
8 pages
Early Stopping in Practice
No ratings yet
Early Stopping in Practice
14 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manualpdf download
100% (4)
Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manualpdf download
52 pages
C2_W3_Assignment
No ratings yet
C2_W3_Assignment
437 pages
10 Advice for Applying Machine Learning
No ratings yet
10 Advice for Applying Machine Learning
25 pages
DIT865 2018 Mar Solution
No ratings yet
DIT865 2018 Mar Solution
9 pages
cours4
No ratings yet
cours4
30 pages
3 - Bias - and - Variance - With - Mismatched - Data - Distributions
No ratings yet
3 - Bias - and - Variance - With - Mismatched - Data - Distributions
2 pages
Cp4252 - Machine Learning
No ratings yet
Cp4252 - Machine Learning
1 page
2411.14478v1
No ratings yet
2411.14478v1
6 pages
ML SP24 Mid Term Exam - Solution
No ratings yet
ML SP24 Mid Term Exam - Solution
8 pages
Week 7
No ratings yet
Week 7
3 pages
Arnav MLlab05
No ratings yet
Arnav MLlab05
12 pages
ML SP24 Final Term Exam (Solution)
No ratings yet
ML SP24 Final Term Exam (Solution)
14 pages
Boosting of Support Vector Machines With Application To Editing
No ratings yet
Boosting of Support Vector Machines With Application To Editing
7 pages
Complete Answer Guide for Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manual
100% (1)
Complete Answer Guide for Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manual
42 pages
PE IV - Practical Machine Learning
No ratings yet
PE IV - Practical Machine Learning
7 pages
Bias Variance Tradeoff ML
No ratings yet
Bias Variance Tradeoff ML
2 pages
Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches
No ratings yet
Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches
17 pages
18-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-22!08!2024
No ratings yet
18-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-22!08!2024
5 pages
Unit 7 - Week 4: Assignment 4
No ratings yet
Unit 7 - Week 4: Assignment 4
5 pages
Free Access to Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manual Chapter Answers
100% (3)
Free Access to Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manual Chapter Answers
60 pages
Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manual - Free Download Available To Read All Chapters
100% (3)
Introductory Statistics Using SPSS 2nd Edition Knapp Solutions Manual - Free Download Available To Read All Chapters
48 pages
Quiz1 Solutions Quiz 1 Soln
No ratings yet
Quiz1 Solutions Quiz 1 Soln
7 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Week 6 Lecture Notes
No ratings yet
Week 6 Lecture Notes
9 pages
08_eval-intro_notes (1)
No ratings yet
08_eval-intro_notes (1)
10 pages
"Am I The Asshole?": A Deep Learning Approach For Evaluating Moral Scenarios
No ratings yet
"Am I The Asshole?": A Deep Learning Approach For Evaluating Moral Scenarios
6 pages
C2W3_Lab_02_Diagnosing_Bias_and_Variance
No ratings yet
C2W3_Lab_02_Diagnosing_Bias_and_Variance
11 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Machine Learning-Lecture 04
No ratings yet
Machine Learning-Lecture 04
31 pages
Support Vector Machines For Classification
No ratings yet
Support Vector Machines For Classification
29 pages
Learning_Curves_in_Machine_Learning
No ratings yet
Learning_Curves_in_Machine_Learning
7 pages
ML Short
No ratings yet
ML Short
11 pages
Interview Questions
100% (1)
Interview Questions
67 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Identifying and Compensating For Feature Deviation in Imbalanced Deep Learning
No ratings yet
Identifying and Compensating For Feature Deviation in Imbalanced Deep Learning
15 pages
sol_eval_1
No ratings yet
sol_eval_1
4 pages
Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
15 pages
ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM
No ratings yet
ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM
3 pages
Edab Module - 2
No ratings yet
Edab Module - 2
20 pages
lab_2
No ratings yet
lab_2
8 pages
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
No ratings yet
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
5 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
STAT APPLIQUE CHAPITRE 6.fr - en
No ratings yet
STAT APPLIQUE CHAPITRE 6.fr - en
3 pages
STAT7055 Spring Session 2017 Topic 1 Tutorial Questions
No ratings yet
STAT7055 Spring Session 2017 Topic 1 Tutorial Questions
4 pages
Business Statistics Unit 1
No ratings yet
Business Statistics Unit 1
22 pages
midem_ML_makeup_sol_upated
No ratings yet
midem_ML_makeup_sol_upated
6 pages
Sample Size Determination: BY DR Zubair K.O
100% (1)
Sample Size Determination: BY DR Zubair K.O
43 pages
Link To Publication in University of Groningen/UMCG Research Database
No ratings yet
Link To Publication in University of Groningen/UMCG Research Database
30 pages
GITAM School of International Business GITAM University
No ratings yet
GITAM School of International Business GITAM University
3 pages
Random Variable, Mathematical Expectation
No ratings yet
Random Variable, Mathematical Expectation
23 pages
Class Work 2
No ratings yet
Class Work 2
13 pages
Statistics: Solutions
No ratings yet
Statistics: Solutions
24 pages
SE125 Fall20 HWK 4 Solution PDF
No ratings yet
SE125 Fall20 HWK 4 Solution PDF
18 pages
STA03B3 Lecture 1
No ratings yet
STA03B3 Lecture 1
29 pages
Descriptive Statistics+probability
0% (1)
Descriptive Statistics+probability
3 pages
Assignment 02
No ratings yet
Assignment 02
3 pages
Computer Vision: Models, Learning and Inference
No ratings yet
Computer Vision: Models, Learning and Inference
59 pages
XZCX
No ratings yet
XZCX
6 pages
Summative Test
0% (1)
Summative Test
2 pages
MMPC-005 Quantitative Analysis
No ratings yet
MMPC-005 Quantitative Analysis
4 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
94 pages
EDA Quiz 2 Answer Key
No ratings yet
EDA Quiz 2 Answer Key
4 pages
Partial Least Square
No ratings yet
Partial Least Square
6 pages
More Practice Multiple Choice
No ratings yet
More Practice Multiple Choice
3 pages
Lesson Plan: Statistics and Probability
No ratings yet
Lesson Plan: Statistics and Probability
6 pages
CH02 Cheat Sheet
No ratings yet
CH02 Cheat Sheet
1 page
The Impact Tax Knowledge Tax Awareness Tax Morale
No ratings yet
The Impact Tax Knowledge Tax Awareness Tax Morale
17 pages
Sample Problemsfor Confidence Intervals 924152003
No ratings yet
Sample Problemsfor Confidence Intervals 924152003
22 pages
2 Multiple Linear Regression I (1)
No ratings yet
2 Multiple Linear Regression I (1)
9 pages
CILS GlobalRisk
No ratings yet
CILS GlobalRisk
5 pages
Surveying & Measurement: Accuracy and Precision
No ratings yet
Surveying & Measurement: Accuracy and Precision
38 pages