0% found this document useful (0 votes)
25 views23 pages

Lecture 4.2. Generalization and Regularization

Uploaded by

thaotrau55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views23 pages

Lecture 4.2. Generalization and Regularization

Uploaded by

thaotrau55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Learning Systems (DT8008)

• Overfitting and Generalization


• Regularization

Dr. Mohamed-Rafik Bouguelia


[email protected]

Halmstad University
Quick reminder about overfitting

2
The problem of overfitting

Underfitting Overfitting

3
The problem of overfitting
Classification Regression
Addressing overfitting
1. Model selection (previous lecture)
– You can try various models (of different complexity) and compute the
generalization error (as explained previously), and keep the best model.

2. Reducing the number of features (previous lecture)


– We are more likely to overfit when the number of features is high (relatively to
the size of the dataset).
• Manually select which features to keep / remove
• Or using feature selection algorithms

3. Using an ensemble method (previous lecture)

4. Using regularization (this lecture)


– Keep all features, but reduce the magnitude / values of parameters 𝜃𝜃𝑗𝑗
– Works well when we have a lot of features, and each feature contributes a bit to
predicting 𝑦𝑦

5
Regularization

6
Regularization - Motivation

ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜽𝜽𝟑𝟑 𝒙𝒙𝟑𝟑𝟏𝟏 + 𝜽𝜽𝟒𝟒 𝒙𝒙𝟒𝟒𝟏𝟏

We added more features, e.g. 𝒙𝒙𝟑𝟑𝟏𝟏 and 𝒙𝒙𝟒𝟒𝟏𝟏

Overfits the data poorly and


does not generalize well 

7
Regularization - Motivation

ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜽𝜽𝟑𝟑 𝒙𝒙𝟑𝟑𝟏𝟏 + 𝜽𝜽𝟒𝟒 𝒙𝒙𝟒𝟒𝟏𝟏

Suppose that we penalize and make 𝜃𝜃3 , 𝜃𝜃4 really small.


𝑛𝑛
1 2
min � ℎ𝜃𝜃 𝑥𝑥 (𝑖𝑖) − 𝑦𝑦 (𝑖𝑖) + 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 𝜽𝜽𝟐𝟐𝟑𝟑 + 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 𝜽𝜽𝟐𝟐𝟒𝟒
𝜃𝜃 2𝑛𝑛
𝑖𝑖=1
Then, the only way to make this new cost function small is if
𝜃𝜃3 and 𝜃𝜃4 are small

8
Regularization - Motivation

ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜽𝜽𝟑𝟑 𝒙𝒙𝟑𝟑𝟏𝟏 + 𝜽𝜽𝟒𝟒 𝒙𝒙𝟒𝟒𝟏𝟏

Suppose that we penalize and make 𝜃𝜃3 , 𝜃𝜃4 really small. ≈ 𝟎𝟎 ≈ 𝟎𝟎


𝑛𝑛
1 2
min � ℎ𝜃𝜃 𝑥𝑥 (𝑖𝑖) − 𝑦𝑦 (𝑖𝑖) + 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 𝜽𝜽𝟐𝟐𝟑𝟑 + 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 𝜽𝜽𝟐𝟐𝟒𝟒
𝜃𝜃 2𝑛𝑛
𝑖𝑖=1
Then, the only way to make this new cost function small is if
𝜃𝜃3 and 𝜃𝜃4 are small

9
Regularization
• Small values for parameters 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑝𝑝
– Implies a simpler hypothesis
– Less prone to overfitting

• So we just modify our cost function as follows

𝜆𝜆 = Regularization parameter
(it’s a hyper-parameter)
10
Regularization
• Small values for parameters 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑝𝑝
– Implies a simpler hypothesis
– Less prone to overfitting

• So we just modify our cost function as follows

𝝀𝝀 controls the trade-off Objective 1: Objective 2:


between two objectives: • Fit the training • Keep the
11parameters small
dataset well
Regularization

What happens if 𝝀𝝀 is set to zero ?


• This becomes our original cost function. Overfitting can happen.

What happens if 𝝀𝝀 is set to an extremely large value?


• The algorithm might result in underfitting.
• Example for Linear Regression:

Suppose:
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜃𝜃3 𝑥𝑥13 + 𝜃𝜃4 𝑥𝑥44

12
Regularization

What happens if 𝝀𝝀 is set to zero ?


• This becomes our original cost function. Overfitting can happen.

What happens if 𝝀𝝀 is set to an extremely large value?


• The algorithm might result in underfitting.
• Example for Linear Regression:

Suppose:
ℎ𝜃𝜃 𝑥𝑥 = 𝜽𝜽𝟎𝟎 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜃𝜃3 𝑥𝑥13 + 𝜃𝜃4 𝑥𝑥44

We will end up penalizing 𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 , 𝜃𝜃4 (their value


Underfitting
will be close to 0)

13
Regularization

What happens if 𝝀𝝀 is set to zero ? So, it’s good to try


• This becomes our original cost function. Overfitting can happen. several values for 𝜆𝜆
and estimate the
What happens if 𝝀𝝀 is set to an extremely large value? generalization
• The algorithm might result in underfitting. error each time ...
• Example for Linear Regression:

Suppose:
ℎ𝜃𝜃 𝑥𝑥 = 𝜽𝜽𝟎𝟎 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜃𝜃3 𝑥𝑥13 + 𝜃𝜃4 𝑥𝑥44

We will end up penalizing 𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 , 𝜃𝜃4 (their value


Underfitting
will be close to 0)

14
Regularized Linear Regression

15
Regularized Linear Regression
We minimize:

where ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑇𝑇 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + ⋯ + 𝜃𝜃𝑑𝑑 𝑥𝑥𝑑𝑑

• By the way, how can you write 𝐸𝐸(𝜃𝜃) in a more compact way, using
vectors/matrices?

?
16
Regularized Linear Regression
We minimize:

where ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑇𝑇 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + ⋯ + 𝜃𝜃𝑑𝑑 𝑥𝑥𝑑𝑑

• By the way, how can you write 𝐸𝐸(𝜃𝜃) in a more compact way, using
vectors/matrices?

vector of vector of true vector of parameters


predictions outputs 𝜃𝜃1 , 𝜃𝜃2 , … , 𝜃𝜃𝑑𝑑 17
Regularized Linear Regression
Gradient Descent

Update
𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑑𝑑
simultaneously
same as

Some ratio This term is same as what


times current 𝜃𝜃𝑗𝑗 we had previously in GD. 18
Regularized Linear Regression
Normal equation
• Previously (in the lecture about linear regression), when we computed the
derivative of the cost function (without the regularization term) and set it
equal to 0 (to find optimal 𝜃𝜃), we found that the solution is:

• If we do the same while including the regularization term in our cost


function, then the solution would be:

19
Regularized Logistic Regression
(for classification)

20
Regularized Logistic Regression

ℎ𝜃𝜃 𝑥𝑥
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜃𝜃3 𝑥𝑥12 𝑥𝑥2
+ 𝜃𝜃4 𝑥𝑥12 𝑥𝑥22 + 𝜃𝜃5 𝑥𝑥12 𝑥𝑥23 + 𝜃𝜃6 𝑥𝑥13 𝑥𝑥2 … )

21
Regularized Logistic Regression

ℎ𝜃𝜃 𝑥𝑥
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥12 + 𝜃𝜃3 𝑥𝑥12 𝑥𝑥2
+ 𝜃𝜃4 𝑥𝑥12 𝑥𝑥22 + 𝜽𝜽𝟓𝟓 𝑥𝑥12 𝑥𝑥23 + 𝜽𝜽𝟔𝟔 𝑥𝑥13 𝑥𝑥2 … )

Regularization term

22
Regularized Logistic Regression
Gradient Descent

Simultaneously
update all
parameters
𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑝𝑝

23

You might also like