0% found this document useful (0 votes)

23 views43 pages

Lecture 7 - Part A - Mutli Class and Overfitting and Regularization

Uploaded by

royeha2011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views43 pages

Lecture 7 - Part A - Mutli Class and Overfitting and Regularization

Uploaded by

royeha2011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Classification, Logistic Regression, Overfit

and Regularization
Mariette Awad

Slide sources for this set of slides: Stanford Intro to ML course

Quote of the Day
Lecture Outcomes

Logistic Regression Learning with Gradient Descent

Logistic Regression – Multi-Class Classification

Overfit and Underfit

Regularization
Logistic Regression – Learning
with Gradient Descent
Review - Logistic Regression Model

Probability interpretation:
= estimated probability that y = 1 on input x
Decision Boundary: = 0.5
Cost Function:

To fit parameters :
Review - Logistic Regression Model
Cost function should have:
Zero cost for correct decision
Large (infinite) cost for wrong decision

Test it for all combinations:

y(true)=y(predicted)=0 y(true)=0, y(predicted)=1
y(true)=y(predicted)=1 y(true)=1, y(predicted)=0

y=1 y=0
Classification and Regression Visually
Gradient Descent

Want :
Repeat

(simultaneously update all )

Gradient Descent

Want :
Repeat

(simultaneously update all )

Algorithm looks identical to linear regression!

Note that feature scaling is also beneficial for Logistic Regression.
Advanced Optimization algorithms
Given , we have code that can compute
-
- (for )
Advantages:
One Option: Gradient descent - No need to manually pick
- Often faster than gradient descent
Other Advanced Options: Disadvantages:
- More complex

• Conjugate gradient (computes best ⍺ at every step, e.g. for steepest descent)
• BFGS (Broyden–Fletcher–Goldfarb–Shanno): BFGS determines the descent direction
by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to
the Hessian matrix of the loss function, obtained only from gradient evaluations
• L-BFGS (Limited-memory BFGS): approximates BFGS using a limited amount of computer memory
Potential References (based on quick search)
Details of the advanced optimization techniques is out of scope.
Conjugate Gradient:
*** https://www.cs.cmu.edu/~quake-papers/painless-conjugate-
gradient.pdf
https://en.wikipedia.org/wiki/Conjugate_gradient_method
BFGS:
http://www.seas.ucla.edu/~vandenbe/236C/lectures/qnewton.pdf
https://en.wikipedia.org/wiki/Broyden–Fletcher–Goldfarb–Shanno_algorithm
L-BGFS:
https://en.wikipedia.org/wiki/Limited-memory_BFGS
Logistic Regression – Multi-
class classification
Multiclass classification - Examples

Opinion/Sentiment (Like): 1, 2, 3, 4, or 5 stars (5 classes)

Email tagging: Work, Friends, Family, Hobby ( 4 classes)

Medical diagnosis: Not ill, Cold, Flu (3 classes)

Weather: Sunny, Cloudy, Rain, Snow (4 classes)

Multiclass classification – Graphical Illustration
Binary classification: Multi-class classification:

x2 x2

x1 x1
x2
One-vs-all (one-vs-rest):
For triangles x1
x2 x2
Turn problem
into binary For squares
classification
problems
x1
x1 For x’s
x2
Class 1:
Class 2:
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .

On a new input , to make a prediction, pick the

class that maximizes
Other Multi-Class Classification Approaches
One model that outputs simultaneously one probability
prediction for every class.

Choose one with highest probability.

Overfitting
What is Overfitting? Example: Linear regression
Price

Price

Price
Size Size Size

Which model is: Underfit; Overfit; High Bias; High Variance; Just Right

Overfitting: the learned hypothesis may fit the training set very well
( ), but fail to generalize to new examples
(predict prices on new examples).
What is Overfitting? Example: Linear regression
Price

Price

Price
Size Size Size

Underfit; High Bias Just right Overfit; High Variance

Overfitting: the learned hypothesis may fit the training set very well
( ), but fail to generalize to new examples
(predict prices on new examples).
Error due to Bias
Difference between expected prediction of model and correct value of the
predictor.

Bias measures how far off these models' predictions are from the correct value.
High bias means high erroneous assumptions in the learning model which
misses relevant relations between features and target output: underfit
Error due to Variance
Describes how much deviation from its average value (mean of a squared
deviation).

taken as the variability of a model prediction for a given data point (sensitivity
to small fluctuations in the training set).

If entire model building process is repeated multiple times, variance is how
much the predictions for a given point vary between different realizations of
the model.

High variance models random noise in training data, rather than the intended
outputs: overfit
Where are H/L Bias and Variance?
Graphical Representations of Bias and Variance
Another Example w: Logistic Regression

x2 x2 x2

x1 x1 x1

( = sigmoid function)

Which one is Underfit; Overfit; High Bias; High Variance; Just Right?
Another Example w: Logistic regression

x2 x2 x2

x1 x1 x1

( = sigmoid function)

Underfit; High Bias Just right Overfit; High Variance

Addressing Overfitting (1 of 2):
 For low dimensional features, visual display,

Price
identify noisy data and select best polynomial.
 May not be feasible in general with many
features exist:
Size
size of house
no. of bedrooms
no. of floors
age of house
average income in neighborhood
kitchen size
Addressing overfitting (2 of 2):
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
3. Bootstrap, Bagging and Boosting.
Regularization in Cost
Function
Intuition

Price
Price

Size of house Size of house

Suppose we want to penalize and make , really small.

+ 𝜆 (θ32 + θ42 )

By choosing 𝜆 large enough, we are forcing θ3 and θ4 to be small.

Regularization in Cost Function

Regularization Regularization Regularization term

Cost Function parameter (notice start at 1)

With
Regularization
The regularization Without

Price
parameter 𝜆 provides a Regularization
tradeoff between error
minimization and
generalization Size of house
Underfitting (High Variance) - Regularization
Parameter too large
In regularized linear regression, we choose to minimize

With hθ(x) =

Price
Underfit
What if is set to an extremely large value
(perhaps for too large for our problem, say
Size of house
)?
hθ(x) =
Regularized Linear
Regression
Regularized linear regression

The resulting model is called Ridge Regression (L2 penalty)

Gradient descent for Regularized linear regression

Repeat
Regularization with Normal equations (1 of 2)
Recall, the normal equations without regularization

Results in
If , then XTX is not invertible,
(#examples) (#features)
but can be addressed by regularization
Regularization with Normal equations (2 of 2)
Suppose ,
(#examples) (#features)

But XTX is not invertible

If , with regularization:
Illustration of Ridge Regression performance
Impact of penalty on RMSE

overfit underfit

The right fit 

 Cross-validation performance with Ridge Regression for different values of λ.

 As λ increases, lower error, but more bias.
Lasso Regression (L1 penalty)
Lasso Regression is the model derived by adding the L1 penalty to the SSE
error:

P is the number of regression coefficients βj.

 All previous comments apply:
 This means that the regression coefficients are allowed to be large if they contribute to reduction in SSE.
 The larger the penalty λ, the smaller the coefficients. Large coefficients indicate overfitting or colinearity (redundancy)
 A tradeoff between bias and variance 

 However, in this case the penalty forces some coefficients to go to zero.

 As a result, Lasso regression can be used for feature selection, or
reduction of attributes
 Many possible optimization solution options
Elastic Net Regression (L1 & L2 penalties)

• Elastic Net Regression is the model derived by adding both L1 and L2 penalties to
the SSE error:

P is the number of regression coefficients βj.

• Combines and generalizes the Ridge and Lasso regression models.

• Requires tuning of both parameters α and 𝜆 for best performance.
• Learned by an algorithm called LARS-EN
• Source: “Regularization and variable selection via the elastic net” -
https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-
320%20Zou%20&%20Hastie.pdf
Regularized Logistic
Regression
Recall logistic regression without Regularization
Subject to overfitting:
x2

Cost function without regularization:

Gradient descent for Regularized Logistic regression

Repeat

Proc Esm 2
No ratings yet
Proc Esm 2
11 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
ML-1
No ratings yet
ML-1
24 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
06LogisticRegression
No ratings yet
06LogisticRegression
55 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
9_Linear Regression-Problems and Solutions
No ratings yet
9_Linear Regression-Problems and Solutions
23 pages
Machine learning
No ratings yet
Machine learning
19 pages
07 Regularization
No ratings yet
07 Regularization
7 pages
Introduction To Machine Learning: The Problem of Overfitting
No ratings yet
Introduction To Machine Learning: The Problem of Overfitting
8 pages
07: Regularization: The Problem of Overfitting
No ratings yet
07: Regularization: The Problem of Overfitting
5 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
Lecture 09 ML
No ratings yet
Lecture 09 ML
26 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Feature selection
No ratings yet
Feature selection
19 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
3 Logistic Regression and Regularization
No ratings yet
3 Logistic Regression and Regularization
42 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Unit 2
No ratings yet
Unit 2
8 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
5 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Logistic Regression
No ratings yet
Logistic Regression
24 pages
Multiclass Classification Regularization
No ratings yet
Multiclass Classification Regularization
31 pages
DL-Lec 2 -bias-variance-tradeoff
No ratings yet
DL-Lec 2 -bias-variance-tradeoff
33 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Sp 24 BADM 576 Final_Exam_Study_Guide.docx
No ratings yet
Sp 24 BADM 576 Final_Exam_Study_Guide.docx
13 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Regularization 1704650055
No ratings yet
Regularization 1704650055
32 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Chapter 2 - Logistic Regression
No ratings yet
Chapter 2 - Logistic Regression
88 pages
Module 3
No ratings yet
Module 3
35 pages
Regression_Questionnaire
No ratings yet
Regression_Questionnaire
10 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
Regularization Linear Models
No ratings yet
Regularization Linear Models
23 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Lecture 4.2. Generalization and Regularization
No ratings yet
Lecture 4.2. Generalization and Regularization
23 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
lec8_Regularization_polynomial_regression
No ratings yet
lec8_Regularization_polynomial_regression
30 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
BiasVariance
No ratings yet
BiasVariance
14 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Lecture 10_04.09.2024_Regression-02 Lecture Slides
No ratings yet
Lecture 10_04.09.2024_Regression-02 Lecture Slides
61 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
Bias
No ratings yet
Bias
62 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Lecture Slides - Linear Reg
No ratings yet
Lecture Slides - Linear Reg
34 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Autoregressive Integrated Moving Average Arima
No ratings yet
Autoregressive Integrated Moving Average Arima
23 pages
Introduction To Uncertainty Quantification
100% (6)
Introduction To Uncertainty Quantification
351 pages
Pattern Recognition Unit 1,2
No ratings yet
Pattern Recognition Unit 1,2
82 pages
PG M.com Commerce (English) Research Methodology 6223
No ratings yet
PG M.com Commerce (English) Research Methodology 6223
284 pages
This Scatterplot Shows The Relationship Between Chocolate Consumption and Body Weight. On The
No ratings yet
This Scatterplot Shows The Relationship Between Chocolate Consumption and Body Weight. On The
2 pages
16 InsuBlocbxk Utilizinghay Rice Hull Ash and Coconut Fiber For Better Compressive Streegngth G
No ratings yet
16 InsuBlocbxk Utilizinghay Rice Hull Ash and Coconut Fiber For Better Compressive Streegngth G
41 pages
Using Gretl For POE4
No ratings yet
Using Gretl For POE4
501 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
Lecture 3 - Forecasting(s)
No ratings yet
Lecture 3 - Forecasting(s)
96 pages
Спецефикация Кондиционера Uniflair SUAV0601A
No ratings yet
Спецефикация Кондиционера Uniflair SUAV0601A
3 pages
Binomial Distribution
No ratings yet
Binomial Distribution
38 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
1 page
Descriptive Statistics
100% (3)
Descriptive Statistics
41 pages
Operations Management: Forecasting
No ratings yet
Operations Management: Forecasting
64 pages
Walmart Case
No ratings yet
Walmart Case
5 pages
Reasoning and Tools For Human-Level Forecasting: Elvis Hsieh, Preston Fu, Jonathan Chen
No ratings yet
Reasoning and Tools For Human-Level Forecasting: Elvis Hsieh, Preston Fu, Jonathan Chen
9 pages
DOE - PPT
No ratings yet
DOE - PPT
225 pages
A2 Ideal Gases Questions
No ratings yet
A2 Ideal Gases Questions
60 pages
Ie 604 Project Final
No ratings yet
Ie 604 Project Final
30 pages
EX XPX X: Notes
No ratings yet
EX XPX X: Notes
13 pages
Net Positive Suction Head or NPSH
100% (1)
Net Positive Suction Head or NPSH
4 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
CHAPTER 4 - Forecasting Demand
100% (1)
CHAPTER 4 - Forecasting Demand
29 pages
Extraction of Sunflower Oil Using Ethanol As Solvent PDF
No ratings yet
Extraction of Sunflower Oil Using Ethanol As Solvent PDF
8 pages
Yared Leliso CBR PDF
No ratings yet
Yared Leliso CBR PDF
172 pages
2003.john H. Warford JR - Prediction of Maxillary Canine Impaction Using
No ratings yet
2003.john H. Warford JR - Prediction of Maxillary Canine Impaction Using
5 pages
Randys Varroa Model V19
No ratings yet
Randys Varroa Model V19
77 pages
Adkins (2011) - Using Gretl For Principles of Econometrics, 4th Edition PDF
No ratings yet
Adkins (2011) - Using Gretl For Principles of Econometrics, 4th Edition PDF
494 pages
Vehicle Detection and Tracking at Night in Video Surveillance
No ratings yet
Vehicle Detection and Tracking at Night in Video Surveillance
5 pages

Lecture 7 - Part A - Mutli Class and Overfitting and Regularization

Uploaded by

Lecture 7 - Part A - Mutli Class and Overfitting and Regularization

Uploaded by

Classification, Logistic Regression, Overfit

Slide sources for this set of slides: Stanford Intro to ML course

Logistic Regression Learning with Gradient Descent

Logistic Regression – Multi-Class Classification

Overfit and Underfit

Test it for all combinations:

(simultaneously update all )

(simultaneously update all )

Algorithm looks identical to linear regression!

Opinion/Sentiment (Like): 1, 2, 3, 4, or 5 stars (5 classes)

Email tagging: Work, Friends, Family, Hobby ( 4 classes)

Medical diagnosis: Not ill, Cold, Flu (3 classes)

Weather: Sunny, Cloudy, Rain, Snow (4 classes)

On a new input , to make a prediction, pick the

Choose one with highest probability.

Underfit; High Bias Just right Overfit; High Variance

Underfit; High Bias Just right Overfit; High Variance

Size of house Size of house

Suppose we want to penalize and make , really small.

By choosing 𝜆 large enough, we are forcing θ3 and θ4 to be small.

Regularization Regularization Regularization term

The resulting model is called Ridge Regression (L2 penalty)

But XTX is not invertible

The right fit 

 Cross-validation performance with Ridge Regression for different values of λ.

P is the number of regression coefficients βj.

 However, in this case the penalty forces some coefficients to go to zero.

P is the number of regression coefficients βj.

• Combines and generalizes the Ridge and Lasso regression models.

Cost function without regularization:

You might also like