0% found this document useful (0 votes)
23 views43 pages

Lecture 7 - Part A - Mutli Class and Overfitting and Regularization

Uploaded by

royeha2011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views43 pages

Lecture 7 - Part A - Mutli Class and Overfitting and Regularization

Uploaded by

royeha2011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Classification, Logistic Regression, Overfit

and Regularization
Mariette Awad

Slide sources for this set of slides: Stanford Intro to ML course


Quote of the Day
Lecture Outcomes

Logistic Regression Learning with Gradient Descent

Logistic Regression – Multi-Class Classification

Overfit and Underfit

Regularization
Logistic Regression – Learning
with Gradient Descent
Review - Logistic Regression Model

Probability interpretation:
= estimated probability that y = 1 on input x
Decision Boundary: = 0.5
Cost Function:

To fit parameters :
Review - Logistic Regression Model
Cost function should have:
Zero cost for correct decision
Large (infinite) cost for wrong decision

Test it for all combinations:


y(true)=y(predicted)=0 y(true)=0, y(predicted)=1
y(true)=y(predicted)=1 y(true)=1, y(predicted)=0

y=1 y=0
Classification and Regression Visually
Gradient Descent

Want :
Repeat

(simultaneously update all )


Gradient Descent

Want :
Repeat

(simultaneously update all )

Algorithm looks identical to linear regression!


Note that feature scaling is also beneficial for Logistic Regression.
Advanced Optimization algorithms
Given , we have code that can compute
-
- (for )
Advantages:
One Option: Gradient descent - No need to manually pick
- Often faster than gradient descent
Other Advanced Options: Disadvantages:
- More complex

• Conjugate gradient (computes best ⍺ at every step, e.g. for steepest descent)
• BFGS (Broyden–Fletcher–Goldfarb–Shanno): BFGS determines the descent direction
by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to
the Hessian matrix of the loss function, obtained only from gradient evaluations
• L-BFGS (Limited-memory BFGS): approximates BFGS using a limited amount of computer memory
Potential References (based on quick search)
Details of the advanced optimization techniques is out of scope.
Conjugate Gradient:
*** https://www.cs.cmu.edu/~quake-papers/painless-conjugate-
gradient.pdf
https://en.wikipedia.org/wiki/Conjugate_gradient_method
BFGS:
http://www.seas.ucla.edu/~vandenbe/236C/lectures/qnewton.pdf
https://en.wikipedia.org/wiki/Broyden–Fletcher–Goldfarb–Shanno_algorithm
L-BGFS:
https://en.wikipedia.org/wiki/Limited-memory_BFGS
Logistic Regression – Multi-
class classification
Multiclass classification - Examples

Opinion/Sentiment (Like): 1, 2, 3, 4, or 5 stars (5 classes)

Email tagging: Work, Friends, Family, Hobby ( 4 classes)

Medical diagnosis: Not ill, Cold, Flu (3 classes)

Weather: Sunny, Cloudy, Rain, Snow (4 classes)


Multiclass classification – Graphical Illustration
Binary classification: Multi-class classification:

x2 x2

x1 x1
x2
One-vs-all (one-vs-rest):
For triangles x1
x2 x2
Turn problem
into binary For squares
classification
problems
x1
x1 For x’s
x2
Class 1:
Class 2:
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .

On a new input , to make a prediction, pick the


class that maximizes
Other Multi-Class Classification Approaches
One model that outputs simultaneously one probability
prediction for every class.

Choose one with highest probability.


Overfitting
What is Overfitting? Example: Linear regression
Price

Price

Price
Size Size Size

Which model is: Underfit; Overfit; High Bias; High Variance; Just Right

Overfitting: the learned hypothesis may fit the training set very well
( ), but fail to generalize to new examples
(predict prices on new examples).
What is Overfitting? Example: Linear regression
Price

Price

Price
Size Size Size

Underfit; High Bias Just right Overfit; High Variance

Overfitting: the learned hypothesis may fit the training set very well
( ), but fail to generalize to new examples
(predict prices on new examples).
Error due to Bias
Difference between expected prediction of model and correct value of the
predictor.

Bias measures how far off these models' predictions are from the correct value.
High bias means high erroneous assumptions in the learning model which
misses relevant relations between features and target output: underfit
Error due to Variance
Describes how much deviation from its average value (mean of a squared
deviation).

taken as the variability of a model prediction for a given data point (sensitivity
to small fluctuations in the training set).

If entire model building process is repeated multiple times, variance is how
much the predictions for a given point vary between different realizations of
the model.

High variance models random noise in training data, rather than the intended
outputs: overfit
Where are H/L Bias and Variance?
Graphical Representations of Bias and Variance
Another Example w: Logistic Regression

x2 x2 x2

x1 x1 x1

( = sigmoid function)

Which one is Underfit; Overfit; High Bias; High Variance; Just Right?
Another Example w: Logistic regression

x2 x2 x2

x1 x1 x1

( = sigmoid function)

Underfit; High Bias Just right Overfit; High Variance


Addressing Overfitting (1 of 2):
 For low dimensional features, visual display,

Price
identify noisy data and select best polynomial.
 May not be feasible in general with many
features exist:
Size
size of house
no. of bedrooms
no. of floors
age of house
average income in neighborhood
kitchen size
Addressing overfitting (2 of 2):
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
3. Bootstrap, Bagging and Boosting.
Regularization in Cost
Function
Intuition

Price
Price

Size of house Size of house

Suppose we want to penalize and make , really small.


+ 𝜆 (θ32 + θ42 )

By choosing 𝜆 large enough, we are forcing θ3 and θ4 to be small.


Regularization in Cost Function

Regularization Regularization Regularization term


Cost Function parameter (notice start at 1)

With
Regularization
The regularization Without

Price
parameter 𝜆 provides a Regularization
tradeoff between error
minimization and
generalization Size of house
Underfitting (High Variance) - Regularization
Parameter too large
In regularized linear regression, we choose to minimize

With hθ(x) =

Price
Underfit
What if is set to an extremely large value
(perhaps for too large for our problem, say
Size of house
)?
hθ(x) =
Regularized Linear
Regression
Regularized linear regression

The resulting model is called Ridge Regression (L2 penalty)


Gradient descent for Regularized linear regression

Repeat
Regularization with Normal equations (1 of 2)
Recall, the normal equations without regularization

Results in
If , then XTX is not invertible,
(#examples) (#features)
but can be addressed by regularization
Regularization with Normal equations (2 of 2)
Suppose ,
(#examples) (#features)

But XTX is not invertible

If , with regularization:
Illustration of Ridge Regression performance
Impact of penalty on RMSE

overfit underfit

The right fit 

 Cross-validation performance with Ridge Regression for different values of λ.


 As λ increases, lower error, but more bias.
Lasso Regression (L1 penalty)
Lasso Regression is the model derived by adding the L1 penalty to the SSE
error:

P is the number of regression coefficients βj.


 All previous comments apply:
 This means that the regression coefficients are allowed to be large if they contribute to reduction in SSE.
 The larger the penalty λ, the smaller the coefficients. Large coefficients indicate overfitting or colinearity (redundancy)
 A tradeoff between bias and variance 

 However, in this case the penalty forces some coefficients to go to zero.


 As a result, Lasso regression can be used for feature selection, or
reduction of attributes
 Many possible optimization solution options
Elastic Net Regression (L1 & L2 penalties)

• Elastic Net Regression is the model derived by adding both L1 and L2 penalties to
the SSE error:

P is the number of regression coefficients βj.

• Combines and generalizes the Ridge and Lasso regression models.


• Requires tuning of both parameters α and 𝜆 for best performance.
• Learned by an algorithm called LARS-EN
• Source: “Regularization and variable selection via the elastic net” -
https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-
320%20Zou%20&%20Hastie.pdf
Regularized Logistic
Regression
Recall logistic regression without Regularization
Subject to overfitting:
x2

x1

Cost function without regularization:


Gradient descent for Regularized Logistic regression

Repeat

You might also like