Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
and Regularization
Mariette Awad
Regularization
Logistic Regression – Learning
with Gradient Descent
Review - Logistic Regression Model
Probability interpretation:
= estimated probability that y = 1 on input x
Decision Boundary: = 0.5
Cost Function:
To fit parameters :
Review - Logistic Regression Model
Cost function should have:
Zero cost for correct decision
Large (infinite) cost for wrong decision
y=1 y=0
Classification and Regression Visually
Gradient Descent
Want :
Repeat
Want :
Repeat
• Conjugate gradient (computes best ⍺ at every step, e.g. for steepest descent)
• BFGS (Broyden–Fletcher–Goldfarb–Shanno): BFGS determines the descent direction
by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to
the Hessian matrix of the loss function, obtained only from gradient evaluations
• L-BFGS (Limited-memory BFGS): approximates BFGS using a limited amount of computer memory
Potential References (based on quick search)
Details of the advanced optimization techniques is out of scope.
Conjugate Gradient:
*** https://www.cs.cmu.edu/~quake-papers/painless-conjugate-
gradient.pdf
https://en.wikipedia.org/wiki/Conjugate_gradient_method
BFGS:
http://www.seas.ucla.edu/~vandenbe/236C/lectures/qnewton.pdf
https://en.wikipedia.org/wiki/Broyden–Fletcher–Goldfarb–Shanno_algorithm
L-BGFS:
https://en.wikipedia.org/wiki/Limited-memory_BFGS
Logistic Regression – Multi-
class classification
Multiclass classification - Examples
x2 x2
x1 x1
x2
One-vs-all (one-vs-rest):
For triangles x1
x2 x2
Turn problem
into binary For squares
classification
problems
x1
x1 For x’s
x2
Class 1:
Class 2:
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .
Price
Price
Size Size Size
Which model is: Underfit; Overfit; High Bias; High Variance; Just Right
Overfitting: the learned hypothesis may fit the training set very well
( ), but fail to generalize to new examples
(predict prices on new examples).
What is Overfitting? Example: Linear regression
Price
Price
Price
Size Size Size
Overfitting: the learned hypothesis may fit the training set very well
( ), but fail to generalize to new examples
(predict prices on new examples).
Error due to Bias
Difference between expected prediction of model and correct value of the
predictor.
Bias measures how far off these models' predictions are from the correct value.
High bias means high erroneous assumptions in the learning model which
misses relevant relations between features and target output: underfit
Error due to Variance
Describes how much deviation from its average value (mean of a squared
deviation).
taken as the variability of a model prediction for a given data point (sensitivity
to small fluctuations in the training set).
If entire model building process is repeated multiple times, variance is how
much the predictions for a given point vary between different realizations of
the model.
High variance models random noise in training data, rather than the intended
outputs: overfit
Where are H/L Bias and Variance?
Graphical Representations of Bias and Variance
Another Example w: Logistic Regression
x2 x2 x2
x1 x1 x1
( = sigmoid function)
Which one is Underfit; Overfit; High Bias; High Variance; Just Right?
Another Example w: Logistic regression
x2 x2 x2
x1 x1 x1
( = sigmoid function)
Price
identify noisy data and select best polynomial.
May not be feasible in general with many
features exist:
Size
size of house
no. of bedrooms
no. of floors
age of house
average income in neighborhood
kitchen size
Addressing overfitting (2 of 2):
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
3. Bootstrap, Bagging and Boosting.
Regularization in Cost
Function
Intuition
Price
Price
With
Regularization
The regularization Without
Price
parameter 𝜆 provides a Regularization
tradeoff between error
minimization and
generalization Size of house
Underfitting (High Variance) - Regularization
Parameter too large
In regularized linear regression, we choose to minimize
With hθ(x) =
Price
Underfit
What if is set to an extremely large value
(perhaps for too large for our problem, say
Size of house
)?
hθ(x) =
Regularized Linear
Regression
Regularized linear regression
Repeat
Regularization with Normal equations (1 of 2)
Recall, the normal equations without regularization
Results in
If , then XTX is not invertible,
(#examples) (#features)
but can be addressed by regularization
Regularization with Normal equations (2 of 2)
Suppose ,
(#examples) (#features)
If , with regularization:
Illustration of Ridge Regression performance
Impact of penalty on RMSE
overfit underfit
• Elastic Net Regression is the model derived by adding both L1 and L2 penalties to
the SSE error:
x1
Repeat