ML-Unit I - Logistic Regression
ML-Unit I - Logistic Regression
14 30
20 ??
By adding a new data point
(x,y) = (300,1)
Applying OLS on categorical data In our training set, we get a new
line Y2= w2x+b2
● Consider the new dataset as: Due to this, now we have two
misclassifications.
votes=17, then
product quality= 0.9
Step 2:
Pass the value of z to the logistic function:
fw,b(x) = g( ) = g(z) =
Logistic Regression: output interpretation
● To get the output between 0 and 1, we use the sigmoid (or logit) function.
Decision Boundary: =0
Decision Boundary:
for 1D data
16
16
Votes=x1
Decision Boundary: for 2D data
Y=1
Y=0
Decision Boundary: for 2D data
Logistic Regression Hypothesis for the 2-D data
is as follows:
Let's find the decision
boundary for this.
Let’s consider; w1=1, w2=1, and b=-3
Y=1
As, we saw the decision boundary in Logistic
Regression is at;
z=
Let’s plot the loss for y(i)=1, here f is the output of logistic regression and 0<f<1.
Relevant portion of
this graph is
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:
Case 2:
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:
0.5
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:
Overall Cost
function is
convex
Cost function for Logistic Regression
Therefore the cost function for Logistic Regression as:
Where;
Overall Cost
function is
convex
Simplified Loss Function for Logistic Regression
The loss function for Logistic Regression as:
In the above loss function if we substitute y(i) =1, In the above loss function if we substitute y(i) =0,
We get; We get;
Simplified Cost Function for Logistic Regression
Simplified version of loss function for logistic regression:
The simplified cost function derived from above loss function is:
Simplified Cost Function for Logistic Regression
Simplified version of loss function for logistic regression:
The simplified cost function derived from above loss function is:
● In order to minimize the cost function, we use the following gradient update rule:
Gradient descent for training Logistic Regression
● Find the parameters: such that the we get the minimum cost.
Then, Given new x, output
● The simplified cost function derived from above loss function is:
● In order to minimize the cost function, we use the following gradient update rule:
Simultaneous update
Gradient descent for training Logistic Regression
The simplified gradient descent update rule is:
Training with
Overfitting more example.
Addressing Overfitting: Feature tuning
● Many times we are in a situation where:
● We can manually remove a few irrelevant features from the input features to improve
the generalization.
● One way to do it is by deriving a conclusion as to how a feature fits into the model
(correlation between feature and target).
○ It is quite similar to debugging the code line-by-line.
○ In case if a feature is unable to explain the relevancy in the model, we can
simply identify those features.
● We can even use a few feature selection heuristics for a good starting point.
Addressing Overfitting: Early stopping
● When the model is training, you can
actually measure how well the model
performs based on each iteration.
● We can do this until a point when the
iterations improve the model’s
performance.
● After this, the model overfits the
training data as the generalization
weakens after each iteration.
Addressing Overfitting: Cross validation
● One of the most powerful features to avoid or
prevent overfitting is cross-validation.
● The idea behind this is to use the initial training
data to generate mini train-test-splits, and then
use these splits to tune your model.
● In a standard k-fold validation, the data is
partitioned into k-subsets also known as folds.
● After this, the algorithm is trained iteratively on
k-1 folds while using the remaining folds as the
test set, also known as holdout fold.
Addressing Overfitting: Regularization
● Regularization is done:
○ by penalizing the algorithm proportional to value of Wj
○ which will ensure small values of these parameters and hence would prevent
overfitting
○ by attributing small contributions from each features and hence removing high bias
or high variance.
Just fit
Implementing Regularization
● Intuition:
Just fit
Overfit
Implementing Regularization
● Intuition:
Just fit
Overfit
Just fit
Overfit
Just fit
Overfit
Just fit
Overfit
● In general, we may have 100 features and it is hard to find which one is most important feature and which one
to penalize.
● The way regularization works is to penalize all features by reducing the effect of Wj.which is less likely to
overfit.
Cost function with Regularization
● So the main ideas in regularization are: maintain small values for the parameters
W0 , W1,W2 ⋯ Wn which keeps the hypothesis simple and less prone to
overfitting
● Mathematically, regularization is achieved by modifying the cost function as
follows,
[ ]
Fit data by Keep Wj small
minimizing the to control the
MSE overfitting
Cost function with Regularization
● If noticed closely, regularization term mainly points to the fact that if value of
Wj is increased, it would consequently increase the cost which is to be
minimized during gradient descent.
● So it would ensure small values of the parameters as intended to prevent
overfitting.
[ ]
Fit data by Keep Wj small
minimizing the to control the
MSE overfitting
Cost function with Regularization
● Case I: If λ=0, That means we are not using
regularization term.
Therefore, F(x) =b
Cost function with Regularization
● Case II: If λ=1010, That means we are placing very
heavy weight to the regularization term.
○ The only way to minimize it by choosing all the
value of wj very close to 0.
● To minimize the cost function we use the following gradient descent update rule:
Regularized linear regression
● Mathematically, regularization is achieved by modifying the cost function as follows,
● To minimize the cost function we use the following gradient descent update rule:
Regularized linear regression
● To minimize the cost function we use the following gradient descent update rule:
In the term,
Multiclass Classification
Multiclass Classification: One-vs-Rest
● For each class, build a logistic regression to find the probability the
observation belongs to that class.
● For each data point, predict the class with the highest probability.
● Consider the following dataset:
Multiclass Classification: One-vs-Rest
● For each class, build a
logistic regression to find
the probability the P(Y=0|x)
observation belongs to that
class.
● For each data point, predict P(Y=1|x)
the class with the highest
probability.
● Consider the following
dataset:
P(Y=2|x)
Multinomial Regression
● Assignment for you.
Classification metrics
● There are many classification metrics available:
○ Accuracy
○ Confusion Matrix
○ Precision
○ Recall
○ F1 score
○ AUC
Classification metrics: Accuracy
● Consider the dataset as follows:
● Based on the predictions given by LR
and DT, identify which classifier is
better?
● How do we find?
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● For example, LR:
○ 1st data : correct
○ 2nd data: correct
○ 3rd data: wrong
○ 4th data: correct
○ 5th data: wrong
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● For example, LR:
○ 1st data : correct
○ 2nd data: correct
○ 3rd data: wrong
○ 4th data: correct
○ 5th data: wrong
● Accuracy of LR = ⅗= 0.6 = 60%
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● For example, DT:
○ 1st data : correct
○ 2nd data: correct
○ 3rd data: correct
○ 4th data: wrong
○ 5th data: correct
● Accuracy of DT = ⅘ = 0.8 = 80%
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● Even in multiclass classification
problem accuracy can do the job in the
same way.
● Accuracy of LR = ⅘=0.8
● Accuracy 0f DT = ⅖=0.4
Classification metrics: Accuracy
● How much Accuracy is good?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
● No, we can't rely on this model.
● There is a chance that out of 100, 1 patient will die.
● Its a bad model.
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 2: Predict your self driving car has to take left or right?
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
● No, we can't rely on this model.
● There is a chance that out of 100, 1 patient will die.
● Its a bad model.
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 2: Predict whether a customer will order food this weekend or
not?
○ How much accurate your model should be?
■ Say your model is 80% accurate.
■ Can you deploy this model?
● Yes, we can.
Classification metrics: Accuracy
● The problem with Accuracy:
○ Accuracy score gives a number.
■ That says how good a model is.
■ Or, how bad a model is.
○ Say a model is 90% accurate.
■ That also means 10% incorrect.
■ But, what is the type of incorrect? That does not explain by
Accuracy.
● For example: Actual label is 0 → Model prediction is 1
Actual label is 1 → Model prediction is 0
Classification metrics: Confusion Metric
● The Confusion Metric looks like:
Predicted class
Actual class
Classification metrics: Confusion Metric
● The Confusion Metric looks like:
Classification metrics: Confusion Metric
● The Confusion Metric looks like:
Classification metrics: Confusion Metric
● Sometime accuracy is misleading:
● Consider of predicting a passenger as terrorist at airport.
● Say the number of passenger are as:
not terrorist: 9999
terrorist: 1
Model A Model B
Model A Model B
Spam-A Not-spam-A
● This is explained by Precision and
Spam 100(TP) 70(FN)
Recall as follows:
● Precision: Not-spam 30(FP) 700(TN)
Clearly, PrecisionA<PrecisionB
Classification metrics: Confusion Metric
● Consider there are two data scientist who develops the
cancer prediction model as follows:
Model A Model B
RecallA= 1000/(1000+200)
RecallB = 1000/(1000+500)
Clearly, RecallB<RecallA
Classification metrics: Confusion Metric
● Sometimes your model is neither precision based nor recall based or you
can say both are equally important.
● Then we use the mean of precision and recall and that is called as
F1-score.
● F1-score: is given as