0% found this document useful (0 votes)
22 views

ML-Unit I - Logistic Regression

Uploaded by

Pranav Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

ML-Unit I - Logistic Regression

Uploaded by

Pranav Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Machine Learning

Dr. Sunil Saumya


IIIT Dharwad
Logistic Regression
Revisiting the regression model
● Recall the dataset we have seen:
Word Product
Count quality
27 52
2 6
100 42
40 38
14 30
20 ??
The new observations
● Consider the new dataset as:
Word Product ● Here, the given dataset is a classification dataset.
Count quality ○ The target is either 0 or 1.
27 52 ● Because, there are only two possible label given for
each input, it is known as Binary Classification.
2 6
100 42 How to build a Binary classification model?
40 38
14 30 Let’s apply the linear regression model on the given data
and check whether it also works for binary classification?
20 ??
Applying OLS on categorical data
● Consider the new dataset as:
● Linear regression predicts
Word Product value not only 0 and 1 but
Count quality the values between 0 and
1, less than 0 and greater
27 52 than 1.
2 6 ● But here, we want to
predict categories either 0
100 42 or 1.
40 38 ● One try we give by
putting a threshold at 0.5
14 30
on Y-axis.
20 ??
Applying OLS on categorical data
● Consider the new dataset as:
Word Product
Count quality
27 52
2 6
100 42
40 38
14 30
20 ??
Applying OLS on categorical data
● Consider the new dataset as:
Word Product
Count quality
27 52
2 6
100 42
40 38
14 30
20 ??
Applying OLS on categorical data
● Consider the new dataset as:
Word Product
Count quality
27 52
2 6
100 42
40 38
14 30
20 ??
Applying OLS on categorical data
● Consider the new dataset as:
Word Product
Count quality
27 52
For this set of dataset, it looks
2 6 like Linear Regression could do
a pretty good job.
100 42
Here, we have only one
40 38 misclassification.

14 30
20 ??
By adding a new data point
(x,y) = (300,1)
Applying OLS on categorical data In our training set, we get a new
line Y2= w2x+b2

● Consider the new dataset as: Due to this, now we have two
misclassifications.

However, the best line should


not increase the misclassification
as we increase the number of
training examples.

When best regression line is


added with a new data point, the
decision boundary (green dotted
line) has shift over to the right
(red dotted line).

Therefore, we can say that linear


regression is not a good model
for classification task.
Logistic Regression
● Let’s fit the logistic regression model on the following dataset. We get the following
model:
Logistic Regression
● Let’s fit the logistic regression model on the following dataset. We get the following
model:
For the given S-shaped
model, if the

votes=17, then
product quality= 0.9

that means it is more


likely to be 1 (High).

But, the output can


only be 0 or 1. How do
we achieve that?
Sigmoid or logistic function
● To get the output between 0 and 1, we use the sigmoid (or logit) function.
Case I: if Z is very big -ve number, say z=∞
e-z ≃ 0
g(z) = 1/(1+0) = 1

Case II: if Z is very big +ve number, say z=-∞


e-z ≃ ∞
g(z) = 1/(1+∞) = 0

Case II: if z=0


e-z ≃ 1
g(z) = 1/(1+1) = 0.5
Let’s use this sigmoid function to build up the
logistic regression.
Logistic Regression
● To get the output between 0 and 1, we use the sigmoid (or logit) function.
Step 1:
Let’s use the Linear Regression function
fw,b(x) = and store it in

Step 2:
Pass the value of z to the logistic function:

Therefore we get logistic regression model as;

fw,b(x) = g( ) = g(z) =
Logistic Regression: output interpretation
● To get the output between 0 and 1, we use the sigmoid (or logit) function.

For Votes= 17,


The learned logistic regression will give
output as; f(17) = 0.9

It means that there is 90% probability that the


output is 1 or product quality is high.

The model also tell that there is 10%


probability that the output is 0 or product
quality is low.
Logistic Regression always gives the probability assuming that class is 1.
Decision Boundary
● Decision boundary gives the output as
either 1 or 0.
● Logistic regression model:

● In the above model , if we keep a


threshold at 0.5, then we get

When fw,b(x) ≥ 0.5 ?


Decision Boundary

When fw,b(x) ≥ 0.5 ?


g(z) ≥ 0.5
z ≥0
≥ 0 and <0

Decision Boundary: =0
Decision Boundary:
for 1D data

16

Decision boundary for the given dataset is:


⇒ -30.8841+1.9292x1 = 0
⇒ 1.9292x1=30.8841
⇒ x1 = 30.8841/1.9292 = 16.0087

16
Votes=x1
Decision Boundary: for 2D data

Consider the 2-dimensional data:

Y=1

Let's find the decision


boundary for this.

Y=0
Decision Boundary: for 2D data
Logistic Regression Hypothesis for the 2-D data
is as follows:
Let's find the decision
boundary for this.
Let’s consider; w1=1, w2=1, and b=-3
Y=1
As, we saw the decision boundary in Logistic
Regression is at;
z=

Putting the values of w1=1, w2=1, and b=-3, we


Y=0 get;
z= 1. x1+1. x2 - 3 = 0
Therefore, x1+x2= 3 is my decision boundary
Decision Boundary: for 2D data
Logistic Regression Hypothesis for the 2-D data
is as follows:
Let's find the decision
boundary for this.
Let’s consider; w1=1, w2=1, and b=-3
Y=1
As, we saw the decision boundary in Logistic
Regression is at;
z=

Putting the values of w1=1, w2=1, and b=-3, we


Y=0 get;
z= 1. x1+1. x2 - 3 = 0
Therefore, x1+x2= 3 is my decision boundary
Decision Boundary: Non-linear
Consider Logistic Regression Hypothesis for the
polynomial data is as follows:
Let's find the decision
boundary for this.
Let’s consider; w1=1, w2=1, and b=-1

As, we saw the decision boundary in Logistic


Regression is at;
z=

Putting the values of w1=1, w2=1, and b=-1, we


get;

Therefore, is my decision boundary


Decision Boundary: Non-linear
Consider Logistic Regression Hypothesis for the
polynomial data is as follows:
Let's find the decision
boundary for this.
Let’s consider; w1=1, w2=1, and b=-1

As, we saw the decision boundary in Logistic


Regression is at;
z=

Putting the values of w1=1, w2=1, and b=-1, we


get;

Therefore, is my decision boundary


Decision Boundary: Non-linear
● We can have even more complex decision boundary, when we have even higher
order polynomial features.
● Consider the hypothesis as:
Decision Boundary: Non-linear
● We can have even more complex decision boundary, when we have even higher
order polynomial features.
● Consider the hypothesis as:
Decision Boundary: Non-linear
● We can have even more complex decision boundary, when we have even higher
order polynomial features.
● Consider the hypothesis as:
Decision Boundary: Non-linear
● We can have even more complex decision boundary, when we have even higher
order polynomial features.
● Consider the hypothesis as:
Cost function for Logistic Regression
Consider the following multivariate dataset:

The logistic regression model


for this dataset is:

Given this dataset, how to


choose parameters:
Cost function for Logistic Regression
Recall the linear regression cost function:

Linear regression model:

Cost function for linear regression


Cost function for Logistic Regression
The logistic regression model for this
Recall the linear regression cost function: dataset is:

Linear regression model: If we apply the same cost function on


logistic regression, it looks like:
Non-Convex
Convex

Therefore, for Logistic


Regression, the squared
error cost function is not
Cost function for linear regression a good choice.
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Why this cost function will work for logistic regression?


Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Let’s plot the loss for y(i)=1, here f is the output of logistic regression and 0<f<1.

Relevant portion of
this graph is
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Case 1: If the Logistic Regression model predicts 1

Case 2:
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Case 1: If the Logistic Regression model predicts 1

Case 2: If the Logistic Regression model predicts 0.5


Loss is little higher but not very high

0.5
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Case 1: If the Logistic Regression model predicts 1

Case 2: If the Logistic Regression model predicts 0.5


Loss is little higher but not very high
Case 3: If the Logistic Regression model predicts 0.1
Loss is higher, only 10% chance of being class 1
0.1
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Case 1: If the Logistic Regression model predicts 1

Case 2: If the Logistic Regression model predicts 0.5


Loss is little higher but not very high
Case 3: If the Logistic Regression model predicts 0.1
Loss is higher, only 10% chance of being class 1
0.5
Case 4: If the Logistic Regression model predicts 0
0.1
Loss is higher, only 0% chance of being class 1
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Case 1: If the Logistic Regression model predicts 0

Case 2: If the Logistic Regression model predicts 0.5


Loss is little higher but not very high

Case 3: If the Logistic Regression model predicts 1


Loss is higher, only 0% chance of being class 1
Loss function for Logistic Regression
Let’s define the loss function for Logistic Regression as:

Overall Cost
function is
convex
Cost function for Logistic Regression
Therefore the cost function for Logistic Regression as:

Where;

Overall Cost
function is
convex
Simplified Loss Function for Logistic Regression
The loss function for Logistic Regression as:

Simplified version of loss function for logistic regression:


Simplified Loss Function for Logistic Regression
The loss function for Logistic Regression as:

Simplified version of loss function for logistic regression:

In the above loss function if we substitute y(i) =1, In the above loss function if we substitute y(i) =0,
We get; We get;
Simplified Cost Function for Logistic Regression
Simplified version of loss function for logistic regression:

The simplified cost function derived from above loss function is:
Simplified Cost Function for Logistic Regression
Simplified version of loss function for logistic regression:

The simplified cost function derived from above loss function is:

● But, why have we chosen this cost function over


tons of other function available?
○ Because it is convex.
Gradient descent for training Logistic Regression
● Find the parameters: such that the we get the minimum cost.
Then, Given new x, output
● The simplified cost function derived from above loss function is:

● In order to minimize the cost function, we use the following gradient update rule:
Gradient descent for training Logistic Regression
● Find the parameters: such that the we get the minimum cost.
Then, Given new x, output
● The simplified cost function derived from above loss function is:

● In order to minimize the cost function, we use the following gradient update rule:

Simultaneous update
Gradient descent for training Logistic Regression
The simplified gradient descent update rule is:

● Have you seen this update rule before?


○ The same update rule we used for
linear regression also.
● However, they vary as:
The problem of overfitting: classification
Addressing Overfitting
● There are many ways to overcome overfitting problem:
○ Training with more data
○ Feature tuning
○ Early stopping
○ Cross Validation
○ Regularization
Addressing Overfitting: Training with more data
● This technique might not work every time, as we have also discussed in the example
above, where training with a significant amount of population helps the model.
● It basically helps the model in identifying the signal better.
● But, getting more data may not be the ideal case in general.

Training with
Overfitting more example.
Addressing Overfitting: Feature tuning
● Many times we are in a situation where:

Many features + insufficient data → Overfitting

● We can manually remove a few irrelevant features from the input features to improve
the generalization.
● One way to do it is by deriving a conclusion as to how a feature fits into the model
(correlation between feature and target).
○ It is quite similar to debugging the code line-by-line.
○ In case if a feature is unable to explain the relevancy in the model, we can
simply identify those features.
● We can even use a few feature selection heuristics for a good starting point.
Addressing Overfitting: Early stopping
● When the model is training, you can
actually measure how well the model
performs based on each iteration.
● We can do this until a point when the
iterations improve the model’s
performance.
● After this, the model overfits the
training data as the generalization
weakens after each iteration.
Addressing Overfitting: Cross validation
● One of the most powerful features to avoid or
prevent overfitting is cross-validation.
● The idea behind this is to use the initial training
data to generate mini train-test-splits, and then
use these splits to tune your model.
● In a standard k-fold validation, the data is
partitioned into k-subsets also known as folds.
● After this, the algorithm is trained iteratively on
k-1 folds while using the remaining folds as the
test set, also known as holdout fold.
Addressing Overfitting: Regularization
● Regularization is done:
○ by penalizing the algorithm proportional to value of Wj
○ which will ensure small values of these parameters and hence would prevent
overfitting
○ by attributing small contributions from each features and hence removing high bias
or high variance.

f(x) = 28x-385x2+39x3-174x4+100 f(x) = 13x-0.23x2+0.000014x3-0.00014x4+10


Addressing Overfitting: Regularization
● Regularization encourages the learning algorithm to shrink the values of the
parameters without necessarily demanding that the parameter is set to exactly 0.
● It turns out that even if we fit a higher order polynomial, so long as we can get
the algorithm to use smaller parameter values: w1, w2, w3, w4, we end up with a
curve that ends up fitting the training data much better.
● So what regularization does, is it lets you keep all of your features, but they just
prevents the features from having an overly large effect, which is what
sometimes can cause overfitting.
Implementing Regularization
● Intuition:

Just fit
Implementing Regularization
● Intuition:

Just fit
Overfit
Implementing Regularization
● Intuition:

Just fit
Overfit

How do we overcome the overfitting?


Implementing Regularization
● Intuition:

Just fit
Overfit

● Overfitting can be controlled by minimizing the effect of w3 and w4.


● That means make w3 and w4 really small (close to 0).
● That means, instead of minimizing the cost function:

We minimize the cost function as:


+1000 w3+1000w4
Implementing Regularization
● Intuition:

Just fit
Overfit

We minimize the cost function as:


+1000 w3+1000w4

By making w3=0.001 and w4=0.001


1000 w3⋍ 0
1000 w4⋍ 0,
Therefore we get the quadratic curve with little contribution of x3 and x4.
Implementing Regularization
● Intuition:

Just fit
Overfit

We minimize the cost function as:


+1000 w3+1000w4

● In general, we may have 100 features and it is hard to find which one is most important feature and which one
to penalize.
● The way regularization works is to penalize all features by reducing the effect of Wj.which is less likely to
overfit.
Cost function with Regularization
● So the main ideas in regularization are: maintain small values for the parameters
W0 , W1,W2 ⋯ Wn which keeps the hypothesis simple and less prone to
overfitting
● Mathematically, regularization is achieved by modifying the cost function as
follows,

[ ]
Fit data by Keep Wj small
minimizing the to control the
MSE overfitting
Cost function with Regularization
● If noticed closely, regularization term mainly points to the fact that if value of
Wj is increased, it would consequently increase the cost which is to be
minimized during gradient descent.
● So it would ensure small values of the parameters as intended to prevent
overfitting.

[ ]
Fit data by Keep Wj small
minimizing the to control the
MSE overfitting
Cost function with Regularization
● Case I: If λ=0, That means we are not using
regularization term.

Therefore, we end up with overfitted curve as:


Cost function with Regularization
● Case II: If λ=1010, That means we are placing
very heavy weight to the regularization term.
○ The only way to minimize it by choosing all the value of
wj very close to 0.

So, if λ is very large then to minimize the


regularization term, our algorithm will choose

Therefore, F(x) =b
Cost function with Regularization
● Case II: If λ=1010, That means we are placing very
heavy weight to the regularization term.
○ The only way to minimize it by choosing all the
value of wj very close to 0.

So, if λ is very large then to minimize the regularization term,


our algorithm will choose

Therefore, F(x) =b and we get a horizontal line and


model underfits.

● Therefore, choosing a balance value of λ balances


both goals well.
Cost function with Regularization
● Case II: If λ=1010, That means we are placing very
heavy weight to the regularization term.
○ The only way to minimize it by choosing all the
value of wj very close to 0.

So, if λ is very large then to minimize the regularization term,


our algorithm will choose

Therefore, F(x) =b and we get a horizontal line and


model underfits.

● Therefore, choosing a balance (not too small and not


too large) value of λ balances both goals well.
Regularized linear regression
● Mathematically, regularization is achieved by modifying the cost function as follows,

● To minimize the cost function we use the following gradient descent update rule:
Regularized linear regression
● Mathematically, regularization is achieved by modifying the cost function as follows,

● To minimize the cost function we use the following gradient descent update rule:
Regularized linear regression
● To minimize the cost function we use the following gradient descent update rule:

In the term,

Wj= 1.Wj-∝ -∝ If ∝ = 0.001, and λ = 1, then (1-∝ ) is (1- 0.001* 1/50)


Which is = 0.9998.

Wj= Wj(1- ∝ )-∝ Therefore, in every update we reduce the Wj by multiplying it


with a little small value 0.9998.
Regularized Logistic Regression
● Similar to linear regression, we can also regularized logistic regression

To minimize the effects of polynomial features and reduce the


overfitting, we regularize the wj
Regularized Logistic Regression
● The cost function of logistic regression:

● The modified cost function:


Multiclass Classification
● Logistic regression can be applied to solve
multiclass problems.
● Common Approaches
○ One-vs-Rest (One-vs-All)
○ Softmax Regression (Multinomial
Logistic Regression)

Multiclass Classification
Multiclass Classification: One-vs-Rest
● For each class, build a logistic regression to find the probability the
observation belongs to that class.
● For each data point, predict the class with the highest probability.
● Consider the following dataset:
Multiclass Classification: One-vs-Rest
● For each class, build a
logistic regression to find
the probability the P(Y=0|x)
observation belongs to that
class.
● For each data point, predict P(Y=1|x)
the class with the highest
probability.
● Consider the following
dataset:
P(Y=2|x)
Multinomial Regression
● Assignment for you.
Classification metrics
● There are many classification metrics available:
○ Accuracy
○ Confusion Matrix
○ Precision
○ Recall
○ F1 score
○ AUC
Classification metrics: Accuracy
● Consider the dataset as follows:
● Based on the predictions given by LR
and DT, identify which classifier is
better?
● How do we find?
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● For example, LR:
○ 1st data : correct
○ 2nd data: correct
○ 3rd data: wrong
○ 4th data: correct
○ 5th data: wrong
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● For example, LR:
○ 1st data : correct
○ 2nd data: correct
○ 3rd data: wrong
○ 4th data: correct
○ 5th data: wrong
● Accuracy of LR = ⅗= 0.6 = 60%
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● For example, DT:
○ 1st data : correct
○ 2nd data: correct
○ 3rd data: correct
○ 4th data: wrong
○ 5th data: correct
● Accuracy of DT = ⅘ = 0.8 = 80%
Classification metrics: Accuracy
● Accuracy measures performance of
the classifier.
● Even in multiclass classification
problem accuracy can do the job in the
same way.
● Accuracy of LR = ⅘=0.8
● Accuracy 0f DT = ⅖=0.4
Classification metrics: Accuracy
● How much Accuracy is good?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
● No, we can't rely on this model.
● There is a chance that out of 100, 1 patient will die.
● Its a bad model.
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 1: Say that we have to predict the cancer (Yes/No) based on the
image of chest
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 2: Predict your self driving car has to take left or right?
○ How much accurate your model should be?
■ Say your model is 99% accurate.
■ Can you deploy this model?
● No, we can't rely on this model.
● There is a chance that out of 100, 1 patient will die.
● Its a bad model.
Classification metrics: Accuracy
● How much Accuracy is good?
○ It depends on the problem.
● Scenario 2: Predict whether a customer will order food this weekend or
not?
○ How much accurate your model should be?
■ Say your model is 80% accurate.
■ Can you deploy this model?
● Yes, we can.
Classification metrics: Accuracy
● The problem with Accuracy:
○ Accuracy score gives a number.
■ That says how good a model is.
■ Or, how bad a model is.
○ Say a model is 90% accurate.
■ That also means 10% incorrect.
■ But, what is the type of incorrect? That does not explain by
Accuracy.
● For example: Actual label is 0 → Model prediction is 1
Actual label is 1 → Model prediction is 0
Classification metrics: Confusion Metric
● The Confusion Metric looks like:
Predicted class
Actual class
Classification metrics: Confusion Metric
● The Confusion Metric looks like:
Classification metrics: Confusion Metric
● The Confusion Metric looks like:
Classification metrics: Confusion Metric
● Sometime accuracy is misleading:
● Consider of predicting a passenger as terrorist at airport.
● Say the number of passenger are as:
not terrorist: 9999
terrorist: 1

○ Clearly, there is an data imbalance case.

● We train the model on above dataset and get the


prediction as shown in confusion metric:
● Consider model always predicts as “Not terrorist”
Classification metrics: Confusion Metric
● Sometime accuracy is misleading:
● Consider of predicting a passenger as terrorist at airport.
● Say the number of passenger are as:
not terrorist: 9999
terrorist: 1

○ Clearly, there is an data imbalance case.

● We train the model on above dataset and get the


prediction as shown in confusion metric: Accuracy = 99.99%
● Consider model always predicts as “Not terrorist” is misleading
Classification metrics: Confusion Metric
● When accuracy is misleading we will use Precision,
Recall.
● Consider there are two data scientist who develops the
email spam classifier model as follows:

Spam-A Not-spam-A Spam-B Not-spam-B

Spam 100(TP) 70(FN) Spam 100(TP) 90(FN)

Not-spam 30(FP) 700(TN) Not-spam 10(FP) 700(TN)

Model A Model B

Among above which model will you select?


Classification metrics: Confusion Metric ● Accuracy of both models are
80%.
● When accuracy is misleading we will use ● Therefore, from accuracy we can
not select model.
Precision, Recall. CASE I: if FP is important for you, you
● Consider there are two data scientist who develops select Model B because
the email spam classifier model as follows: FP=10 in Model B < FP=30 in Model A

Spam-A Not-spam-A Spam-B Not-spam-B

Spam 100(TP) 70(FN) Spam 100(TP) 190(FN)

Not-spam 30(FP) 700(TN) Not-spam 10(FP) 700(TN)

Model A Model B

CASE II: if FN is important for you, you


Among above which model will you select?
select Model A because
FN=70 in Model A < FP=190 in Model B
Here, you may select CASE I.
Classification metrics: Confusion Metric Model A

Spam-A Not-spam-A
● This is explained by Precision and
Spam 100(TP) 70(FN)
Recall as follows:
● Precision: Not-spam 30(FP) 700(TN)

○ What proportion of predicted


positives is truly positive? Spam-B Not-spam-B

Spam 100(TP) 190(FN)

Not-spam 10(FP) 700(TN)

PrecisionA= 100/(100+30) Model B


PrecisionB = 100/(100+10)

Clearly, PrecisionA<PrecisionB
Classification metrics: Confusion Metric
● Consider there are two data scientist who develops the
cancer prediction model as follows:

Cancer Not-Cancer Cancer Not-Cancer

Cancer 1000(TP) 200(FN) Cancer 1000(TP) 500(FN)

Not-Cancer 800(FP) 8000(TN) Not-Cancer 500(FP) 8000(TN)

Model A Model B

Among above which model will you select?


Classification metrics: Confusion Metric
● Accuracy of both models are 90%.
● When accuracy is misleading we will use ● Therefore, from acc we can not select
model.
Precision, Recall. CASE I: if FP is important for you, you
● Consider there are two data scientist who develops select Model B because
the email spam classifier model as follows: FP=500 in Model B < FP=800 in Model A

CASE II: if FN is important for you, you select


Among above which model will you select?
Model A because
FN=200 in Model A < FN=5000 in Model B
Here, you may select CASE II.
Classification metrics: Confusion Metric
● This is explained by Precision and
Recall as follows:
● Recall:
○ What proportion of actual
positives is correctly classified?

RecallA= 1000/(1000+200)
RecallB = 1000/(1000+500)

Clearly, RecallB<RecallA
Classification metrics: Confusion Metric
● Sometimes your model is neither precision based nor recall based or you
can say both are equally important.
● Then we use the mean of precision and recall and that is called as
F1-score.
● F1-score: is given as

● For example: say Precision = 0 and Recall = 100,

Arithmetic mean: (0+100)/2 = 50


Harmonic mean (F1-score): (2* (0*100)/0+100) = 0
F1-score is more inclined towards the low value.
Regression and Classification: Assignments
1. How do we find the Precision, Recall and F1-score for Multiclass
classification?
2. What is Ridge Regression/L1 Regression?
3. What is Lasso Regression/L2 Regression?
4. What is Softmax regression?

You might also like