0% found this document useful (0 votes)
16 views

Practical - Logistic Regression

1. Logistic regression is a statistical method used for binary classification problems to predict the probability that an observation falls into one of two categories. 2. It models the probability of an observation being in a particular class based on input features by using the logistic function, which outputs a value between 0 and 1. 3. Logistic regression is commonly used for problems like disease diagnosis, customer churn prediction, and spam detection by evaluating models using metrics like sensitivity, specificity, precision, and F1-score.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Practical - Logistic Regression

1. Logistic regression is a statistical method used for binary classification problems to predict the probability that an observation falls into one of two categories. 2. It models the probability of an observation being in a particular class based on input features by using the logistic function, which outputs a value between 0 and 1. 3. Logistic regression is commonly used for problems like disease diagnosis, customer churn prediction, and spam detection by evaluating models using metrics like sensitivity, specificity, precision, and F1-score.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Logistic Regression

Unit 3
• Logistic and Multinomial Regression
• Understanding logistic regression and its use in
binary classification.
• Estimating probabilities using logistic regression.
• Model evaluation metrics: sensitivity, specificity,
precision, F-score.
• Model Performance and Conclusion
• Introduction to ROC curve and AUC.
• Determining the optimal cutoff probability.

Logistic Regression and Multinomial
Logistic Regression
• They are both types of regression analysis
used for binary and multiclass classification,
respectively.
• They are commonly used in machine learning
and statistics to model the relationship
between a set of input features and
categorical target variables.
Logistic Regression

• Logistic Regression is used for binary classification


problems where the target variable (dependent) has
two classes.
• It models the probability that a given input belongs to
a particular class.
• The logistic function (also known as the sigmoid
function) is used to map the linear combination of
input features to a probability between 0 and 1.
• In logistic regression, the goal is to find the best-fitting
coefficients for the input features that maximize the
likelihood of the observed class labels.
Multinomial Logistic Regression

• Multinomial Logistic Regression is an extension of logistic


regression for multiclass classification problems.
• It's used when the target variable has more than two
classes.
• It models the probabilities of each class independently,
compared to each other.
• In multinomial logistic regression, a separate binary logistic
regression model is created for each class, using one class
as the reference.
• The model predicts the log-odds of each class, and then the
softmax function is applied to convert the log-odds into
probabilities.
In this example
• Age is independent variable
• Have insurance is dependent variable
• If we apply simple linear regression here then we can see that
younger people and older people do not have insurance
• To predict age 47 will buy insurance or not we can develop
probabilities such that
• Greater than 0.5 = Yes (1)
• Less than 0.5 = No (0)
• And we will get the answer
• But if age is 90, Y is more than 1 and if age is less than 20, Y is
negative
• This means the model is not a food fit of this data
• My value of Y should be only 0 and 1 or between 0 and 1.
• Hence simple regression does not fit well this type of data
• In The Sigmoid function the value of Y will
remain between 0 and 1
• This is also known as Logistic Function
• It is an ‘S’ shaped curve
• This fits very well in such data
• The logistic regression uses Maximum
Likelihood function to maximize the
probability and then through cut off
probability we answer how Y will be between
0 and 1.
2. Understanding logistic regression
and its use in binary classification.
Example
• Logistic Regression is a statistical method used
for binary classification, which involves
predicting one of two possible classes or
outcomes.
• It's a fundamental algorithm in machine
learning and statistics.
• Despite its name, logistic regression is
primarily used for classification rather than
regression tasks.
Purpose of Logistic Regression

• Logistic Regression is used to model the


probability of a binary outcome based on one or
more predictor variables.
• It predicts the probability that an instance belongs
to a particular class (e.g., positive or negative) by
using the logistic function (also known as the
sigmoid function) to map the linear combination
of predictor variables to a value between 0 and 1.
• This probability can then be thresholded to make
a binary classification decision.
Logistic Function (Sigmoid)

• The logistic function is the core of logistic


regression. It has an S-shaped curve and is
defined as:

• σ(z)=1/(1+e-z)

• Where z=b0+b1x1+b2x2 is the linear


combination of predictor variables and their
corresponding coefficients.
Model Training

• During training, logistic regression estimates


the coefficients of the predictor variables that
best fit the training data.
• The optimization algorithm (e.g., 'lbfgs',
'newton-cg', 'sag', 'saga') finds the coefficients
that maximize the likelihood of the observed
class labels given the input features.
Decision Boundary

• The decision boundary is a threshold that


determines the class prediction.
• It is the point where the logistic function
crosses the threshold (usually 0.5).
• Instances with probabilities above the
threshold are classified as one class, and those
below are classified as the other class.
Evaluation

• Logistic Regression models are evaluated


using various metrics such as accuracy,
precision, recall, F1-score, ROC curve, and
AUC-ROC.
• The choice of evaluation metric depends on
the problem's requirements and the class
distribution.
Use Cases

• Logistic Regression is commonly used in


various fields, including:
• - Medical diagnosis (e.g., disease prediction)
• - Customer churn prediction
• - Spam detection
• - Credit risk assessment
• - Sentiment analysis
• - Fraud detection
Few examples of classification problem
are as follows:
1. A bank would like to classify the customers based on risk such as
low-risk or high-risk customers.
2. E-commerce providers would like to predict whether a customer is
likely to churn or not. It is a loss of revenue for the company if an
existing and valuable customer churns.
3. Health service providers may classify a patient, based on the
diagnostic results, as positive (presence of disease) or negative.
4. The HR department of a firm may want to predict if an applicant
would accept an offer or not.
5. Predicting outcome of a sporting event, for example, whether India
will win the next world cup cricket tournament
6. Sentiments of customers on a product or service may be classified
as positive, negative, neutral, and sarcastic.
7. Based on the image of a plant, one can predict if the plant is
infected with a specific disease or not.
Limitations

• Logistic Regression assumes a linear relationship


between the predictor variables and the log-odds of
the outcome.
• It may not perform well with complex interactions
between features.
• Additionally, it's sensitive to outliers and can suffer
from multicollinearity.
• In summary, Logistic Regression is a powerful and
interpretable algorithm used for binary classification
tasks. It models the probability of belonging to a class
and is widely applied in practical machine learning
problems.
3. Logistic Regression is not only used for binary
classification, but it also provides estimated
probabilities for each class.
• These estimated probabilities can be interpreted
as the model's confidence in its predictions.
• When logistic regression is used for binary
classification, it models the probability of an
instance belonging to the positive class (usually
labeled as "1").
• The logistic function (sigmoid) transforms the
linear combination of predictor variables into a
probability score between 0 and 1.
The formula for logistic regression is as
follows:

• P(y = 1 | X) = 1/(1+e-z)

• Where P(y = 1 | X) is the probability of the positive


class given the input features X.
• z is the linear combination of predictor variables and
their coefficients.

• To get the predicted probability for a specific instance,


you compute \( z \) based on the model's coefficients
and the input features, and then apply the logistic
function.
4.Model evaluation
• Metrics such as sensitivity, specificity,
precision, and F-score are used to assess the
performance of classification models,
particularly in binary classification problems.
• These metrics provide insights into different
aspects of the model's behaviour and are
often used together to provide a
comprehensive understanding of its
performance.
Confusion Matrix
• A confusion matrix is a table that is used to
define the performance of a classification
algorithm.
• A confusion matrix visualizes and summarizes
the performance of a classification algorithm.
• tn (True Negatives): The number of instances that
were correctly predicted as the negative class (0).
• fp (False Positives): The number of instances that
were incorrectly predicted as the positive class (1)
when they actually belong to the negative class.
• fn (False Negatives): The number of instances that
were incorrectly predicted as the negative class
(0) when they actually belong to the positive class.
• tp (True Positives): The number of instances that
were correctly predicted as the positive class (1).
1. Sensitivity (True Positive Rate or
Recall)
• Sensitivity measures the ability of the model
to correctly identify positive instances (true
positives) out of all actual positive instances.
• It is the ratio of true positives to the total
number of actual positives.
• Sensitivity = True Positives/{True Positives +
False Negatives}
2. Specificity (True Negative Rate)

• Specificity measures the ability of the model


to correctly identify negative instances (true
negatives) out of all actual negative instances.
• It is the ratio of true negatives to the total
number of actual negatives.
• Specificity = {True Negatives}/{True
Negatives + False Positives}.
• Fpr=1-Specificity
3. Precision (Positive Predictive Value)

• Precision measures the proportion of correctly


predicted positive instances (true positives)
out of all instances predicted as positive.
• It is a measure of the model's accuracy when
predicting positive cases.
• Precision = True Positives/{True Positives +
False Positives}
4. F-score (F1-score)

• The F-score is the harmonic mean of precision and


recall (sensitivity).
• It provides a balanced measure that takes both false
positives and false negatives into account.
• The F1-score is often used when there is a class
imbalance.
• F-score = (2 (Precision *Sensitivity)/(Precision +
Sensitivity)
• These metrics are particularly important when the
cost of false positives and false negatives varies or
when you want to understand the trade-off between
different aspects of the model's performance.
• Remember that the choice of which metrics to
focus on depends on the specific problem and
the relative importance of different types of
errors in your application.
5. The Receiver Operating Characteristic (ROC)
curve and the Area Under the Curve (AUC)
• The Receiver Operating Characteristic (ROC)
curve and the Area Under the Curve (AUC) are
widely used tools for evaluating and
visualizing the performance of binary
classification models.
• They provide valuable insights into the
model's ability to distinguish between positive
and negative classes across different threshold
settings.
ROC Curve (Receiver Operating
Characteristic)
• The ROC curve is a graphical representation of the
true positive rate (sensitivity) versus the false positive
rate (1 - specificity) as the classification threshold is
varied.
• It shows how well the model can discriminate
between the positive and negative classes across
different threshold levels.
• Each point on the ROC curve corresponds to a specific
threshold setting.
• The diagonal line (y = x) on the ROC plot represents
random guessing.
• An ideal model would have a ROC curve that hugs the
top-left corner of the plot, indicating high sensitivity
(true positive rate) and low false positive rate across
all threshold values.
AUC (Area Under the Curve)

• The AUC is a scalar value that summarizes the


overall performance of a model's ROC curve.
• It quantifies the area under the ROC curve and
ranges between 0 and 1.
• A model with perfect discrimination will have an
AUC of 1, while a model with no discrimination
(random guessing) will have an AUC of 0.5.
• A higher AUC generally indicates better
classification performance, but the interpretation
of AUC can depend on the specific problem and
the trade-offs between sensitivity and specificity.
Interpretation

• AUC = 0.5: Random guessing.


• AUC > 0.5 and < 0.7: Poor to fair
discrimination.
• AUC ≥ 0.7 and < 0.8: Acceptable
discrimination.
• AUC ≥ 0.8 and < 0.9: Good discrimination.
• AUC ≥ 0.9: Excellent discrimination.
Advantages of ROC Curve and AUC

1. Threshold Invariance: ROC curve and AUC provide


a comprehensive view of the model's performance
across different threshold settings, making them
useful when the decision threshold needs to be
adjusted.
2. Class Imbalance: They are less affected by class
imbalance compared to metrics like accuracy.
3. Model Comparison: ROC curves and AUC allow for
easy comparison between different models, helping
you choose the best-performing one.
Finding Optimal Classification Cut-off
• While using logistic regression model one of
the decisions that a data scientist has to make
is to choose the right classification cut-off
probability (Pc ).
• The overall accuracy, sensitivity, and
specificity will depend on the chosen cut-off
probability.
The following two methods are used
for selecting the cut-off probability:
1. Youden’s index
2. Cost-based approach
Youden’s Index

• Sensitivity and specificity change when we


change the cut-off probability.
• Youden’s index (Youden, 1950) is a
classification cut-off probability for which the
following function is maximized (also known
as J-statistic):
• Youden’s Index = J-Statistic = Max [Sensitivity
(p) + Specificity (p) – 1]=Max(tpr-fpr)
Cost-Based Approach

• As the cost of false negatives and false positives is not


same, the optimal classification cut-off probability can
also be determined using cost-based approach, which
finds the cut-off where the total cost is minimum. In
the cost-based approach, we assign penalty cost for
misclassification of positives and negatives and find
the total cost for a cut-off probability.
• Assuming cost of a false positive is C1 and that of a
false negative is C2, total cost will be
• Total cost = FN × C1 + FP × C2
• The optimal cut-off probability is the one which
minimizes the total penalty cost.
Example 1 (LR3)
• The example we consider below is a marketing
scenario in which we try to predict the probability that
a customer will renew his subscription to an online
information service.
• The data correspond to a sample of 60 readers, with
the age category, the average number of page views
per week over the last 10 weeks, and the number of
page views during the last week. These readers were
asked to renew their subscription which is due to
expire in two weeks.
• The goal is to understand why some have
re-subscribed while others have not.
Step 1. Import data
# import data
import pandas as pd
df =
pd.read_excel('C:/Users/LENOVO/Desktop/RLA/
BMS/Sem 3/Introduction to Business
Analytics/Practical/Logistic regression
examples/LR3.xlsx') # insert path of your Excel
file
df.head()
Step 2. import libraries
#Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, roc_auc_score
import statsmodels.api as sm
Step3- Preparing the variables
#prepare the X and y
y = df.Renewed
X = df.iloc[:,:-1]
X.head()

(we will specify the y variable from the dataset and


segregate all the other variables except y in x, make
sure the y variable is the last column with values
only 0 or 1)
Step 4- Splitting Dataset into Training
and Test Sets
• Before building the model, split the dataset
into 80:20 or (70:30) ratio for creating training
and validation datasets.
• The model will be built using the training set
and tested using test set.
Dividing a dataset into training and testing data is a fundamental
practice in machine learning, including logistic regression, to
evaluate how well a model generalizes to new, unseen data. This
division serves several purposes:

• Model Evaluation: The primary reason for


splitting the dataset is to evaluate the
performance of the trained model. By testing
the model on data it has never seen before
(testing data), you get an estimate of how well
it will perform on real-world data. This
evaluation helps you assess whether the
model is overfitting or underfitting.
Preventing Overfitting:
• Overfitting occurs when a model learns to
perform very well on the training data but fails
to generalize to new data.
• By evaluating the model on a separate testing
dataset, you can identify if it's overfitting.
• If the model's performance on the testing data
is significantly worse than on the training
data, it's a sign of overfitting.
Real-world Simulation:
• The testing data simulates real-world
scenarios where the model is presented with
new observations that it hasn't encountered
during training.
• This is a critical aspect of assessing a model's
practical utility.
• The typical practice is to split the dataset into
two parts: a larger portion for training (usually
around 70-80% of the data) and a smaller
portion for testing (the remaining 20-30%).
• The training data is used to fit the model's
parameters, while the testing data is used to
evaluate its performance.
• This approach helps ensure that the
evaluation is unbiased and provides a reliable
estimate of the model's generalization ability.
Code

#Training and test set


from sklearn.model_selection import
train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.2,
random_state = 42)
• X_train and y_train contain the independent
variables and response variable values for the
training dataset respectively.
• Similarly, X_test and y_test contain the
independent variables and response variable
values for the test dataset, respectively.
Step 5- fitting the model
#Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

(We will fit the model using LogisticRegression()


and pass training sets y_train and X_train as
parameters.)
Step 6- Predicting the model
• This line of code uses the trained model to predict the target
variable (binary classification) for the test data.
• It takes the test feature data (X_test) as input and returns
predicted labels for each observation in the test dataset.
• After executing this code, y_pred will contain the predicted
labels for the test data based on the logistic regression
model's predictions.
• For binary classification problems like logistic regression,
predicted labels are usually encoded as 0 or 1, indicating the
predicted class for each observation.
• These predicted labels can then be compared with the actual
labels (y_test) to assess the model's performance, calculate
metrics like accuracy, ROC curves, AUC, and more.
Code
# Make predictions on the test data
y_pred = model.predict(X_test)
y_pred
Step 7- Classification Report &
Confusion Matrix
• In classification, the model performance is often
measured using concepts such as sensitivity,
specificity, precision, and F-score.
• The ability of the model to correctly classify
positives and negatives is called sensitivity (also
known as recall or true positive rate) and
specificity (also known as true negative rate),
respectively.
• The terminologies sensitivity and specificity
originated in medical diagnostics
Code
#Classification report
from sklearn.metrics import
classification_report
print(classification_report(y_test, y_pred))
Output
(Accuracy)
Output
(FI SCORE)
Output
(Specificity)
Output
(Sensitivity)
Confusion matrix code
#Confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
tn, fp, fn, tp = confusion_matrix(y_test,
y_pred).ravel()

Output:
Mannual calculations of components
of Confusion matrix
#Accuracy
accuracy = (tp + tn) / (tp + tn + fp + fn)
print("Accuracy:", round(accuracy * 100, 1), "%")

#F1_score
f1 = 2 * tp / (2 * tp + fn + fp)
print("F1_score:", round(f1 * 100, 1), "%")

#Specificity
specificity = tn / (tn + fp)
print("Specificity:", round(specificity * 100, 1), "%")

#Sensitivity
sensitivity = tp / (tp + fn)
print("Sensitivity:", round(sensitivity * 100, 1), "%")
Output
Accuracy: 75.0 %
F1_score: 76.9 %
Specificity: 66.7 %
Sensitivity: 83.3 %
Analysis- Accuracy
• An accuracy of 75.0% in a logistic regression
analysis indicates that the model's predictions
are correct for 75% of the observations in the
dataset.
F1 score
• An F1 score of 76.9% in logistic regression signifies a well-balanced
model performance in binary classification, considering both
precision and recall.
• This score is especially important when dealing with imbalanced
datasets, offering a reliable measure of the model's ability to
manage false positives and false negatives.
• The F1 score's harmonious blend of precision and recall aids in
making informed decisions across applications, and its value
should be assessed alongside contextual factors and comparative
analyses for a comprehensive understanding of the model's
efficacy.
• An F1 score of 76.9% suggests that the model is achieving a
balanced performance in terms of both precision and recall.
• This means that the model is making a good trade-off between
minimizing false positives (precision) and false negatives (recall).
Specificity
• A specificity of 66.7% in logistic regression
indicates the model's ability to correctly
identify negative cases among all actual
negatives.
• This metric is crucial for tasks where avoiding
false positives is vital.
• A high specificity suggests the model is adept
at reducing false alarms.
Sensitivity
• A sensitivity of 83.3% in logistic regression
highlights the model's capability to correctly
identify positive cases among all actual positives.
• This metric is vital in scenarios where avoiding
false negatives is critical.
• A high sensitivity indicates the model is proficient
at capturing true positives.
• Nonetheless, striking a balance between
sensitivity and specificity is essential, particularly
when false positives and false negatives have
differing consequences.
Step 8- Display coefficients and
intercept
• # Display coefficients and intercept
• coef = model.coef_
• intercept = model.intercept_
• print("Coefficients:", coef)
• print("Intercept:", intercept)
Output
Coefficients: [[-0.00820279 0.04094567
0.04561219]]
Intercept: [-1.42380146]
Analysis
• The log of odds ratio or probability that a customer will renew his
subscription to an online information service will decrease by 0.0082
units if the age increases by 1 unit. This means more people will renew
the subscription to online information service as age decreases.
• The log of odds ratio or probability that a customer will renew his
subscription to an online information service will increase by 0.0409
units if the average number of page views per week over the last 10
weeks increases by 1 unit. This means more people will renew the
subscription to online information service as views in the last 10 weeks
increases.
• The log of odds ratio or probability that a customer will renew his
subscription to an online information service will increase by 0.0456
units if the number of page views during the last week increases by 1
unit. This means more people will renew the subscription to online
information service as views in the last weeks increases.
• This is the baseline log-odds when all independent variables are zero.
In this context, it represents the estimated log-odds of subscription
renew with zero age, no average number of page views per week over
the last 10 weeks , and zero page views during the last week. The
intercept is then used in the logistic function to calculate the initial
probability.
Step 9- Obtaining predicted
probabilities
• We use the trained logistic regression model (model) to generate
predicted probabilities for each class (positive and negative) for the test
data (X_test). It computes the probability of each observation belonging to
each class.
• Decision Threshold: By obtaining the predicted probabilities for the
positive class, you can apply a decision threshold (usually 0.5) to convert
these probabilities into binary class predictions. If the predicted
probability is greater than or equal to the threshold, the observation is
classified as the positive class; otherwise, it's classified as the negative
class.
• ROC Curve and AUC: Predicted probabilities are used to create Receiver
Operating Characteristic (ROC) curves and calculate Area Under the Curve
(AUC), which provide insights into the model's performance across various
thresholds.
• In summary, obtaining the predicted probabilities using is crucial for
making informed decisions, understanding the model's confidence, tuning
threshold-dependent metrics, and assessing the model's overall
performance.
Code
y_pred_prob = model.predict_proba(X_test)[:,
1]
y_pred_prob
Output
array([0.82546484, 0.45931361, 0.48723111,
0.52871316, 0.83128135, 0.40167389, 0.5608134 ,
0.99609292, 0.99968847, 0.3439706 , 0.3461162 ,
0.99847422])

Interpretation: Each probability value indicates how


likely the corresponding observation is to belong to
the positive class. For instance, a probability of
0.825 for an observation implies that the model
estimates an 82.5% chance that this observation
belongs to the positive class.
Step 10- Calculate ROC curve and AUC

• Receiver operating characteristic (ROC) curve can be


used to understand the overall performance (worth)
of a logistic regression model (and, in general, of
classification models) and used for model selection.
• Given a random pair of positive and negative class
records, ROC gives the proportions of such pairs that
will be correctly classified.
• ROC curve is a plot between sensitivity (true positive
rate) on the vertical axis and 1 – specificity (false
positive rate) on the horizontal axis.
Code
# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = roc_auc_score(y_test, y_pred_prob)

# Plot the ROC curve


plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
Output
Analysis
• As a thumb rule, AUC of at least 0.7 is required for
practical application of the model.
• AUC greater than 0.9 implies an outstanding model.
• Caution should be exercised while selecting models
based on AUC, especially when the data is imbalanced
(i.e., dataset which has less than 10% positives).
• In case of imbalanced datasets, the AUC may be very
high (greater than 0.9); however, either sensitivity or
specificity values may be poor.
• For this example, the AUC is 0.81, which implies the
model is fairly good.
Example 1(spam)
• Spam E-mail
• Data Description: The data consist of 4601 email items, of
which 1813 items were identified as spam.
• Format: This data frame contains the following columns:
• crl.tot: total length of words in capitals
• Dollar: number of occurrences of the \$ symbol
• Bang: number of occurrences of the ! symbol
• Money: number of occurrences of the word ‘money’
• N000: number of occurrences of the string ‘000’
• Make: number of occurrences of the word ‘make’
• Yesno: outcome variable, a factor with levels n not spam, y
spam
Example 3 (LR1)
• Includes customer information such as:
• Age
• How many days since they first visited the
store website
• No. of items in their cart
• The first column is the dependent variable
that indicates whether the customer
purchased on their latest visit

You might also like