logisticregression
logisticregression
Regression
Dr. Rupak Chakraborty
Techno India University, Kolkata
Introduction
LOGISTIC REGRESSION 2
Explanation
Logistic Regression is a type of statistical model that is used to predict the probability
of a certain event happening. It works by taking input variables and transforming
them into a probability value between 0 and 1, where 0 represents a low probability
and 1 represents a high probability.
For example, imagine you want to predict whether someone will buy a product based
on their age and income. Logistic Regression would take these input variables and use
them to calculate the probability of the person buying the product.
It's called "logistic" because the transformation of the input variables is done using a
mathematical function called the logistic function, which creates an S-shaped curve.
Overall, Logistic Regression is a useful tool for making predictions and understanding
the relationship between variables in a dataset.
LOGISTIC REGRESSION 3
Example
Imagine it’s been several years since you service your car.
Probability of breakdown
One day you are wondering…
If your car will break down in near future or not.
So this is like classification, as we will have answers
either in ‘Yes’ or ‘No’.
As we can imagine that the no. of years that are on lower side like
1 year, 2 year, 3 year after the service, the chances of the car
breaking down is very limited.
Years since service
Here, the dependent variable’s output is discrete.
LOGISTIC REGRESSION 4
Why not Linear Regression?
Take for example,
LOGISTIC REGRESSION 5
Why not Linear Regression?
As you can see that the graph doesn’t look very right.
LOGISTIC REGRESSION 6
Odds of Success
To understand Logistic Regression, let’s talk about the odds of success.
If p = 0, θ = 0/(1-0) = 0/1 = 0
If p = 1, θ = 1/(1-1) = 1/0 = α
LOGISTIC REGRESSION 7
Predicting Odds of Success
𝑝 𝑥
log = β0 + β1x (β0 = constant)
1−𝑝 𝑥
𝑝 𝑥
e^ln( ) = e^(β0 + β1x )
1−𝑝 𝑥
𝑝 𝑥
Or, = e^(β0 + β1x )
1−𝑝 𝑥
𝑝 𝑥
Then, =Y
1−𝑝 𝑥
LOGISTIC REGRESSION 8
Predicting Odds of Success
𝑝 𝑥
Then, =Y
1−𝑝 𝑥
or, p(x) (1 + Y) = Y
𝑌
or, p(x) =
1+𝑌
LOGISTIC REGRESSION 9
Predicting Odds of Success
𝑌
or, p(x) =
1+𝑌
e^(β0 + β1x )
or, p(x) = [Sigmoid]
1+e^(β0 + β1x )
e^(β0 + β1x )
p(x) =
1+e^(β0 + β1x )
1
p(x) =
1+e^−(β0 + β1x )
LOGISTIC REGRESSION 10
Sigmoid function curve
LOGISTIC REGRESSION 11
Sigmoid function contd..
LOGISTIC REGRESSION 12
Compare Linear regression
and Logistic regression
Linear Regression Logistic Regression
LOGISTIC REGRESSION 13
Compare Linear regression
and Logistic regression
Linear Regression Logistic Regression
Example: Example:
Weather Prediction Weather Prediction
If we need to predict the If we are going to predict
temperature of the coming whether it would be raining
week. tomorrow or not.
Then it is a continuous number. Then it is a discrete value.
The predictions will be either
in ‘Yes’ or ‘No’
LOGISTIC REGRESSION 14
Compare Logistic Regression
and Classification
Logistic Regression Classification
Logistic regression is a statistical modeling technique used to Classification, on the other hand, is a machine
analyze and model the relationship between a dependent variable learning task that involves assigning an input
and one or more independent variables. to one of several predefined categories.
In logistic regression, the dependent variable is categorical (i.e., it Classification can be thought of as a kind of
takes on a limited number of values), but it is continuous in prediction problem, where the goal is to
nature. predict the class or category of a given input.
The goal of logistic regression is to predict the probability of an
event occurring (i.e., the dependent variable taking a certain
value) based on the values of the independent variables.
LOGISTIC REGRESSION 15
Applications of
Logistic Regression
LOGISTIC REGRESSION 16
Logistic Regression
Assumptions
Binary Outcome:
The dependent variable, also known as the outcome variable or response
variable, is binary in nature.
This means that it takes on one of two possible values, typically coded as 0
and 1, or as "success" and "failure", "yes" and "no", "true" and "false", or some
other binary coding.
The logistic regression model is designed to estimate the probability of the
"success" outcome as a function of one or more independent variables, also
known as predictors or covariates.
The logistic function, which transforms a linear combination of the
predictors into a probability between 0 and 1, is used to model the
relationship between the predictors and the outcome.
LOGISTIC REGRESSION 17
Logistic Regression
Assumptions
No Multicollinearity:
The assumption of no or low multicollinearity among the independent variables is
important in logistic regression. Multicollinearity refers to a situation where two or
more independent variables are highly correlated with each other, which can lead to
problems in the estimation of the model parameters and in the interpretation of the
results.
Multicollinearity can cause unstable and imprecise estimates of the logistic
regression parameters, and may make it difficult to identify which independent
variable(s) are driving the observed effects on the outcome variable. One way to
check for multicollinearity is to calculate the correlation matrix between the
independent variables and look for high correlations (i.e., correlations greater than
0.7 or 0.8).
LOGISTIC REGRESSION 18
Logistic Regression
Assumptions
Large Sample Size:
Sample size is an important consideration in logistic regression. A relatively large
sample size is typically required to ensure stable estimates and adequate
statistical power to detect meaningful effects.
The sample size requirements for logistic regression depend on several factors,
such as the number and complexity of the independent variables, the prevalence
of the outcome in the population, and the desired level of statistical power. As a
general rule of thumb, a sample size of at least 10-15 observations per
independent variable is often recommended.
If the sample size is too small, the logistic regression model may suffer from
issues such as overfitting, where the model fits the noise in the data instead of
the underlying signal, and underpowered statistical tests, where important
effects may be missed due to insufficient sample size.
LOGISTIC REGRESSION 19
Confusion Matrix
A confusion matrix is a table used to evaluate the performance of a machine learning
algorithm for classification tasks. It is a square matrix that compares the actual and
predicted values of a classifier.
Let's consider an example of a binary classification problem where we have a dataset
of 100 patients with diabetes, and we want to build a model that can predict whether
a patient has diabetes or not based on their medical data. The model output will be
either "Positive" or "Negative".
By examining the values in the confusion matrix, we can calculate various
performance metrics, such as accuracy, precision, recall, and F1-score, which can help
us evaluate the model's performance. The confusion matrix provides a clear and
concise way of visualizing the model's performance in terms of its ability to correctly
classify positive and negative cases.
LOGISTIC REGRESSION 20
Confusion Matrix
Suppose the model has
The values in the confusion matrix are as made predictions on the
follows: test set and we have the
True Positives (TP): the number of cases following results:
that were correctly classified as positive
(60 in this case). Predicted Predicted
False Positives (FP): the number of cases Positive Negative
that were incorrectly classified as Actual
60 10
positive (15 in this case). Positive
True Negatives (TN): the number of Actual
15 15
cases that were correctly classified as Negative
negative (15 in this case).
Here, we have a 2x2 matrix, where the rows represent the
False Negatives (FN): the number of actual values and the columns represent the predicted
cases that were incorrectly classified as values. The diagonal elements of the matrix represent the
negative (10 in this case). correctly classified cases, and the off-diagonal elements
represent the incorrectly classified cases
LOGISTIC REGRESSION 21
Thank you