0% found this document useful (0 votes)
2 views

MODULE-3

The document provides an overview of regression analysis, including concepts of covariance and correlation, various regression techniques such as linear and logistic regression, and the assumptions required for linear regression models. It discusses the importance of model building in data analytics, outlining the model building process and types of models, including predictive, descriptive, prescriptive, and causal models. Additionally, it emphasizes the significance of assessing model fit through metrics like R-squared and the need for variable rationalization to improve data processing.

Uploaded by

lemel52930
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

MODULE-3

The document provides an overview of regression analysis, including concepts of covariance and correlation, various regression techniques such as linear and logistic regression, and the assumptions required for linear regression models. It discusses the importance of model building in data analytics, outlining the model building process and types of models, including predictive, descriptive, prescriptive, and causal models. Additionally, it emphasizes the significance of assessing model fit through metrics like R-squared and the need for variable rationalization to improve data processing.

Uploaded by

lemel52930
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT-III

Regression
Covariance
• Covariance is a measure of how much two random variables vary together. It’s similar to
variance, but where variance tells you how a single variable varies, covariance tells you
how two variables vary together.

• A positive covariance would indicate a positive linear relationship between the

• variables, and a negative covariance would indicate the opposite.

The Covariance Formula:

Where:

• Xi – the values of the X-variable

• Yj – the values of the Y-variable

• X̄ – the mean (average) of the X-variable

• Ȳ – the mean (average) of the Y-variable

• n – the number of data points

Correlation
• The Correlation is a measure of association between two variables. Correlations are
Positive and negative which are ranging between +1 and -1.
• Degree and type of relationship between any two or more quantities (variables) in which
they vary together over a period.

• for example, variation in the level of expenditure or savings with variation in the level of
income.

• A positive correlation exists where the high values of one variable are associated with
the high values of the other variable(s).

• A 'negative correlation' means association of high values of one with the low values of
the other(s).

• Values close to +1 indicate a high degree of positive correlation, and values close to -1
indicate a high degree of negative correlation.

• Values close to zero indicate poor correlation of either kind, and 0 indicates no
correlation at all.

Regression – Concepts
Introduction:

• The term regression is used to indicate the estimation or prediction of the average value
of one variable for a specified value of another variable.

• Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.

Regression Analysis:

• Regression analysis is a set of statistical processes for estimating the relationships


between a dependent variable (often called the 'outcome variable‘ or ‘ Response
Variable’ ) and one or more independent variables (often called 'predictors', or '
explanatory variables ').

Types of Regression Analysis Techniques:

– Linear Regression

– Logistic Regression

– Ridge Regression

– Lasso Regression

– Polynomial Regression
– Bayesian Linear Regression
1. Linear Regression

• Purpose: Predict a continuous dependent variable based on the linear relationship with
one or more independent variables.

• Key Assumption: There is a linear relationship between the dependent and independent
variables.

• Example: Predicting housing prices based on square footage.

2. Logistic Regression

• Purpose: Used to predict a binary outcome (0 or 1) based on one or more predictor


variables.

• Key Assumption: The dependent variable is binary, and the log-odds of the outcome are
linearly related to the independent variables.

• Example: Predicting whether a customer will purchase a product (yes/no).

3. Ridge Regression (L2 Regularization)

• Purpose: A variant of linear regression that adds a regularization term to penalize large
coefficients, reducing overfitting.

• Key Assumption: The predictors may be highly correlated (multicollinearity), and the
model is prone to overfitting.

• Example: Predicting car prices with many correlated variables like mileage, age, and
horsepower.

4. Lasso Regression (L1 Regularization)

• Purpose: Similar to Ridge Regression but capable of driving some coefficients to zero,
performing automatic feature selection.

• Key Assumption: Some predictors may be irrelevant or unimportant, and feature


selection is needed.

• Example: Predicting sales with hundreds of potential features, where some are not
useful.
5. Polynomial Regression

• Purpose: Extends linear regression by fitting a polynomial equation (curved line) to the
data to model non-linear relationships.

• Key Assumption: The relationship between the dependent and independent variables is
non-linear.

• Example: Modeling the effect of time on population growth (e.g., quadratic


relationship).

6. Bayesian Linear Regression

• Purpose: Uses Bayesian principles to estimate the distribution of regression coefficients,


providing probabilistic interpretations of the predictions.

• Key Assumption: Parameters are treated as random variables with prior distributions,
and predictions are made using a posterior distribution.

• Example: Predicting stock prices while incorporating prior beliefs and uncertainties
about the coefficients.

Simple Linear Regression


Simple linear regression is used to predict the value of one variable (the dependent variable) on
the basis of other variables (the independent variables). Simple linear regression is an
approach for predicting a response using a single feature. A line is fitted through the group of
plotted data.

Where Variables:

x = Independent Variable (we provide this)

y= Dependent Variable (we observe this)

Parameters:

β0= Intercept

• The y-intercept of a line is the point at which the line crosses the y axis. ( i.e. where the x
value equals 0)
β1= Slope

• Change in the mean of Y for a unit change in X

ε = residuals

• The residual value is a discrepancy between the actual and the predicted value. The
distance of the plotted points from the line gives the residual value.

Positive relation

Negative relation
Constructing Regression model for following samples
OLS Regression: Linear Regression using Ordinary /Least Squares
Approximation/ Estimation /Based on Gauss Markov Theorem
• We can start off by estimating the value for B1 as:

If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear
regression. The procedure for linear regression is different and simpler than that for multiple
linear regression.
Calculating B1 & B0 using Correlations and Standard Deviations:

B1 = Correlation(x, y)* St.Deviation( y)/St.Deviation(x)

B0 = mean(y)-B1*mean(x)

Where

– cor (x,y) is the correlation between x & y

– stdev() is the calculation of the standard deviation for a variable.

The same is calculated in R as follows:

• x<-c(1,2,4,3,5)

• y<-c(1,3,3,2,5)

• x;y

• [1] 1 2 4 3 5
• [1] 1 3 3 2 5

• B1=cor(x,y)*sd(y)/sd(x)

• B1=0.8

• B0=mean(y)-B1*mean(x)

• B0=0.4

Applications of Linear Regression


• Evaluating Trends and Sales Estimates

• Analyzing the impact of Price changes

• Assessment of risk in financial services and insurance domain

1. Evaluating Trends and Sales Estimates

• Linear regressions can be used in business to evaluate trends and make estimates or
forecasts.

• For example, if a company’s sales have increased steadily every month for the past few
years, conducting a linear analysis on the sales data with monthly sales on the y-axis and
time on the x-axis would produce a line that that depicts the upward trend in sales.

• After creating the trend line, the company could use the slope of the line to forecast sales
in future months.

Analyzing the impact of Price changes

• Linear regression can also be used to analyze the effect of pricing on consumer behavior.

• For example, if a company changes the price on a certain product several times, it can
record the quantity it sells for each price level and then performs a linear regression with
quantity sold as the dependent variable and price as the explanatory variable.

• The result would be a line that depicts the extent to which consumers reduce their
consumption of the product as prices increase, which could help guide future pricing
decisions.

Assessment of risk in financial services and insurance domain

• Linear regression can be used to analyze risk.


• For example, A health insurance company might conduct a linear regression plotting
number of claims per customer against age and discover that older customers tend to
make more health insurance claims.

• The results of such an analysis might guide important business decisions made to account
for risk.

Assumptions of Linear Regression Model:


• Linear regression is a useful statistical method we can use to understand the relationship
between two variables, x and y.

• However, before we conduct linear regression, we must first make sure that four
assumptions are met:

1. Linear relationship: There exists a linear relationship between the independent


variable x and the dependent variable y.

2. Independence: The residuals are independent.

• In particular, there is no correlation between consecutive residuals in data.


(Multicollinearity occurs when independent variables in a regression model are
correlated.This correlation is a problem because independent variables should be
independent)

3. Homoscedasticity: The residuals have constant variance at every level of x.

4. Normality: The residuals of the model are normally distributed.

• If one or more of these assumptions are violated, then the results of our linear regression
may be unreliable or even misleading.

Properties and Assumptions of OLS approximation (BLUE Properties):

– Unbiasedness:

– Biased estimator is defined as the difference between its expected value and the
true value. i.e., e(y)=y_actual – y_predited

– If the biased error (bias) is zero then estimator become unbiased.

– Unbiasedness is important only when it is combined with small variance


– Least Variance:

– An estimator is best when it has the smallest or least variance

– Least variance property is more important when it combined with small biased.

– Efficient estimator:

– An estimator said to be efficient when it fulfilled both conditions.

– Estimator should unbiased and have least variance

– Best Linear Unbiased Estimator (BLUE Properties):

• An estimator is said to be BLUE when it fulfill the above properties

• An estimator is BLUE if it is Unbiased, Least Variance and Linear Estimator

– Minimum Mean Square Error (MSE):

• An estimator is said to be MSE estimator if it has smallest mean square error.

• Less difference between estimated value and True Value

– Sufficient Estimator:

• An estimator is sufficient if it utilizes all the information of a sample about the True
parameter.

• It must use all the observations of the sample.

Assumptions of OLS Regression

• There are random sampling of observations.

• The conditional mean should be zero

• There is homoscedasticity and no Auto-correlation.

• Error terms should be normally distributed(optional)

• The Properties of OLS estimates of simple linear regression equation is y =B0+B1*x + µ


(µ -> Error)

• The above equation is based on the following assumptions

– Randomness of µ

– Mean of µ is Zero
– Variance of µ is constant

– The variance of µ has normal distribution

– Error µ of different observations are independent.

Assessing the fit of regression models:


• A well-fitting regression model results in predicted values close to the observed data
values. The mean model, which uses the mean for every predicted value, generally would
be used if there were no informative predictor variables. The fit of a proposed regression
model should therefore be better than the fit of the mean model.

R Square Method–Goodness of Fit

• R-Squared (R² or the coefficient of determination) is a statistical measure in a regression


model that determines the proportion of variance in the dependent variable that can be
explained by the independent variable. In other words, r-squared shows how well the data
fit the regression model (the goodness of fit).
• R-squared is always between 0 and 100%

• 0% indicates that the model explains none of the variability of the response data around
its mean.

• 100% indicates that the model explains all the variability of the response data around its
mean.

Variable Rationalization:
• The data set may have a large number of attributes. But some of those attributes can be
irrelevant or redundant.

• The goal of Variable Rationalization is to improve the Data Processing in an optimal way
through attribute subset selection.

• This process is to find a minimum set of attributes such that dropping of those irrelevant
attributes does not much affect the utility of data and the cost of data analysis could be
reduced.

• Mining on a reduced data set also makes the discovered pattern easier to understand.

• As part of Data processing, we use the below methods of Attribute subset selection

– Stepwise Forward Selection

– Stepwise Backward Elimination

– Combination of Forward Selection and Backward Elimination

– Decision Tree Induction.

• All the above methods are greedy approaches for attribute subset selection.

Stepwise Forward Selection:

• This procedure starts with an empty set of attributes as the minimal set.
• The most relevant attributes are chosen (having minimum p-value) and are added to the
minimal set. In each iteration, one attribute is added to a reduced set.

Stepwise Backward Elimination:

• Here all the attributes are considered in the initial set of attributes.

• In each iteration, one attribute is eliminated from the set of attributes whose p-value is
higher than significance level.

Combination of Forward Selection and Backward Elimination:

• The stepwise forward selection and backward elimination are combined so as to select
the relevant attributes most efficiently.

• This is the most common technique which is generally used for attribute selection.

Decision Tree Induction:

• This approach uses decision tree for attribute selection. It constructs a flow chart like
structure having nodes denoting a test on an attribute.

• Each branch corresponds to the outcome of test and leaf nodes is a class prediction.

• The attribute that is not the part of tree is considered irrelevant and hence discarded.

Model Building in Data Analytics


• In data analytics, a model refers to a mathematical or computational representation that
captures the relationships between input variables (features) and an output variable
(target).

• Models are used to make predictions, uncover patterns, and derive insights from data.

Types of Models

Predictive Models:
• Purpose: To predict future outcomes based on historical data.

• Examples:

– Regression Models: Used for continuous outcomes (e.g., linear regression,


logistic regression).
– Classification Models: Used for categorical outcomes (e.g., decision trees,
support vector machines, neural networks).

Descriptive Models:

• Purpose: To summarize and understand the underlying structure of the data.

• Examples:

– Clustering Models: Group similar observations (e.g., k-means clustering,


hierarchical clustering).

– Association Models: Identify relationships between variables (e.g., market basket


analysis using Apriori algorithm).

Prescriptive Models:

• Purpose: To recommend actions based on predictive analytics.

• Examples:

– Optimization Models: Help in decision-making (e.g., linear programming).

– Simulation Models: Analyze complex systems and their behaviors under various
scenarios.

Causal Models:

• Purpose: To identify cause-and-effect relationships.

• Examples:

– Structural Equation Modeling: Examines relationships between variables.

– Time Series Analysis: Can infer causal relationships in temporal data.

Model Building Process


• Define the Problem: Clearly outline what you want to achieve with the model.

• Data Collection: Gather relevant data from various sources.

• Data Preprocessing: Clean and prepare the data for analysis, including handling missing
values, normalization, and encoding categorical variables.

• Exploratory Data Analysis (EDA): Analyze the data to understand distributions,


relationships, and trends.
• Feature Engineering: Create new features or modify existing ones to improve model
performance.

• Model Selection: Choose the appropriate modeling technique based on the problem type.

• Model Training: Fit the model to the training data.

• Model Evaluation: Assess model performance using relevant metrics (e.g., accuracy,
precision, recall, R-squared).

• Model Deployment: Implement the model in a production environment for practical use.

• Monitoring and Maintenance: Continuously track model performance and make


adjustments as necessary.

Model Building Life Cycle in Data Analytics


• When we come across a business analytical problem, without acknowledging the
stumbling blocks, we proceed towards the execution.

• Before realizing the misfortunes, we try to implement and predict the outcomes.

• The problem-solving steps involved in the data science model-building life cycle.

• Let’s understand every model building step in-depth,

• The data science model-building life cycle includes some important steps to follow.

• The following are the steps to follow to build a Data Model

– Problem Definition

– Hypothesis Generation

– Data Collection

– Data Exploration/Transformation

– Predictive Modelling

– Model Deployment
1. Problem Definition

• The first step in constructing a model is to understand the industrial problem in a more
comprehensive way.

• To identify the purpose of the problem and the prediction target, we must define the
project objectives appropriately.

• Therefore, to proceed with an analytical approach, we have to recognize the obstacles


first.

• Remember, excellent results always depend on a better understanding of the problem.

2. Hypothesis Generation

• Hypothesis generation is the guessing approach through which we derive some essential
data parameters that have a significant correlation with the prediction target.

• Your hypothesis research must be in-depth, looking for every perceptive of all
stakeholders into account.

• We search for every suitable factor that can influence the outcome.

• Hypothesis generation focuses on what you can create rather than what is available in the
dataset.

3. Data Collection

• Data collection is gathering data from relevant sources regarding the analytical problem,
then we extract meaningful insights from the data for prediction.
• The data gathered must have:

– Proficiency in answer hypothesis questions.

– Capacity to elaborate on every data parameter.

– Effectiveness to justify your research.

– Competency to predict outcomes accurately.

4. Data Exploration/Transformation

• The data you collected may be in unfamiliar shapes and sizes.

• It may contain unnecessary features, null values, unanticipated small values, or immense
values.

• So, before applying any algorithmic model to data, we have to explore it first.

• By inspecting the data, we get to understand the explicit and hidden trends in data.

• We find the relation between data features and the target variable.

• Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only.

• There are several sub steps involved in data exploration:

1. Feature Identification:

– You need to analyze which data features are available and which ones are not.

– Identify independent and target variables.


– Identify data types and categories of these variables.

2. Univariate Analysis:

• We inspect each variable one by one.

• This kind of analysis depends on the variable type whether it is categorical and
continuous.

– Continuous variable: We mainly look for statistical trends like mean, median,
standard deviation, skewness, and many more in the dataset.

– Categorical variable: We use a frequency table to understand the spread of data


for each category. We can measure the counts and frequency of occurrence of
values.

3. Multi-variate Analysis:

• The bi-variate analysis helps to discover the relation between two or more variables.

• We can find the correlation in case of continuous variables and the case of categorical,
we look for association and dissociation between them.

4. Filling Null Values:

• Usually, the dataset contains null values which lead to lower the potential of the model.

• With a continuous variable, we fill these null values using the mean or mode of that
specific column.

• For the null values present in the categorical column, we replace them with the most
frequently occurred categorical value.

• Remember, don’t delete that rows because you may lose the information.

5. Predictive Modeling
• Predictive modeling is a mathematical approach to create a statistical model to forecast
future behavior based on input test data.

Steps involved in predictive modeling:

Algorithm Selection:

• When we have the structured dataset, and we want to estimate the continuous or
categorical outcome then we use supervised machine learning methodologies like
regression and classification techniques.
• When we have unstructured data and want to predict the clusters of items to which a
particular input test sample belongs, we use unsupervised algorithms.

• An actual data scientist applies multiple algorithms to get a more accurate model.

Train Model:

• After assigning the algorithm and getting the data handy, we train our model using the
input data applying the preferred algorithm.

• It is an action to determine the correspondence between independent variables, and the


prediction targets.

Model Prediction:

• We make predictions by giving the input test data to the trained model.

• We measure the accuracy by using a cross-validation strategy or ROC curve which


performs well to derive model output for test data.

6. Model Deployment

• There is nothing better than deploying the model in a real-time environment.

• It helps us to gain analytical insights into the decision-making procedure.

• You constantly need to update the model with additional features for customer
satisfaction.

• To predict business decisions, plan market strategies, and create personalized customer
interests, we integrate the machine learning model into the existing production domain.

• When you go through the Amazon website and notice the product recommendations
completely based on your curiosities.

• You can experience the increase in the involvement of the customers utilizing these
services.

• That’s how a deployed model changes the mindset of the customer and convince him to
purchase the product.
Logistic Regression
• Regression models traditionally work with continuous numeric value data for dependent
and independent variables.

• Logistic regression models can, however, work with dependent variables with binary
values, such as whether a loan is approved (yes or no).

• Logistic regression measures the relationship between a categorical dependent variable


and one or more independent variables.

• For example, Logistic regression might be used to predict whether a patient has a given
disease (e.g. diabetes), based on observed characteristics of the patient (age, gender, body
mass index, results of blood tests, etc.).

• Logistical regression models use probability scores as the predicted values of the
dependent variable.

• Logistic regression takes the natural logarithm of the odds of the dependent variable
being a case (referred to as the logit) to create a continuous criterion as a transformed
version of the dependent variable.

• Thus the logit transformation is used in logistic regression as the dependent variable.

• The net effect is that although the dependent variable in logistic regression is binomial
(or categorical, i.e. has only two possible values), the logit is the continuous function
upon which linear regression is conducted.

Logistic Regression Model Theory


• Logistic regression is a supervised learning algorithm that models the probability of a
binary outcome based on one or more predictor variables.

• It is commonly used in classification problems where the goal is to predict whether an


observation belongs to one of two possible categories (e.g., yes/no, success/failure, or
1/0).

• Here's a breakdown of the theory behind the logistic regression model:

1. Logistic Function and Sigmoid Curve

• Logistic Function: The key component in logistic regression is the logistic (or sigmoid)
function, which maps any real-valued input to a probability value between 0 and 1. The
sigmoid function is given by:
• where z=β0+β1x1+β2x2+⋯+βnxn represents a linear combination of the predictor
variables (x1,x2,…,xn) and the model parameters (coefficients) β.

• S-Shape: The sigmoid function has an S-shaped curve, which smoothly transforms input
values into a range between 0 and 1.

• This is essential for representing probabilities.

2. Log-Odds and Probability

• Logistic regression does not directly predict classes. Instead, it predicts the probability of
belonging to a particular class.

• The probability of the outcome y=1(e.g., "success" or "yes") is given by:

• Log-Odds Transformation: The model is based on the "log-odds," or "logit,"


transformation, which linearizes the relationship between the predictors and the log-odds
of the outcome:

• where p is the probability that y=1

3. Decision Boundary and Classification

• To classify observations, we typically set a threshold (often 0.5) on the predicted


probability
• This threshold can be adjusted depending on the specific application and the costs
associated with false positives and false negatives.

Let's understand how Logistic Regression works.


• For Linear Regression, where the output is a linear combination of input feature(s), we
write the equation as:

`Y = βo + β1X + ∈`

• In Logistic Regression, we use the same equation but with some modifications made to
Y.

• Let's reiterate a fact about Logistic Regression: we calculate probabilities. And,


probabilities always lie between 0 and 1.

In other words, we can say:

1. The response value must be positive.

2. It should be lower than First,

we'll meet the above two criteria. We know the exponential of any value is always a
positive number. And, any number divided by number + 1 will always be lower than 1.
Let's implement these two findings:

• This is the logistic function.

• Now we are convinced that the probability value will always lie between 0 and 1.

• To determine the link function, follow the algebraic calculations carefully. P(Y=1|X) can
be read as "probability that Y =1 given some value for x."

• Y can take only two values, 1 or 0.


• As you might recognize, the right side of the (immediate) equation above depicts the
linear combination of independent variables.

• The left side is known as the log - odds or odds ratio or logit function and is the link
function for Logistic Regression.

• This link function follows a sigmoid (shown below) function which limits its range of
probabilities between 0 and 1.

• In Multiple Regression, we use the Ordinary Least Square (OLS) method to determine
the best coefficients to attain good model fit.
• In Logistic Regression, we use maximum likelihood method to determine the best
coefficients and eventually a good model fit.

• Maximum likelihood works like this: It tries to find the value of coefficients (βo,β1) such
that the predicted probabilities are as close to the observed probabilities as possible.

• In other words, for a binary classification (1/0), maximum likelihood will try to find
values of βo and β1 such that the resultant probabilities are closest to either 1 or 0.

• The likelihood function is written as

How can you evaluate Logistic Regression model fit and accuracy?
• In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to evaluate
model fit and accuracy.

• But, Logistic Regression employs all different sets of metrics.

• Here, we deal with probabilities and categorical values.

Following are the evaluation metrics used for Logistic Regression:

1. Akaike Information Criteria (AIC)

• You can look at AIC as counterpart of adjusted r square in multiple regression.

• It's an important indicator of model fit.

• It follows the rule: Smaller the better.

• AIC penalizes increasing number of coefficients in the model.

• In other words, adding more variables to the model wouldn't let AIC increase.

• It helps to avoid over fitting.

• Looking at the AIC metric of one model wouldn't really help.

• It is more useful in comparing models (model selection).


• So, build 2 or 3 Logistic Regression models and compare their AIC.

• The model with the lowest AIC will be relatively better.

2. Null Deviance and Residual Deviance

• Deviance of an observation is computed as -2 times log likelihood of that observation.

• The importance of deviance can be further understood using its types: Null and Residual
Deviance.

• Null deviance is calculated from the model with no features, i.e.,only intercept.

• The null model predicts class via a constant probability.

• Residual deviance is calculated from the model having all the features.

• On comparison with Linear Regression, think of residual deviance as residual sum of


square (RSS) and null deviance as total sum of squares (TSS).

• The larger the difference between null and residual deviance, better the model.

• Also, you can use these metrics to compared multiple models: whichever model has a
lower null deviance, means that the model explains deviance pretty well, and is a better
model.

• Also, lower the residual deviance, better the model.

• Practically, AIC is always given preference above deviance to evaluate model fit.

3. Confusion Matrix

• Confusion matrix is the most crucial metric commonly used to evaluate classification
models.

• It's quite confusing but make sure you understand it by heart.

• The skeleton of a confusion matrix looks like this:


• As you can see, the confusion matrix avoids "confusion" by measuring the actual and
predicted values in a tabular format.

• In table above, Positive class = 1 and Negative class = 0.

Following are the metrics we can derive from a confusion matrix:

Accuracy –

 It determines the overall predicted accuracy of the model.

• It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True


Negatives + False Positives + False Negatives)

True Positive Rate (TPR) –

 It indicates how many positive values, out of all the positive values, have been correctly
predicted.

• The formula to calculate the true positive rate is (TP/TP + FN). Also, TPR = 1 - False
Negative Rate.

• It is also known as Sensitivity or Recall.

False Positive Rate (FPR) –

 It indicates how many negative values, out of all the negative values, have been
incorrectly predicted.

• The formula to calculate the false positive rate is (FP/FP + TN).

• Also, FPR = 1 - True Negative Rate.

True Negative Rate (TNR) –

 It indicates how many negative values, out of all the negative values, have been correctly
predicted.

• The formula to calculate the true negative rate is (TN/TN + FP).

• It is also known as Specificity.

False Negative Rate (FNR) –

It indicates how many positive values, out of all the positive values, have been incorrectly
predicted.

• The formula to calculate false negative rate is (FN/FN + TP).


Precision:

 It indicates how many values, out of all the predicted positive values, are actually
positive.

• It is formulated as:(TP / TP + FP).

F Score:

 F score is the harmonic mean of precision and recall. It lies between 0 and 1.

• Higher the value, better the model. It is formulated as 2((precision*recall) /


(precision+recall)).

4. Receiver Operator Characteristic (ROC)

• ROC determines the accuracy of a classification model at a user defined threshold value.

• It determines the model's accuracy using Area Under Curve (AUC).

• The area under the curve (AUC), also referred to as index of accuracy (A) or concordant
index, represents the performance of the ROC curve.

• Higher the area, better the model.

• ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).

• In this plot, our aim is to push the red curve (shown below) toward 1 (left corner) and
maximize the area under curve.

• Higher the curve, better the model.

• The yellow line represents the ROC curve at 0.5 threshold.

• At this point, sensitivity = specificity.


Steps in Logistic Regression Model Construction
Define the Problem:

• Clearly specify the binary outcome you want to predict.

Data Collection:

• Gather and prepare the relevant dataset that includes the target variable and the predictor
variables.

Data Preprocessing:

• Cleaning: Handle missing values and outliers.

• Encoding: Convert categorical variables into a suitable format (e.g., one-hot encoding).

• Feature Scaling: Although not always necessary for logistic regression, scaling can help
in interpreting coefficients.

Exploratory Data Analysis (EDA):

• Analyze the data to understand distributions, relationships, and potential multicollinearity


among predictors.

Model Training:

• Split the data into training and testing sets.

• Fit the logistic regression model using the training data.

Model Evaluation:

• Evaluate the model using appropriate metrics:

– Confusion Matrix: To visualize true vs. predicted classifications.

– Accuracy: The proportion of correct predictions.

– Precision, Recall, F1-Score: Especially important in imbalanced datasets.

– ROC Curve and AUC: To assess the trade-off between true positive rate and
false positive rate.

Model Interpretation:

• Interpret the coefficients to understand the impact of each predictor on the odds of the
outcome.
• A positive coefficient increases the odds, while a negative coefficient decreases them.

Model Validation:

• Use cross-validation techniques to ensure the model's robustness.

Deployment:

• Deploy the model for real-time predictions or batch processing, integrating it into
existing systems.

Monitoring and Maintenance:

• Regularly monitor the model’s performance and update it as needed to ensure it remains
accurate over time.

Analytics applications to various Business Domains


• logistic regression is widely used for classification tasks where the goal is to predict a
binary or multi-class outcome.

• Here are some key applications of logistic regression across various domains:

1. Healthcare and Medical Diagnosis

2. Finance and Credit Scoring

3. Marketing and Customer Analytics

4. Human Resources (HR) Analytics

5. Telecommunications

6. Retail and E-commerce

7. Social Media and Sentiment Analysis

8. Political and Social Sciences

9. Insurance Industry

10. Cyber security


1. Healthcare and Medical Diagnosis

• Disease Prediction: Logistic regression is often used to assess the likelihood of diseases,
such as heart disease, diabetes, or cancer, based on patient data. For example, logistic
regression can analyze factors like age, blood pressure, cholesterol levels, and more to
predict if a patient is at risk.

• Patient Outcome Prediction: It can predict patient outcomes like survival rates or the
probability of hospital readmission based on previous history, demographics, and clinical
variables.

2. Finance and Credit Scoring

• Loan Default Prediction: Banks and financial institutions use logistic regression to
predict the likelihood of a customer defaulting on a loan. This model can use features like
credit score, income level, and employment history to estimate risk.

• Credit Scoring: Logistic regression helps classify individuals into different credit risk
categories (e.g., low-risk or high-risk borrowers) based on financial behavior and
historical data.

• Fraud Detection: Logistic regression is used to identify fraudulent transactions by


assessing the probability that a transaction is legitimate or fraudulent based on features
such as transaction amount, location, and time.

3. Marketing and Customer Analytics

• Customer Churn Prediction: Logistic regression helps companies predict if a customer


will stop using a product or service based on usage patterns, customer demographics, and
interactions with the company.

• Customer Segmentation: In marketing, logistic regression is used to categorize


customers by their likelihood of responding to promotional offers, converting to a
subscription, or purchasing a product. This helps target the right customers with tailored
marketing efforts.

• Predicting Click-Through Rates (CTR): Logistic regression can predict the probability
of a user clicking on an ad based on features like time, device type, and past behavior.
This enables effective ad placements and marketing spend optimization.

4. Human Resources (HR) Analytics

• Employee Retention: Logistic regression is used to predict the likelihood of employees


leaving a company based on factors such as job satisfaction, years of experience,
compensation, and work environment.
• Recruitment Screening: It can help classify job applicants as suitable or not for a
specific role based on factors like skills, experience, and past performance.

5. Telecommunications

• Churn Prediction: In the telecommunications industry, logistic regression is commonly


applied to predict customer churn. Telecom companies analyze user data such as usage
frequency, call patterns, and support interactions to determine the likelihood of customer
churn.

• Service Upgrade Prediction: Telecoms can use logistic regression to predict the
likelihood of a customer opting for service upgrades or additional features, allowing for
targeted offers and personalized recommendations.

6. Retail and E-commerce

• Purchase Probability Prediction: Logistic regression is used to estimate the likelihood


of a customer purchasing a product after browsing through an e-commerce site. This
helps improve recommendation engines and targeted advertising.

• Product Return Prediction: By analyzing purchase patterns, reviews, and historical


return behavior, logistic regression can help predict if a product is likely to be returned,
allowing companies to identify potential issues and improve quality control.

• Customer Segmentation: Classifying customers based on purchasing behavior, loyalty,


and other features can help tailor marketing strategies and personalize customer
experiences.

7. Social Media and Sentiment Analysis

• Sentiment Classification: Logistic regression is used to classify text data (like social
media posts or customer reviews) into sentiment categories, such as positive, neutral, or
negative. This is valuable for brand sentiment analysis.

• Fake News Detection: Logistic regression can classify articles as real or fake based on
language patterns, sources, and other textual features.

• User Behavior Prediction: Predicting the likelihood of a user engaging with or sharing
certain content can be helpful for optimizing content strategy and social media targeting.

8. Political and Social Sciences

• Election Outcome Prediction: Logistic regression is used to predict the likelihood of a


candidate winning based on demographic, socioeconomic, and voting history data.
• Public Opinion Analysis: Logistic regression can help classify responses in surveys
(e.g., "for" or "against" a policy) based on demographic and other features to understand
public opinion.

9. Insurance Industry

• Claim Prediction: Logistic regression is used to predict the likelihood of policyholders


filing an insurance claim, helping insurers identify potential risks.

• Risk Classification: By analyzing policyholder data, logistic regression can classify


customers into different risk categories, helping with underwriting decisions and
premium calculation.

10. Cyber security

• Intrusion Detection: Logistic regression helps identify whether network activity is


legitimate or malicious, based on traffic patterns and user behavior.

• Spam Detection: Email providers use logistic regression to classify emails as spam or
not based on features like sender details, frequency of specific keywords, and attachment
types.

Pros and Cons of Logistic Regression


• Many of the pros and cons of the linear regression model also apply to the logistic
regression model.

• Although Logistic regression is used widely by many people for solving various types of
problems, it fails to hold up its performance due to its various limitations and also other
predictive models provide better predictive results.

Pros

• The logistic regression model not only acts as a classification model, but also gives you
probabilities.

• This is a big advantage over other models where they can only provide the final
classification.

• Knowing that an instance has a 99% probability for a class compared to 51% makes a big
difference.

• Logistic Regression performs well when the dataset is linearly separable.


• Logistic Regression not only gives a measure of how relevant a predictor (coefficient
size) is, but also its direction of association (positive or negative).

• We see that Logistic regression is easier to implement, interpret and very efficient to
train.

Cons

• Logistic regression can suffer from complete separation.

• If there is a feature that would perfectly separate the two classes, the logistic regression
model can no longer be trained.

• This is because the weight for that feature would not converge, because the optimal
weight would be infinite.

• This is really a bit unfortunate, because such a feature is really very useful.

• But you do not need machine learning if you have a simple rule that separates both
classes.

• The problem of complete separation can be solved by introducing penalization of the


weights or defining a prior probability distribution of

weights:

• Logistic regression is less prone to overfitting but it can overfit in high dimensional
datasets and in that case, regularization techniques should be considered to avoid over-
fitting in such scenarios.

You might also like