SOL Study Material
SOL Study Material
Content Writers
Dr. Abhishek Kumar Singh, Dr. Satish Goel,
Mr. Anurag Goel, Dr. Sanjay Kumar
Academic Coordinator
Mr. Deekshant Awasthi
Published by:
Department of Distance and Continuing Education under the
aegis of Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007
Printed by:
School of Open Learning, University of Delhi
Lesson2: Predictive
Analytics………………………………………………………20 2.1 Learning
objectives
2.2 Introduction
2.3 Classical Linear Regression Model
2.4 Multiple Linear Regression Models
2.5 Practical Exercise using R/Python Programming:
2.6 Summary
3.11 AUC
3.12 Summary
ii | P a g e
STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Introduction to Business Analytics
1.4 Role of Analytics for Data-Driven Decision Making
1.5 Types of Business Analytics
1.6 Introduction to the concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to relevant statistical software packages
1.9 Summary
1.10 Glossary
1.11 Answer to in text Question
1.12 Self- Assessment Question
1.13 References
1.14 Suggested Reading
2|Page
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
3|Page
1.4
Business analytics can be divided into four primary categories, each of which gets more
complex. They bring us one step closer to implementing scenario insight applications
for the present and the future. Below is a description of each of these business
analytics categories.
1. Descriptive analytics,
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
1. Descriptive analytics: In order to understand what has occurred in the past or is
happening right now, it summarises the data that an organisation currently has.
The simplest type of analytics is descriptive analytics, which uses data
aggregation and mining techniques. It increases the availability of data to an
organization's stakeholders, including shareholders, marketing executives, and
sales managers. It can aid in discovering strengths and weaknesses and give
information about customer behaviour. This aids in the development of
strategies for the field of focused marketing.
2. Diagnostic Analytics: This kind of analytics aids in refocusing attention from past
performance to present occurrences and identifies the variables impacting
trends. Drill-down, data mining, and other techniques are used to find the
underlying cause of occurrences. Probabilities and likelihoods are used in
diagnostic analytics to
6|Page
Fig. No. 2
7|Page
BMS
1.6 INTRODUCTION TO THE CONCEPTS OF BIG DATA
ANALYTICS
Big Data Analytics is made up of enormous volumes of information that cannot be
processed or stored using conventional data processing or storage methods. There
are often three distinct versions.
Structured data, as the name implies, has a clear structure and follows a regular
sequence. A person or machine may readily access and utilise this type of
information since it has been intended to be user-friendly. Structured data is
typically kept in databases, especially relational database management
systems, or RDBMS, and tables with clearly defined rows and columns, such
as spreadsheets.
While semi-structured data displays some of the same characteristics as
structured data, for the most part it lacks a clear structure and cannot adhere to
the formal specifications of data models like an RDBMS.
Unstructured data does not adhere to the formal structural norms of traditional
data models and lacks a consistent structure across all of its different forms. In
a very small number of cases, it might contain information on the date and time.
1.6.1 Large-scale Data Management Traits
According to traditional definitions of the term, big data is typically linked to three
essential traits:
Volume: The massive amounts of information produced every second by social
media, mobile devices, automobiles, transactions, connected sensors, photos,
video, and text are referred to by this characteristic. Only big data technologies
can handle enormous volumes, which come in petabyte, terabyte, or even
zettabyte sizes.
Diversity: Information in the form of images, audio streams, video, and a variety
of other forms now contributes to a diversity of data kinds, around 80% of which
are completely unstructured, to the existing landscape of transactional and
demographic data like phone numbers and addresses.
Velocity: This attribute relates to the velocity of data accumulation and refers to
the phenomenal rate at which information is flooding into data repositories. It
also describes how quickly massive data can be analysed and processed to
draw out the insights and patterns it contains. Now, that speed is frequently
real-time. Current
8|Page
Fig: 3
1.6.2 Services for Big Data Management
Organisations can pick from a wide range of big data management options when it
comes to technology. Big data management solutions can be standalone or multi-
featured, and many businesses employ several of them. The following are some of the
most popular kinds of big data management capabilities:
Finding and resolving problems in data sets is known as data
cleansing. Data integration is the process of merging data from
several sources.
Data preparation is the process of preparing data for use in analytics or other
applications. Data enrichment is the process of enhancing data by adding new
data sets, fixing minor errors, or extrapolating new information from raw data.
Data migration is the process of moving data from one environment to another,
such as from internal data centres to the cloud.
9|Page
BMS Adding new data sets, fixing minor errors, or extrapolating new
information from raw data are all examples of data enrichment. Data analytics is
the process of analysing data using a variety of techniques in order to gain
insights.
10 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
12 | P a g e
1.9 SUMMARY
The disciplines of management, business, and computer science are all combined in
business analytics. The commercial component requires knowledge of the industry at
a high level as well as awareness of current practical constraints. An understanding of
data, statistics, and computer science is required for the analytical portion. Business
analysts can close the gap between management and technology thanks to this
confluence of disciplines. Business analytics also includes effective problem-solving
and communication to translate data insights into information that is understandable
to executives. A related field called business intelligence likewise uses data to better
understand and inform businesses. What distinguishes
16 | P a g e
17 | P a g e
18 | P a g e
1.13 REFERENCES
19 | P a g e
BMS LESSON 2
PREDICTIVE ANALYTICS
Dr. Satish Kumar Goel
Assistant Professor
Shaheed Sukhdev College of Business Studies
(University of Delhi)
[email protected]
STRUCTURE
2.2 INTRODUCTION
In this chapter, we will explore the field of predictive analytics, focusing on two
fundamental techniques: Simple Linear Regression and Multiple Linear Regression.
Predictive analytics is a powerful tool for analysing data and making predictions about
future outcomes. We will
20 | P a g e
2.3.1. Introduction
Predictive analytics is the use of statistical techniques, machine learning algorithms,
and other tools to identify patterns and relationships in historical data and use them to
make predictions about future events. These predictions can be used to inform
decision-making in a wide variety of areas, such as business, marketing, healthcare,
and finance.
Linear regression is the traditional statistical technique used to model the relationship
between one or more independent variables and a dependent variable.
Linear regression involving only two variables is called simple linear regression. Let us
consider two variables as ‘x’ and ‘y’. Here ‘x’ represents independent variable or
explanatory variable and ‘y’ represents dependent variable or response variable.
Dependent variable must be a ratio variable, whereas independent variable can be
ratio or categorical variable. We can talk about regression model for cross-sectional
data or for time series data. In time series regression model, time is taken as
independent variable and is very useful for predicting future. Before we develop a
regression model, it is a good exercise to ensure that two variables are linearly related.
For this, plotting the scatter diagram is really helpful. A linear pattern can easily be
identified in the data.
The Classical Linear Regression Model (CLRM) is a statistical framework used to
analyse the relationship between a dependent variable and one or more independent
variables. It is a widely used method in econometrics and other fields to study and
understand the nature of this relationship, make predictions, and test hypotheses.
Regression analysis aims to examine how changes in the independent variable(s)
affect the dependent variable. The CLRM assumes a linear relationship between the
dependent variable (Y) and the independent variable(s) (X), allowing us to estimate
the parameters of this relationship and make predictions.
The regression equation in the CLRM is expressed as:
Yi = α + βxi + μi
21 | P a g e
22 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
23 | P a g e
Var (µi) =
24 | P a g e
Yi =
Yj =
26 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
BMS
30 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
Impact of multicollinearity:
Unbiasedness: The Ordinary Least Squares (OLS) estimators remain unbiased.
Precision: OLS estimators have large variances and covariances, making precise
estimation difficult and leading to wider confidence intervals. Statistically insignificant
coefficients may be observed.
High R-squared: The R-squared value can still be high, even with statistically
insignificant coefficients.
Sensitivity: OLS estimators and their standard errors are sensitive to small changes
in the data.
Efficiency: Despite increased variance, OLS estimators are still efficient, meaning
they have minimum variance among all linear unbiased estimators.
In summary, multicollinearity undermines the precision of coefficient estimates and can
lead to unreliable statistical inference. While the OLS estimators remain unbiased, they
become imprecise, resulting in wider confidence intervals and potential insignificance
of coefficients.
We will learn how to detect multicollinearity using the Variance Inflation Factor (VIF)
and explore strategies to address this issue, ensuring the accuracy and interpretability
of the regression model.
31 | P a g e
BMS VIF stands for Variance Inflation Factor, which is a measure used to
assess multicollinearity in multiple regression model. VIF quantifies how much the
variance of the estimated regression coefficient is increased due to multicollinearity. It
measures how much the variance of one independent variable's estimated coefficient
is inflated by the presence of other independent variables in the model.
The formula for calculating the VIF for an independent variable Xj is:
VIF(Xj) = 1 / (1 – rj2)
where rj2 represents the coefficient of determination (R-squared) from a regression
model that regresses Xj on all other independent variables.
The interpretation of VIF is as follows:
If VIF(Xj) is equal to 1, it indicates that there is no correlation between Xj and the
other independent variables.
If VIF(Xj) is greater than 1 but less than 5, it suggests moderate multicollinearity.
If VIF(Xj) is greater than 5, it indicates a high degree of multicollinearity, and it is
generally considered problematic.
When assessing multicollinearity, it is common to examine the VIF values for all
independent variables in the model. If any variables have high VIF values, it indicates
that they are highly correlated with the other variables, which may affect the reliability
and interpretation of the regression coefficients.
If high multicollinearity is detected (e.g., VIF greater than 5), some steps can be taken
to address it:
Remove one or more of the highly correlated independent variables from the
model. Combine or transform the correlated variables into a single variable.
Obtain more data to reduce the correlation among the independent variables.
By addressing multicollinearity, the stability and interpretability of the regression model
can be improved, allowing for more reliable inferences about the relationships between
the independent variables and the dependent variable.
HOW TO DETECT MULTICOLLINEARITY
To detect multicollinearity in your regression model, you can use several methods:
32 | P a g e
33 | P a g e
Graphical Method
Durbin Watson test
Breusch-Godfrey test
1. Graphical Method
Autocorrelation can be detected using graphical methods. Here are a few graphical
techniques to identify autocorrelation:
34 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
Figure 1.2
Autocorrelation and partial autocorrelation function (ACF and PACF) plots, prior to
differencing (A and B) and after differencing (C and D)
In both the PACF and ACF plots, significance can be determined by comparing the
correlation values against the confidence intervals. If the correlation values fall outside
the confidence intervals, it suggests the presence of autocorrelation.
35 | P a g e
BMS It's important to note that these graphical methods provide indications
of autocorrelation, but further statistical tests, such as the Durbin-Watson test or Ljung-
Box test, should be conducted to confirm and quantify the autocorrelation in the model.
2. Durbin Watson D Test
The Durbin-Watson test is a statistical test used to detect autocorrelation in the
residuals of a regression model. It is specifically designed for detecting first-order
autocorrelation, which is the correlation between adjacent observations.
The Durbin-Watson test statistic is computed using the following
formula: d = (Σ (e_i - e_i-1)^2) / Σe_i^2
where:
· e_i is the residual for observation i.
· e_i-1 is the residual for the previous observation (i-1).
The test statistic is then compared to critical values to determine the presence of
autocorrelation. The critical values depend on the sample size, the number of
independent variables in the regression model, and the desired level of significance.
The Durbin-Watson test statistic, denoted as d, ranges from 0 to 4. The test statistic
is calculated based on the residuals of the regression model and is interpreted as
follows:
A value of d close to 2 indicates no significant autocorrelation. It suggests that the
residuals are independent and do not exhibit a systematic relationship.
A value of d less than 2 indicates positive autocorrelation. It suggests that there is a
positive relationship between adjacent residuals, meaning that if one residual is high,
the next one is likely to be high as well.
A value of d greater than 2 indicates negative autocorrelation. It suggests that there is
a negative relationship between adjacent residuals, meaning that if one residual is
high, the next one is likely to be low.
The closer it is to zero, the greater is the evidence of positive autocorrelation, and the
closer it is to 4, the greater is the evidence of negative autocorrelation. If d is about 2,
there is no evidence of positive or negative (first-) order autocorrelation.
36 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
Using R:
# Load the necessary libraries
Library(dplyr)
# Read the dataset
data <- read.csv("your_dataset.csv")
# Perform the OLS regression
model <- lm (Y ~ X, data = data)
# Print the summary of the regression results
Summary (model)
38 | P a g e
39 | P a g e
BMS p-values: The regression results also provide p-values for the
coefficients. These p-values indicate the statistical significance of the coefficients.
Generally, a p-value less than a significance level (e.g., 0.05) suggests that the
coefficient is statistically significant, implying a relationship between the independent
variable and the dependent variable.
R-squared: The R-squared value (R-squared or R2) measures the proportion of the
variance in the dependent variable that can be explained by the independent
variable(s). It ranges from 0 to 1, with higher values indicating a better fit of the
regression model to the data. R-squared can be interpreted as the percentage of the
dependent variable's variation explained by the independent variable(s).
Residuals: The regression results also include information about the residuals, which
are the differences between the observed values of the dependent variable and the
predicted values from the regression model. Residuals should ideally follow a normal
distribution with a mean of zero, and their distribution can provide insights into the
model's goodness of fit and potential violations of the regression assumptions.
It's important to note that interpretation may vary depending on the specific context and
dataset. Therefore, it's essential to consider the characteristics of your data and the
objectives of your analysis while interpreting the results of an OLS regression.
Exercise 2. Test the assumptions of OLS (multicollinearity, autocorrelation, normality
etc.) on R/Python.
Sol. To test the assumptions of OLS, including multicollinearity, autocorrelation, and
normality, you can use various diagnostic tests in R or Python. Here are the steps and
some commonly used tests for each assumption:
Multicollinearity:
Step 1: Calculate the pairwise correlation matrix between the independent variables
using the cor () function in R or the corrcoef() function in Python (numpy).
Step 2: Calculate the Variance Inflation Factor (VIF) for each independent variable
using the vif () function from the "car" package in R or the variance_inflation_factor()
function from the "statsmodels" library in Python. VIF values greater than 10 indicate
high multicollinearity.
Step 3: Perform auxiliary regressions by regressing each independent variable
against the remaining independent variables to identify highly collinear variables.
Autocorrelation:
40 | P a g e
BMS
This dataset consists of three columns: y represents the dependent variable, and x1
and x2 are the independent variables. Each row corresponds to an observation in the
dataset.
We can use this dataset to run the provided code and perform diagnostic tests on the
OLS regression model.
import numpy as np.
import pandas as pd.
import statsmodels.api as sm.
import seaborn as sns.
import matplotlib. pyplot as plt
42 | P a g e
# Create a DataFrame
data = pd. DataFrame({'y': y, 'x1': x1, 'x2': x2})
# Diagnostic tests
print("Multicollinearity:")
vif = pd. DataFrame()
vif["Variable"] = X. columns
vif["VIF"] = [variance_inflation_factor (X.values, i) for i in
range(X.shape[1])] print(vif)
print("\nAutocorrelation:")
residuals = results. resid
fig, ax = plt. subplots()
ax. scatter(results.fittedvalues, residuals)
ax.set_xlabel ("Fitted values")
ax.set_ylabel("Residuals")
plt. show()
print ("Durbin-Watson test:")
dw_statistic = sm. stats.stattools.durbin_watson(residuals)
43 | P a g e
print("\nNormality of Residuals:")
sns.histplot(residuals, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()
shapiro_test = sm.stats.shapiro(residuals)
print(f"Shapiro-Wilk test p-value: {shapiro_test[1]}")
In this example, we generated a random dataset with two independent variables (x1
and x2) and a dependent variable (y). We fit an OLS regression model using the
statsmodels library. Then, we perform diagnostic tests for multicollinearity,
autocorrelation, and normality of residuals.
The code calculates the VIF for each independent variable, plots the residuals against
the fitted values, performs the Durbin-Watson test for autocorrelation, and plots a
histogram of the residuals. Additionally, the Shapiro-Wilk test is conducted to check
the normality of residuals.
We can run this code in a Python environment to see the results and interpretations
for each diagnostic test based on the random dataset provided.
3. Perform regression analysis with categorical/dummy/qualitative variables on
R/Python.
import pandas as pd
import statsmodels.api as sm
# Create a DataFrame with the data
data = {
'y': [3.3723, 5.5593, 8.1878, -2.4581, 3.8578, 5.4747, 6.4135, 8.1032, 5.56,
5.3514, 5.8457],
44 | P a g e
df = pd.DataFrame(data)
In this example, we have created a DataFrame df with the y, x1, x2, and category
variables. The category variable is converted into dummy variables using the
get_dummies function, and the category A column is dropped to avoid multicollinearity.
We then define the dependent variable y and the independent variables X, including
the dummy variable category_B. A constant term is added to the independent variables
using sm.add_constant. Finally, we fit the OLS model using sm.OLS and print the
summary of the regression results using model.summary(). The regression analysis
provides the estimated coefficients, standard errors, t-statistics, and p-values for each
independent variable, including the dummy variable category B.
45 | P a g e
BMS
IN-TEXT QUESTIONS AND ANSWERS
2.6 SUMMARY
2.8 REFERENCES
1. Business Analytics: The Science of Data Driven Decision Making, First Edition
(2017), U Dinesh Kumar, Wiley, India.
47 | P a g e
BMS LESSON 3
LOGISTIC AND MULTINOMIAL REGRESSION
Anurag Goel
Assistant Professor, CSE Dept.
Delhi Technological University, New Delhi
Email-Id: [email protected]
STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lemshow Test
3.7 Pseudo R Square
3.8 Classification Table
3.9 Gini Coefficient
3.10 ROC
3.11 AUC
3.12 Summary
3.13 Glossary
3.14 Answers to In-Text Questions
3.15 Self-Assessment Questions
3.16 References
3.17 Suggested Readings
49 | P a g e
50 | P a g e
where Dr represents the deviance of the reduced model (without predictors) and Df
represents the deviance of the full model (with predictors).
The Omnibus test statistic approximately follows chi-square distribution with degrees
of freedom given by the difference in the number of predictors between the full and
reduced models. By comparing the test statistic to the chi-square distribution and
calculating the associated p-value, we can calculate the collective statistical
significance of the predictor variables.
When the calculated p-value is lower than a predefined significance level (e.g., 0.05),
we reject the null hypothesis, indicating that the group of predictor variables collectively
has a statistically significant influence on the dependent variable. On the other hand,
if the p-value exceeds the significance level, we fail to reject the null hypothesis,
suggesting that the predictors may not have a significant collective effect.
The Omnibus test provides a comprehensive assessment of the overall significance of
the predictor variables within a regression model, aiding in the understanding of how
these predictors jointly contribute to explaining the variation in the dependent variable.
51 | P a g e
By using statistical software, we obtain the estimated coefficients and the deviance of
the full model:
Deviance_reduced = 15.924
53 | P a g e
BMS
where β is the estimated coefficient for the predictor variable of interest, β₀ is the
hypothesized value of the coefficient under the null hypothesis (typically 0 for testing if
the coefficient is zero) and Var(β) is the estimated variance of the coefficient.
The Wald test statistic is compared to the chi-square distribution, where the degrees of
freedom are set to 1 (since we are testing a single parameter) to obtain the associated
p-value. Rejecting the null hypothesis occurs when the calculated p-value falls below
a predetermined significance level (e.g., 0.05), indicating that the predictor variable
has a statistically significant impact on the dependent variable.
The Wald test allows us to determine the individual significance of predictor variables
by testing whether their coefficients significantly deviate from zero. It is a valuable tool
for identifying which variables have a meaningful impact on the outcome of interest in
a regression model.
Let's consider an example where we have a logistic regression model with two predictor
variables (X1 and X2) and a binary outcome variable (Y). We want to assess the
significance of the coefficient for each predictor using the Wald test.
Here is a sample dataset with the predictor variables and the binary outcome
variable: X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1
54 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
For X1:
W₁ = (0.921 - 0)² / (0.512)² = 1.790
For X2:
W₂ = (0.372 - 0)² / (0.295)² = 1.608
Step 3: Conduct the Hypothesis Test
To assess the statistical significance of each predictor, we compare the Wald test
statistic for each variable to the chi-square distribution with 1 degree of freedom
(since we are testing a single parameter).
By referring to the chi-square distribution table or using statistical software, we
determine the p-value associated with each Wald test statistic. Let's assume the p-
value for X1 is 0.183 and the p-value for X2 is 0.205.
Step 4: Interpret the Results
For X1, since the p-value (0.183) is larger than the predetermined significance
level (e.g., 0.05), we fail to reject the null hypothesis. This suggests that the
coefficient for X1 is not statistically significantly different from zero, indicating that
X1 may not have a significant effect on the binary outcome variable Y.
Similarly, for X2, since the p-value (0.205) is larger than the significance level, we
fail to reject the null hypothesis. This suggests that the coefficient for X2 is not
statistically 55 | P a g e
BMS significantly different from zero, indicating that X2 may not have a
significant effect on the binary outcome variable Y.
In summary, based on the Wald tests, we do not have sufficient evidence to
conclude that either X1 or X2 has a significant impact on the binary outcome
variable in the logistic regression model.
IN-TEXT QUESTIONS
1. What does the Wald test statistic compare to obtain the associated
p-value? a) The F-distribution
b) The t-distribution
c) The normal distribution
d) The chi-square distribution
56 | P a g e
57 | P a g e
Bin: [0.3-0.5]
Total cases in bin: 4
Observed cases (Y = 1): 2
Expected cases: (0.40 + 0.35 + 0.30 + 0.28) * 4 = 3.52
Bin: [0.5-0.7]
Total cases in bin: 3
Observed cases (Y = 1): 2
Expected cases: (0.45 + 0.60) * 3 = 3.15
58 | P a g e
where ℒ model is the log-likelihood of the full model, ℒ null is the log-likelihood of the
null model (a model with only an intercept term) and ℒ max is the log-likelihood of a
model with perfect prediction (a hypothetical model that perfectly predicts all
outcomes).
Nagelkerke's R-squared ranges from 0 to 1, with 0 indicating that the predictors have
no explanatory power, and 1 suggesting a perfect fit of the model. However, it is
important to note that Nagelkerke's R-squared is an adjusted measure and should not
be interpreted in the same way as R-squared in linear regression.
59 | P a g e
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1
60 | P a g e
Actual
Cancerous Non-Cancerous
Predicted Cancerous TP = 5 FP = 15
Non- FN = 5 TN = 75
Cancerous
Fig 3.2: Classification Matrix
3.8.1 Sensitivity
Sensitivity, also referred to as True Positive Rate or Recall, is calculated as the ratio of
correctly predicted cancerous cells to the total number of cancerous cells in the ground
truth. To compute sensitivity, you can use the following formula:
62 | P a g e
3.8.3 Accuracy
Accuracy is calculated as the ratio of correctly classified cells to the total number of
cells. To compute accuracy, you can use the following formula:
3.8.4 Precision
Precision is calculated as the ratio of the correctly predicted cancerous cells to the total
number of cells predicted as cancerous by the model. To compute precision, you can
use the following formula:
3.8.5 F score
The F1-score is calculated as the harmonic mean of Precision and Recall. To
compute the F1- score, you can follow the following formula:
IN-TEXT QUESTIONS
3. For the model X results on the given dataset of 100 cells, the precision of
model is a) 0 b) 0.25
c) 0.5 d) 1
4. For the model X results on the given dataset of 100 cells, the recall of
model is a) 0 b) 0.25
c) 0.5 d) 1
63 | P a g e
3.10 ROC
In particular in logistic regression or machine learning techniques, the performance of
a binary classification model is assessed using a graphical representation called the
Receiver Operating Characteristic (ROC) curve. The trade-off between the true
positive rate (sensitivity) and the false positive rate (specificity minus 1) for various
categorization thresholds is demonstrated.
Plotting the true positive rate (TPR) against the false positive rate (FPR) at various
categorization thresholds results in the ROC curve. The formula for TPR and FPR are
as follows:
64 | P a g e
We may evaluate the model's capacity to distinguish between positive and negative
examples at various classification levels using the ROC curve. With a TPR of 1 and
an FPR of 0, a perfect classifier would have a ROC curve that reaches the top left
corner of the plot. The model's discriminatory power increases with the distance
between the ROC curve and the top left corner.
3.11 AUC
When employing a Receiver Operating Characteristic (ROC) curve, the Area Under the
Curve (AUC) is a statistic used to assess the effectiveness of a binary classification
model. The likelihood that a randomly selected positive occurrence will have a greater
projected probability than a randomly selected negative instance is represented by the
AUC.
The AUC is calculated by integrating the ROC curve. However, it is important to note
that the AUC does not have a specific formula since it involves calculating the area
under a curve. Instead, it is commonly calculated using numerical methods or software.
The AUC value ranges between 0 and 1. A model with an AUC of 0.5 indicates a
random classifier, where the model's predictive power is no better than chance. An
AUC value that is nearer 1 indicates a classifier that is more accurate and is better
able to distinguish between positive and negative situations. Conversely, an AUC
value closer to 0 suggests poor performance, with the model performing worse than
random guessing.
In binary classification tasks, the AUC is a commonly utilized statistic since it offers a
succinct assessment of the model's performance at different categorization thresholds.
It is especially useful when the dataset is imbalanced i.e. the number of instances that
are positive and negative differ significantly.
In conclusion, the AUC measure evaluates a binary classification model's total
discriminatory power by delivering a single value that encapsulates the model's
capacity to rank cases properly. Better classification performance is shown by higher
AUC values, whilst worse performance is indicated by lower values.
65 | P a g e
3.12 SUMMARY
Logistic regression is used to solve the classification problems by producing the
probabilistic values within the range of 0 and 1. Logistic regression uses Logistic
function i.e. sigmoid function. Multinomial Regression is the generalization of logistic
regression to multiclass problems. Omnibus test is a statistical test utilized to test the
significance of several model parameters at once. Wald test is a statistical test used
to assess the significance of individual predictor variables in a regression model.
Hosmer-Lemeshow test is a statistical test employed to assess the adequacy of a
logistic regression model. Pseudo R-square is a measure to assess the proportion of
variance in the dependent variable explained by the predictor variables. There are
various classification metrics namely Sensitivity, Specificity, Accuracy, Precision, F-
score, Gini Coefficient, ROC and AUC, which are utilized to evaluate the performance
of a classifier model.
66 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi
ROC curve ROC curve demonstrates the balance between the true positive rate
and the false positive rate across various classification thresholds.
Gini A metric used to measure the inequality.
Coefficient
3.16 REFERENCES
LaValley, M. P. (2008). Logistic regression. Circulation, 117(18), 2395-
2399. Wright, R. E. (1995). Logistic regression.
Chatterjee, Samprit, and Jeffrey S. Simonoff. Handbook of regression analysis.
John Wiley & Sons, 2013.
Kleinbaum, David G., K. Dietz, M. Gail, Mitchel Klein, and Mitchell Klein. Logistic
regression. New York: Springer-Verlag, 2002.
DeMaris, Alfred. "A tutorial in logistic regression." Journal of Marriage and the
Family (1995): 956-968.
Osborne, J. W. (2014). Best practices in logistic regression. Sage
Publications. Bonaccorso, Giuseppe. Machine learning algorithms. Packt
Publishing Ltd, 2017.
68 | P a g e
STRUCTURE
70 | P a g e
50 210 No
55 190 Yes
71 | P a g e
BMS 60 220 No
65 230 Yes
70 200 No
Using the CART algorithm, we can build a decision tree to make predictions. The
decision tree may look like this:
Fig 4.2: Predicting Disease based on Age and Cholesterol Levels
The decision tree in this illustration begins at the node at the top, which evaluates the
statement "Age = 55." If a patient is under the age of 55, we proceed to the left branch
and examine the "Cholesterol = 200" condition. The diagnosis is "No Disease" if the
patient's cholesterol level is less than or equal to 200. The forecast is "Yes Disease" if
the cholesterol level is more than 200.
However, if the patient is older than 55, we switch to the right branch, where "No
Disease" is predicted regardless of the cholesterol level.
4.3 CHAID
of a contingency table.
E represents the expected frequencies under the assumption of independence
between variables.
Fig 4.3: Determining Satisfaction Levels of customer
73 | P a g e
BMS This flowchart shows how CHAID gradually divides the dataset into
subsets according to the most important predictor factors, resulting in a hierarchical
structure. It enables us to clearly and orderly visualise the links between the variables
and their effects on the target variable (Customer Satisfaction).
Age Group is the first variable on the flowchart, and it has two branches: "Young" and
"Middle-aged." We further examine the Gender variable within the "Young" branch,
resulting in branches for "Male" and "Female." The Purchase Frequency variable is
next examined for each gender subgroup, yielding three branches: "Low," "Medium,"
and "High." We arrive at the leaf nodes, which represent the customer satisfaction
outcome and are either "Satisfied" or "Not Satisfied."
4.3.2 Bonferroni Correction
The Bonferroni correction is a statistical method used to adjust the significance levels
(p values) when conducting multiple hypothesis tests at the same time. It helps
control the overall chance of falsely claiming a significant result by making the criteria
for significance more strict.
To apply the Bonferroni correction, we divide the desired significance level (usually
denoted as α) by the number of tests being performed (denoted as m). This adjusted
significance level, denoted as α' or α_B, becomes the new threshold for determining
statistical significance.
Mathematically, the Bonferroni correction can be represented as:
(1.2) For example, suppose we are conducting 10 hypothesis tests, and we want a
significance level of 0.05 (α = 0.05). By applying the Bonferroni correction, we divide
α by 10, resulting in an adjusted significance level of:
Based on the Bonferroni correction, we conclude that Test 1, Test 3, and Test 10 show
statistically significant results, as their p-values are less than or equal to the adjusted
significance level. The remaining tests are not considered statistically significant.
(1.4) 75 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi