0% found this document useful (0 votes)
79 views74 pages

SOL Study Material

Uploaded by

baswalganesh1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views74 pages

SOL Study Material

Uploaded by

baswalganesh1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Department of Distance

and Continuing Education


University of Delhi

Bachelor of Management Studies Course


Credit - 4
Semester-III
Discipline Specific Core (DSC-7)

As per the UGCF - 2022 and National Education Policy 2020

DSC-7: Introduction to Business Analytics


Editors
Dr. Rishi Rajan Sahay
Assistant Professor, Shaheed Sukhdev College of
Business Studies, University of Delhi
Dr. Sanjay Kumar
Assistant Professor, Delhi Technological University

Content Writers
Dr. Abhishek Kumar Singh, Dr. Satish Goel,
Mr. Anurag Goel, Dr. Sanjay Kumar

Academic Coordinator
Mr. Deekshant Awasthi

© Department of Distance and Continuing Education


ISBN: 978-81-19169-84-9
Ist edition: 2023
E-mail: [email protected]
[email protected]

Published by:
Department of Distance and Continuing Education under the
aegis of Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007

Printed by:
School of Open Learning, University of Delhi

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi
Attention

Corrections/Modifications/Suggestions proposed by Statutory


Body, DU/ Stakeholder/s in the Self Learning Material (SLM) will be
incorporated in the next edition. However, these
corrections/modifications/ suggestions will be uploaded on the
website https://sol.du.ac.in. Any feedback or suggestions can be
sent to the email- [email protected]
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics INDEX


Lesson1: Introduction to business analytics and descriptive
analytics………01 1.1 Learning objectives
1.2 Introduction
1.3 Introduction to business analytics.
1.4 Role of Analytics for Data-Driven Decision making
1.5 Types of Business Analytics
1.6 Introduction to the Concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to relevant statistical software packages
1.9 Summary

Lesson2: Predictive
Analytics………………………………………………………20 2.1 Learning
objectives
2.2 Introduction
2.3 Classical Linear Regression Model
2.4 Multiple Linear Regression Models
2.5 Practical Exercise using R/Python Programming:
2.6 Summary

Lesson 3: Logistic and multinomial


regression………………………………….48 3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lem Show Test
3.7 Pseudo R Square
3.8 Classification Table
3.9 Gini Coefficient
3.10 ROC
i|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi
BMS

3.11 AUC
3.12 Summary

Lesson 4: Decision Tree and


Clustering……………………….…………………...69 4.1 Learning Objectives
4.2 Introduction
4.3 Classification and Regression Tree
4.4 CHAID
4.4 Impurity Measures
4.5 Ensemble Methods
4.6 Clustering
4.7 Summary

ii | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics LESSON 1


INTRODUCTION TO BUSINESS ANALYTICS AND
DESCRIPTIVE ANALYTICS
Dr. Abhishek Kumar Singh
Assistant Professor
University of Delhi-19
[email protected]

STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Introduction to Business Analytics
1.4 Role of Analytics for Data-Driven Decision Making
1.5 Types of Business Analytics
1.6 Introduction to the concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to relevant statistical software packages
1.9 Summary
1.10 Glossary
1.11 Answer to in text Question
1.12 Self- Assessment Question
1.13 References
1.14 Suggested Reading

1.1 LEARNING OBJECTIVES

After studying the lesson, you will be able to:


Define Business Analytics
State the Role of Analytics for Data-Driven Decision Making
Mention the types of Business Analytics
Classify the concepts of Big Data Analytics
Describe Machine Learning Algorithms
Identify relevant statistical software packages.
1|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 1.2 INTRODUCTION


Business analytics (BA) consists of using data to gain valuable insights and make
informed decisions in a business setting. It involves analysing and interpreting data to
uncover patterns, trends, and correlations that can help organizations improve their
operations, better understand their customers, and make strategic decisions. Business
analytics (BA) places a focus on statistical analysis. In addition to statistical analysis,
business analytics also focuses on various other aspects, such as data mining,
predictive modelling, data visualization, machine learning, and data-driven decision
making.
Companies committed to making data-driven decisions employ business analytics. The
study of data through statistical and operational analysis, the creation of predictive
models, the use of optimisation techniques, and the communication of these results to
clients, business partners, and college executives are all considered to be components
of business analytics. It relies on quantitative methodologies, and data needed to
create specific business models and reach lucrative conclusions must be supported
by proof. As a result, Business Analytics heavily relies on and utilises Big Data.
Business analytics is the process used to analyse data after looking at past outcomes
and problems in order to create an effective future plan.
Big Data, or a lot of data, is utilised to generate answers. The economy and the sectors
that prosper inside it depend on this way of conducting business or this outlook on
creating and maintaining a business. Over the past ten or so years, the word analytics
has gained popularity. Analytics are now incredibly important due to the growth of the
internet and information technology. In this lesson we are going to learn about
Business Analytics and the area of analytics integrates data, information technology,
statistical analysis, and combining quantitative techniques with computer-based
models. All of these factors work together to give decision-makers every possibility
that can arise, allowing them to make well-informed choices. The computer-based
model makes sure that decision-makers may examine how their choices function in
various scenarios.

2|Page
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


Fig. 1

1.3 BUSINESS ANALYTICS


1.3.1 Meaning: Business analytics (BA) utilizes data analysis, statistical models, and
various quantitative techniques as a comprehensive discipline and technological
approach. It involves a systematic and iterative examination of organizational data,
with a specific emphasis on statistical analysis, to facilitate informed decision-making.
Business analytics primarily entails a combination of the following: discovering novel
patterns and relationships using data mining; developing business models using
quantitative and statistical analysis; conducting A/B and multi-variable testing based
on findings; forecasting future business needs, performance, and industry trends using
predictive modelling; and reporting your findings to co-workers, management, and
clients in simple-to
understand reports.
1.3.2 Definition
Business analytics (BA) involves utilizing knowledge, tools, and procedures to
analyse past business performance in order to gain insight and inform present and
future business strategy.
Business analytics is the process of transforming data into insights to improve business
choices. It is based on data and statistical approaches to provide new insights and
understanding of business performance. Some of the methods used to extract insights
from

3|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS data include data management, data visualisation, predictive


modelling, data mining, forecasting simulation, and optimisation.
1.3.3 Business analytics evolution
Business analytics has been around for a very long time and has developed as more,
better technology have been available. Operations research, which was widely applied
during World War II, is where it has its roots.
Operations research was initially designed as a methodical strategy to analyse data in
military operations. Over time, this strategy began to be applied in business domain
as well. Gradually, the study of operations evolved into management science.
Furthermore, the fundamental elements such as decision-making models, and other
foundations of management science were the same as those of operation research.
Ever since Frederick Winslow Taylor implemented management exercises in the late
19th century, analytics have been employed in business. Henry Ford's freshly
constructed assembly line involved timing of each component.
However, when computers were deployed in decision support systems in the late
1960s, analytics started to garner greater attention. Since then, enterprise resource
planning (ERP) systems, data warehouses, and a huge range of other software tools
and procedures have all modified and shaped analytics.
With the advent of computers, business analytics have grown in recent years. This
modification has elevated analytics to entirely new heights and opened up a world of
opportunity. Many people would never guess that analytics began in the early 1900s
with Mr Ford himself, given how far analytics has come in history and what the
discipline is now. Business intelligence, decision support systems, and PC software
all developed from management science.

1.4

ROLE OF ANALYTICS FOR DATA-DRIVEN DECISION


MAKING
1.4.1 Applications and uses for business analytics are numerous. It can be applied to
descriptive analysis, which makes use of facts to comprehend the past and present.
This form of descriptive analysis is employed to evaluate the company's present
position in the market and the success of earlier business decisions.
Predictive analysis, which is frequently employed to evaluate past business
performance, is used with it. Prescriptive analysis, which is used to develop
optimisation strategies for better
4|Page
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


corporate performance, is another application of business analytics. Business
analytics, for instance, is used to base price decisions for various products in a
department store on historical and current data.
1.4.2 Workings of business analytics: Several fundamental procedures are first
carried out by BA before any data analysis is done:
Identify the analysis's corporate objective.
Choose an analytical strategy.
Gather business data often from multiple systems and sources to support the
study. Cleanse and incorporate all data into one location, such as data
warehouse or data mart.
1.4.3 Need/Importance of Business Analytics
Business analytics serves as an approach to help in making informed business
decisions.
As a result, it affects how the entire organisation functions. Business analytics can
therefore help a company become more profitable, grow its market share and revenue,
and give shareholders a higher return. It entails improved primary and secondary data
interpretation, which again affects the operational effectiveness of several
departments. Moreover, it provides a competitive advantage to the organization. The
flow of information is nearly equal among all actors in this digital age. The
competitiveness of the company is determined by how this information is used.
Corporate analytics improves corporate decisions by combining readily available data
with numerous carefully considered models.
1.4.4 Transforms data into insightful knowledge.
Business analytics is the serves as a resource for a firm to make informed decisions.
These choices will probably have an effect on your entire business because they will
help you expand market share, boost profitability, and give potential investors a higher
return.
While some businesses struggle with how to use massive volumes of data, business
analytics aims to combine this data with useful insights to enhance the decisions your
organisation makes.
In essence, business analytics is significant across all industries for the following four
reasons:
Enhances performance by providing your company with a clear picture of what
is and what isn't working
5|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS Facilitates quicker and more precise decision-making


Reduces risks by assisting a company in making wise decisions on consumer
behaviour, trends, and performance.
By providing information on the consumer, it encourages innovation and

change. IN-TEXT QUESTIONS

1. Define Business Analytics?


2. What do you understand by the term Business analysis evolution?
3. State two importance of Business Analytics?
1.5 TYPES OF BUSINESS ANALYTICS

Business analytics can be divided into four primary categories, each of which gets more
complex. They bring us one step closer to implementing scenario insight applications
for the present and the future. Below is a description of each of these business
analytics categories.
1. Descriptive analytics,
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
1. Descriptive analytics: In order to understand what has occurred in the past or is
happening right now, it summarises the data that an organisation currently has.
The simplest type of analytics is descriptive analytics, which uses data
aggregation and mining techniques. It increases the availability of data to an
organization's stakeholders, including shareholders, marketing executives, and
sales managers. It can aid in discovering strengths and weaknesses and give
information about customer behaviour. This aids in the development of
strategies for the field of focused marketing.
2. Diagnostic Analytics: This kind of analytics aids in refocusing attention from past
performance to present occurrences and identifies the variables impacting
trends. Drill-down, data mining, and other techniques are used to find the
underlying cause of occurrences. Probabilities and likelihoods are used in
diagnostic analytics to

6|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


comprehend the potential causes of events. For classification and regression,
methods like sensitivity analysis and training algorithms are used.
3. Predictive Analytics: With the aid of statistical models and ML approaches, this
type of analytics is used to predict the likelihood of a future event. The outcome
of descriptive analytics is built upon to create models that extrapolate item
likelihood. Machine learning specialists are used to conduct predictive analyses.
They can be more accurate than they might be with just business intelligence.
Sentiment analysis is among its most popular uses. Here, social media data
already in existence is used to construct a complete picture of a user's
viewpoint. To forecast their attitude (positive, neutral, or negative), this data is
evaluated.
4. Prescriptive Analytics: It offers suggestions for the next best course of action,
going beyond predictive analytics. It makes all beneficial predictions in
accordance with a particular course of action and also provides the precise
steps required to produce the most desirable outcome. It primarily depends on
a robust feedback system and ongoing iterative analysis. It gains knowledge of
the connection between acts and their results. The development of
recommendation systems is a typical use of this kind of analytics.

Fig. No. 2
7|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS
1.6 INTRODUCTION TO THE CONCEPTS OF BIG DATA
ANALYTICS
Big Data Analytics is made up of enormous volumes of information that cannot be
processed or stored using conventional data processing or storage methods. There
are often three distinct versions.
Structured data, as the name implies, has a clear structure and follows a regular
sequence. A person or machine may readily access and utilise this type of
information since it has been intended to be user-friendly. Structured data is
typically kept in databases, especially relational database management
systems, or RDBMS, and tables with clearly defined rows and columns, such
as spreadsheets.
While semi-structured data displays some of the same characteristics as
structured data, for the most part it lacks a clear structure and cannot adhere to
the formal specifications of data models like an RDBMS.
Unstructured data does not adhere to the formal structural norms of traditional
data models and lacks a consistent structure across all of its different forms. In
a very small number of cases, it might contain information on the date and time.
1.6.1 Large-scale Data Management Traits
According to traditional definitions of the term, big data is typically linked to three
essential traits:
Volume: The massive amounts of information produced every second by social
media, mobile devices, automobiles, transactions, connected sensors, photos,
video, and text are referred to by this characteristic. Only big data technologies
can handle enormous volumes, which come in petabyte, terabyte, or even
zettabyte sizes.
Diversity: Information in the form of images, audio streams, video, and a variety
of other forms now contributes to a diversity of data kinds, around 80% of which
are completely unstructured, to the existing landscape of transactional and
demographic data like phone numbers and addresses.
Velocity: This attribute relates to the velocity of data accumulation and refers to
the phenomenal rate at which information is flooding into data repositories. It
also describes how quickly massive data can be analysed and processed to
draw out the insights and patterns it contains. Now, that speed is frequently
real-time. Current

8|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


definitions of big data management also contain two additional features in
addition to "the Three v/s," namely:
Veracity: The level of dependability and truth that big data can provide in terms
of its applicability, rigour, and correctness.
Value: This feature examines whether information and analytics will eventually
be beneficial or detrimental as the main goal of big data collection and analysis
is to uncover insights that can guide decision-making and other activities.

Fig: 3
1.6.2 Services for Big Data Management
Organisations can pick from a wide range of big data management options when it
comes to technology. Big data management solutions can be standalone or multi-
featured, and many businesses employ several of them. The following are some of the
most popular kinds of big data management capabilities:
Finding and resolving problems in data sets is known as data
cleansing. Data integration is the process of merging data from
several sources.
Data preparation is the process of preparing data for use in analytics or other
applications. Data enrichment is the process of enhancing data by adding new
data sets, fixing minor errors, or extrapolating new information from raw data.
Data migration is the process of moving data from one environment to another,
such as from internal data centres to the cloud.
9|Page

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS Adding new data sets, fixing minor errors, or extrapolating new
information from raw data are all examples of data enrichment. Data analytics is
the process of analysing data using a variety of techniques in order to gain
insights.

1.7 OVERVIEW OF MACHINE LEARNING ALGORITHMS

1.7.1 Machine Learning:


Machine Learning (ML) is the study of computer algorithms that can get better on their
own over time and with the help of data. It is thought to be a component of artificial
intelligence. Without being expressly taught to do so, machine learning algorithms
create a model using sample data, also referred to as training data, in order to make
predictions or judgements. In a wide range of fields where it is challenging or
impractical to design traditional algorithms, such as medicine, email filtering, speech
recognition, and computer vision, machine learning algorithms are applied.
Computational statistics, which focuses on making predictions with computers, is
closely related to a subset of machine learning, but not all machine learning is
statistical learning. The field of machine learning benefits from the tools, theory, and
application fields that come from the study of mathematical optimisation. Data mining
is a related area of study that focuses on unsupervised learning for exploratory data
analysis.
Some machine learning applications employ data and neural networks in a way that
closely resembles how a biological brain function. Machine learning is also known as
predictive analytics when it comes to solving business problems.
How does machine learning operate?
The operation of machine learning
1. The Making of a Decision: Typically, machine learning algorithms are employed
to produce a forecast or classify something. Your algorithm will generate an
estimate about a pattern in the supplied data, which may be tagged or
unlabelled.
2. An Error Function: A model's prediction can be assessed using an error function.
In order to evaluate the model's correctness when there are known examples,
an error function can compare the results.
3. A process for model optimisation: If the model can more closely match the data
points in the training set, weights are modified to lessen the difference between
the known example and the model prediction. Until an accuracy level is reached,
the

10 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


algorithm will iteratively evaluate and optimise, updating weights on its own
each time.
1.7.2 Machine learning methods: Machine learning classifiers fall into three primary
categories
1. Supervised machine learning: The definition of supervised learning, commonly
referred to as supervised machine learning, is the use of labelled datasets to
train algorithms that can reliably classify data or predict outcomes. As the model
receives input data and modifies its weights until the model is properly fitted.
This happens as part of the cross-validation process to make sure the model
doesn't fit too well or too poorly. Supervised learning assists organisations in
finding saleable solutions to a range of real-world issues, such as classifying
spam in a different folder from your email. Neural networks, naive Bayes, linear
regression, logistic regression, random forest, support vector machine (SVM),
and other techniques are used in supervised learning.
2. Unsupervised Machine learning: Unsupervised learning, commonly referred to
as unsupervised machine learning, analyses and groups un-labelled datasets
using machine learning algorithms. These algorithms identify hidden patterns
or data clusters without the assistance of a human. It is the appropriate solution
for exploratory data analysis, cross-selling tactics, consumer segmentation, and
picture and pattern recognition because of its capacity to find similarities and
differences in information. Through the process of dimensionality reduction, it
is also used to lower the number of features in a model; principal component
analysis (PCA) and singular value decomposition (SVD) are two popular
methods for this. The use of neural networks, k-means clustering, probabilistic
clustering techniques, and other algorithms is also common in unsupervised
learning.
3. Semi-supervised education: A satisfying middle ground between supervised and
unsupervised learning is provided by semi-supervised learning. It employs a
smaller, labelled data set during training to direct feature extraction and
classification from a larger, unlabelled data set. If you don't have enough
labelled data—or can't pay to label enough data—to train a supervised learning
system, semi-supervised learning can help.
1.7.3 Reinforcement learning with computers: Although the algorithm isn't trained on
sample data, reinforcement machine learning is a behavioural machine learning
model that is similar to supervised learning. Trial and error are used by this model to
learn as it goes. The 11 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS optimal suggestion or strategy will be created for a specific problem


by reinforcing a string of successful outcomes.
A subset of artificial intelligence called "machine learning" employs computer
algorithms to enable autonomous learning from data and knowledge. In machine
learning, computers can change and enhance their algorithms without needing to be
explicitly programmed.
Computers can now interact with people, drive themselves, write and publish sport
match reports, and even identify terrorism suspects thanks to machine learning
algorithms.

IN-TEXT QUESTIONS 1.2

1. What are the types of Business Analytics?


2. What is Big Data Analytics?
3. Name the three essential traits of big data?

1.8 INTRODUCTION TO RELEVANT STATISTICAL SOFTWARE


PACKAGES
A statistical package is essentially a group of software programmes that share a
common user interface and were created to make it easier to do statistical analysis
and related duties like data management.
What is Statistical Software?
Software for doing complex statistical analysis is known as statistical software. They
serve as tools for organising, interpreting, and presenting particular data sets in order
to provide scientific insights on patterns and trends. To perform data sciences,
statistical software uses statistical analysis theorems and procedures like regression
analysis and time series analysis.
Benefits of Statistical Software:
Increases productivity and accuracy in data management and
analysis. Requiring less time.
Simple personalization
Access to a sizable database that reduces sampling error and enables data-
driven decision making.

12 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics Relevant statistical


software packages:
Increases productivity and accuracy in data management and analysis requiring less
time.
Simple personalization access to a sizable database reduces sampling error and
enables data driven decision-making.
1. SPSS (Statistical Package for Social Sciences)
The most popular and effective programme for analysing complex
statistical data is called SPSS.
To make the results easy to discuss, it quickly generates descriptive
statistics, parametric and non-parametric analysis, and delivers graphs
and presentation ready reports.
Here, estimate and the discovery of missing values in the data sets lead to
more accurate reports.
For the analysis of quantitative data, SPSS is utilised.
2. Stata
Stata is another commonly used programme that makes it possible to
manage, save, produce, and visualise data graphically. It does not require
any coding expertise to use.
Its use is more intuitive because it has both a command line and a
graphical user interface.
3. R
Free statistical software known as "R" offers graphical and statistical
tools, including linear and non-linear modelling.
Available for a wide range of applications are toolboxes, which are
effective plugins. Here, coding expertise is necessary.
It offers interactive reports and apps, makes extensive use of data, and
complies with security guidelines.
R is used to analyse quantitative data.
4. Python
Another freely available software
Extensive libraries and frameworks
13 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

BMS A popular choice for machine learning tasks.


Simplicity and Readability
5. SAS (Statistical Analysis Software)
It is a cloud-based platform that offers ready-to-use applications for
manipulating data, storing information, and retrieving it.
Its processes employ several threads. executing several tasks at once.
Business analysts, statisticians, data scientists, researchers, and
engineers utilise it largely for statistical modelling, spotting trends and
patterns in data, and assisting in decision-making.
For someone unfamiliar with this method, coding can be
challenging. It is utilised for the analysis of numerical data.
6. MATLAB (MATrix LABoratory)
The initials MATLAB stand for Matrix Laboratory.
Software called MATLAB offers both an analytical platform and a
programming language.
It expresses matrix and array mathematics, function and data charting,
algorithm implementation, and user interface development.
A script that combines code, output, and formatted text into an executable
notebook is produced by Live Editor, which is also provided.
Engineers and scientists utilise it a lot.
For the analysis of quantitative data, MATLAB is employed.
7. Epi-data
Epi-data is a widely used, free data programme created to help
epidemiologists, public health researchers, and others enter, organise, and
analyse data while working on the ground.
It manages all of the data and produces graphs and elementary statistical
analysis. Here, users can design their own databases and forms.
Epi-data is a tool for analysing quantitative data.
8. Epi-info
It is a public domain software suite created by the Centres for Disease
Control and Prevention (CDC) for researchers and public health professionals
worldwide. 14 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


For those who might not have a background in information technology, it
offers simple data entry forms, database development, and data analytics
including epidemiology statistics, maps, and graphs.
Investigations into disease outbreaks, the creation of small to medium-
sized disease monitoring systems, and the analysis, visualisation, and
reporting (AVR) elements of bigger systems all make use of it.
It is utilised for the analysis of numerical data.
9. NVivo
It is a piece of software that enables the organisation and archiving of
qualitative data for analysis.
The analysis of unstructured text, audio, video, and image data, such as
that from interviews, focus groups (FGD), surveys, social media, and journal
articles, is done using NVivo.
You can import Word documents, PDFs, audio, video, and photos here.
It facilitates users' more effective organisation, analysis, and discovery of
insights from structured or qualitative data.
The user-friendly layout makes it instantly familiar and intuitive for the
user. It contains a free version as well as automated transcribing and auto
coding. Research using mixed methods and qualitative data is conducted using
NVivo. 10. Mini-tab
Mini-tab provides both fundamental and moderately sophisticated
statistical analysis capabilities.
It has the ability to analyse a variety of data sets, automate statistical
calculations, and provide beautiful visualisations.
The usage of mini-tabs allows users to concentrate more on data analysis
by allowing them to examine both current and historical data to spot trends
and patterns as well as hidden links between variables.
It makes it easier to understand the data's insights.
For the examination of qualitative data, Mini-tab is employed.
11. Dedoose
Dedoose, a tool for qualitative and quantitative data analysis, is entirely
web based.
15 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS This low-cost programme is user-friendly and team-oriented, and it


makes it simple to import both text and visual data.
It has access to cutting-edge data security equipment.
12. ATLAS.ti
It is a pioneer in qualitative analysis software and has incorporated AI as
it has developed.
The greatest places for this are research organisations, businesses, and
academic institutions. Due to the cost of doing individual studies.
With sentiment analysis and auto coding, it is more potent.
It gives users the option to use any language or character set.
13. MAXDQA 12
It is expert software for analysing data using quantitative, qualitative, and
mixed methods.
It imports the data, reviews it in a single spot, and categorises any
unstructured data with ease.
With this software, a literature review may also be created.
It costs money and is not always easy to collaborate with others in a
team. IN-TEXT QUESTIONS 1.3

1. Name three relevant statistical software packages?


2. Name the machine learning methods?

1.9 SUMMARY
The disciplines of management, business, and computer science are all combined in
business analytics. The commercial component requires knowledge of the industry at
a high level as well as awareness of current practical constraints. An understanding of
data, statistics, and computer science is required for the analytical portion. Business
analysts can close the gap between management and technology thanks to this
confluence of disciplines. Business analytics also includes effective problem-solving
and communication to translate data insights into information that is understandable
to executives. A related field called business intelligence likewise uses data to better
understand and inform businesses. What distinguishes

16 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


business analytics from business intelligence in terms of objectives? Despite the fact
that both areas rely on data to provide answers, the goal of business intelligence is to
comprehend how an organisation came to be in the first place. Measurement and
monitoring of key performance indicators (KPIs) are part of this. The goal of business
analytics, on the other hand, is to support business improvements by utilizing
predictive models that offer insight into the results of suggested adjustments. Big data,
statistical analysis, and data visualization are all used in business analytics to
implement organizational changes. This work includes predictive analytics, which is
crucial since it uses data that is already accessible to build statistical models. These
models can be applied to decision-making and result prediction. Business analytics
can provide specific recommendations to fix issues and enhance enterprises by
learning from the data already available.
1.10 GLOSSARY
Term Full Form/Formulae/Meaning
to gain insights, make informed
Business Analytics
decisions, and drive strategic actions
Business analytics consists of using
in a business or organizational
data analysis and statistical methods
context

Big Data Big data refers to large and complex datasets. It is


characterized by the volume, velocity, and variety of data,
often generated from various sources such as social media,
sensors, devices, and business transactions.

1.11 ANSWER TO INTEXT QUESTION

INTEXT QUESTIONS 1.1


1. Business analytics (BA) refers to the knowledge, tools, and procedures used for
ongoing, iterative analysis and investigation of previous business performance
in order to generate knowledge and inform future business strategy.
2. Business analytics evolution has been around for a very long time and has
developed as more, better technology has been available.

17 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 3. The two importance of business analytics are:


1. Gives businesses a competitive advantage
2. Transforms accessible data into insightful knowledge
INTEXT QUESTIONS 1.2
1. There are four types of Business analytics.
Descriptive analytics,
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
2. Big data is made up of enormous volumes of information that cannot be
processed or stored using conventional data processing or storage methods.
3. The three essential traits are Volume, diversity, and velocity.
INTEXT QUESTIONS 1.3
1. The Three relevant statistical software packages are SPSS, STATA and SAS.
2. The machine learning methods are: Supervising machine learning, Machine
learning without supervision, Semi-supervised education.

1.12 TERMINAL QUESTION

1. What is Business Analysis?


2. Why a Business Analyst needed in an organization?
3. What is SaaS?
4. What are considered to be the four types of Business analytics? Explain them
in your own words.
5. Explain the importance of Business Analytics?
6. Explain the three relevant statistical software packages?

18 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics 7. How does the


machine learning method works?
8. Explain the difference between any two software packages?

1.13 REFERENCES

Evans, J.R. (2021), Business Analytics: Methods, Models and Decisions,


Pearson India
Kumar, U. D. (2021), Business Analytics: The Science of Data-Driven Decision
Making, Wiley India.
Larose, D. T. (2022), Data Mining and Predictive Analytics, Wiley
India Shmueli, G. (2021), Data Mining and Business Analytics,
Wiley India

1.14 SUGGESTED READING

Business Analysis Techniques: 99 Essential Tools for Success, Cadle, Paul,


and Turner, 2014. BCS in Swindon.
Kimi Ziemski, Richard Vander Horst, and Kathleen B. Hass (2008). Business
analyst management concepts: elevating the role of the analyst, 2008. ISBN 1-
56726-213-9. p94: "As business analysis becomes a more professionalised
discipline”.

19 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS LESSON 2
PREDICTIVE ANALYTICS
Dr. Satish Kumar Goel
Assistant Professor
Shaheed Sukhdev College of Business Studies
(University of Delhi)
[email protected]

STRUCTURE

2.1 Learning Objectives


2.2 Introduction
2.3 Classical Linear Regression Model (CLRM)
2.4 Multiple Linear Regression Model
2.5 Practical Exercises Using R/Python Programming
2.6 Summary
2.7 Self-Assessment Questions
2.8 References
2.9 Suggested Readings

2.1 LEARNING OBJECTIVES

● To understand the basic concept of linear regression and where to


apply. ● To develop a linear relationship between two or more
variables.
● To predict the value of dependent variable given the value of independent
variable using regression line.
● To be familiar with the different metrices used in the regression.
● Use of R and Python for regression implementation.

2.2 INTRODUCTION

In this chapter, we will explore the field of predictive analytics, focusing on two
fundamental techniques: Simple Linear Regression and Multiple Linear Regression.
Predictive analytics is a powerful tool for analysing data and making predictions about
future outcomes. We will
20 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


cover various aspects of regression models, including parameter estimation, model
validation, coefficient of determination, significance tests, residual analysis, and
confidence and prediction intervals. Additionally, we will provide practical exercises to
reinforce your understanding of these concepts, using R or Python for implementation.

2.3 CLASSICAL LINEAR REGRESSION MODEL (CLRM)

2.3.1. Introduction
Predictive analytics is the use of statistical techniques, machine learning algorithms,
and other tools to identify patterns and relationships in historical data and use them to
make predictions about future events. These predictions can be used to inform
decision-making in a wide variety of areas, such as business, marketing, healthcare,
and finance.
Linear regression is the traditional statistical technique used to model the relationship
between one or more independent variables and a dependent variable.
Linear regression involving only two variables is called simple linear regression. Let us
consider two variables as ‘x’ and ‘y’. Here ‘x’ represents independent variable or
explanatory variable and ‘y’ represents dependent variable or response variable.
Dependent variable must be a ratio variable, whereas independent variable can be
ratio or categorical variable. We can talk about regression model for cross-sectional
data or for time series data. In time series regression model, time is taken as
independent variable and is very useful for predicting future. Before we develop a
regression model, it is a good exercise to ensure that two variables are linearly related.
For this, plotting the scatter diagram is really helpful. A linear pattern can easily be
identified in the data.
The Classical Linear Regression Model (CLRM) is a statistical framework used to
analyse the relationship between a dependent variable and one or more independent
variables. It is a widely used method in econometrics and other fields to study and
understand the nature of this relationship, make predictions, and test hypotheses.
Regression analysis aims to examine how changes in the independent variable(s)
affect the dependent variable. The CLRM assumes a linear relationship between the
dependent variable (Y) and the independent variable(s) (X), allowing us to estimate
the parameters of this relationship and make predictions.
The regression equation in the CLRM is expressed as:
Yi = α + βxi + μi
21 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS Here, Yi represents the dependent variable,


xi represents the independent variable,
α represents the intercept,
β represents the coefficient or slope that quantifies the effect of xi on Yi, and μi
represents the error term or residual.
The error term captures the unobserved factors and random variations that affect the
dependent variable but are not explicitly included in the model.
The CLRM considers the population regression function (PRF), which is the true
underlying relationship between the variables in the population. The PRF is
expressed as:
Yi = α + βxi + μi
The difference between the regression equation and the PRF is the inclusion of the
error term (μi) in the PRF. The error term represents the discrepancy between the
observed value of Yi and the predicted value based on the regression equation.
In practice, we estimate the parameters of the PRF using sample data and derive the
sample regression function (SRF), which is an approximation of the PRF. The SRF is
represented as:
Yi = (α ̂) + (β ̂xi) + (u î )
In the SRF, (α )̂ and (β )̂ are the estimated intercept and coefficient, respectively,
obtained through statistical methods such as ordinary least squares (OLS). The
estimated error term (u î ) captures the residuals or discrepancies between the
observed and predicted values based on the estimated parameters.
2.3.2. Assumptions
To ensure reliable and meaningful results, the CLRM relies on several key
assumptions. Let's discuss these assumptions one by one:
Linearity: The regression model must be linear in its parameters. Linearity refers to
the linearity of the parameters (α and β), not necessarily the linearity of the
variables themselves. For example, even if the variable xi is not linear, the model
can still be considered linear if the parameters (α and β) are linear.
Variation in Independent Variables: There should be sufficient variation in the
independent variable(s) to be qualified as an explanatory variable. In other words,
if there

22 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


is little or no variation in the independent variable, it cannot effectively explain the
differences in the dependent variable.
For example, suppose we want to model the consumption level taking income as the
independent variable. If everyone in the sample has an income of Rs 10,000, then
there is no variation in Xi. Hence, the difference in their consumption levels cannot
be explained by Xi.
Hence, we assume that there is enough variation in Xi. Otherwise, we cannot include
it as an explanatory variable in the model.
Zero Mean and Normal Distribution of Error Term: The error term (μi) should have a
mean of zero. This means that, on average, the errors do not systematically
overestimate or underestimate the dependent variable. Additionally, the error term
is assumed to follow a normal distribution, allowing for statistical inference and
hypothesis testing.
Fixed Values of Independent Variables: The values of the independent variable(s)
are considered fixed over repeated sampling. This assumption implies that the
independent variables are not subject to random fluctuations or changes during the
sampling process.
No Endogeneity: Endogeneity refers to the situation where there is a correlation
between the independent variables and the error term. In other words, the
independent variables are not independent of the error term. To ensure valid
results, it is crucial to address endogeneity issues, as violating this assumption can
lead to biased and inconsistent parameter estimates.
Number of Parameters vs. Sample Size: The number of parameters to be estimated
(k) from the model should be significantly smaller than the total number of
observations in the sample (n). In general, it is recommended that the sample size
(n) should be at least 20 times greater than the number of parameters (k) to obtain
reliable and stable estimates.
Correct Model Specification: The econometric model should be correctly specified,
meaning that it reflects the true relationship between the variables in the population.
Model misspecification can occur in two ways: improper functional form and
inclusion/exclusion of relevant variables. Improper functional form refers to using a
linear model when the true relationship is nonlinear, leading to biased parameter
estimates. The inclusion of irrelevant variables or exclusion of relevant variables
can also lead to biased and inefficient estimates.

23 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS Homoskedasticity: Homoskedasticity assumes that the variance


of the error term is constant across all levels of the independent variables. It means
that the spread or dispersion of the errors does not change systematically with the
values of the independent variable(s). This assumption is important for obtaining
efficient and unbiased estimates of the parameters.
To understand homoskedasticity visually, let's consider a scatter plot with a regression
line. In a homoskedastic scenario, the spread of the residuals around the regression
line will be relatively constant across different values of the independent variable(s).
Homoskedasticity means that Variance of the error term is
constant Yi = α +βxi +µi

Var (µi) =

Fig 2.1: Scatter Plot


Even at higher levels of Xi, the variance of the error term remains constant.
In a homoskedastic scenario, the spread of the residuals (green lines) remains
relatively constant across different values of the independent variable. This means that
the variability of the dependent variable is consistent across the range of the
independent variable.

24 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi
DSC 7: Introduction to Business Analytics
Homoskedasticity is an important assumption in the CLRM because violations of this
assumption can lead to biased and inefficient estimators, affecting the reliability of the
regression analysis. If heteroskedasticity is present (where the spread of the residuals
varies across the range of the independent variable), it can indicate that the model is
not adequately capturing the relationship between the variables, leading to unreliable
inference and misleading results.
To detect heteroskedasticity, you can visually inspect the scatter plot of the residuals
or employ statistical tests specifically designed to assess the presence of
heteroskedasticity, such as the Breusch-Pagan test or the White test.
If heteroskedasticity is detected, various techniques can be employed to address it,
such as transforming the variables, using weighted least squares (WLS) regression,
or employing heteroskedasticity-consistent standard errors.
No Autocorrelation: Autocorrelation, also known as serial correlation, refers to the
correlation between error terms of different observations. In the case of cross-
sectional data, autocorrelation occurs when the error terms of different individuals
or units are correlated. In time series data, autocorrelation occurs when the error
terms of consecutive time periods are correlated. Autocorrelation violates the
assumption of independent and identically distributed errors, and it can lead to
biased and inefficient estimates.
This means that the covariance between µi and µ i-1 should be zero. If that is not the
case, then it is a situation of autocorrelation.

Yi =

Yj =

Cov(ui,uj)≠ 0 = spatial autocorrelation

Cov(ut, ut+1)≠0 = autocorrelation


In cross sectional data, if two error terms do not have zero covariance, then it is a
situation of SPATIAL CORRELATION. In time series data, if two error terms for
consecutive time periods do not have zero covariance, then it is a situation of
AUTOCORRELATION OR SERIAL CORRELATION.
No Multicollinearity: Multicollinearity occurs when there is a high degree of correlation
between two or more independent variables in the regression model. This can pose
a problem because it becomes challenging to separate the individual effects of the
25 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS correlated variables. Multicollinearity can lead to imprecise and


unstable parameter estimates.
By adhering to these assumptions, the CLRM exhibits desirable properties such as
efficiency, unbiasedness, and consistency. Efficiency refers to obtaining parameter
estimates with the minimum possible variance, allowing for precise estimation.
Unbiasedness means that, on average, the estimated parameters are not
systematically over or underestimating the true population parameters. Consistency
implies that as the sample size increases, the estimated parameters converge to the
true population parameters.
In conclusion, the Classical Linear Regression Model (CLRM) is a widely used
statistical framework for analysing the relationship between a dependent variable and
one or more independent variables. By estimating the parameters of the regression
equation, we can make predictions, test hypotheses, and gain insights into the factors
influencing the dependent variable. However, it is crucial to ensure that the
assumptions of the CLRM are met to obtain reliable and meaningful results. Violating
these assumptions can lead to biased and inconsistent parameter estimates,
compromising the validity of the analysis.
2.3.3 Simple Linear Regression
2.3.3.1. Estimation of Parameters
Simple Linear Regression involves estimating the parameters of a linear equation that
best fits the relationship between a single independent variable and a dependent
variable. We will discuss the methods used to estimate these parameters and interpret
their meaning in the context of the problem at hand using R/Python programming.
2.3.3.2 Model Validation
Validating the simple linear regression model is crucial to ensure its reliability. We will
cover various techniques, such as hypothesis testing, to assess the significance of the
model and evaluate its performance. Additionally, we will examine residual analysis to
understand the differences between the observed and predicted values and identify
potential issues with the model.
Validation of a simple linear regression model involves assessing the model's
performance and determining how well it fits the data. Here are some common
techniques for validating a simple linear regression model:

26 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


Residual Analysis: Residuals are the differences between the observed values and the
predicted values of the dependent variable. By analysing the residuals, you can
evaluate the model's performance. Some key aspects to consider are:
Checking for randomness: Plotting the residuals against the predicted values or the
independent variable can help identify any patterns or non-random behaviour.
Assessing normality: Plotting a histogram or a Q-Q plot of the residuals can indicate
whether they follow a normal distribution. Departures from normality might suggest
violations of the assumptions.
Checking for homoscedasticity: Plotting the residuals against the predicted values
or the independent variable can reveal any patterns indicating non-constant
variance. The spread of the residuals should be consistent across all levels of the
independent variable.
R-squared (Coefficient of Determination): R-squared measures the proportion of the
total variation in the dependent variable that is explained by the linear regression
model. A higher R-squared value indicates a better fit. However, R-squared alone does
not provide a complete picture of model performance and should be interpreted along
with other validation metrics.
Adjusted R-squared: Adjusted R-squared takes into account the number of
independent variables in the model. It penalizes the addition of irrelevant variables and
provides a more reliable measure of model fit when comparing models with different
numbers of predictors.
F-statistic: The F-statistic assesses the overall significance of the linear regression
model. It compares the fit of the model with a null model (no predictors) and provides
a p-value indicating whether the model significantly improves upon the null model.
Outlier Analysis: Identify potential outliers in the data that may have a substantial
impact on the model's fit. Outliers can skew the regression line and affect the estimated
coefficients. It is important to investigate and understand the reasons behind any
outliers and assess their influence on the model.
Cross-Validation: Splitting the dataset into training and testing subsets allows you to
assess the model's performance on unseen data. The model is trained on the training
set and then evaluated on the testing set. Metrics such as mean squared error (MSE),
or root mean squared error (RMSE) can be calculated to quantify the model's
predictive accuracy.
By employing these validation techniques, you can gain insights into the model's
performance, evaluate its assumptions, and make informed decisions about its
reliability and usefulness for predicting the dependent variable.
27 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS

2.3.4. Coefficient of Determination:


The coefficient of determination, commonly known as R-squared, quantifies the
proportion of variance in the dependent variable that can be explained by the
independent variable in a simple linear regression model. We will delve into the
calculation and interpretation of this important metric.
Introduction:
The overall goodness of fit of the regression model is measured by the coefficient of
determination, r2. It tells what proportion of the variation in the dependent variable, or
regressor and, is explained by the explanatory variable, or regressor. This r 2 lies
between 0 and 1; the closer it is to 1, the better is the fit.
Let TSS denotes TOTAL SUM OF SQUARES which is Total variation of the actual Y
values about their sample means which may be called the total sum of squares:
TSS = Σ (yᵢ - ȳ)²
TSS can further be split into two variations; explained sum of square (ESS) and
residual sum of squares (RSS).
Explained sum of square (ESS) or Regression sum of squares or Model sum of squares
is a statistical quantity used in modelling of a process. ESS gives an estimate of how
well a model explains the observed data for the process.
ESS = Σ (ŷᵢ - ȳ)²
The residual sum of squares (RSS) is a statistical technique used to measure the
amount of variance in a data set that is not explained by a regression model itself.
Instead, it estimates the variance in the residuals, or error term.
RSS = Σ (ŷᵢ - ȳ)²
Since, TSS = ESS + RSS
Or 1= ESS/TSS +RSS/TSS
Since ESS/TSS determines proportion of variability in Y explained by regression
model, therefore.
r2 = ESS/TSS
Alternatively, from above r2 = 1-RSS/TSS
28 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics 2.3.5 Significance Tests:


To determine the significance of the simple linear regression model and its coefficients,
we will explore statistical tests such as t-tests and p-values in the practical exercise.
These tests help assess the statistical significance of the relationships between
variables and make informed conclusions.
2.3.6 Residual Analysis
Residual analysis is a critical step in evaluating the adequacy of a simple linear
regression model. Using practical examples, we will discuss how to interpret and
analyse residuals, which represent the differences between the observed and
predicted values. Residual analysis provides insights into the model's assumptions
and potential areas for improvement.
2.3.7 Confidence and Prediction Intervals
Confidence and prediction intervals are essential in understanding the uncertainty
associated with the predictions made by a simple linear regression model. We will
cover the calculation and interpretation of these intervals, allowing us to estimate the
range within which future observations are expected to fall in the practical exercises.

2.4 MULTIPLE LINEAR REGRESSION MODEL

Multiple regression is a statistical analysis technique used to examine the relationship


between a dependent variable and two or more independent variables. It builds upon
the concept of simple linear regression, which analyses the relationship between a
dependent variable and a single independent variable.
The multiple regression model equation looks like this:
Y = β0 + β1X1 + β2X2 + ... + βnXn + ε
In this equation:
Y represents the dependent variable that we want to predict or
explain. X1, X2, ..., Xn are the independent variables.
β0 is the y-intercept or constant term.
β1, β2, ..., βn are the coefficients or regression weights that represent the change in
the dependent variable associated with a one-unit change in the corresponding
independent variable.
29 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS ε is the error term or residual, representing the unexplained variation


in the dependent variable.
2.4.1 Interpretation of Partial Regression Coefficients:
Multiple Linear Regression extends the simple linear regression framework to include
multiple independent variables. We will explore the interpretation of partial regression
coefficients, which quantify the relationship between each independent variable and
the dependent variable while holding other variables constant.
2.4.2 Working with Categorical Variables:
Categorical variables require special treatment in regression analysis. We will discuss
how to handle categorical variables by creating dummy variables or qualitative
variables. The interpretation of these coefficients will be explained to understand the
impact of categorical variables on the dependent variable.
2.4.3 Multicollinearity and VIF:
Multicollinearity refers to the presence of ha igh correlation between independent
variables in a multiple linear regression model. One of the assumptions of the CLRM
is that there is no exact linear relationship among the independent variables
(regressors). If there are one or more such relationships among the regressors, we
call it multicollinearity.
There are two types of multicollinearities.
1. Perfect collinearity
2. Imperfect collinearity
Perfect multicollinearity occurs when two or more independent variables in a regression
model exhibit a deterministic (perfectly predictable or containing no randomness) linear
relationship. With imperfect multicollinearity, an independent variable has a strong but
not perfect linear function of one or more independent variables.
This also means that there are also variables in the model that effects the
independent variable.
Multicollinearity occurs when there is a high correlation between independent variables
in a regression model. It can cause issues with the estimation of coefficients and affect
the reliability of statistical inference.

30 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics The causes of


multicollinearity are as
1) Data collection method: If we sample over a limited range of values taken by the
regressors in the population, it can lead to multicollinearity
2) Model specification: If we introduce polynomial terms into the model, especially
when the values of the explanatory variables are small; it can lead to
multicollinearity.
3) Constraint on the model or in the population: For example, if we try to regress
electricity expenditure on house size and income, it may suffer from multicollinearity
as there is a constraint in the population. People with higher incomes typically have
bigger houses.
4) Over determined model: If we have more explanatory variables than the number of
observations, then it could lead to multicollinearity. Often happens in medical
research when you only have a limited number of patients about whom a large
amount of information is collected.

Impact of multicollinearity:
Unbiasedness: The Ordinary Least Squares (OLS) estimators remain unbiased.
Precision: OLS estimators have large variances and covariances, making precise
estimation difficult and leading to wider confidence intervals. Statistically insignificant
coefficients may be observed.
High R-squared: The R-squared value can still be high, even with statistically
insignificant coefficients.
Sensitivity: OLS estimators and their standard errors are sensitive to small changes
in the data.
Efficiency: Despite increased variance, OLS estimators are still efficient, meaning
they have minimum variance among all linear unbiased estimators.
In summary, multicollinearity undermines the precision of coefficient estimates and can
lead to unreliable statistical inference. While the OLS estimators remain unbiased, they
become imprecise, resulting in wider confidence intervals and potential insignificance
of coefficients.
We will learn how to detect multicollinearity using the Variance Inflation Factor (VIF)
and explore strategies to address this issue, ensuring the accuracy and interpretability
of the regression model.

31 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS VIF stands for Variance Inflation Factor, which is a measure used to
assess multicollinearity in multiple regression model. VIF quantifies how much the
variance of the estimated regression coefficient is increased due to multicollinearity. It
measures how much the variance of one independent variable's estimated coefficient
is inflated by the presence of other independent variables in the model.
The formula for calculating the VIF for an independent variable Xj is:
VIF(Xj) = 1 / (1 – rj2)
where rj2 represents the coefficient of determination (R-squared) from a regression
model that regresses Xj on all other independent variables.
The interpretation of VIF is as follows:
If VIF(Xj) is equal to 1, it indicates that there is no correlation between Xj and the
other independent variables.
If VIF(Xj) is greater than 1 but less than 5, it suggests moderate multicollinearity.
If VIF(Xj) is greater than 5, it indicates a high degree of multicollinearity, and it is
generally considered problematic.
When assessing multicollinearity, it is common to examine the VIF values for all
independent variables in the model. If any variables have high VIF values, it indicates
that they are highly correlated with the other variables, which may affect the reliability
and interpretation of the regression coefficients.
If high multicollinearity is detected (e.g., VIF greater than 5), some steps can be taken
to address it:
Remove one or more of the highly correlated independent variables from the
model. Combine or transform the correlated variables into a single variable.
Obtain more data to reduce the correlation among the independent variables.
By addressing multicollinearity, the stability and interpretability of the regression model
can be improved, allowing for more reliable inferences about the relationships between
the independent variables and the dependent variable.
HOW TO DETECT MULTICOLLINEARITY
To detect multicollinearity in your regression model, you can use several methods:

32 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


Pairwise Correlation: Calculate the pairwise correlation coefficients between each pair
of explanatory variables. If the correlation coefficient is very high (typically greater than
0.8), it indicates potential multicollinearity. However, low pairwise correlations do not
guarantee the absence of multicollinearity.
Variance Inflation Factor (VIF) and Tolerance: VIF measures the extent to which the
variance of the estimated regression coefficient is increased due to multicollinearity.
High VIF values (greater than 10) suggest multicollinearity. Tolerance, which is the
reciprocal of VIF, measures the proportion of variance in the predictor variable that is
not explained by other predictors. Low tolerance values (close to zero) indicate high
multicollinearity.
Insignificance of Individual Variables: If many of the explanatory variables in the
model are individually insignificant (i.e., their t-statistics are statistically insignificant)
despite a high R squared value, it suggests the presence of multicollinearity.
Auxiliary Regressions: Conduct auxiliary regressions where each independent
variable is regressed against the remaining independent variables. Check the overall
significance of these regressions using the F-test. If any of the auxiliary regressions
show significant F values, it indicates collinearity with other variables in the model.
HOW TO FIX MULTICOLLINEARITY
To address multicollinearity, you can consider the following approaches:
Increase Sample Size: By collecting a larger sample, you can potentially reduce
the severity of multicollinearity. With a larger sample, you can include
individuals with different characteristics, reducing the correlation between
variables. Increasing the sample size leads to more efficient estimators and
mitigates the multicollinearity problem.
Drop Non-Essential Variables: If you have variables that are highly correlated
with each other, consider excluding non-essential variables from the model. For
example, if both father's and mother's education are highly correlated, you can
choose to include only one of them. However, be cautious when dropping
variables as it may result in model misspecification if the excluded variable is
theoretically important.
Detecting and addressing multicollinearity is crucial for obtaining reliable regression
results. By understanding the signs of multicollinearity and applying appropriate
remedies, you can improve the accuracy and interpretability of your regression model.

33 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 2.4.4 Outlier Analysis


Outliers can significantly influence the results of a regression model. We will discuss
techniques for identifying and handling outliers effectively, enabling us to build more
robust and reliable models.
2.4.5 Autocorrelation
Autocorrelation, also known as serial correlation, refers to the correlation between
observations in a time series data set or within a regression model. It arises when there
is a systematic relationship between the current observation and one or more past
observation. Autocorrelation occurs when the residuals of a regression model exhibit
a pattern, indicating a potential violation of the model's assumptions. We will cover
methods for detecting and addressing autocorrelation, ensuring the independence of
residuals and the validity of our model.
Consequences of Autocorrelation

I. OLS estimators are still unbiased and consistent.


II. They are still normally distributed in large samples.
III. But they are no longer efficient. That is, they are no longer BLU. In a case of
autocorrelation, standard errors are UNDERESTIMATED. This means that
the t values are OVERESTIMATED. Hence, it means that variables that
may not be statistically significant erroneously appear to be statistically
significant with high t-values.
IV. Hypothesis testing procedure is not reliable as standard errors are
erroneous, even with large samples. Therefore, the F and T tests may not
be valid.
Autocorrelation can be detected by following methods:

Graphical Method
Durbin Watson test
Breusch-Godfrey test
1. Graphical Method
Autocorrelation can be detected using graphical methods. Here are a few graphical
techniques to identify autocorrelation:

34 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


Residual Plot: Plot the residuals of the regression model against the corresponding
time or observation index. If there is no autocorrelation, the residuals should appear
random and evenly scattered around zero. However, if autocorrelation is present, you
may observe patterns or clustering of residuals above or below zero, indicating a
systematic relationship.
Partial Autocorrelation Function (PACF) Plot: The PACF plot displays the correlation
between the residuals at different lags, while accounting for the intermediate lags. In
the absence of autocorrelation, the PACF values should be close to zero for all lags
beyond the first. If there is significant autocorrelation, you may observe spikes or
significant values beyond the first lag.
Autocorrelation Function (ACF) Plot: The ACF plot shows the correlation between the
residuals at different lags, without accounting for the intermediate lags. Similar to the
PACF plot, significant values beyond the first lag in the ACF plot indicate the presence
of autocorrelation.

Figure 1.2
Autocorrelation and partial autocorrelation function (ACF and PACF) plots, prior to
differencing (A and B) and after differencing (C and D)
In both the PACF and ACF plots, significance can be determined by comparing the
correlation values against the confidence intervals. If the correlation values fall outside
the confidence intervals, it suggests the presence of autocorrelation.
35 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS It's important to note that these graphical methods provide indications
of autocorrelation, but further statistical tests, such as the Durbin-Watson test or Ljung-
Box test, should be conducted to confirm and quantify the autocorrelation in the model.
2. Durbin Watson D Test
The Durbin-Watson test is a statistical test used to detect autocorrelation in the
residuals of a regression model. It is specifically designed for detecting first-order
autocorrelation, which is the correlation between adjacent observations.
The Durbin-Watson test statistic is computed using the following
formula: d = (Σ (e_i - e_i-1)^2) / Σe_i^2

where:
· e_i is the residual for observation i.
· e_i-1 is the residual for the previous observation (i-1).
The test statistic is then compared to critical values to determine the presence of
autocorrelation. The critical values depend on the sample size, the number of
independent variables in the regression model, and the desired level of significance.
The Durbin-Watson test statistic, denoted as d, ranges from 0 to 4. The test statistic
is calculated based on the residuals of the regression model and is interpreted as
follows:
A value of d close to 2 indicates no significant autocorrelation. It suggests that the
residuals are independent and do not exhibit a systematic relationship.
A value of d less than 2 indicates positive autocorrelation. It suggests that there is a
positive relationship between adjacent residuals, meaning that if one residual is high,
the next one is likely to be high as well.
A value of d greater than 2 indicates negative autocorrelation. It suggests that there is
a negative relationship between adjacent residuals, meaning that if one residual is
high, the next one is likely to be low.
The closer it is to zero, the greater is the evidence of positive autocorrelation, and the
closer it is to 4, the greater is the evidence of negative autocorrelation. If d is about 2,
there is no evidence of positive or negative (first-) order autocorrelation.

36 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics 3. The Breusch-Godfrey


Test
The Breusch-Godfrey test, also known as the LM test for autocorrelation, is a
statistical test used to detect autocorrelation in the residuals of a regression model.
Unlike the Durbin Watson test, which is primarily designed for detecting first-order
autocorrelation, the Breusch-Godfrey test can detect higher-order autocorrelation.
The test is based on the idea of regressing the residuals of the original regression
model on lagged values of the residuals. It tests whether the lagged residuals are
statistically significant in explaining the current residuals, indicating the presence of
autocorrelation.
The general steps for performing the Breusch-Godfrey test are as follows:
1. Estimate the initial regression model and obtain the residuals.
2. Extend the initial regression model by including lagged values of the residuals as
additional independent variables.
3. Estimate the extended regression model and obtain the residuals from this model.
4. Perform a hypothesis test on whether the lagged residuals are jointly significant in
explaining the current residuals.
The test statistic for the Breusch-Godfrey test follows a chi-square distribution and is
calculated based on the residual sum of squares (RSS) from the extended regression
model. The test statistic is compared to the critical values from the chi-square
distribution to determine the presence of autocorrelation.
The interpretation of the Breusch-Godfrey test involves the following steps:
1. Set up the null hypothesis (H0): There is no autocorrelation in the residuals
(autocorrelation is absent).

2. Set up the alternative hypothesis (Ha): There is autocorrelation in the residuals


(autocorrelation is present).
3. Conduct the Breusch-Godfrey test and calculate the test statistic.
4. Compare the test statistic to the critical value(s) from the chi-square distribution.
5. If the test statistic is greater than the critical value, reject the null hypothesis and
conclude that there is evidence of autocorrelation and If the test statistic is less
than the critical value, fail to reject the null hypothesis and conclude that there is
no significant evidence of autocorrelation.
37 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 2.4.6 Transformation of Variables


Transforming variables can enhance the fit and performance of a regression model.
We would explore techniques such as logarithmic and power transformations in
practical examples, which can help improve linearity, normality, and homoscedasticity
assumptions.
2.4.7 Variable Selection in Regression Model Building:
Building an optimal regression model involves selecting the most relevant independent
variables. We will discuss various techniques for variable selection, including stepwise
regression and regularization methods like Lasso and Ridge regression.

2.5 PRACTICAL EXERCISES USING R/PYTHON PROGRAMMING


To reinforce the concepts covered in this chapter, practical exercises using R/Python
programming has been shown. These exercises will involve implementing simple OLS
regression using R or Python, interpreting the results obtained, and conducting
assumption tests such as checking for multicollinearity, autocorrelation, and normality.
Furthermore, regression analysis with categorical/dummy/qualitative variables will be
performed to understand their impact on the dependent variable.
Exercise1: Perform simple OLS regression on R/Python and interpret the results
obtained.
Sol. Certainly! Here's an example of how you can perform a simple Ordinary Least
Squares (OLS) regression in both R and Python, along with results interpretation.
Let's assume you have a dataset with a dependent variable (Y) and an independent
variable (X). We will use this dataset to demonstrate the OLS regression.

Using R:
# Load the necessary libraries
Library(dplyr)
# Read the dataset
data <- read.csv("your_dataset.csv")
# Perform the OLS regression
model <- lm (Y ~ X, data = data)
# Print the summary of the regression results
Summary (model)
38 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics Using Python (using the


statsmodels library):
# Import the necessary libraries
import pandas as pd.
import statsmodels.api as sm.
# Read the dataset
data = pd. read_csv("your_dataset.csv")
# Perform the OLS regression
model = sm.OLS(data['Y'], sm.add_constant(data['X']))
# Fit the model
results = model.fit ()
# Print the summary of the regression results
print(results. summary())
In both R and Python, we first load the necessary libraries (e.g., dplyr in R and
pandas and statsmodels in Python). Then, we read the dataset containing the
variables Y and X.
Next, we perform the OLS regression by specifying the formula in R (Y ~ X) and using
the lm function. In Python, we create an OLS model object using sm.OLS and provide
the dependent variable (Y) and independent variable (X) as arguments. We also add
a constant term using sm.add_constant to account for the intercept in the regression.
After fitting the model, we can print the summary of the regression results using
summary(model) in R and print(results. summary()) in Python. The summary provides
various statistical measures and information about the regression model.
Interpreting the results:
Coefficients: The regression results will include the estimated coefficients for the
intercept and the independent variable. These coefficients represent the average
change in the dependent variable for a one-unit increase in the independent variable.
For example, if the coefficient for X is 0.5, it suggests that, on average, Y increases by
0.5 units for every one
unit increase in X.

39 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS p-values: The regression results also provide p-values for the
coefficients. These p-values indicate the statistical significance of the coefficients.
Generally, a p-value less than a significance level (e.g., 0.05) suggests that the
coefficient is statistically significant, implying a relationship between the independent
variable and the dependent variable.
R-squared: The R-squared value (R-squared or R2) measures the proportion of the
variance in the dependent variable that can be explained by the independent
variable(s). It ranges from 0 to 1, with higher values indicating a better fit of the
regression model to the data. R-squared can be interpreted as the percentage of the
dependent variable's variation explained by the independent variable(s).
Residuals: The regression results also include information about the residuals, which
are the differences between the observed values of the dependent variable and the
predicted values from the regression model. Residuals should ideally follow a normal
distribution with a mean of zero, and their distribution can provide insights into the
model's goodness of fit and potential violations of the regression assumptions.
It's important to note that interpretation may vary depending on the specific context and
dataset. Therefore, it's essential to consider the characteristics of your data and the
objectives of your analysis while interpreting the results of an OLS regression.
Exercise 2. Test the assumptions of OLS (multicollinearity, autocorrelation, normality
etc.) on R/Python.
Sol. To test the assumptions of OLS, including multicollinearity, autocorrelation, and
normality, you can use various diagnostic tests in R or Python. Here are the steps and
some commonly used tests for each assumption:
Multicollinearity:
Step 1: Calculate the pairwise correlation matrix between the independent variables
using the cor () function in R or the corrcoef() function in Python (numpy).
Step 2: Calculate the Variance Inflation Factor (VIF) for each independent variable
using the vif () function from the "car" package in R or the variance_inflation_factor()
function from the "statsmodels" library in Python. VIF values greater than 10 indicate
high multicollinearity.
Step 3: Perform auxiliary regressions by regressing each independent variable
against the remaining independent variables to identify highly collinear variables.
Autocorrelation:
40 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


Step 1: Plot the residuals against the predicted values (fitted values) from the
regression model. In R, you can use the plot () function with the residuals() and fitted()
functions. In Python, you can use the scatter () function from matplotlib.
Step 2: Conduct the Durbin-Watson test using the dwtest () function from the "lmtest"
package in R or the DurbinWatson() function from the "statsmodels.stats.stattools"
module in Python. A value close to 2 indicates no autocorrelation, while values
significantly greater or smaller than 2 suggest positive or negative autocorrelation,
respectively.
Normality of Residuals:
Step 1: Plot a histogram or a kernel density plot of the residuals. In R, you can use the
hist () or density() functions. In Python, you can use the histplot () or kdeplot() functions
from the seaborn library.
Step 2: Perform a normality test such as the Shapiro-Wilk test using the shapiro.test ()
function in R or the shapiro() function from the "scipy.stats" module in Python. A p-
value greater than 0.05 indicates that the residuals are normally distributed.
It's important to note that these tests provide diagnostic information, but they may not
be definitive. It's also advisable to consider the context and assumptions of the specific
regression model being used.
Here is the random data set to perform the regression code in either R or
Python.
41 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS
This dataset consists of three columns: y represents the dependent variable, and x1
and x2 are the independent variables. Each row corresponds to an observation in the
dataset.
We can use this dataset to run the provided code and perform diagnostic tests on the
OLS regression model.
import numpy as np.
import pandas as pd.
import statsmodels.api as sm.
import seaborn as sns.
import matplotlib. pyplot as plt

# Set random seed for reproducibility


np. random.seed(123)
# Generate random data
n = 100 # Number of observations
x1 = np. random.normal(0, 1, n) # Independent variable 1
x2 = np. random.normal(0, 1, n) # Independent variable 2
epsilon = np. random.normal(0, 1, n) # Error term

42 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics # Generate dependent


variable
y = 1 + 2*x1 + 3*x2 + epsilon

# Create a DataFrame
data = pd. DataFrame({'y': y, 'x1': x1, 'x2': x2})

# Fit OLS regression model


X = sm.add_constant (data[['x1', 'x2']]) # Add constant term
model = sm.OLS(data['y'], X)
results = model.fit ()

# Diagnostic tests
print("Multicollinearity:")
vif = pd. DataFrame()
vif["Variable"] = X. columns
vif["VIF"] = [variance_inflation_factor (X.values, i) for i in
range(X.shape[1])] print(vif)

print("\nAutocorrelation:")
residuals = results. resid
fig, ax = plt. subplots()
ax. scatter(results.fittedvalues, residuals)
ax.set_xlabel ("Fitted values")
ax.set_ylabel("Residuals")
plt. show()
print ("Durbin-Watson test:")
dw_statistic = sm. stats.stattools.durbin_watson(residuals)
43 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS print (f"Durbin-Watson statistic: {dw_statistic}")

print("\nNormality of Residuals:")
sns.histplot(residuals, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

shapiro_test = sm.stats.shapiro(residuals)
print(f"Shapiro-Wilk test p-value: {shapiro_test[1]}")

In this example, we generated a random dataset with two independent variables (x1
and x2) and a dependent variable (y). We fit an OLS regression model using the
statsmodels library. Then, we perform diagnostic tests for multicollinearity,
autocorrelation, and normality of residuals.
The code calculates the VIF for each independent variable, plots the residuals against
the fitted values, performs the Durbin-Watson test for autocorrelation, and plots a
histogram of the residuals. Additionally, the Shapiro-Wilk test is conducted to check
the normality of residuals.
We can run this code in a Python environment to see the results and interpretations
for each diagnostic test based on the random dataset provided.
3. Perform regression analysis with categorical/dummy/qualitative variables on
R/Python.
import pandas as pd
import statsmodels.api as sm
# Create a DataFrame with the data
data = {
'y': [3.3723, 5.5593, 8.1878, -2.4581, 3.8578, 5.4747, 6.4135, 8.1032, 5.56,
5.3514, 5.8457],

'x1': [-1.085631, 0.997345, 0.282978, -1.506295, -0.5786, 1.651437, - 2.426679,


-0.428913, -0.86674, 0.742045, 2.312265],

44 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


'x2': [-0.076047, 0.352978, -2.242685, 1.487477, 1.058969, -0.37557, - 0.600516,
0.955434, -0.151318, -0.10322, 0.410598],
'category': ['A', 'B', 'A', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'B']
}

df = pd.DataFrame(data)

# Convert the categorical variable to dummy variables


df = pd.get_dummies(df, columns=['category'], drop_first=True)

# Define the dependent and independent variables


X = df[['x1', 'x2', 'category_B']]
y = df['y']

# Add a constant term to the independent variables


X = sm.add_constant(X)

# Fit the OLS model


model = sm.OLS(y, X).fit()

# Print the summary of the regression results


print(model.summary())

In this example, we have created a DataFrame df with the y, x1, x2, and category
variables. The category variable is converted into dummy variables using the
get_dummies function, and the category A column is dropped to avoid multicollinearity.
We then define the dependent variable y and the independent variables X, including
the dummy variable category_B. A constant term is added to the independent variables
using sm.add_constant. Finally, we fit the OLS model using sm.OLS and print the
summary of the regression results using model.summary(). The regression analysis
provides the estimated coefficients, standard errors, t-statistics, and p-values for each
independent variable, including the dummy variable category B.
45 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS
IN-TEXT QUESTIONS AND ANSWERS

1. What is the main objective of simple linear regression?


Answer: The main objective of simple linear regression is to establish a
linear relationship between a dependent variable and a single independent
variable and use it to predict the value of the dependent variable based on
the value of the independent variable.

2. What are the key assumptions of multiple linear regression?


Answer: The key assumptions of multiple linear regression are linearity,
independence of errors, homoscedasticity, absence of multicollinearity,
and normality of residuals.

3. What is the interpretation of the coefficient of determination (R-


squared)? Answer: The coefficient of determination (R-squared)
represents the proportion of the variance in the dependent variable that
can be explained by the independent variables in the regression model. It
ranges from 0 to 1, where 0 indicates no explanatory power, and 1
indicates that all the variability in the dependent variable is accounted for
by the independent variables.

4. How is multicollinearity detected in multiple linear regression?


Answer: Multicollinearity in multiple linear regression can be detected
through methods such as examining pairwise correlations among the
independent variables, calculating variance inflation factor (VIF) values,
and performing auxiliary regressions.

2.6 SUMMARY

This chapter discusses a comprehensive understanding of predictive analytics


techniques, with a specific focus on simple linear regression and multiple linear
regression. It provides the knowledge and practical skills necessary to apply these
techniques using R or Python, enabling one to make informed predictions and
interpretations in the context of the regression analysis.
46 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics 2.7 SELF-


ASSESSMENT QUESTIONS
1. What is the purpose of residual analysis in regression?
2. How do you interpret the p-value in regression analysis?
3. What is the purpose of stepwise regression?
4. What is the difference between simple linear regression and multiple linear
regression?
5. What is the purpose of interaction terms in multiple linear
regression? 6. How can you assess the goodness of fit in
regression analysis?

2.8 REFERENCES

1. Business Analytics: The Science of Data Driven Decision Making, First Edition
(2017), U Dinesh Kumar, Wiley, India.

2.9 SUGGESTED READINGS

1. Introduction to Machine Learning with Python, Andreas C. Mueller and Sarah


Guido, O'Reilly Media, Inc.
2. 2. Data Mining or Business Analytics – Concepts, Techniques, and Applications
in Python. Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. Wiley.

47 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS LESSON 3
LOGISTIC AND MULTINOMIAL REGRESSION
Anurag Goel
Assistant Professor, CSE Dept.
Delhi Technological University, New Delhi
Email-Id: [email protected]

STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lemshow Test
3.7 Pseudo R Square
3.8 Classification Table
3.9 Gini Coefficient
3.10 ROC
3.11 AUC
3.12 Summary
3.13 Glossary
3.14 Answers to In-Text Questions
3.15 Self-Assessment Questions
3.16 References
3.17 Suggested Readings

3.1 LEARNING OBJECTIVES


At the end of the chapter, the students will be able to:
● familiarize the concepts of logistic regression and multinomial logistic
regression. ● understand the various evaluation metrics to evaluate the logistic
regression model. ● analyse the scenario where the logistic regression model
is relevant. ● apply logistic regression model for nominal and ordinal
outcomes.
48 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


3.2 INTRODUCTION
In machine learning, we often are required to determine if a particular variable belongs
to a given class. In such cases, one can use logistic regression. Logistic Regression,
a popular supervised learning technique, is commonly employed when the desired
outcome is a categorical variable such as binary decisions (e.g., 0 or 1, yes or no, true
or false). It finds extensive applications in various domains, including fake news
detection and cancerous cell identification.
Some examples of logistic regression applications are as follows:
To detect whether a given news is fake or not.
To detect whether a given cell is Cancerous cell or not.
In essence, logistic regression can be understood as the probability of belonging to a
class given a particular input variable. Since it’s probabilistic in nature, the logistic
regression output values lie in the range of 0 and 1.
Generally, when we think about regression from a strictly statistical perspective, the
output value is generally not restricted to a particular interval. Thus, to achieve this in
logistic regression, we utilise logistic function. An intuitive example to see the use of
logistic function can be to understand logistic regression as any simple regression
value model, on top of whose output value, we have applied a logistic function so that
the final output becomes restricted in the above defined range.
Generally, logistic regression results work well when the output is of binary type, that
is, it either belongs to a specific category or it does not. This, however, is not always
the case in real-life problem statements. We may encounter a lot of scenarios where
we have a dependent variable having multiple classes or categories. In such cases,
Multinomial Regression emerges as a valuable extension of logistic regression,
specifically designed to handle multiclass problems. Multinomial Regression is the
generalization of logistic regression to multiclass problems. For example, based on the
results of some analysis, predicting the engineering branch students will choose for
their graduation is a multinomial regression problem since the output categories of
engineering branches are multiple. In this multinomial regression problem, the
engineering branch will be the dependent variable predicted by the multinomial
regression model while the independent variables are student’s marks in XII board
examination, student’s score in engineering entrance exam, student’s interest
areas/courses etc. These independent variables are used by the multinomial
regression model to predict the outcome i.e. engineering branch the student may opt
for.

49 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS To better understand the application of multinomial regression,


consider the example of predicting a person's blood group based on the results of
various diagnostic tests. Unlike binary classification problems that involve two
categories, blood group prediction involves multiple possible outcomes. In this case,
the output categories are the different blood groups, and predicting the correct blood
group for an individual becomes a multinomial regression problem. The multinomial
regression model aims to estimate the probabilities associated with each class or
category, allowing us to assign an input sample to the most likely category.
Now, let us understand this better by doing a simple walkthrough of how a multinomial
logistic regression model might work on the above example. For simplicity, let us
assume we have a well-balanced, cleaned, pre-processed and labelled dataset
available with us which has an input variable (or feature) and a corresponding output
blood group. During training, our multinomial logistic regression model will try to learn
the underlying patterns and relationships between the input features and the
corresponding class labels (from training data). Once trained, the model can utilise
these learned patterns and relationships on new (or novel) input variable to assign a
probability of the input variable to belonging to each output class using the logistic
function. Model can then simply select the class which has the highest probability as
the predicted output of our overall model.
Thus, multinomial regression serves as a powerful extension of logistic regression,
enabling the handling of multiclass classification problems. By estimating the
probabilities associated with each class using the logistic function, it provides a
practical and effective approach for assigning input samples to their most likely
categories. Applications of multinomial regression encompass a wide range of
domains, including medical diagnosis, sentiment analysis, and object recognition,
where classification tasks involve more than two possible outcomes.

3.3 Logistic Function


3.3.1 Logistic function (Sigmoid function)
The sigmoid function is represented as follows:

50 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


It is a mathematical function that assigns values between 0 and 1 based on the input
variable. It is characterized by its S-shaped curve and is commonly used in statistics,
machine learning, and neural networks to model non-linear relationships and provide
probabilistic interpretations.
3.3.2 Estimation of probability using logistic function
The logistic function is often used for estimating probabilities in various fields. By
applying the logistic function to a linear combination of input variables, such as in
logistic regression, it transforms the output into a probability value between 0 and 1.
This allows for the prediction and classification of events based on their likelihoods.

3.4 OMNIBUS TEST


Omnibus test is a statistical test used to test the significance of several model
parameters at once. It examines whether the combined effect of the predictors is
statistically significant.
The Omnibus statistic is calculated by examining the difference in deviance between
the full model (with predictors) and the reduced model (without predictors) to derive
its formula:

where Dr represents the deviance of the reduced model (without predictors) and Df
represents the deviance of the full model (with predictors).
The Omnibus test statistic approximately follows chi-square distribution with degrees
of freedom given by the difference in the number of predictors between the full and
reduced models. By comparing the test statistic to the chi-square distribution and
calculating the associated p-value, we can calculate the collective statistical
significance of the predictor variables.
When the calculated p-value is lower than a predefined significance level (e.g., 0.05),
we reject the null hypothesis, indicating that the group of predictor variables collectively
has a statistically significant influence on the dependent variable. On the other hand,
if the p-value exceeds the significance level, we fail to reject the null hypothesis,
suggesting that the predictors may not have a significant collective effect.
The Omnibus test provides a comprehensive assessment of the overall significance of
the predictor variables within a regression model, aiding in the understanding of how
these predictors jointly contribute to explaining the variation in the dependent variable.

51 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS Let's consider an example where we have a regression model with


three predictor variables (X1, X2, X3) and a continuous dependent variable (Y). We
want to assess the overall significance of these predictors using the Omnibus test.
Here is a sample dataset with the predictor variables and the dependent
variable: X1 X2 X3 Y
2.5 6 8 10.2
3.2 4 7 12.1
1.8 5 6 9.5
2.9 7 9 11.3
3.5 5 8 13.2
2.1 6 7 10.8
2.7 7 6 9.7
3.9 4 9 12.9
2.4 5 8 10.1
2.8 6 7 11.5

Step 1: Fit the Full Model


We start by fitting the full regression model that includes all three predictor

variables: Y = β₀ + β₁ *X₁ + β₂ *X₂ + β₃ *X₃

By using statistical software, we obtain the estimated coefficients and the deviance of
the full model:

β₀ = 8.463, β₁ = 0.643, β₂ = 0.245, β₃ = 0.812


Deviance_full = 5.274

Step 2: Fit the Reduced Model


Next, we fit the reduced model, which only includes the intercept
term: Y = β₀
52 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics Similarly, we obtain the


deviance of the reduced model:

Deviance_reduced = 15.924

Step 3: Calculate the Omnibus Test Statistic


Using the deviance values obtained from the full and reduced models, we can
calculate the Omnibus test statistic:

Omnibus = (Deviance_reduced - Deviance_full) / Deviance_reduced


= (15.924 - 5.274) / 15.924
= 0.668

Step 4: Conduct the Hypothesis Test


To assess the statistical significance of the predictors, we compare the Omnibus test
statistic to the chi-square distribution with degrees of freedom equal to the difference
in the number of predictors between the full and reduced models. In this case, the
difference is 3 (since we have 3 predictor variables).

By referring to the chi-square distribution table or using statistical software, we


determine the p-value associated with the Omnibus test statistic. Let's assume the p-
value is 0.022.

Step 5: Interpret the Results


Since the p-value (0.022) is smaller than the predetermined significance level (e.g.,
0.05), we reject the null hypothesis. This indicates that the set of predictor variables
(X1, X2, X3) collectively has a statistically significant impact on the dependent variable
(Y). In other words, the predictors significantly contribute.

3.5 WALD TEST


The Wald test is a statistical test utilized to assess the significance of individual
predictor variables in a regression model. It examines whether the estimated
coefficient for a specific predictor differs significantly different from zero, indicating its
importance in predicting the dependent variable.

The formula for the Wald test statistic is as follows:

53 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS
where β is the estimated coefficient for the predictor variable of interest, β₀ is the
hypothesized value of the coefficient under the null hypothesis (typically 0 for testing if
the coefficient is zero) and Var(β) is the estimated variance of the coefficient.
The Wald test statistic is compared to the chi-square distribution, where the degrees of
freedom are set to 1 (since we are testing a single parameter) to obtain the associated
p-value. Rejecting the null hypothesis occurs when the calculated p-value falls below
a predetermined significance level (e.g., 0.05), indicating that the predictor variable
has a statistically significant impact on the dependent variable.
The Wald test allows us to determine the individual significance of predictor variables
by testing whether their coefficients significantly deviate from zero. It is a valuable tool
for identifying which variables have a meaningful impact on the outcome of interest in
a regression model.
Let's consider an example where we have a logistic regression model with two predictor
variables (X1 and X2) and a binary outcome variable (Y). We want to assess the
significance of the coefficient for each predictor using the Wald test.
Here is a sample dataset with the predictor variables and the binary outcome
variable: X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1

54 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics Step 1: Fit the Logistic


Regression Model
We start by fitting the logistic regression model with the predictor variables X1
and X2: logit(p) = β₀ + β₁ *X₁ + β₂ *X₂

By using statistical software, we obtain the estimated coefficients and their


standard errors: β₀ = -1.613, β₁ = 0.921, β₂ = 0.372

SE(β₀ ) = 0.833, SE(β₁ ) = 0.512, SE(β₂ ) = 0.295


Step 2: Calculate the Wald Test Statistic
Next, we calculate the Wald test statistic for each predictor variable using the
formula: W = (β - β₀ )² / Var(β)

For X1:
W₁ = (0.921 - 0)² / (0.512)² = 1.790
For X2:
W₂ = (0.372 - 0)² / (0.295)² = 1.608
Step 3: Conduct the Hypothesis Test
To assess the statistical significance of each predictor, we compare the Wald test
statistic for each variable to the chi-square distribution with 1 degree of freedom
(since we are testing a single parameter).
By referring to the chi-square distribution table or using statistical software, we
determine the p-value associated with each Wald test statistic. Let's assume the p-
value for X1 is 0.183 and the p-value for X2 is 0.205.
Step 4: Interpret the Results
For X1, since the p-value (0.183) is larger than the predetermined significance
level (e.g., 0.05), we fail to reject the null hypothesis. This suggests that the
coefficient for X1 is not statistically significantly different from zero, indicating that
X1 may not have a significant effect on the binary outcome variable Y.

Similarly, for X2, since the p-value (0.205) is larger than the significance level, we
fail to reject the null hypothesis. This suggests that the coefficient for X2 is not
statistically 55 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS significantly different from zero, indicating that X2 may not have a
significant effect on the binary outcome variable Y.
In summary, based on the Wald tests, we do not have sufficient evidence to
conclude that either X1 or X2 has a significant impact on the binary outcome
variable in the logistic regression model.

IN-TEXT QUESTIONS
1. What does the Wald test statistic compare to obtain the associated
p-value? a) The F-distribution
b) The t-distribution
c) The normal distribution
d) The chi-square distribution

2. What does the Omnibus test assess in a regression model?


a) The individual significance of predictor variables
b) The collinearity between predictor variables
c) The overall significance of predictor variables collectively
d) The goodness-of-fit of the regression model

3.6 HOSMER LEMESHOW TEST


The Hosmer-Lemeshow test is a statistical test used to evaluate the goodness-of-fit of
a logistic regression model. It assesses how well the predicted probabilities from the
model align with the observed outcomes.
The Hosmer-Lemeshow test is based on dividing the observations into groups or
"bins" based on the predicted probabilities of the logistic regression model. The
formula for the Hosmer Lemeshow test statistic is as follows:

56 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


Where Oij is the observed number of outcomes (events or non-events) in the ith bin
and jth outcome category, Eij is the expected number of outcomes (events or non-
events) in the ith bin and jth outcome category, calculated as the sum of predicted
probabilities in the bin for the jth outcome category.
The test statistic H follows an approximate chi-square distribution with degrees of
freedom equal to the number of bins minus the number of model parameters. A smaller
p-value obtained by comparing the test statistic to the chi-square distribution suggests
a poorer fit of the model to the data, indicating a lack of goodness-of-fit.
By conducting the Hosmer-Lemeshow test, we can determine whether the logistic
regression model adequately fits the observed data. A non-significant result (p > 0.05)
indicates that the model fits well, suggesting that the predicted probabilities align
closely with the observed outcomes. Conversely, a significant result (p < 0.05)
suggests a lack of fit, indicating that the model may not accurately represent the data.
The Hosmer-Lemeshow test is a valuable tool in assessing the goodness-of-fit of
logistic regression models, allowing us to evaluate the model's performance in
predicting outcomes based on observed and predicted probabilities.
Let's consider the example again with the logistic regression model predicting the
probability of a disease (Y) based on a single predictor variable (X). We will divide the
predicted probabilities into three bins and calculate the observed and expected
frequencies in each bin.
Y X Predicted Probability
0 2.5 0.25
1 3.2 0.40
0 1.8 0.15
1 2.9 0.35
1 3.5 0.45
0 2.1 0.20
1 2.7 0.30
0 3.9 0.60
0 2.4 0.18
1 2.8 0.28

57 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS Step 1: Fit the Logistic Regression Model


By fitting the logistic regression model, we obtain the predicted probabilities for each
observation based on the predictor variable X.

Step 2: Divide the Predicted Probabilities into Bins


Let's divide the predicted probabilities into three bins: [0.1-0.3], [0.3-0.5], and [0.5-0.7].

Step 3: Calculate Observed and Expected Frequencies in Each Bin


Now, we calculate the observed and expected frequencies in each bin.
Bin: [0.1-0.3]
Total cases in bin: 3
Observed cases (Y = 1): 1
Expected cases: (0.25 + 0.20 + 0.28) * 3 = 1.23

Bin: [0.3-0.5]
Total cases in bin: 4
Observed cases (Y = 1): 2
Expected cases: (0.40 + 0.35 + 0.30 + 0.28) * 4 = 3.52

Bin: [0.5-0.7]
Total cases in bin: 3
Observed cases (Y = 1): 2
Expected cases: (0.45 + 0.60) * 3 = 3.15

Step 4: Calculate the Hosmer-Lemeshow Test Statistic


We calculate the Hosmer-Lemeshow test statistic by summing the contributions from
each bin:

HL = ((O₁ - E₁ )² / E₁ ) + ((O₂ - E₂ )² / E₂ ) + ((O₃ - E₃ )² / E₃ )

HL = ((1 - 1.23)² / 1.23) + ((2 - 3.52)² / 3.52) + ((2 - 3.15)² / 3.15)


= (0.032) + (0.670) + (0.224)
= 0.926

58 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics Step 5: Conduct the


Hypothesis Test
We compare the Hosmer-Lemeshow test statistic (HL) to the chi-square distribution
with 1 degree of freedom (number of bins - 2).
By referring to the chi-square distribution table or using statistical software, let's
assume that the critical value for a significance level of 0.05 is 3.841.
Since the calculated test statistic (0.926) is less than the critical value (3.841), we fail
to reject the null hypothesis. This suggests that the logistic regression model fits the
data well.
Step 6: Interpret the Results
Based on the Hosmer-Lemeshow test, there is no evidence to suggest lack of fit for the
logistic regression model. The calculated test statistic (0.926) is below the critical value,
indicating good fit between the observed and expected frequencies in the different bins.
In summary, the Hosmer-Lemeshow test assesses the goodness of fit of a logistic
regression model by comparing the observed and expected frequencies in different
bins of predicted probabilities. In this example, the test result indicates that the model
fits the data well.

3.7 PSEUDO R SQUARE


Pseudo R-square is a measure used in regression analysis, particularly in logistic
regression, to assess the proportion of variance in the dependent variable explained
by the predictor variables. It is called "pseudo" because it is not directly comparable to
the R-squared used in linear regression.
There are various methods to calculate Pseudo R-squared, and one commonly used
method is Nagelkerke's R-squared. The formula for Nagelkerke's R-squared is as
follows:

where ℒ model is the log-likelihood of the full model, ℒ null is the log-likelihood of the
null model (a model with only an intercept term) and ℒ max is the log-likelihood of a
model with perfect prediction (a hypothetical model that perfectly predicts all
outcomes).
Nagelkerke's R-squared ranges from 0 to 1, with 0 indicating that the predictors have
no explanatory power, and 1 suggesting a perfect fit of the model. However, it is
important to note that Nagelkerke's R-squared is an adjusted measure and should not
be interpreted in the same way as R-squared in linear regression.
59 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS Pseudo R-squared provides an indication of how well the predictor


variables explain the variance in the dependent variable in logistic regression. While it
does not have a direct interpretation as the proportion of variance explained, it serves
as a relative measure to compare the goodness-of-fit of different models or assess the
improvement of a model compared to a null model.
One commonly used pseudo R-squared measure is the Cox and Snell R-squared. Let's
calculate the Cox and Snell R-squared using the given example of a logistic regression
model with two predictor variables.
X1 X2 Y

2.5 6 0

3.2 4 1

1.8 5 0

2.9 7 1

3.5 5 1
2.1 6 0

2.7 7 1

3.9 4 0

2.4 5 0
2.8 6 1

Step 1: Fit the Logistic Regression Model


By fitting the logistic regression model using the predictor variables X1 and X2, we
obtain the estimated coefficients for each predictor.
Step 2: Calculate the Null Log-Likelihood (LL0)
To calculate the null log-likelihood, we fit a null model with only an intercept term.
Let's assume that the null log-likelihood (LL0) is -48.218.

60 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics Step 3: Calculate the Full


Log-Likelihood (LLF)
The full log-likelihood represents the maximum value of the log-likelihood for the fitted
logistic regression model. Let's assume that the full log-likelihood (LLF) is -31.384.

Step 4: Calculate the Cox and Snell R-Squared


Using the formula R²_CS = 1 - (LL0 / LLF)^(2 / n), we can calculate the Cox and
Snell R squared.
Given:
LL0 = -48.218
LLF = -31.384
n = 10 (number of observations)

R²_CS = 1 - (-48.218 / -31.384)^(2 / 10)


= 1 - 0.4309
= 0.5691

Step 5: Interpret the Results


The calculated Cox and Snell R-squared is approximately 0.5691. This indicates that
around 56.91% of the variance in the binary outcome variable can be explained by the
logistic regression model.
In summary, based on the calculations, the Cox and Snell R-squared for the logistic
regression model with X1 and X2 as predictors is approximately 0.5691, suggesting a
moderate amount of variance explained by the model.

3.8 CLASSIFICATION TABLE


To understand the classification table, let’s consider a binary classification problem of
detecting whether an input cell is cancerous cell or not. Consider a logistic regression
model X implemented for the given classification problem on a dataset of 100 random
cells in which 10 cells are cancerous cells and 90 cells are non-cancerous cells. Let
suppose the model X
61 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS outputs 20 input cells as cancerous and rest 80 as non-cancerous


cells. Out of the total predicted cancerous cells, only 5 input cells are actually
cancerous as per the ground truth while the rest 15 cells are non-cancerous. On the
other hand, out of the total predicted non cancerous cells, 75 cells are also non-
cancerous cells in the ground truth but 5 cells are cancerous. Here, cancerous cell is
considered as positive class while non-cancerous cell is considered as negative class
for the given classification problem. Now, we define the four primary building blocks of
the various evaluation metrics of classification models as follows:
True Positive (TP): The number of input cells for which the classification model X
correctly predicts that they are cancerous cells are referred as True Positive. For
example, for the model X, TP = 5.
True Negative (TN): The number of input cells for which the classification model X
correctly predicts that they are non-cancerous cells are referred as True Negative. For
example, for the model X, TN = 75.
False Positive (FP): The number of input cells for which the classification model X
incorrectly predicts that they are cancerous cells are referred as False Positive. For
example, for the model X, FP = 15.
False Negative (FN): The number of input cells for which the classification model X
incorrectly predicts that they are non-cancerous cells are referred as False Negative
Positive. For example, for the model X, FN = 5.

Actual
Cancerous Non-Cancerous

Predicted Cancerous TP = 5 FP = 15
Non- FN = 5 TN = 75
Cancerous
Fig 3.2: Classification Matrix

3.8.1 Sensitivity
Sensitivity, also referred to as True Positive Rate or Recall, is calculated as the ratio of
correctly predicted cancerous cells to the total number of cancerous cells in the ground
truth. To compute sensitivity, you can use the following formula:

62 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics 3.8.2 Specificity


Specificity is defined as the ratio of number of input cells that are correctly predicted
as non cancerous to the total number of non-cancerous cells in the ground truth.
Specificity is also known as True Negative Rate. To compute specificity, we can use
the following formula:

3.8.3 Accuracy
Accuracy is calculated as the ratio of correctly classified cells to the total number of
cells. To compute accuracy, you can use the following formula:

3.8.4 Precision
Precision is calculated as the ratio of the correctly predicted cancerous cells to the total
number of cells predicted as cancerous by the model. To compute precision, you can
use the following formula:

3.8.5 F score
The F1-score is calculated as the harmonic mean of Precision and Recall. To
compute the F1- score, you can follow the following formula:

IN-TEXT QUESTIONS

3. For the model X results on the given dataset of 100 cells, the precision of
model is a) 0 b) 0.25
c) 0.5 d) 1
4. For the model X results on the given dataset of 100 cells, the recall of
model is a) 0 b) 0.25
c) 0.5 d) 1

63 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 3.9 GINI COEFFICIENT


A metric used to assess inequality is the Gini coefficient, also referred to as the Gini
index. The Gini coefficient has a value between 0 and 1. The performance of the model
improves with increasing Gini coefficient values. Gini coefficient can be computed from
the AUC of ROC curve using the formula:

3.10 ROC
In particular in logistic regression or machine learning techniques, the performance of
a binary classification model is assessed using a graphical representation called the
Receiver Operating Characteristic (ROC) curve. The trade-off between the true
positive rate (sensitivity) and the false positive rate (specificity minus 1) for various
categorization thresholds is demonstrated.

Plotting the true positive rate (TPR) against the false positive rate (FPR) at various
categorization thresholds results in the ROC curve. The formula for TPR and FPR are
as follows:

64 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics

We may evaluate the model's capacity to distinguish between positive and negative
examples at various classification levels using the ROC curve. With a TPR of 1 and
an FPR of 0, a perfect classifier would have a ROC curve that reaches the top left
corner of the plot. The model's discriminatory power increases with the distance
between the ROC curve and the top left corner.

3.11 AUC
When employing a Receiver Operating Characteristic (ROC) curve, the Area Under the
Curve (AUC) is a statistic used to assess the effectiveness of a binary classification
model. The likelihood that a randomly selected positive occurrence will have a greater
projected probability than a randomly selected negative instance is represented by the
AUC.
The AUC is calculated by integrating the ROC curve. However, it is important to note
that the AUC does not have a specific formula since it involves calculating the area
under a curve. Instead, it is commonly calculated using numerical methods or software.
The AUC value ranges between 0 and 1. A model with an AUC of 0.5 indicates a
random classifier, where the model's predictive power is no better than chance. An
AUC value that is nearer 1 indicates a classifier that is more accurate and is better
able to distinguish between positive and negative situations. Conversely, an AUC
value closer to 0 suggests poor performance, with the model performing worse than
random guessing.
In binary classification tasks, the AUC is a commonly utilized statistic since it offers a
succinct assessment of the model's performance at different categorization thresholds.
It is especially useful when the dataset is imbalanced i.e. the number of instances that
are positive and negative differ significantly.
In conclusion, the AUC measure evaluates a binary classification model's total
discriminatory power by delivering a single value that encapsulates the model's
capacity to rank cases properly. Better classification performance is shown by higher
AUC values, whilst worse performance is indicated by lower values.

65 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi
BMS IN-TEXT QUESTIONS
5. Which of the following illustrates trade-off between True Positive Rate
and False Positive Rate?
a) Gini Coefficient b) F1-score
c) ROC d) AUC
6. Which of the following value of AUC indicates a more accurate
classifier? a) 0.01 b) 0.25
c) 0.5 d) 0.99
7. What is the range of values for the Gini coefficient?
a) -1 to 1
b) 0 to 1
c) 0 to infinity
d) -infinity to infinity
8. How can the Gini coefficient be computed?
a) By calculating the area under the precision-recall curve
b) By calculating the area under the receiver operating characteristic
(ROC) curve
c) By calculating the ratio of true positives to true negatives.
d) By calculating the ratio of false positives to false negatives.

3.12 SUMMARY
Logistic regression is used to solve the classification problems by producing the
probabilistic values within the range of 0 and 1. Logistic regression uses Logistic
function i.e. sigmoid function. Multinomial Regression is the generalization of logistic
regression to multiclass problems. Omnibus test is a statistical test utilized to test the
significance of several model parameters at once. Wald test is a statistical test used
to assess the significance of individual predictor variables in a regression model.
Hosmer-Lemeshow test is a statistical test employed to assess the adequacy of a
logistic regression model. Pseudo R-square is a measure to assess the proportion of
variance in the dependent variable explained by the predictor variables. There are
various classification metrics namely Sensitivity, Specificity, Accuracy, Precision, F-
score, Gini Coefficient, ROC and AUC, which are utilized to evaluate the performance
of a classifier model.
66 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics 3.13 GLOSSARY


Terms Definition
Omnibus test A statistical test used to test the significance of multiple model
parameters simultaneously.
Wald test A statistical test used to evaluate the significance of each individual
predictor variables within a regression model.
model.
Hosmer
Lemeshow test A metric used to evaluate the portion of
variability in the dependent variable that
Pseudo R square can be accounted for by the predictor
a statistical test utilized to assess the variables.
adequacy of fit for a logistic regression

F1-score The F1-score is calculated as the harmonic mean of Precision and


Recall.

ROC curve ROC curve demonstrates the balance between the true positive rate
and the false positive rate across various classification thresholds.
Gini A metric used to measure the inequality.
Coefficient

3.14 ANSWERS TO INTEXT QUESTIONS


1. (d) The chi-square distribution 2. 5. (c) ROC
(c) The overall significance of predictor 6. (d) 0.99
variables collectively 7. (b) 0 to 1
3. (b) 0.25 8. (b) By calculating the area under the
4. (c) 0.25 receiver operating characteristic (ROC)
curve.

3.15 SELF-ASSESSMENT QUESTIONS


1. Differentiate between Linear Regression and Logistic Regression.
2. Differentiate between Sensitivity and Specificity.
67 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 3. Define True Positive Rate and False Positive Rate.


4. Consider a logistic regression model X that is applied on a problem of classifying
a statement is hateful or not. Consider a dataset D of 100 statements containing
equal number of hateful statements and non-hateful statements. Suppose the
model X is classifying all the input statements as hateful. Comment on the
precision and recall values of the model X.
5. Define F-score and Gini Index.
6. Explain the use of ROC curve and AUC of a ROC curve.

3.16 REFERENCES
LaValley, M. P. (2008). Logistic regression. Circulation, 117(18), 2395-
2399. Wright, R. E. (1995). Logistic regression.
Chatterjee, Samprit, and Jeffrey S. Simonoff. Handbook of regression analysis.
John Wiley & Sons, 2013.
Kleinbaum, David G., K. Dietz, M. Gail, Mitchel Klein, and Mitchell Klein. Logistic
regression. New York: Springer-Verlag, 2002.
DeMaris, Alfred. "A tutorial in logistic regression." Journal of Marriage and the
Family (1995): 956-968.
Osborne, J. W. (2014). Best practices in logistic regression. Sage
Publications. Bonaccorso, Giuseppe. Machine learning algorithms. Packt
Publishing Ltd, 2017.

3.17 SUGGESTED READINGS


Huang, F. L. (2022). Alternatives to logistic regression models in experimental
studies. The Journal of Experimental Education, 90(1), 213-228.
https://towardsdatascience.com/logistic-regression-in-real-life-building-a-
daily productivity-classification-model-a0fc2c70584e

68 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics LESSON 4


DECISION TREE AND CLUSTERING
Dr. Sanjay Kumar
Dept. of Computer Science and Engineering,
Delhi Technological University,
Email-Id: [email protected]

STRUCTURE

4.1 Learning Objectives


4.2 Introduction
4.3 Classification and Regression Tree
4.4 CHAID
4.4 Impurity Measures
4.5 Ensemble Methods
4.6 Clustering
4.7 Summary
4.8 Glossary
4.9 Answers to In-Text Questions
4.10 Self-Assessment Questions
4.11 References
4.12 Suggested Readings

4.1 LEARNING OBJECTIVES

At the end of the chapter, the students will be able to:


● Exploring the concept of decision tree and its components
● Evaluating attribute selection measures
● Understanding ensemble methods
● Comprehending the random forest algorithm
● Exploring the concept of clustering and its types
● Comprehending distance and similarity measures
● Evaluating cluster quality
69 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 4.2 INTRODUCTION


Decision Tree is a popular machine learning approach for classification and regression
tasks. Its structure is similar to a flowchart, where internal nodes represent features or
attributes, branches depict decision rules, and leaf nodes signify outcomes or
predicted values. The data are divided recursively according to feature values by the
decision tree algorithm to create the tree. It chooses the best feature for data
partitioning at each stage by analysing parameters such as information gain or Gini
impurity. The goal is to divide the data into homogeneous subsets within each branch
to increase the tree's capacity for prediction.
Fig 4.1: Decision Tree for classification scenario of a mammal
By choosing a path through the tree based on feature values, the decision tree can be
used to generate predictions on fresh, unexpected data after it has been constructed.
The circumference. Figure 4.1 shows the decision tree helps classify an animal based
on a series of questions. The flowchart begins with the question, "Is it a mammal?" If
the answer is "Yes," we follow the branch on the left. The next question asks, "Does it
have spots?" If the answer is "Yes," we conclude that it is a leopard. If the answer is
"No," we determine it is a cheetah.

70 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


If the answer to the initial question, "Is it a mammal?" is "No," we follow the branch on
the right, which asks, "Is it a bird?" If the answer is "Yes," we classify it as a parrot. If
the answer is "No," we classify it as a fish.
Thus decision tree demonstrates a classification scenario where we aim to determine
the type of animal based on specific attributes. By following the flowchart, we can
systematically navigate through the questions to reach a final classification.
4.3 Classification and Regression Tree
A popular machine learning approach for classification and regression tasks is called
the Classification and Regression Tree (CART). It is a decision tree-based model that
divides the data into subsets according to the values of the input features and then
predicts the target variable using the tree structure.
CART is especially well-liked because of how easy it is to understand. Each core node
represents a test on a specific feature, and each leaf node represents a class label or
a predicted value, forming a binary tree structure. The method divides the data
iteratively according to the features with the goal of producing homogeneous subsets
with regard to the target variable.
In classification tasks, CART measures the impurity or disorder within each node using
a criterion like Gini impurity or entropy. Selecting the best feature and split point at
each node aims to reduce this impurity. The outcome is a tree that correctly
categorises new instances according to their feature values. In regression problems,
CART measures the quality of each split using a metric called mean squared error
(MSE). In order to build a tree that can forecast the continuous target variable, it
searches for the feature and split point that minimises the MSE.
Example: Let suppose we have a dataset of patients and we want to predict whether
they have a heart disease based on their age and cholesterol level. The dataset
contains the following information:
Age Cholesterol Disease
45 180 Yes

50 210 No
55 190 Yes

71 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS 60 220 No
65 230 Yes
70 200 No

Using the CART algorithm, we can build a decision tree to make predictions. The
decision tree may look like this:
Fig 4.2: Predicting Disease based on Age and Cholesterol Levels
The decision tree in this illustration begins at the node at the top, which evaluates the
statement "Age = 55." If a patient is under the age of 55, we proceed to the left branch
and examine the "Cholesterol = 200" condition. The diagnosis is "No Disease" if the
patient's cholesterol level is less than or equal to 200. The forecast is "Yes Disease" if
the cholesterol level is more than 200.
However, if the patient is older than 55, we switch to the right branch, where "No
Disease" is predicted regardless of the cholesterol level.

4.3 CHAID

4.3.1 Chi-Square Automatic Interaction Detection


CHAID (Chi-Square Automatic Interaction Detection) is a statistical method used to
analyze the interaction between different categories of variables. It is particularly useful
when 72 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


working with data that involves categorical variables, which represent different groups
or categories. The CHAID algorithm aims to identify meaningful patterns by dividing
the data into groups based on various categories of variables. This is achieved through
the application of statistical tests, particularly the chi-square test. The chi-square test
helps determine if there is a significant relationship between the categories of a
variable and the outcome of interest.
It divides the data into smaller groups. It repeats this procedure for each of these
smaller groups in order to find other categories that might be significantly related to
the result. The leaves on the tree indicate the expected outcomes, and each branch
represents a distinct category.
Calculate the Chi-Square statistic (χ^2):

(1.1) O represents the observed frequencies in each category or cell

of a contingency table.
E represents the expected frequencies under the assumption of independence
between variables.
Fig 4.3: Determining Satisfaction Levels of customer

73 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

BMS This flowchart shows how CHAID gradually divides the dataset into
subsets according to the most important predictor factors, resulting in a hierarchical
structure. It enables us to clearly and orderly visualise the links between the variables
and their effects on the target variable (Customer Satisfaction).
Age Group is the first variable on the flowchart, and it has two branches: "Young" and
"Middle-aged." We further examine the Gender variable within the "Young" branch,
resulting in branches for "Male" and "Female." The Purchase Frequency variable is
next examined for each gender subgroup, yielding three branches: "Low," "Medium,"
and "High." We arrive at the leaf nodes, which represent the customer satisfaction
outcome and are either "Satisfied" or "Not Satisfied."
4.3.2 Bonferroni Correction
The Bonferroni correction is a statistical method used to adjust the significance levels
(p values) when conducting multiple hypothesis tests at the same time. It helps
control the overall chance of falsely claiming a significant result by making the criteria
for significance more strict.
To apply the Bonferroni correction, we divide the desired significance level (usually
denoted as α) by the number of tests being performed (denoted as m). This adjusted
significance level, denoted as α' or α_B, becomes the new threshold for determining
statistical significance.
Mathematically, the Bonferroni correction can be represented as:

(1.2) For example, suppose we are conducting 10 hypothesis tests, and we want a
significance level of 0.05 (α = 0.05). By applying the Bonferroni correction, we divide
α by 10, resulting in an adjusted significance level of:

(1.3) Now, when we assess the p-values obtained from each


test, we compare them against the adjusted significance level (α') instead of the
original α. If a p-value is less than or equal to α', we consider the result to be statistically
significant.
Let's consider an example. Suppose we have conducted 10 independent hypothesis
tests, and we obtain p-values of 0.02, 0.07, 0.01, 0.03, 0.04, 0.09, 0.06, 0.08, 0.05,
and 0.02. Using the Bonferroni correction with α of 0.05 and m = 10, the adjusted
significance level becomes α' = 0.05 / 10 = 0.005.
74 | P a g e

© Department of Distance & Continuing Education, Campus of Open


Learning, School of Open Learning, University of Delhi

DSC 7: Introduction to Business Analytics


We want a significance level (α) of 0.05, and we have 10 hypothesis tests (m = 10).
Applying the Bonferroni correction, we divide α by 10, resulting in an adjusted
significance level (α') of 0.005.
Comparing the p-values to the adjusted significance level (α'), we find:

- Hypothesis test 1: p-value (0.02) ≤ α' (0.005) - Statistically significant


- Hypothesis test 2: p-value (0.07) > α' (0.005) - Not statistically
significant - Hypothesis test 3: p-value (0.01) ≤ α' (0.005) -
Statistically significant - Hypothesis test 4: p-value (0.03) > α' (0.005)
- Not statistically significant - Hypothesis test 5: p-value (0.04) > α'
(0.005) - Not statistically significant - Hypothesis test 6: p-value (0.09)
> α' (0.005) - Not statistically significant - Hypothesis test 7: p-value
(0.06) > α' (0.005) - Not statistically significant - Hypothesis test 8: p-
value (0.08) > α' (0.005) - Not statistically significant - Hypothesis test
9: p-value (0.05) > α' (0.005) - Not statistically significant -
Hypothesis test 10: p-value (0.02) ≤ α' (0.005) - Statistically significant

Based on the Bonferroni correction, we conclude that Test 1, Test 3, and Test 10 show
statistically significant results, as their p-values are less than or equal to the adjusted
significance level. The remaining tests are not considered statistically significant.

4.4 IMPURITY MEASURES

4.4.1 Gini Impurity Index


Gini impurity index is a measure used in decision tree algorithms to evaluate the
impurity or disorder within a set of class labels. It quantifies the likelihood of a randomly
selected element being misclassified based on the distribution of class labels in a given
node. The Gini impurity index ranges from 0 to 1, where 0 represents a perfectly pure
node with all elements belonging to the same class, and 1 represents a completely
impure node with an equal distribution of elements across different classes.
To calculate the Gini impurity index, we first compute the probability of each class label
within the node by dividing the count of elements belonging to that class by the total
number of elements. Then, we square each probability and sum up the squared
probabilities for all classes. Finally, we subtract the sum from 1 to obtain the Gini
impurity index.
Mathematically, the formula for Gini impurity index is as follows:

(1.4) 75 | P a g e
© Department of Distance & Continuing Education, Campus of Open
Learning, School of Open Learning, University of Delhi

You might also like