0% found this document useful (0 votes)
19 views67 pages

BBABB602 Study Material and Syllabus

The document outlines the course objectives and outcomes for Advanced Data Analytics, focusing on the role of data analytics in business decision-making and the principles of information analysis. It covers various statistical methods such as linear regression, logistic regression, factor analysis, and cluster analysis, along with their applications and assumptions. Additionally, it emphasizes the importance of ethical, social, and security considerations in data analytics systems.

Uploaded by

vsetthupati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views67 pages

BBABB602 Study Material and Syllabus

The document outlines the course objectives and outcomes for Advanced Data Analytics, focusing on the role of data analytics in business decision-making and the principles of information analysis. It covers various statistical methods such as linear regression, logistic regression, factor analysis, and cluster analysis, along with their applications and assumptions. Additionally, it emphasizes the importance of ethical, social, and security considerations in data analytics systems.

Uploaded by

vsetthupati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Advanced Data Analytics

STUDY MATERIAL
COURSE OBJECTIVES:

1. To describe the role of data analytics and decision support systems in business and record the current
issues with those of the firm to solve business problems.
2. To introduce the fundamental principles of computer-based information analysis and design and
develop an understanding of the principles and techniques used.
3. To enable students to understand the various knowledge representation methods and different expert
system structures as strategic weapons to counter the threats to business and make business more
competitive.
4. To enable the students to use of data analysis to assess the impact of Technology on electronic
commerce and electronic business and understand the specific threats and vulnerabilities of
computer systems.
.

COURSE OUTCOMES:

CO1: The students will be able to relate the basic concepts and technologies used in the field of data
analytics.
CO2: The students will be able to compare the processes of developing and implementing data analytics
algorithms.
CO3: The students will be able to examine the role of the ethical, social, and security issues of data
analytics systems.
CO4: The students will be able to investigate and translate the role of data analytics in organizations,
and the strategic management processes, with the implications for the management.

Course Content:

2|Page
Module Number Description of Topic Page No.
1 Simple Linear Regression: 9-35
Introduction – Overview –
Importance -Least Square Method–
Normal Equations - Calculation of
Regression Coefficients –
Properties of Regression Line –
Uses of Regression;
• Multiple Linear Regression:
Overview – Importance -
Least Square Method –-
Normal Equations –
Calculation of Regression
Coefficients - Properties of–
Testing Relevance of an
Additional Explanatory
Variable
2 Basic concept of Logistic 35-50
Regression – Assessing the
Model –
• log-likelihood statistic –
deviance statistic – R and R2
– Wald Statistic – odds ratio –
Sources of Bias and Common
Problems - Interpreting
Binary Logistic Regression
3 • Basic concept of Factor 50-71
Analysis, Factor Analysis
Model, Statistics Associated
with Factor Analysis, Factor
Analysis Process – Formulate
the Problem – Construct the
Correlation Matrix-
Determine the method of
Factor Analysis –Determine
the number of Factors –
Factor Extraction eigenvalues
and scree plot- Factor
Rotation – Interpret Factors –
Calculate Factor Scores -
Determine Model Fit.
4 • Basic concept of Cluster 71-82
Analysis, Statistics
Associated with Cluster
Analysis, Cluster Analysis
Process - Formulate the
Problem – Select a distance
measure – Select a clustering
procedure – Decide on the
number of Clusters – Interpret
and Profile Cluster – Asses
the reliability and validity .

3|Page
Module Topic Sub-topics Mapping with Industry and Lecture Correspond
number International Standard Hours ing
Assignment

Linear Simple Linear International Academia: 12 Simple


Regressio Regression: https://ocw.mit.edu/courses/18- Linear
n Introduction – s096-topics-in-mathematics-with- Regression:
1 Analysis: Overview – applications-in-finance-fall- Introductio
Importance -Least 2013/resources/lecture-6- n–
Square Method– regression-analysis/ Overview –
Normal Equations Industry Mapping: Importance
- Calculation of Creating a Predictive model -Least
Regression Square
Coefficients – Method–
Properties of Normal
Regression Line – Equations -
Uses of Calculation
Regression; of
• Multiple Regression
Linear Coefficients
Regression: – Properties
Overview – of
Regression
Importance -
Line – Uses
Least Square of
Method –- Regression;
Normal Multiple
Equations – Linear
Calculation Regressi
of Regression on:
Coefficients - Overvie
Properties w–
of– Testing Importan
Relevance of ce - Least
Square
an Additional
Method –
Explanatory
-Normal
Variable
Equatio
ns

4|Page
Binary Basic concept of International Academia: 12 Basic
2 Logistic Logistic https://ocw.mit.edu/courses/15-071- concept
Regression – the-analytics-edge-spring- of
Regression 2017/pages/logistic-regression/
Assessing the Logistic
Model – Industrial Mapping : Predictive Regressi
• log- model creation on –
likelihood Assessin
statistic – g the
deviance Model –
statistic – R log-
and R2 – likeliho
Wald od
Statistic – statistic
odds ratio – –
Sources of devianc
Bias and e
Common statistic
Problems - – R and
Interpreting R2
Binary
Logistic
Regression
3 Factor • Basic International Academia: 12 Basic
Analysis concept of https://ocw.mit.edu/courses/18- concept
Factor s096-topics-in-mathematics-with- of
Analysis, applications-in-finance-fall- Factor
Factor 2013/resources/lecture-15-factor- Analysi
modeling/
Analysis s,
Industrial Mapping : Predictive
Model, Factor
model creation
Statistics Analysi
Associated s
with Factor Model,
Analysis, Statisti
Factor cs
Analysis Associ
Process – ated
Formulate with
the Problem Factor
– Construct Analysi
the s,
Correlation Factor
Matrix- Analysi
Determine s
the method of Process
Factor –
Analysis – Formul
Determine ate the
the number Proble
of Factors – m–
Factor Constr
Extraction uct the
eigenvalues Correla
5|Page
and scree tion
plot- Factor Matrix-
Rotation – Determ
Interpret ine the
Factors – method
Calculate of
Factor Scores Factor
- Determine Analysi
Model Fit. s–
Cluster • Basic International Academia: 12 Basic
4 Analysis concept of https://ocw.mit.edu/courses/6- concept
Cluster 0002-introduction-to- of
Analysis, computational-thinking-and-data- Cluster
Statistics science-fall- Analysi
2016/resources/lecture-12-
Associated s,
clustering/
with Cluster Statistic
Analysis, Industrial Mapping : Predictive s
Cluster model creation Associat
Analysis ed with
Process - Cluster
Formulate Analysi
the Problem s,
– Select a Cluster
distance Analysi
measure – s
Select a Process
clustering
procedure –
Decide on the
number of
Clusters –
Interpret and
Profile
Cluster –
Asses the
reliability
and validity .

Learning Resources:

Text Book:

References:

6|Page
CO-PO Mapping:

CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8

BBABA602C01 2 3 1 2

BBABA602CO2 2 3 2 2

BBABA602CO3 1 1 1 2

BBABA602CO4 1 1 3 3

1=Low(Slight) 2=Moderate(Medium) 3=Substantial (High)

7|Page
MODULE -1

8|Page
Multiple Linear Regression

9|Page
10 | P a g e
11 | P a g e
12 | P a g e
13 | P a g e
14 | P a g e
15 | P a g e
16 | P a g e
17 | P a g e
18 | P a g e
19 | P a g e
20 | P a g e
21 | P a g e
22 | P a g e
23 | P a g e
24 | P a g e
25 | P a g e
26 | P a g e
Exercise

1. Derive normal equations of multiple linear regressions in matrix method. (BL: 6,


Create)

2.

(BL6, Create)

3. Find an expression of R square. (BL: 5, Evaluate)

4. Prove that OLS estimator is unbiased. (BL: 5, Evaluate)

5. Prove that OLS estimator is a minimum variance estimator. (BL: 5, Evaluate)

6. Explain the assumptions of multiple linear regression models using matrix notation.

7. How can you test the overall significance of regression model? (BL: 5, Evaluate)

27 | P a g e
MODULE – 2

ADVANCED DATA ANALYTICS

In multiple regression, in which there are several predictors, a similar equation is derived in which
each predictor has its own coefficient. As such, Y is predicted from a combination of each predictor
variable multiplied by its respective regression coefficient.

28 | P a g e
29 | P a g e
30 | P a g e
31 | P a g e
32 | P a g e
33 | P a g e
34 | P a g e
USES OF LOGISTIC REGRESSION

Logistic regression is commonly used for prediction and classification problems. Some of these
use cases include:

• Fraud detection: Logistic regression models can help teams identify data anomalies,
which are predictive of fraud. Certain behaviors or characteristics may have a higher
association with fraudulent activities, which is particularly helpful to banking and other
financial institutions in protecting their clients. SaaS-based companies have also started to
adopt these practices to eliminate fake user accounts from their datasets when conducting
data analysis around business performance.

• Disease prediction: In medicine, this analytics approach can be used to predict the
likelihood of disease or illness for a given population. Healthcare organizations can set up
preventative care for individuals that show higher propensity for specific illnesses.

• Churn prediction: Specific behaviors may be indicative of churn in different functions of


an organization. For example, human resources and management teams may want to know
if there are high performers within the company who are at risk of leaving the organization;
this type of insight can prompt conversations to understand problem areas within the
company, such as culture or compensation. Alternatively, the sales organization may want
to learn which of their clients are at risk of taking their business elsewhere. This can prompt
teams to set up a retention strategy to avoid lost revenue.

ASSUMPTIONS OF LOGISTIC REGRESSION

Logistic regression does not make many of the key assumptions of linear regression and general
linear models that are based on ordinary least squares algorithms – particularly regarding linearity,
normality, homoscedasticity, and measurement level.

Firstly, it does not need a linear relationship between the dependent and independent variables.
Logistic regression can handle all sorts of relationships, because it applies a non-linear log
transformation to the predicted odds ratio. Secondly, the independent variables do not need to be
multivariate normal – although multivariate normality yields a more stable solution. Also the error
terms (the residuals) do not need to be multivariate normally distributed. Thirdly,
homoscedasticity is not needed. Logistic regression does not need variances to be heteroscedastic
for each level of the independent variables. Lastly, it can handle ordinal and nominal data as
independent variables. The independent variables do not need to be metric (interval or ratio
scaled). However some other assumptions still apply.

35 | P a g e
Binary logistic regression requires the dependent variable to be binary and ordinal logistic
regression requires the dependent variable to be ordinal. Reducing an ordinal or even metric
variable to dichotomous level loses a lot of information, which makes this test inferior compared
to ordinal logistic regression in these cases.

Secondly, since logistic regression assumes that P(Y=1) is the probability of the event occurring,
it is necessary that the dependent variable is coded accordingly. That is, for a binary regression,
the factor level 1 of the dependent variable should represent the desired outcome.

Thirdly, the model should be fitted correctly. Neither over fitting nor under fitting should occur.
That is only the meaningful variables should be included, but also all meaningful variables should
be included. A good approach to ensure this is to use a stepwise method to estimate the logistic
regression.

Fourthly, the error terms need to be independent. Logistic regression requires each observation to
be independent. That is that the data-points should not be from any dependent samples design, e.g.,
before-after measurements, or matched pairings. Also the model should have little or no
multicollinearity. That is that the independent variables should be independent from each other.
However, there is the option to include interaction effects of categorical variables in the analysis
and the model. If multicollinearity is present centering the variables might resolve the issue, i.e.
deducting the mean of each variable. If this does not lower the multicollinearity, a factor analysis
with orthogonally rotated factors should be done before the logistic regression is estimated.

Fifthly, logistic regression assumes linearity of independent variables and log odds. Whilst it does
not require the dependent and independent variables to be related linearly, it requires that the
independent variables are linearly related to the log odds. Otherwise the test underestimates the
strength of the relationship and rejects the relationship too easily, that is being not significant (not
rejecting the null hypothesis) where it should be significant. A solution to this problem is the
categorization of the independent variables. That is transforming metric variables to ordinal level
and then including them in the model. Another approach would be to use discriminant analysis, if
the assumptions of homoscedasticity, multivariate normality, and absence of multicollinearity are
met.

Lastly, it requires quite large sample sizes. Because maximum likelihood estimates are less
powerful than ordinary least squares (e.g., simple linear regression, multiple linear regression);
whilst OLS needs 5 cases per independent variable in the analysis, ML needs at least 10 cases per
independent variable, some statisticians recommend at least 30 cases for each parameter to be
estimated.

36 | P a g e
TYPES OF LOGISTIC REGRESSION

There are three types of logistic regression models, which are defined based on categorical
response.

• Binary logistic regression: In this approach, the response or dependent variable is


dichotomous in nature—i.e. it has only two possible outcomes (e.g. 0 or 1). Some popular
examples of its use include predicting if an e-mail is spam or not spam or if a tumor is
malignant or not malignant. Within logistic regression, this is the most commonly used
approach, and more generally, it is one of the most common classifiers for binary
classification.
• Multinomial logistic regression: In this type of logistic regression model, the dependent
variable has three or more possible outcomes; however, these values have no specified
order. For example, movie studios want to predict what genre of film a moviegoer is likely
to see to market films more effectively. A multinomial logistic regression model can help
the studio to determine the strength of influence a person's age, gender, and dating status
may have on the type of film that they prefer. The studio can then orient an advertising
campaign of a specific movie toward a group of people likely to go see it.
• Ordinal logistic regression: This type of logistic regression model is leveraged when the
response variable has three or more possible outcome, but in this case, these values do have
a defined order. Examples of ordinal responses include grading scales from A to F or rating
scales from 1 to 5.

Exercises:

1. Explain the probability value of logistic regression. (BL: 4, Analyze)

2. What do you mean by Wald Statistic? (BL: 4, Analyze)

3. Analyze the concept of odds ratio. (BL: 4, Analyze)

4. Discuss the uses of logistic regression. (BL: 5, Evaluate)

37 | P a g e
MODULE – 3

ADVANCED DATA ANALYTICS

38 | P a g e
39 | P a g e
40 | P a g e
41 | P a g e
42 | P a g e
43 | P a g e
44 | P a g e
45 | P a g e
46 | P a g e
47 | P a g e
48 | P a g e
49 | P a g e
50 | P a g e
51 | P a g e
Exercises

1. Why is it useful to rotate the factors? Which is the most common method of rotation? (BL: 5,
Evaluate)

2. What guidelines are available for interpreting the factors? (BL: 4, Analyze)

3. What is the major difference between principal components analysis and common factor
analysis? (BL: 4, Analyze)

4. What hypothesis is examined by Bartlett’s test of sphericity? For what purpose is this test
used? (BL: 5, Evaluate)

5. For what purpose is the Kaiser–Meyer–Olkin measure of sampling adequacy used? (BL: 5,
Evaluate)

52 | P a g e
MODULE – 4

ADVANCED DATA ANALYTICS

53 | P a g e
54 | P a g e
55 | P a g e
56 | P a g e
57 | P a g e
58 | P a g e
59 | P a g e
60 | P a g e
61 | P a g e
62 | P a g e
63 | P a g e
64 | P a g e
65 | P a g e
66 | P a g e
Exercise:

1. Why is the average linkage method usually preferred to single linkage and complete linkage? (BL:
5, Evaluate)

2. What guidelines are available for deciding the number of clusters? (BL4:, Analyze)

3. Upon what basis may a researcher decide which variables should be selected to formulate a
clustering problem? (BL: 5, Evaluate)

4. What are some of the uses of cluster analysis in marketing? (BL4:, Analyze)

5. Compare different clustering procedures. (BL4:, Analyze)

You might also like