0% found this document useful (0 votes)

86 views

Data Analytics: Relation Analysis

This document discusses relationship analysis and correlation analysis techniques in data analytics. It presents an example of wage data to analyze the relationship between wage and age, wage and year, and wage and education level. The interpretations show that wage increases with age until 60 years and then declines, increases slowly with calendar years from 2010 to 2016, and also increases with level of education. The document then discusses measures of correlation, positive correlation, negative correlation, and zero correlation. It emphasizes that correlation analysis only makes sense when the variables have a cause-effect relationship.

Uploaded by

2d Hoehoe

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

Data Analytics: Relation Analysis

Uploaded by

2d Hoehoe

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 88

Data Analytics

(CS40003)
Lecture #7
Relation Analysis

Dr. Debasis Samanta

Associate Professor
Department of Computer Science & Engineering
Quote of the day..

Nothing great was ever achieved without

enthusiasm.
 RALPH WALDO EMERSON, American philosopher

CS 40003: Data Analytics 2

This presentation includes…

Introduction

 Measures of Relationship
 Correlation Analysis
 - Test
 Spearman’s Correlation Analysis
 Pearson’s Correlation Analysis
 Regression Analysis
 Simple Linear Regression
 Multiple Linear Regression
 Non-Linear Regression Analysis
 Auto-Regression Analysis

CS 40003: Data Analytics 3

Hypothesis Testing Strategies

 There are two types of tests of hypotheses

Parametric tests (also called standard test of hypotheses).

 Non-parametric tests (also called distribution-free test of hypotheses)

CS 40003: Data Analytics 4

Parametric Tests : Applications
 Usually assume certain properties of the population from
which we draw samples.

• Observation come from a normal population

• Sample size is small

• Population parameters like mean, variance, etc. are hold good.

• Requires measurement equivalent to interval scaled data.

CS 40003: Data Analytics 5

Hypothesis Testing : Non-Parametric Test
Non-Parametric tests
o Does not under any assumption
o Assumes only nominal or ordinal data

Note: Non-parametric tests need entire population (or very large sample size)

CS 40003: Data Analytics 6

Relationship Analysis
Example: Wage Data

A large data regarding the wages for a group of employees from the eastern
region of India is given.

In particular, we wish to understand the following relationships:

 Employee’s age and wage: How wages vary with ages?

 Calendar year and wage: How wages vary with time?

 Employee’s age and education: Whether wages are anyway related with
employees’ education levels?

CS 40003: Data Analytics 7

Relationship Analysis
 Example: Wage Data

 Case I. Wage versus Age

 From the data set, we have a graphical representations, which is as follows:

?
How wages vary with ages?
 How wages vary with ages?

CS 40003: Data Analytics 8

Relationship Analysis
 Example: Wage Data
 Employee’s age and wage: How wages vary with ages?

Interpretation: On the average, wage increases with age until about 60 years of age, at
which point it begins to decline.

CS 40003: Data Analytics 9

Relationship Analysis
 Example: Wage Data

 Case II. Wage versus Year

 From the data set, we have a graphical representations, which is as follows:

?
How wages vary with time?

CS 40003: Data Analytics 10

Relationship Analysis
 Example: Wage Data
 Wage and calendar year: How wages vary with years?

Interpretation: There is a slow but steady increase in the average wage between 2010 and
2016.
.
CS 40003: Data Analytics 11
Relationship Analysis
 Example: Wage Data

 Case III. Wage versus Education

 From the data set, we have a graphical representations, which is as follows:

?
Whether wages are related with education?

CS 40003: Data Analytics 12

Relationship Analysis
 Example: Wage Data
 Wage and education level: Whether wages vary with employees’ education levels?

Interpretation: On the average, wage increases with the level of education.

CS 40003: Data Analytics 13

Relationship Analysis
Given an employee’s wage can we predict his age?

Whether wage has any association with both year and education
level?

etc….

CS 40003: Data Analytics 14

An Open Challenge!


Suppose there are countably infinite points in the . We need a huge memory to store all
such points.

Is there any way out to store this information with a least amount of memory?
Say, with two values only.

CS 40003: Data Analytics 15

Yahoo!
y=ax+b

Just decide the values of a and b

(as if storing one point’s data only!)

Note: Here, tricks was to find a relationship among all the points.

CS 40003: Data Analytics 16

Measures of Relationship
 Univariate population: The population consisting of only one variable.

Here, statistical measures are suffice to find a relationship.

 Bivariate population: Here, the data happen to be on two variables.

CS 40003: Data Analytics 17

Measures of Relationship
 Multivariate population: If the data happen to be one more than two variable.

lu me
Vo
Temperature
Pressure

? If we add another variable say viscosity in addition to Pressure, Volume or Temperature?

CS 40003: Data Analytics 18

Measures of Relationship
In case of bivariate and multivariate populations, usually, we have to answer two
types of questions:

Q1: Does there exist correlation (i.e., association) between two (or more) variables?
If yes, of what degree?

Q2: Is there any cause and effect relationship between the two variables (in case of
bivariate population) or one variable in one side and two or more variables on the
other side (in case of multivariate population)?
If yes, of what degree and in which direction?

To find solutions to the above questions, two approaches are known.

 Correlation Analysis
 Regression Analysis

CS 40003: Data Analytics 19

Correlation Analysis

CS 40003: Data Analytics 20

Correlation Analysis
 In statistics, the word correlation is used to denote some form of
association between two variables.
 Example: Weight is correlated with height

Example:

The correlation may be positive, negative or zero.

 Positive correlation: If the value of the attribute A increases with the increase
in the value of the attribute B and vice-versa.

 Negative correlation: If the value of the attribute A decreases with the

increase in the value of the attribute B and vice-versa.

 Zero correlation: When the values of attribute A varies at random with B and
vice-versa.
CS 40003: Data Analytics 21
Correlation Analysis
 In order to measure the degree of correlation between two attributes.

100
90
80
70
60
50
40
30
20
10

1 2 3 4 5 6 7
Hours of study

CS 40003: Data Analytics 22

Correlation Analysis
 Do you find any correlation between X and Y as shown in the table?.

# CD

# Cigarette

Note:
In data analytics, correlation analysis make sense only when relationship make sense.
There should be a cause-effect relationship.

CS 40003: Data Analytics 23

Correlation Analysis
Positive correlation
Negative correlation
7
7 Zero correlation
7
6
6
6
5
5
5
4
4
4

3 3
3

2 2
2

1 1 1

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11

CS 40003: Data Analytics 24

Correlation Coefficient
 Correlation coefficient is used to measure the degree of association.


 It is usually denoted by r.

 The value of r lies between +1 and -1.

 Positive values of r indicates positive correlation between two variables,

whereas, negative values of r indicate negative correlation.

 The value of nearer to +1 or -1 indicates high degree of correlation between

the two variables.

 r = 0 implies, there is no correlation

CS 40003: Data Analytics 25

Correlation Coefficient
High Positive Correlation Low Positive Correlation
7 7

6 6

5 5

4 4

3 3

2 2

1
1

1 2 3 4 5 6 7 4 5 6 7
1 2 3

Low Negative Correlation

High Negative Correlation 7

7
6
6
5
5

4
4

3
3

2
2

1
1

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

CS 40003: Data Analytics 26

Correlation Coefficient

CS 40003: Data Analytics 27

Correlation Coefficient
R = +0.60
R = +0.80

R = +0.80

R = +0.40

CS 40003: Data Analytics 28

Measuring Correlation Coefficients
 There are three methods known to measure the correlation coefficients

 Karl Pearson’s coefficient of correlation

 This method is applicable to find correlation coefficient between two numerical
attributes

 Charles Spearman’s coefficient of correlation

 This method is applicable to find correlation coefficient between two ordinal attributes

 Chi-square coefficient of correlation

 This method is applicable to find correlation coefficient between two categorical
attributes

CS 40003: Data Analytics 29

Pearson’s Correlation Coefficient

CS 40003: Data Analytics 30

Karl Pearson’s Correlation Coefficient
 This is also called Pearson’s Product Moment Correlation

Definition 7.1: Karl Pearson’s correlation coefficient

Let us consider two attributes are X and Y.

The Karl Pearson’s coefficient of correlation is denoted by 𝑟∗ and is defined
as

where

CS 40003: Data Analytics 31

Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
 A small study is conducted involving 17 infants to investigate the association between
gestational age at birth, measured in weeks, and birth weight, measured in grams.

CS 40003: Data Analytics 32

Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
 We wish to estimate the association between gestational age and infant birth weight.

 In this example, birth weight is the dependent variable and gestational age is the
independent variable. Thus Y = birth weight and X = gestational age.

 The data are displayed in a scatter diagram in the figure below.

CS 40003: Data Analytics 33

Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
 For the given data, it can be shown the following

= 0.82

Conclusion: The sample’s correlation coefficient indicates a strong positive correlation

between Gestational Age and Birth Weight.

CS 40003: Data Analytics 34

Karl Pearson’s coefficient of Correlation
Example
 7.1: Correlation of Gestational Age and Birth Weight

 Significance Test
 To test whether the association is merely apparent, and might have arisen by chance use the t test in the following
calculation

𝑛 −2
𝑡=𝑟
√
 Number of pair of observation is 17. Hence,
1 −𝑟
2

17 − 2
𝑡=0.82
√
1 −0.82 2
=1.44

 Consulting the t-test table, at degrees of freedom 15 and for , we find that t = 1.753. Thus, the value of Pearson’s
correlation coefficient in this case may be regarded as highly significant.

CS 40003: Data Analytics 35

Rank Correlation Coefficient

CS 40003: Data Analytics 36

Charles Spearman’s Correlation Coefficient
This
 correlation measurement is also called Rank correlation.
 This technique is applicable to determine the degree of correlation between two
variables in case of ordinal data.

 We can assign rank to the different values of a variable with ordinal data type.

Example:

Rank assigned

CS 40003: Data Analytics 37

Charles Spearman’s Correlation Coefficient

Definition 7.2: Charles Spearman’s correlation coefficient

The rank correlation can be defined as

 The Spearman’s coefficient is often used as a statistical methods to aid either providing or disproving a hypothesis.

CS 40003: Data Analytics 38

Charles Spearman’s Coefficient of Correlation
Example 7.2: The hypothesis that the depth of a river does not progressively increase with the
width of the river.
W

A sample of size 10 is collected to test the hypothesis, using Spearman’s correlation coefficient.

CS 40003: Data Analytics 39

Charles Spearman’s Coefficient of Correlation
Step 1: Assign rank to each data. It is customary to assign rank 1 to the largest data, and 2 to
next largest and so on.
Note: If there are two or more samples with the same value, the mean rank should be
used.

CS 40003: Data Analytics 40

Charles Spearman’s Coefficient of Correlation
Step 2: The contingency table will look like

𝑟 𝑠 =0.9757
CS 40003: Data Analytics 41
Charles Spearman’s Coefficient of Correlation
Step
 3: To see, if this value is significant, the Spearman’s rank significance table (or
graph) must be consulted.
Note:

1.0
0.9
0.8
0.7
0.6
Spearaman’s rank correlation

0.5
0.4

0.3 0.1%
0.2 1%
5%
coefficient

0.1
2 4 6 8 10

CS 40003: Data Analytics 42

Charles Spearman’s Coefficient of Correlation
Step
 4: Final conclusion
From the graph, we see that lies above the line at 8 and 0.01%
significance level. Hence, there is a greater than 99% chance that the
relationship is significant (i.e., not random) and hence the hypothesis
should be rejected.

Thus, we can reject the hypothesis and conclude that in this case, depth of
a river progressively increases the further with the width of the river.

CS 40003: Data Analytics 43

χ2-Correlation Analysis

CS 40003: Data Analytics 44

Chi-Squared Test of Correlation
This
 method is also alternatively termed as Pearson’s –test or simply -test
 This method is applicable to categorical (discrete) data only.

 Suppose, two attributes A and B with categorical values

A = , , ,….., and
B = , , ,…..,
having m and n distinct values.

Between whom we are to find the correlation relationship.

CS 40003: Data Analytics 45

–Test Methodology
Contingency Table
Given a data set, it is customary to draw a contingency table, whose structure is given
below.

CS 40003: Data Analytics 46

–Test Methodology
Entry into Contingency Table: Observed Frequency
In contingency table, an entry Oij denotes the event that attribute A takes on value ai and
attribute B takes on value bj (i.e., A = ai, B = bj).

CS 40003: Data Analytics 47

–Test Methodology
Entry into Contingency Table: Expected Frequency
In contingency table, an entry eij denotes the expected frequency, which can be calculated
as
𝐶𝑜𝑢𝑛𝑡 ( 𝐴 =𝑎𝑖 )× 𝐶𝑜𝑢𝑛𝑡 ( 𝐵= 𝑏 𝑗 ) 𝐴 𝑖 × 𝐵 𝑗
𝑒 𝑖𝑗 = =
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙 𝑁

CS 40003: Data Analytics 48

– Test
Definition 7.3: χ2-Value

The value ( also known as the Pearson’s test) can be computes as

is the expected frequency

CS 40003: Data Analytics 49

– Test
 The cell that contribute the most to the 𝛘2 value are those whose
actual count is very different from the expected.

 The 𝛘2 statistics tests the hypothesis that A and B are independent.

The test is based on a significance level, with (n-1) ×(m-1) degrees
of freedom., with a contingency table of size n×m

 If the hypothesis can be rejected, then we say that A and B are

statistically related or associated.

CS 40003: Data Analytics 50

– Test
Example 7.3: Survey on Gender versus Hobby.
 Suppose, a survey was conducted among a population of size 1500. In this survey, gender
of each person and their hobby as either “book” or “computer” was noted. The survey
result obtained in a table like the following.

 We have to find if there is any association between Gender and Hobby of a people, that is,
we are to test whether “gender” and “hobby” are correlated.
CS 40003: Data Analytics 51
– Test
Example 7.3: Survey on Gender versus Hobby.
 From the survey table, the observed frequency are counted and entered into the
contingency table, which is shown below.

GENDER
Male Female Total
HOBBY

Book
Computer

Total

CS 40003: Data Analytics 52

– Test
Example 7.3: Survey on Gender versus Hobby.
 From the survey table, the expected frequency are counted and entered into the
contingency table, which is shown below.

GENDER
Male Female Total
Book
HOBBY

Computer

Total

CS 40003: Data Analytics 53

– Test
Using
 equation for 𝛘2 computation, we get

𝛘2 = + + +
=
 This value needs to be compared with the tabulated value of 𝛘2 (available in any
standard book on statistics) with 1 degree of freedom (for a table of m × n, the
degrees of freedom is ; here m = 2, n = 2).

 For 1 degree of freedom, the 𝛘2 value needed to reject the hypothesis at the 0.01
significance level is 10.828. Since our computed value is above this, we reject the
hypothesis that “Gender” and “Hobby” are independent and hence, conclude that the
two attributes are strongly correlated for the given group of people.

CS 40003: Data Analytics 54

– Test
Example
 7.4: Hypothesis on “accident proneness” versus “driver’s handedness”.
 Consider the following on car accidents among left and right-handed drivers’ of sample size
175.

 Hypothesis is that “fatality of accidents is independent of driver’s handedness”

HANDEDNESS
Left-Handed Right-Handed Total
Non-Fatal
FATALITY

Fatal

Total

 Find the correlation between Fatality and Handedness and test the significance of the
correlation with significance level 0.1%.

CS 40003: Data Analytics 55

Regression Analysis

CS 40003: Data Analytics 56

Regression Analysis
 The regression analysis is a statistical method to deal with the formulation of
mathematical model depicting relationship amongst variables, which can be used
for the purpose of prediction of the values of dependent variable, given the values
of independent variables.
 Classification of Regression Analysis Models
 Linear regression models
1. Simple linear regression
2. Multiple linear regression
 Non-linear regression models

Y Y Y

X X X
Simple linear regression Z Multiple linear regression Non-linear regression

CS 40003: Data Analytics 57

Simple Linear Regression Model
In simple
 linear regression, we have only two variables:
 Dependent variable (also called Response), usually denoted as .
 Independent variable (alternatively called Regressor), usually denoted as .
 A reasonable form of a relationship between the Response and the Regressor is the linear
relationship, that is in the form

Y=α+βx

β=tan(θ)
θ

Note:
 There are infinite number of lines (and hence )

 The concept of regression analysis deal with finding the best relationship between and
(and hence best fitted values of ) quantifying the strength of that relationship.

CS 40003: Data Analytics 58

Regression Analysis


Given the set of data involving pairs of values, our objective is to find “true” or population regression
line such that

Here, is a random variable with and . The quantity is often called the error variance.
Note:
 implies that at a specific , the values are distributed around the “true” regression line (i.e., the
positive and negative errors around the true line is reasonable).
 are called regression coefficients.

 values are to be estimated from the data.

CS 40003: Data Analytics 59
True versus Fitted Regression Line
The

task in regression analysis is to estimate the regression coefficients .
 Suppose, we denote the estimates a for and b for . Then the fitted regression line is

where is the predicted or fitted value.

Ŷ=a+bx

Y=α+βx

CS 40003: Data Analytics 60

Least Square Method to estimate
This
 method uses the concept of residual. A residual is essentially an error in the fit of the model
. Thus, residual is

Ŷ=a+bx
Y ei
Ɛi
Y=α+βx

CS 40003: Data Analytics 61

Least Square method
 The residual sum of squares is often called the sum of squares of the errors about the fitted line and is
denoted
as SSE

SSE = =

 We are to minimize the value of SSE and hence to determine the parameters of a and b.

 Differentiating SSE with respect to a and b, we have

For minimum value of SSE, 0

CS 40003: Data Analytics 62

Least Square method to estimate
Thus
 we set

+b=

These two equations can be solved to determine the values of and b, and it can be
calculated that

CS 40003: Data Analytics 63

: Measure of Quality of Fit
A quantity , is called coefficient of determination is used to measure the proportion of


variability of the fitted model.

 We have

 It signifies the variability due to error.

 Now, let us define the total corrected sum of squares, defined as

 SST represents the variation in the response values. The is

Note:
 If fit is perfect, all residuals are zero and thus = 1.0 (very good fit)

 If SSE is only slightly smaller than SST, then (very poor fit)

CS 40003: Data Analytics 64

: Measure of Quality of Fit

Y Y Ŷ

2
R2≈ 1.0 (Very good fit) 𝑅
≈ 0 (Very poor fit)

CS 40003: Data Analytics 65

Multiple Linear Regression
When more than one variable are independent variable, then the regression can


be estimated as a multiple regression model

 When this model is linear in coefficients, it is called multiple linear regression

model

 If k-independent variables , …………, are associated, the multiple linear

regression model is given by

 And the estimated response is obtained as

CS 40003: Data Analytics 66

Multiple Linear Regression
Estimating
 the coefficients
Let the data points given to us is
( )

where is the observed response to the values of k independent variables .

Thus,
++
and ++

where and are the random error and residual error, respectively associated with true
response and fitted response.

Using the concept of Least Square Method to estimate we minimize the expression

SSE = =

CS 40003: Data Analytics 67

Multiple Linear Regression
Differentiating

SSE in turn with respect to and equating to zero, we generate the set of
(k+1) normal estimation equations for multiple linear regression.

+
… … … … … …
… … … … … …
+

 The system of linear equations can be solved for by any appropriate method for solving
system of linear equations.

 Hence, the multiple linear regression model can be built.

CS 40003: Data Analytics 68

Non Linear Regression Model
When

the regression equation is in terms of r-degree, r>1, then it is called nonlinear
regression model. When more than one independent variables are there, then it is
called Multiple Non linear Regression model. Also, alternatively termed as
polynomial regression model. In general, it takes the form

 The estimated response is obtained as

CS 40003: Data Analytics 69

Solving for Polynomial Regression Model
Given
 that (); i = 1,2,…,n are n pairs of observations. Each observations would satisfy the
equations:
++
and ++ +
where, r is the degree of polynomial
= is the random error
= is the residual error

Note: The number of observations, n, must be at least as large as r+1, the number of
parameters to be estimated.

The polynomial model can be transformed into a general linear regression model setting , …,
= . Thus, the equation assumes the form:
++
++r +

This model then can be solved using the procedure followed for multiple linear regression
model.

CS 40003: Data Analytics 70

Auto-Regression Analysis

CS 40003: Data Analytics 71

Auto Regression Analysis
 Regression analysis for time-ordered data is known as Auto-Regression
Analysis

 Time series data are data collected on the same observational unit at multiple
time periods

Example: Indian rate of price inflation

CS 40003: Data Analytics 72

Auto Regression Analysis
 Examples: Which of the following is a time-series data?

 Aggregate consumption and GDP for a country (for example, 20 years of quarterly
observations = 80 observations)

 Yen/$, pound/$ and Euro/$ exchange rates (daily data for 1 year = 365
observations)

 Cigarette consumption per capita in a state, by years

 Rainfall data over a year

 Sales of tea from a tea shop in a season

CS 40003: Data Analytics 73

Auto Regression Analysis
 Examples: Which of the following graph is due to time-series data?

CS 40003: Data Analytics 74

Use of Time Series Data
 To develop forecast model

 What will the rate of inflation be next year?

 To estimate dynamic causal effects

 If the rate of interest increases the interest rate now, what will be the effect on the rates of
inflation and unemployment in 3 months? in 12 months?

 What is the effect over time on electronics good consumption of a hike in the excise duty?

 Time dependent analysis

 Rates of inflation and unemployment in the country can be observed only over time!

CS 40003: Data Analytics 75

Modeling with Time Series Data
 Correlation over time

 Serial correlation, also called autocorrelation

 Calculating standard error

 To estimate dynamic causal effects

 Under which dynamic effects can be estimated?

 How to estimate?

 Forecasting model

 Forecasting model build on regression model

CS 40003: Data Analytics 76

Auto-Regression Model for Forecasting

 Can we predict the tend at a time say 2017?

CS 40003: Data Analytics 77

Some Notations and Concepts
 Yt = Value of Y in a period t

 Data set [Y1, Y2, … YT-1, YT]: T observations on the time series random variable Y

 Assumptions
 We consider only consecutive, evenly spaced observations
 For example, monthly, 2000-2015, no missing months

 A time series Yt is stationary if its probability distribution does not change over
time, that is, if the joint distribution of (Yi+1, Yi+2, …, Yi+T) does not depend on i.

 Stationary property implies that history is relevant. In other words, Stationary requires the future
to be like the past (in a probabilistic sense).

 Auto Regression analysis assumes that Yt is stationary.

CS 40003: Data Analytics 78

Some Notations and Concepts
There
 are four ways to have the time series data for AutoRegression analysis

 Lag: The first lag of Yt is Yt-1, its j-th lag is Yt-j

 Difference: The fist difference of a series, Yt is its change between period t and t-
1, that is, yt = Yt - Yt-1

 Log difference: yt = log(Yt) - log(Yt-1)

 Percentage:

CS 40003: Data Analytics 79

Some Notations and Concepts
 Autocorrelation

 The correlation of a series with its own lagged values is called autocorrelation
(also called serial correlation)

Definition 7.4: j-th Autocorrelation

The j-th autocorrelation, denoted by ρj is defined as

CS 40003: Data Analytics 80

Some Notations and Concepts

 For the given data, say ρ1 = 0.84

 This implies that the Dollars per Pound is highly serially correlated

 Similarly, we can determine ρ2 , ρ3 …. etc., and hence different regression analyses

CS 40003: Data Analytics 81

Auto-Regression Model for Forecatsing
 A natural starting point for forecasting model is to use past values of Y, that
is, Yt-1, Yt-2, … to predict Yt

 An autoregression is a regression model in which Yt is regressed against its

own lagged values.

 The number of lags used as regressors is called the order of autoregression

 In first order autoregression (denoted as AR(1)), Yt is regressed against Yt-1

 In p-th order autoregression (denoted as AR(p)), Yt is regressed against, Yt-1, Yt-2,

…,Yt-p

CS 40003: Data Analytics 82

p-th Order AutoRegression Model
Definition 7.5: p-th AutoRegression Model

In general, the p-th order autregression model is defined as

is called autoregression coefficients and is the noise term or residue and in

practice it is assumed to Gausian white noise

For
 example, AR(1) is
 The task in AR analysis is to derive the "best" values for i = 0, 1, …, p given
a time series Yt.

CS 40003: Data Analytics 83

Computing AR Coefficients
 A number of techniques known for computing the AR coefficients

 The most common method is called Least Squares Method (LSM)

 The LSM is based upon the Yule-Walker equations

 Here, ri (i = 1, 2 , 3, …, p-1) denotes the i-th auto correlation coefficient.

 β0 can be chosen empirically, usually taken as zero.

CS 40003: Data Analytics 84

Reference

The detail material related to this lecture can be found in

The Elements of Statistical Learning, Data Mining, Inference, and

Prediction (2nd Edn.), Trevor Hastie, Robert Tibshirani, Jerome
Friedman, Springer, 2014.

CS 40003: Data Analytics 85

Any question?

You may post your question(s) at the “Discussion Forum”

maintained in the course Web page!

CS 40003: Data Analytics 86

Questions of the day…
1. For a given sample data the correlation coefficient according to the Karl
Pearson’s correlation analysis is found to be r = 0.79 with degree of freedom 69.
Further, with significant test , the t-value is calculated as t = 2.36. From the t-
test table, it is found that with degree of freedom 69, the t-value at 5%
confidence level is 3.61. What is the inference that you can have in this case?

2. For a given degree of freedom, if α, the value of confidence level increases, then
t-value increases. Is the statement correct? If not, what is the correct
statement? Justify your answer. You can refer the following figure in your
explanation.

CS 40003: Data Analytics 87

Questions of the day…
 Whether the Spearman’s correlation coefficient analysis is applicable to the
3.
numeric data? If so, how?

4. Can –analysis be applied to ordinal data or numeric data? Justify your answer.

5. Briefly explain the following with reference to the correlation analysis.

a) Contingency table
b) Observed frequency
c) Expected frequency
d) Expression for –vale calculation
e) Hypothesis to be tested
f) Degree of freedom of sample data

CS 40003: Data Analytics 88

Statistics For Dummies
From Everand
Statistics For Dummies
Deborah J. Rumsey
4/5 (27)
Graded Quiz - Test Your Project Understanding - Coursera
100% (1)
Graded Quiz - Test Your Project Understanding - Coursera
1 page
Design of Experiments, Principles and Applications
100% (1)
Design of Experiments, Principles and Applications
350 pages
Correlation and Regression
100% (4)
Correlation and Regression
49 pages
Cre Review - Practice Exam
No ratings yet
Cre Review - Practice Exam
10 pages
Solution Dseclzg524 05-07-2020 Ec3r
No ratings yet
Solution Dseclzg524 05-07-2020 Ec3r
7 pages
07 Relation Analysis
No ratings yet
07 Relation Analysis
88 pages
Lec 5
No ratings yet
Lec 5
45 pages
07 Relation Analysis
No ratings yet
07 Relation Analysis
86 pages
Reference Material II CorrelationAnalysis
No ratings yet
Reference Material II CorrelationAnalysis
21 pages
Recal5 RelationAnalysis
No ratings yet
Recal5 RelationAnalysis
83 pages
Short Term Training Programme On Data Analytics Using SPSS and RCMDR
No ratings yet
Short Term Training Programme On Data Analytics Using SPSS and RCMDR
20 pages
Pearson'S Product-Moment Correlation Coefficient: Statistics and Probability
No ratings yet
Pearson'S Product-Moment Correlation Coefficient: Statistics and Probability
18 pages
Lecture - 6.1 and 6.2 - Correlation Analysis
No ratings yet
Lecture - 6.1 and 6.2 - Correlation Analysis
16 pages
CSE-1-PPT-MiniTest-12feb24-Correlation (3)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Correlation (3)
9 pages
Mod 3a
No ratings yet
Mod 3a
54 pages
UNIT III PORIYAN NOTES (1)
No ratings yet
UNIT III PORIYAN NOTES (1)
33 pages
Lecture
No ratings yet
Lecture
3 pages
Correlation Analysis: Concept of Univariate, Bivariate Data
No ratings yet
Correlation Analysis: Concept of Univariate, Bivariate Data
48 pages
Mfylg$f3f !y) NNN) 2
No ratings yet
Mfylg$f3f !y) NNN) 2
13 pages
L14- Correlation Coefficient
No ratings yet
L14- Correlation Coefficient
17 pages
Correlation Analysis
100% (1)
Correlation Analysis
51 pages
Correlation -p1 pearson product
No ratings yet
Correlation -p1 pearson product
4 pages
BSADM Module 4 Session 17 22 KSR
No ratings yet
BSADM Module 4 Session 17 22 KSR
28 pages
Correlation
No ratings yet
Correlation
34 pages
Pearson Correlation Analysis
100% (1)
Pearson Correlation Analysis
26 pages
Correlation
No ratings yet
Correlation
25 pages
Correlation Analysis MBA
No ratings yet
Correlation Analysis MBA
17 pages
20200519072923cce68d4cc4
No ratings yet
20200519072923cce68d4cc4
28 pages
Correlation 1
No ratings yet
Correlation 1
7 pages
L6 - Biostatistics - Linear Regression and Correlation
No ratings yet
L6 - Biostatistics - Linear Regression and Correlation
121 pages
Lecture 29
No ratings yet
Lecture 29
5 pages
Lecture2
No ratings yet
Lecture2
24 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
STAT22209 - Chapter 01-Correlation Analyisis - 2022
No ratings yet
STAT22209 - Chapter 01-Correlation Analyisis - 2022
53 pages
Correlation and Regression-2023
No ratings yet
Correlation and Regression-2023
28 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
34 pages
SSRN 3416918 - Usefulness of Correlation Analysis - 2019 07 09 PDF
No ratings yet
SSRN 3416918 - Usefulness of Correlation Analysis - 2019 07 09 PDF
10 pages
CORRELATION-REGRESSION
No ratings yet
CORRELATION-REGRESSION
9 pages
Correlation
No ratings yet
Correlation
30 pages
Correlation Analysis
No ratings yet
Correlation Analysis
54 pages
I Am Sharing 'Karl-Pearsons-Coefficient-of-Correlation' With You
No ratings yet
I Am Sharing 'Karl-Pearsons-Coefficient-of-Correlation' With You
10 pages
Measures+of+Association
No ratings yet
Measures+of+Association
14 pages
Correlation
No ratings yet
Correlation
13 pages
Block 5 MS 08 Correlation
No ratings yet
Block 5 MS 08 Correlation
13 pages
Correlation Regression 1
No ratings yet
Correlation Regression 1
9 pages
Correlation
No ratings yet
Correlation
8 pages
Correlation Analysis and Its Types
No ratings yet
Correlation Analysis and Its Types
50 pages
Computational Statistics
No ratings yet
Computational Statistics
364 pages
06correlation Analysis
No ratings yet
06correlation Analysis
36 pages
Correlation Rev 1.0
No ratings yet
Correlation Rev 1.0
5 pages
202003241550009941rajeev Pandey Correlation Research
No ratings yet
202003241550009941rajeev Pandey Correlation Research
87 pages
DA Lab Manual
No ratings yet
DA Lab Manual
60 pages
Correlation Analysis and Regression 22
No ratings yet
Correlation Analysis and Regression 22
41 pages
Correlation: Business Statistics 2012
No ratings yet
Correlation: Business Statistics 2012
1 page
Correlation Coefficient in Medical Research
No ratings yet
Correlation Coefficient in Medical Research
6 pages
Correlation Qmt-Students - 13 May 2022
No ratings yet
Correlation Qmt-Students - 13 May 2022
14 pages
Correlation Coefficient
No ratings yet
Correlation Coefficient
2 pages
Week 5
No ratings yet
Week 5
4 pages
Stats Unit 2
No ratings yet
Stats Unit 2
24 pages
DADM-Correlation and Regression
No ratings yet
DADM-Correlation and Regression
138 pages
Statistical Analysis of Relationships the Basics
No ratings yet
Statistical Analysis of Relationships the Basics
24 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Chapter 12345 of Shelter Ao3 Haikyuu
No ratings yet
Chapter 12345 of Shelter Ao3 Haikyuu
10 pages
Jufranz Sweet K. Faustino PURCOM Class
No ratings yet
Jufranz Sweet K. Faustino PURCOM Class
28 pages
Jufranz Sweet K. Faustino PURCOM Class
No ratings yet
Jufranz Sweet K. Faustino PURCOM Class
28 pages
Lipids PDF
No ratings yet
Lipids PDF
33 pages
Protein PDF
No ratings yet
Protein PDF
36 pages
Carbohydrates PDF
100% (1)
Carbohydrates PDF
42 pages
CHON Lecture PDF
No ratings yet
CHON Lecture PDF
25 pages
Case Study 1234 Answers
No ratings yet
Case Study 1234 Answers
3 pages
Multiple Regression
No ratings yet
Multiple Regression
67 pages
Scales of Measurement
No ratings yet
Scales of Measurement
3 pages
The Impact of Using Powerpoint Presentations On Students' Learning and Motivation in Secondary Schools
No ratings yet
The Impact of Using Powerpoint Presentations On Students' Learning and Motivation in Secondary Schools
6 pages
Multiple Imputation of Missing Data
No ratings yet
Multiple Imputation of Missing Data
495 pages
Checklist For Evaluating A Research Report
No ratings yet
Checklist For Evaluating A Research Report
2 pages
Mws Gen Nle TXT Bisection
No ratings yet
Mws Gen Nle TXT Bisection
6 pages
Lecture 3 - Descriptive Statistics P1 - Tabular and Graphical Displays
No ratings yet
Lecture 3 - Descriptive Statistics P1 - Tabular and Graphical Displays
30 pages
Analitik Dan Visualisasi Data - Pengenalan Data Analitik Dan Visualisasi
No ratings yet
Analitik Dan Visualisasi Data - Pengenalan Data Analitik Dan Visualisasi
18 pages
Comparison of Quantitative and Qualitative Research Traditions Epistemological Theoretical and Methodological Differences
No ratings yet
Comparison of Quantitative and Qualitative Research Traditions Epistemological Theoretical and Methodological Differences
18 pages
Anova Answer Report
No ratings yet
Anova Answer Report
5 pages
Titan Report
No ratings yet
Titan Report
28 pages
Asset-V1 MITx+CTL - SC0x+1T2021+type@asset+block@SC0x W7L2 ManagingUncertainty2 FINAL CLEAN Upd
No ratings yet
Asset-V1 MITx+CTL - SC0x+1T2021+type@asset+block@SC0x W7L2 ManagingUncertainty2 FINAL CLEAN Upd
27 pages
LESSON 7: Non-Parametric Statistics: Tests of Association & Test of Homogeneity
No ratings yet
LESSON 7: Non-Parametric Statistics: Tests of Association & Test of Homogeneity
21 pages
Correlation Visualization of High Dimensional Data
No ratings yet
Correlation Visualization of High Dimensional Data
7 pages
Project Report On Customer Lifetime Value
No ratings yet
Project Report On Customer Lifetime Value
23 pages
3M™ Clean-Trace™ Surface ATP - Hospital Evaluation Protocol V3
No ratings yet
3M™ Clean-Trace™ Surface ATP - Hospital Evaluation Protocol V3
11 pages
As & A Level - Probability & Statistics 1 Coursebook - Google Drive 6
No ratings yet
As & A Level - Probability & Statistics 1 Coursebook - Google Drive 6
1 page
Online Record and Doc MGMT System
No ratings yet
Online Record and Doc MGMT System
18 pages
Finding The Answers To The Research Questions Lesson 2 - Interpretation and Presentation of Results Interpreting The Data
No ratings yet
Finding The Answers To The Research Questions Lesson 2 - Interpretation and Presentation of Results Interpreting The Data
6 pages
会计个人陈述
100% (1)
会计个人陈述
10 pages
Retail Sales Forecasting
No ratings yet
Retail Sales Forecasting
31 pages
Parental Involvement Among Student Abstract
No ratings yet
Parental Involvement Among Student Abstract
8 pages
9 Regression Analysis
No ratings yet
9 Regression Analysis
38 pages
Extruline X en
No ratings yet
Extruline X en
4 pages
Tatistics Enter: Final Clearance
No ratings yet
Tatistics Enter: Final Clearance
2 pages