Data Analytics: Relation Analysis
Data Analytics: Relation Analysis
(CS40003)
Lecture #7
Relation Analysis
Introduction
Measures of Relationship
Correlation Analysis
- Test
Spearman’s Correlation Analysis
Pearson’s Correlation Analysis
Regression Analysis
Simple Linear Regression
Multiple Linear Regression
Non-Linear Regression Analysis
Auto-Regression Analysis
Note: Non-parametric tests need entire population (or very large sample size)
A large data regarding the wages for a group of employees from the eastern
region of India is given.
Employee’s age and education: Whether wages are anyway related with
employees’ education levels?
?
How wages vary with ages?
How wages vary with ages?
Interpretation: On the average, wage increases with age until about 60 years of age, at
which point it begins to decline.
?
How wages vary with time?
Interpretation: There is a slow but steady increase in the average wage between 2010 and
2016.
.
CS 40003: Data Analytics 11
Relationship Analysis
Example: Wage Data
?
Whether wages are related with education?
Whether wage has any association with both year and education
level?
etc….
Suppose there are countably infinite points in the . We need a huge memory to store all
such points.
Is there any way out to store this information with a least amount of memory?
Say, with two values only.
Note: Here, tricks was to find a relationship among all the points.
lu me
Vo
Temperature
Pressure
Q1: Does there exist correlation (i.e., association) between two (or more) variables?
If yes, of what degree?
Q2: Is there any cause and effect relationship between the two variables (in case of
bivariate population) or one variable in one side and two or more variables on the
other side (in case of multivariate population)?
If yes, of what degree and in which direction?
Example:
Zero correlation: When the values of attribute A varies at random with B and
vice-versa.
CS 40003: Data Analytics 21
Correlation Analysis
In order to measure the degree of correlation between two attributes.
100
90
80
70
60
50
40
30
20
10
1 2 3 4 5 6 7
Hours of study
# CD
# Cigarette
Note:
In data analytics, correlation analysis make sense only when relationship make sense.
There should be a cause-effect relationship.
3 3
3
2 2
2
1 1 1
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11
6 6
5 5
4 4
3 3
2 2
1
1
1 2 3 4 5 6 7 4 5 6 7
1 2 3
7
6
6
5
5
4
4
3
3
2
2
1
1
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
R = +0.80
R = +0.40
where
In this example, birth weight is the dependent variable and gestational age is the
independent variable. Thus Y = birth weight and X = gestational age.
= 0.82
𝑛 −2
𝑡=𝑟
√
Number of pair of observation is 17. Hence,
1 −𝑟
2
17 − 2
𝑡=0.82
√
1 −0.82 2
=1.44
Consulting the t-test table, at degrees of freedom 15 and for , we find that t = 1.753. Thus, the value of Pearson’s
correlation coefficient in this case may be regarded as highly significant.
We can assign rank to the different values of a variable with ordinal data type.
Example:
Rank assigned
The Spearman’s coefficient is often used as a statistical methods to aid either providing or disproving a hypothesis.
A sample of size 10 is collected to test the hypothesis, using Spearman’s correlation coefficient.
𝑟 𝑠 =0.9757
CS 40003: Data Analytics 41
Charles Spearman’s Coefficient of Correlation
Step
3: To see, if this value is significant, the Spearman’s rank significance table (or
graph) must be consulted.
Note:
1.0
0.9
0.8
0.7
0.6
Spearaman’s rank correlation
0.5
0.4
0.3 0.1%
0.2 1%
5%
coefficient
0.1
2 4 6 8 10
Thus, we can reject the hypothesis and conclude that in this case, depth of
a river progressively increases the further with the width of the river.
We have to find if there is any association between Gender and Hobby of a people, that is,
we are to test whether “gender” and “hobby” are correlated.
CS 40003: Data Analytics 51
– Test
Example 7.3: Survey on Gender versus Hobby.
From the survey table, the observed frequency are counted and entered into the
contingency table, which is shown below.
GENDER
Male Female Total
HOBBY
Book
Computer
Total
GENDER
Male Female Total
Book
HOBBY
Computer
Total
𝛘2 = + + +
=
This value needs to be compared with the tabulated value of 𝛘2 (available in any
standard book on statistics) with 1 degree of freedom (for a table of m × n, the
degrees of freedom is ; here m = 2, n = 2).
For 1 degree of freedom, the 𝛘2 value needed to reject the hypothesis at the 0.01
significance level is 10.828. Since our computed value is above this, we reject the
hypothesis that “Gender” and “Hobby” are independent and hence, conclude that the
two attributes are strongly correlated for the given group of people.
Fatal
Total
Find the correlation between Fatality and Handedness and test the significance of the
correlation with significance level 0.1%.
Y Y Y
X X X
Simple linear regression Z Multiple linear regression Non-linear regression
Y=α+βx
β=tan(θ)
θ
Note:
There are infinite number of lines (and hence )
The concept of regression analysis deal with finding the best relationship between and
(and hence best fitted values of ) quantifying the strength of that relationship.
Given the set of data involving pairs of values, our objective is to find “true” or population regression
line such that
Here, is a random variable with and . The quantity is often called the error variance.
Note:
implies that at a specific , the values are distributed around the “true” regression line (i.e., the
positive and negative errors around the true line is reasonable).
are called regression coefficients.
Ŷ=a+bx
Y=α+βx
Ŷ=a+bx
Y ei
Ɛi
Y=α+βx
SSE = =
We are to minimize the value of SSE and hence to determine the parameters of a and b.
+b=
These two equations can be solved to determine the values of and b, and it can be
calculated that
We have
Note:
If fit is perfect, all residuals are zero and thus = 1.0 (very good fit)
If SSE is only slightly smaller than SST, then (very poor fit)
Y Y Ŷ
2
R2≈ 1.0 (Very good fit) 𝑅
≈ 0 (Very poor fit)
++
++
Thus,
++
and ++
where and are the random error and residual error, respectively associated with true
response and fitted response.
Using the concept of Least Square Method to estimate we minimize the expression
SSE = =
++
+
… … … … … …
… … … … … …
+
The system of linear equations can be solved for by any appropriate method for solving
system of linear equations.
++
++
Note: The number of observations, n, must be at least as large as r+1, the number of
parameters to be estimated.
The polynomial model can be transformed into a general linear regression model setting , …,
= . Thus, the equation assumes the form:
++
++r +
This model then can be solved using the procedure followed for multiple linear regression
model.
Time series data are data collected on the same observational unit at multiple
time periods
Aggregate consumption and GDP for a country (for example, 20 years of quarterly
observations = 80 observations)
Yen/$, pound/$ and Euro/$ exchange rates (daily data for 1 year = 365
observations)
If the rate of interest increases the interest rate now, what will be the effect on the rates of
inflation and unemployment in 3 months? in 12 months?
What is the effect over time on electronics good consumption of a hike in the excise duty?
Rates of inflation and unemployment in the country can be observed only over time!
How to estimate?
Forecasting model
Data set [Y1, Y2, … YT-1, YT]: T observations on the time series random variable Y
Assumptions
We consider only consecutive, evenly spaced observations
For example, monthly, 2000-2015, no missing months
A time series Yt is stationary if its probability distribution does not change over
time, that is, if the joint distribution of (Yi+1, Yi+2, …, Yi+T) does not depend on i.
Stationary property implies that history is relevant. In other words, Stationary requires the future
to be like the past (in a probabilistic sense).
Difference: The fist difference of a series, Yt is its change between period t and t-
1, that is, yt = Yt - Yt-1
Percentage:
The correlation of a series with its own lagged values is called autocorrelation
(also called serial correlation)
For
example, AR(1) is
The task in AR analysis is to derive the "best" values for i = 0, 1, …, p given
a time series Yt.
2. For a given degree of freedom, if α, the value of confidence level increases, then
t-value increases. Is the statement correct? If not, what is the correct
statement? Justify your answer. You can refer the following figure in your
explanation.
4. Can –analysis be applied to ordinal data or numeric data? Justify your answer.