0% found this document useful (0 votes)
4 views

4. Correlation and Regression Analysis

Uploaded by

onlinemedia0011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

4. Correlation and Regression Analysis

Uploaded by

onlinemedia0011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Correlation & Regression Analysis

The primary objective of correlation analysis is to measure the strength or degree of


relationship between two or more variables. For example amount of fertilizer use and rice
production, height and weight of a group of people.

If an increase in one variable corresponds to an increase in other, the correlation is said to be


positive. If an increase in one variable corresponds to the decrease in other, the correlation is
said to be negative. If two variables vary in such a way that their ratio is always constant
then the correlation is said to be perfect.

Types of Correlation:
i. Positive or negative
ii. Simple or multiple
iii. Linear or non-linear

Methods of Estimating Correlation:


Correlation coefficient (r) determines a quantitative measure of the direction and strength of
relationship between two or more variables.

The following are the important methods of ascertaining simple linear correlation:

i. Scatter Diagram Method


ii. Karl Pearson’s Coefficient of Correlation

Scatter Diagram Method

Perfect Positive Correlation


Perfect Negative Correlation

Page 1 of 17
High degree of Positive High degree of Negative
Correlation Correlation

No Correlation
No Correlation

Karl Pearson’s Coefficient of Correlation


If X and Y are two variables under study, then degree of relationship is measured by
∑( ̅)( ̅)
√∑( ̅) √∑( ̅)

Where ̅ and ̅ are the respective means of X and Y.

The above formula can be written as



√∑ √∑
where ( ̅) and ( ̅)

Interpretation of r:
The values of the correlation coefficient lie between -1 and +1.

A value of r = +1 indicate that X and Y perfectly related in a positive linear sense.


A value of r = -1 indicate that X and Y perfectly related in a negative linear sense.
Values of r close to +1 indicate a strong linear relationship between them with positive
slope. Values of r close to -1 indicate a strong linear relationship between them with
negative slope.
Page 2 of 17
Values of r close to 0 from positive side indicate a weak linear relationship between
them with positive slope. Values of r close to 0 from negative side indicate a weak
linear relationship between them with negative slope.
Value of r =0 does not mean that X and Y are not related.

r = +1 r = -1

r close to +1 r close to -1

r=0 but r = 0

Page 3 of 17
Example: Find Karl Pearson’s correlation coefficient between the sales and expenses from
the data given below and interpret its value:

Firm 1 2 3 4 5 6 7 8 9 10
Sales (Lakhs) 50 50 55 60 65 65 65 60 60 50
Expenses(Lakhs) 11 13 14 16 16 15 15 14 13 13
Solution:
Sales ( ̅) ( ̅) Expenses ( ̅) ( ̅) ( ̅)( ̅)
X Y
50 -8 64 11 -3 9 +24
50 -8 64 13 -1 1 +8
55 -3 9 14 0 0 0
60 +2 4 16 +2 4 +4
65 +7 49 16 +2 4 +14
65 +7 49 15 +1 1 +7
65 +7 49 15 +1 1 +7
60 +2 4 14 0 0 0
60 +2 4 13 -1 1 -2
50 -8 64 13 -1 1 +8
∑ 580 ∑ 0 ∑ 360 ∑ 140 ∑ 0 ∑ 22 ∑ 70

∑ ∑
̅ ̅
∑ a strong linear relationship
between them with positive
√∑ √∑ √ slope
Hence, there is a high degree of positive correlation between the two variables i.e., as the
value of sales goes up, the expenses also goes up.

Example: Find Karl Pearson’s correlation coefficient between the sales and expenses from
the data given below and interpret its value:

Advertising 10 12 15 23 20
Expenses(Lakhs)
Sales (Lakhs) 14 17 23 25 21
ANS: r = +0.865

Page 4 of 17
Example: Show that the coefficient of correlation lies between -1 and +1.

Ans:
∑( ̅ )( ̅)

√∑( ̅ ) √∑( ̅)

( ̅) ( ̅)
Let
̅)
and ̅)
√∑( √∑(

Now

∑( ) ∑ ∑ ∑ ( )

i.e.

Similarly using ∑( ) we get


Which concludes

Spearman’s Rank Correlation Coefficient


Rank correlation method is applied when the rank order data are available or when each
variable can be ranked in some order. The measure based on this method is known as rank
correlation coefficient. This method is applied to the situation in which exact numerical
measurement are not available. For instance, sincerity or honesty of each employee.
Spearman’s Rank Correlation Coefficient denoted by and defined as

∑ ∑
( ) ( )

D refers to the difference of ranks between paired items in two series.

[Proof: see Nurul Islam Page 272]

Example: Two managers are asked to rank a group of employee in order of potential for
eventually becoming top managers. The rankings are as follows:

Employee A B C D E F G H I J
Ranking by Manager 1 10 2 1 4 3 6 5 8 7 9
Ranking by Manager 2 9 4 2 3 1 5 6 8 7 10
Compute the coefficient of rank correlation and comment on the value.
Page 5 of 17
Solution:
Employee ( )
A 10 9 1
B 2 4 4
C 1 2 1
D 4 3 1
E 3 1 4
F 6 5 1
G 5 6 1
H 8 8 0
I 7 7 0
J 9 10 1
N=10 ∑ =14


( )

Thus we find that there is a high degree of positive correlation in the ranks assigned by the
two managers.

Example: Compute the rank correlation coefficient for the following data od 2 tests given to
candidates for a critical job and comment on the value.

Preliminary test 92 89 87 86 83 77 71 63 53 50
Final test 86 83 91 77 68 85 52 82 37 57

Solution:
Preliminary test Final test ( )
92 10 86 9 1
89 9 83 7 4
87 8 91 10 4
86 7 77 5 4
83 6 68 4 4
77 5 85 8 9
71 4 52 2 4
63 3 82 6 9
53 2 37 1 1
50 1 57 3 4
N=10 ∑ =44


( )
Thus we find that there is a high degree of positive correlation between preliminary test and
final test rs ranges from -1 to +1.
+1: Perfect positive correlation (as one variable increases, the other increases proportionally).
-1: Perfect negative correlation (as one variable increases, the other decreasesPage 6 of 17
proportionally)
0.8 to 1.0: Strong positive correlation..
Tie in Ranks:
An adjustment of the above formula is made when ranks are equal

{∑ ( ) ( ) }
( )

each are no of the repeated numbers.

Example: An examination of eight applicants for a post was taken by a firm. From the
marks obtained the applicants in the Bangla and English papers, Compute the rank
correlation coefficient:

Applicant A B C D E F G H
Marks in Bangla 15 20 28 12 40 60 20 80
Marks in English 40 30 50 30 20 10 30 60

Solution:

Marks in Bangla Final test ( )

15 2 40 6 16
20 3.5 30 4 0.25
28 5 50 7 4
12 1 30 4 9
40 6 20 2 16
60 7 10 1 36
20 3.5 30 4 0.25
80 8 60 8 0
N=8 ∑ =81.5

Marks 20 is repeated 2 times, hence and Marks 30 is repeated 3 times, hence


.
{∑ ( ) ( ) }
( )
{ ( ) ( )}
( )

Page 7 of 17
Regression Analysis
The regression analysis is a technique of studying the dependence of one variable (called
dependent variable), on one or more variables (called independent variables), with a view to
estimating or predicting the average value of the dependent variable in terms of the known or
fixed values of the independent variables.

Example: Variable 1: Distance to transmitter: X


Variable 2: Wireless signal strength: Y

Let’s assume a linear relationship between Y and X is reasonable as

 If the relationship between Y and X is exact this is called deterministic relationship.


 If the relationship between Y and X is not exact this is called non-deterministic
relationship. In real life application there are many sources of randomness. Randomness
means the same value of X does not always give the same value of Y.

Page 8 of 17
A deterministic relationship A non-deterministic relationship

Primary Objectives of Regression Analysis:


i. To estimate the relationship that exits, on the average, between the dependent variable
and independent variables.
ii. To determine the effect of each independent variable on the dependent variable,
controlling the effects of the others independent variables.
iii. To predict the value of the dependent variable for a given value of the explanatory
variables.

Simple Linear Regression Model


The simplest form of the regression model that displays the relation between X and Y is a
straight line, which appears as follows:

Where ̂ denotes the predicted value of Y, a is the intercept and b is the slope of the straight
line. In regression terminology, b is the regression coefficient of Y on X. This straight line is
called the fitted line of Y.

In practice, the observed value of Y would almost always invariably deviate from the
expectation. If this discrepancy denoted by a quantity . Then

Page 9 of 17
The least-Squares Method:
The least-squares method is a technique for minimizing the sum of the squares of the
differences between the observed values and estimated values of the dependent variable.
That is the least-squares line is the line that minimizes

∑ ∑( ̂) ∑( )

Here ̂ and ∑ is called sum of squares of errors (SSE).

To minimizes SSE with respect to a and b, from calculus we know that the partial derivatives
of SSE with respect to a and b must be 0. Then

∑( )

∑( )

Which concludes

∑ ∑
and
Page 10 of 17
∑ ∑ ∑

From the above equations we get

∑ ∑ ∑
∑ (∑ )

and ̅ ̅

Example: The following table show distance to transmitter (X) and corresponding wireless
signal strength (Y).

Distance to 13 1 17 19 14 15 15 8 13 3
transmitter (m)
wireless signal 34.4 38.4 30.4 29.7 30.1 33.9 32.8 35.2 34.9 36.8
strength (dB)
i. Find the regression line of Y on X.
ii. Predict what the signal strength would be if the distance was 10 meters.

Solution:

13 34.4 169 447.2


1 38.4 1 38.4
17 30.4 289 516.8
19 29.7 361 564.3
14 30.1 196 421.4
15 33.9 225 508.5
15 32.8 225 492
8 35.2 64 281.6
13 34.9 169 453.7
3 36.8 9 110.4
∑ =118 ∑ =336.6 ∑ =1708 ∑ =3834.3
i.
∑ ∑ ∑ ( ) ( )( )
∑ (∑ ) ( ) ( )

̅ ̅

Page 11 of 17
The regression line of Y on X is
̂

ii. For X=10 ̂

In the case when X is assumed to be independent variable and Y as dependent


variable, the regression is said to be regression of Y on X and the estimating regression line
is ̂ . When X acts as dependent variable and Y as independent variable, the
regression is said to be regression of X on Y and the estimating regression line is
̂ with

∑ ∑ ∑
∑ (∑ )

̅ ̅

Goodness of Fit in Regression and Coefficient of Determination:

̅ is called the total deviation and corresponding ∑( ̅) is called total sum of


squares (SST), ̂ ̅ is called explained deviation and corresponding ∑(̂ ̅) is called
sum of squares for regression (SSR) and ̂ is called unexplained deviation and
corresponding ∑( ̂ ) is called sum of squares for error (SSE). The relation among SSE,
SST, SSR is

SST = SSE + SSR


Symbolically
∑( ̅) ∑( ̂) ∑(̂ ̅)

Page 12 of 17
In deterministic relationship SSE = 0 i.e. for a perfect fitting estimation line SST = SSR and
hence SSR/SST=1. For the worst case of data SSR = 0 i.e. SSE = SST and hence
SSR/SST=0.

So the ratio SSR/SST evaluate how good the estimated regression line is, values of this ratio
closer to 1 would imply better fitting estimated line. Thus the ratio SSR/SST is known as the
coefficient of determination.
∑(̂ ̅)
∑( ̅)

Page 13 of 17
is a non-negative value and it’s limit are . Verbally, measures the
percentage of the total variation in the dependent variable explained by the regression model.

Example: The following table shows the hardness (X) and tensile strength (Y) of 5
samples of metal:
X 146 152 158 164 170
Y 75 78 77 89 82
Find the regression line Y on X. Is this linear model adequate for the given data set, Justify
your result?

Difference between Regression and Correlation Analysis:


i. In regression analysis, there is an asymmetry in the way the dependent and
independent variables are treated.
ii. Regression analysis provides us the overall measure of the extent to which the
variation in one variable determines the variation in the other.

Multiple Regression Model


If there are more than one independent variable in the regression model, then it is called the
multiple regression model. The fitted plane of Y is denoted as

Using least squares method we get the error equation as

∑ ∑( ̂) ∑( )

Taking partial derivatives with respect to and we get


∑ ∑ ∑

∑ ∑ ∑ ∑

∑ ∑ ∑ ∑

After simplifying the above equations we get

Page 14 of 17
( ) ( ) ( ) ( )
( ) ( ) * ( )+

( ) ( ) ( ) ( )
( ) ( ) * ( )+

̅ ̅ ̅
Here
(∑ ) ∑ ∑
( ) ∑ and ( ) ∑

Example: A researcher is interested in predicting the average value of rice


production (Y) in a field, on the basis of two predictor variables, the average
rainfall per day ( ) and average urea used in per square feet ( ). Data for 10
individuals were recorded as in table below:
Sl. 1 2 3 4 5 6 7 8 9 10
no.
24 0 25 0 5 18 20 0 15 6
53 47 50 52 40 44 46 45 56 40
Y 11 22 7 26 22 15 9 23 15 24
Estimate the regression of Y on and .
Solution:
Sl.
no.
1 576 2809 264 1272 583
2 0 2209 0 0 1034
3 625 2500 175 1250 350
4 0 2704 0 0 1352
5 25 1600 110 200 880
6 324 1936 270 792 660
7 400 2116 180 920 414
8 0 2025 0 0 1035
9 225 3136 225 840 840
10 36 1600 144 240 960
∑ =2211 ∑ =22635 ∑ =1368 ∑ =5514 ∑ =8108

Page 15 of 17
(∑ ) ( )
( ) ∑

(∑ ) ( )
( ) ∑

∑ ∑
( ) ∑

∑ ∑
( ) ∑

∑ ∑
( ) ∑

( ) ( ) ( ) ( )
( ) ( ) * ( )+

* +

( ) ( ) ( ) ( )
( ) ( ) * ( )+

* +

̅ ̅ ̅

Required regression plane of Y is


̂

Polynomial Regression Model


̂

Treat X as and as in multiple regression model.


Using least squares method

∑ ∑ ∑

∑ ∑ ∑ ∑

Page 16 of 17
∑ ∑ ∑ ∑

We get
( ) ( ) ( ) ( )
( ) ( ) * ( )+

( ) ( ) ( ) ( )
( ) ( ) * ( )+

̅ ̅ ̅̅̅

Example: A test was made to different doses of nitrogen (X) on rice field for
observing rice production. The following data were recorded:
Nitrogen 0 1 2 3 4 5
Dose
Rice 15 25 40 55 52 43
Production
Compute a second degree polynomial to the data.
Solution:
N-Dose R-Production ( )
(X) (Y)
0 15 0 0 0 0 0
1 25 1 1 1 25 25
2 40 4 16 8 80 160
3 55 9 81 27 165 495
4 52 16 256 64 208 832
5 43 25 625 125 215 1075
15 225 693 2587

( ) ; ( ) ; ( ) ; ( ) ; ( )

The estimated polynomial regression is


̂
Page 17 of 17

You might also like