0% found this document useful (0 votes)
20 views

Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 4 CORRELATION ANALYSIS

Objectives
After going through this unit, you will be able to:
 Illustrate correlation analysis
 Explain misconceptions with correlation
 Describe correlation terminologies
Structure
4.1 Introduction
4.2 Covariance and Correlation in Projects
4.3 Correlation Analysis using Scatter Plots
4.4 Karl Pearson’s Coefficient of Correlation
4.5 Spearman’s Rank Correlation Coefficient
4.6 Keywords
4.7 Summary

4.1 INTRODUCTION

In general, correlation exists when two variables have a linear relationship beyond what is expected
by chance alone. The most common measure of correlation is called the “Pearson Product-Moment
Correlation Coefficient”. It is examined by measures from only two variables, namely the covariance
between the two variables {cov(x,y)} and the standard deviation of each (σx, σy). This measure can
range from -1 to 1, inclusive. A value of -1 represents a “perfect negative correlation”, while a value
of 1 represents a “perfect positive correlation”. The closer a correlation measure is to these extremes,
the “stronger” the correlation between the two variables. A value of zero means that no correlation
is observed. It is important to note that a correlation measure of zero does not necessarily mean that
there is no relationship between the two variables, just that there is no linear relationship present in
the data that is being analyzed. It is also sometimes difficult to judge whether a correlation measure
is “high” or “low”.

There are certain situations where a correlation measure of 0.3, for example, may be considered
negligible. In other circumstances, such as in the social sciences, a 0.3 correlation measure may
suggest that further examination is needed. As with all data analysis, the context of the data must be
understood in order to evaluate any results.
4.2 COVARIANCE AND CORRELATION IN PROJECTS

It often arises in the course of executing projects that one or more random variables, or events, appear
to bear on the same project problem. For instance, fixed costs that accumulate period by period and
the overall project schedule duration are two random variables with obvious dependencies. Two
statistical terms come into play when two or more variables are in the same project space: covariance
and correlation.
Covariance
Covariance is a measure of how much one random variable depends on another. Typically, we
think in terms of “if X gets larger, does Y also get larger or does Y get smaller?” The covariance will
be negative for the latter and positive for the former. The value of the covariance is not particularly
meaningful since it will be large or small depending on whether X and Y are large or small.
Covariance is defined simply as:

Cov( X , Y ) = E( X * Y ) - E( X ) * E( Y )

If X and Y are independent, then E( X * Y ) = E( X ) * E( Y ), and COV( X , Y ) = 0.


If the covariance of two random variables is not 0, then the variance of the sum of X and Y
becomes:
VAR( X + Y ) = VAR( X ) + VAR( Y ) + 2 * COV( X , Y )
The covariance of a sum becomes a governing equation for the project management problem of
shared resources, particularly people. If the random variable X describes the availability need for
a resource and Y for another resource, then the total variance of the availability need of the
combined resources is given by the equation above.

Correlation
Covariance does not directly measure the strength of the “sensitivity” of X on Y; judging the
strength is the job of correlation. Sensitivity will tell us how much the cost changes if the schedule
is extended a month or compressed a month. In other words, sensitivity is always a ratio, also
called a density, as in this example: $cost change/month change. But if cost and time are random
variables, what does the ratio of any single outcome among all the possible outcomes forecast for
the future? Correlation is a statistical estimate of the effects of sensitivity, measured on a scale of
-1 to +1.
The Greek letter rho, ρ , used on populations of data, and “r”, used with samples of data, stand
for the correlation between two random variables: r( X , Y ). The usual way of referring to “r” or
“ρ” is as the “correlation coefficient.” As such, their values can range from -1 to +1. “0” value
means no correlation, whereas -1 means highly correlated but moving in opposite directions, and
+1 means highly correlated moving in the same direction.

The correlation function is defined as the covariance normalized by the product of the standard
deviations:
r( X , Y ) = COV( X , Y )/( σ X * σ Y )
We can now rewrite the variance equation:
VAR( X + Y ) = VAR( X ) + VAR( Y ) + 2 * ρ ( σ X + σ Y )

4.3 CORRELATION ANALYSIS USING SCATTER PLOTS

In statistics, correlation, (often measured as a correlation coefficient), indicates the strength and
direction of a linear relationship between two random variables. In general statistical usage,
correlation or co-relation refers to the departure of two variables from independence. In this broad
sense there are several coefficients, measuring the degree of correlation, adapted to the nature of
data. A number of different coefficients are used for different situations. The best known is the
Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the
two variables by the product of their standard deviations. Despite its name, it was first introduced by
Francis Galton. Several authors have offered guidelines for the interpretation of a correlation
coefficient. Cohen (1988), for example, has suggested the following interpretations for correlations in
psychological research, in the table below.

Correlation Negative Positive


Small −0.3 to −0.1 0.1 to 0.3
Medium −0.5 to −0.3 0.3 to 0.5
Large −1.0 to −0.5 0.5 to 1.0

As Cohen himself has observed, however, all such criteria are in some ways arbitrary and should not
be observed too strictly. This is because the interpretation of a correlation coefficient depends on the
context and purposes. A correlation of 0.9 may be very low if one is verifying a physical law using high-
quality instruments, but may be regarded as very high in the social sciences where there may be a
greater contribution from complicating factors.
Along this vein, it is important to remember that “large” and “small” should not be taken as synonyms
for “good” and “bad” in terms of determining that a correlation is of a certain size. For example, a
correlation of 1.0 or −1.0 indicates that the two variables analyzed are equivalent modulo scaling.
Scientifically, this more frequently indicates a trivial result than an earth-shattering one. For example,
consider discovering a correlation of 1.0 between how many feet tall a group of people are and the
number of inches from the bottom of their feet to the top of their heads. A scatter plot is a graph that
represents bivariate data as points on a two-dimensional Cartesian plane. The following set of data
values were observed for the height h (in cm) and weight w (in kg) of nine Year 10 students.

Plot a scatter plot for this set of data.


Solution:

The scatter plot is obtained by plotting w against h, as shown above.


We use the scatter plot to look for patterns that might indicate that the variables are related.
Then, if the variables are related, we can visualise what kind of line (or curve), or equation,
describes the relationship. Association (or relationship) between two variables will be described
as strong, weak or none; and the direction of the association may be positive, negative or none.
In the previous example, w increases as h increases. We say that a strong positive association
exists between the variables h and w.
Consider the following scatter plot:
It is clear from the scatter plot that y decreases as x increases. We say that a strong negative
association exists between the variables x and y.
Consider the following scatter plot:

We observe that y increases as x increases, and the points do not lie on a straight line. We say that
a weak positive association exists between the variables x and y.
Consider the following scatter plot:

We observe that y decreases as x increases, and the points do not lie on a straight line. We say
that a weak negative association exists between the variables x and y.
Consider the following scatter plot:
It is clear from the scatter plot that as x increases, there is no apparent effect on the y. In such a
case, we say that no association exists between the variables x and y.
Consider the following scatter plot:

If a data value does not fit the trend of the data, then it is said to be an outlier. In the above
scatter plot, it is easy to identify the outliers. There are two outliers in the set of data values.

4.4 KARL PEARSON’S COEFFICIENT OF CORRELATION

Sum of Squares

We introduced a notation earlier in the course called the sum of squares. This notation was the
SS notation, and will make these formulas much easier to work with.

Notice these are all the same pattern,


SS(x) could be written as

Also, note that

Pearson's Correlation Coefficient


There is a measure of linear correlation. The population parameter is denoted by the Greek letter
rho and the sample statistic is denoted by the roman letter r. Here are some properties of r
 r only measures the strength of a linear relationship. There are other kinds of relationships
besides linear.
 r is always between -1 and 1 inclusive. -1 means perfect negative linear correlation and +1
means perfect positive linear correlation
 r has the same sign as the slope of the regression (best fit) line
 r does not change if the independent (x) and dependent (y) variables are interchanged
 r does not change if the scale on either variable is changed. You may multiply, divide, add,
or subtract a value to/from all the x-values or y-values without changing the value of r.
 r has a Student's t distribution

Here is the formula for r. Don’t worry about it, we won’t be finding it this way. This formula can
be simplified through some simple algebra and then some substitutions using the SS notation
discussed earlier.

If you divide the numerator and denominator by n, then you get something which is starting to
hopefully look familiar. Each of these values has been seen before in the Sum of Squares notation
section. So, the linear correlation coefficient can be written in terms of sum of squares.
Application of Correlation in Hypothesis Testing
The claim we will be testing is “There is significant linear correlation”.
The Greek letter for r is rho, so the parameter used for linear correlation is rho
H0: rho = 0
H1: rho <> 0
r has a t distribution with n-2 degrees of freedom, and the test statistic is given by:

Now, there are n-2 degrees of freedom this time. This is a difference from before. As an over-
simplification, you subtract one degree of freedom for each variable, and since there are 2
variables, the degrees of freedom are n-2.
This doesn't look like our

If you consider the standard error for r is

the formula for the test statistic is , which does look like the pattern we're looking for.

Remember that Hypothesis testing is always done under the assumption that the null hypothesis is
true.

4.5 SPEARMAN’S RANK CORRELATION COEFFICIENT

In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles
Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of
correlation – that is, it assesses how well an arbitrary monotonic function could describe the
relationship between two variables, without making any other assumptions about the particular
nature of the relationship between the variables. Certain other measures of correlation are parametric
in the sense of being based on possible relationships of a parameterised form, such as a linear
relationship.

In principle, ρ is simply a special case of the Pearson product-moment coefficient in which two sets of
data Xi and Yi are converted to rankings xi and yi before calculating the coefficient. In practice,
however, a simpler procedure is normally used to calculate ρ. The raw scores are converted to ranks,
and the differences, di between the ranks of each observation on the two variables are calculated.

If there are no tied ranks, then ρ is given by:

di = xi − yi = the difference between the ranks of corresponding values X i and Yi, and

n = the number of values in each data set (same for both sets).

If tied ranks exist, classic Pearson's correlation coefficient between ranks has to be used instead of
this formula:

4.7 KEY WORDS


 Scatter Plot – Graph representing bivariate data as points on a two-dimensional Cartesian
plane.
 Pearson's correlation coefficient – Result of Pearson product-moment coefficient.
 Spearman's rank correlation coefficient – A non-parametric measure of correlation.

4.8 SUMMARY
These sorts of studies involve comparing two variables (e.g., income and crime, smoking habits and
health) in order to see if there might be some connection and perhaps even a suggestion of cause. As
a cigarette smoking habit rises, do health problems also rise? As income decreases, does the frequency
of crime increase? As people grow older to they become less or more tolerant of others?
Correlation is an extremely important analytical tool which enables us to begin to sort out claims about
important connections, which may or may not be true: the amount of smoking and the incidence of
lung cancer, HIV infection and the onset of AIDS, the age of a car and its value, television programming
of playoff games and attendance at lectures, poverty and crime, IQ tests and income levels,
intelligence and heredity, age and mechanical skills, and so on. People make claims about such matters
all the time. The principle of correlation enables us to investigate such claims in order to understand
whether they are true or not and, if true, just what the strength of that relationship might be.

You might also like