Unit 4
Unit 4
Objectives
After going through this unit, you will be able to:
Illustrate correlation analysis
Explain misconceptions with correlation
Describe correlation terminologies
Structure
4.1 Introduction
4.2 Covariance and Correlation in Projects
4.3 Correlation Analysis using Scatter Plots
4.4 Karl Pearson’s Coefficient of Correlation
4.5 Spearman’s Rank Correlation Coefficient
4.6 Keywords
4.7 Summary
4.1 INTRODUCTION
In general, correlation exists when two variables have a linear relationship beyond what is expected
by chance alone. The most common measure of correlation is called the “Pearson Product-Moment
Correlation Coefficient”. It is examined by measures from only two variables, namely the covariance
between the two variables {cov(x,y)} and the standard deviation of each (σx, σy). This measure can
range from -1 to 1, inclusive. A value of -1 represents a “perfect negative correlation”, while a value
of 1 represents a “perfect positive correlation”. The closer a correlation measure is to these extremes,
the “stronger” the correlation between the two variables. A value of zero means that no correlation
is observed. It is important to note that a correlation measure of zero does not necessarily mean that
there is no relationship between the two variables, just that there is no linear relationship present in
the data that is being analyzed. It is also sometimes difficult to judge whether a correlation measure
is “high” or “low”.
There are certain situations where a correlation measure of 0.3, for example, may be considered
negligible. In other circumstances, such as in the social sciences, a 0.3 correlation measure may
suggest that further examination is needed. As with all data analysis, the context of the data must be
understood in order to evaluate any results.
4.2 COVARIANCE AND CORRELATION IN PROJECTS
It often arises in the course of executing projects that one or more random variables, or events, appear
to bear on the same project problem. For instance, fixed costs that accumulate period by period and
the overall project schedule duration are two random variables with obvious dependencies. Two
statistical terms come into play when two or more variables are in the same project space: covariance
and correlation.
Covariance
Covariance is a measure of how much one random variable depends on another. Typically, we
think in terms of “if X gets larger, does Y also get larger or does Y get smaller?” The covariance will
be negative for the latter and positive for the former. The value of the covariance is not particularly
meaningful since it will be large or small depending on whether X and Y are large or small.
Covariance is defined simply as:
Cov( X , Y ) = E( X * Y ) - E( X ) * E( Y )
Correlation
Covariance does not directly measure the strength of the “sensitivity” of X on Y; judging the
strength is the job of correlation. Sensitivity will tell us how much the cost changes if the schedule
is extended a month or compressed a month. In other words, sensitivity is always a ratio, also
called a density, as in this example: $cost change/month change. But if cost and time are random
variables, what does the ratio of any single outcome among all the possible outcomes forecast for
the future? Correlation is a statistical estimate of the effects of sensitivity, measured on a scale of
-1 to +1.
The Greek letter rho, ρ , used on populations of data, and “r”, used with samples of data, stand
for the correlation between two random variables: r( X , Y ). The usual way of referring to “r” or
“ρ” is as the “correlation coefficient.” As such, their values can range from -1 to +1. “0” value
means no correlation, whereas -1 means highly correlated but moving in opposite directions, and
+1 means highly correlated moving in the same direction.
The correlation function is defined as the covariance normalized by the product of the standard
deviations:
r( X , Y ) = COV( X , Y )/( σ X * σ Y )
We can now rewrite the variance equation:
VAR( X + Y ) = VAR( X ) + VAR( Y ) + 2 * ρ ( σ X + σ Y )
In statistics, correlation, (often measured as a correlation coefficient), indicates the strength and
direction of a linear relationship between two random variables. In general statistical usage,
correlation or co-relation refers to the departure of two variables from independence. In this broad
sense there are several coefficients, measuring the degree of correlation, adapted to the nature of
data. A number of different coefficients are used for different situations. The best known is the
Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the
two variables by the product of their standard deviations. Despite its name, it was first introduced by
Francis Galton. Several authors have offered guidelines for the interpretation of a correlation
coefficient. Cohen (1988), for example, has suggested the following interpretations for correlations in
psychological research, in the table below.
As Cohen himself has observed, however, all such criteria are in some ways arbitrary and should not
be observed too strictly. This is because the interpretation of a correlation coefficient depends on the
context and purposes. A correlation of 0.9 may be very low if one is verifying a physical law using high-
quality instruments, but may be regarded as very high in the social sciences where there may be a
greater contribution from complicating factors.
Along this vein, it is important to remember that “large” and “small” should not be taken as synonyms
for “good” and “bad” in terms of determining that a correlation is of a certain size. For example, a
correlation of 1.0 or −1.0 indicates that the two variables analyzed are equivalent modulo scaling.
Scientifically, this more frequently indicates a trivial result than an earth-shattering one. For example,
consider discovering a correlation of 1.0 between how many feet tall a group of people are and the
number of inches from the bottom of their feet to the top of their heads. A scatter plot is a graph that
represents bivariate data as points on a two-dimensional Cartesian plane. The following set of data
values were observed for the height h (in cm) and weight w (in kg) of nine Year 10 students.
We observe that y increases as x increases, and the points do not lie on a straight line. We say that
a weak positive association exists between the variables x and y.
Consider the following scatter plot:
We observe that y decreases as x increases, and the points do not lie on a straight line. We say
that a weak negative association exists between the variables x and y.
Consider the following scatter plot:
It is clear from the scatter plot that as x increases, there is no apparent effect on the y. In such a
case, we say that no association exists between the variables x and y.
Consider the following scatter plot:
If a data value does not fit the trend of the data, then it is said to be an outlier. In the above
scatter plot, it is easy to identify the outliers. There are two outliers in the set of data values.
Sum of Squares
We introduced a notation earlier in the course called the sum of squares. This notation was the
SS notation, and will make these formulas much easier to work with.
Here is the formula for r. Don’t worry about it, we won’t be finding it this way. This formula can
be simplified through some simple algebra and then some substitutions using the SS notation
discussed earlier.
If you divide the numerator and denominator by n, then you get something which is starting to
hopefully look familiar. Each of these values has been seen before in the Sum of Squares notation
section. So, the linear correlation coefficient can be written in terms of sum of squares.
Application of Correlation in Hypothesis Testing
The claim we will be testing is “There is significant linear correlation”.
The Greek letter for r is rho, so the parameter used for linear correlation is rho
H0: rho = 0
H1: rho <> 0
r has a t distribution with n-2 degrees of freedom, and the test statistic is given by:
Now, there are n-2 degrees of freedom this time. This is a difference from before. As an over-
simplification, you subtract one degree of freedom for each variable, and since there are 2
variables, the degrees of freedom are n-2.
This doesn't look like our
the formula for the test statistic is , which does look like the pattern we're looking for.
Remember that Hypothesis testing is always done under the assumption that the null hypothesis is
true.
In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles
Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of
correlation – that is, it assesses how well an arbitrary monotonic function could describe the
relationship between two variables, without making any other assumptions about the particular
nature of the relationship between the variables. Certain other measures of correlation are parametric
in the sense of being based on possible relationships of a parameterised form, such as a linear
relationship.
In principle, ρ is simply a special case of the Pearson product-moment coefficient in which two sets of
data Xi and Yi are converted to rankings xi and yi before calculating the coefficient. In practice,
however, a simpler procedure is normally used to calculate ρ. The raw scores are converted to ranks,
and the differences, di between the ranks of each observation on the two variables are calculated.
di = xi − yi = the difference between the ranks of corresponding values X i and Yi, and
n = the number of values in each data set (same for both sets).
If tied ranks exist, classic Pearson's correlation coefficient between ranks has to be used instead of
this formula:
4.8 SUMMARY
These sorts of studies involve comparing two variables (e.g., income and crime, smoking habits and
health) in order to see if there might be some connection and perhaps even a suggestion of cause. As
a cigarette smoking habit rises, do health problems also rise? As income decreases, does the frequency
of crime increase? As people grow older to they become less or more tolerant of others?
Correlation is an extremely important analytical tool which enables us to begin to sort out claims about
important connections, which may or may not be true: the amount of smoking and the incidence of
lung cancer, HIV infection and the onset of AIDS, the age of a car and its value, television programming
of playoff games and attendance at lectures, poverty and crime, IQ tests and income levels,
intelligence and heredity, age and mechanical skills, and so on. People make claims about such matters
all the time. The principle of correlation enables us to investigate such claims in order to understand
whether they are true or not and, if true, just what the strength of that relationship might be.