Introduction of Correlation
Introduction of Correlation
Correlation is based on the idea of variance and co-variance: How much a variable differs from
its own mean and how much the two variables deviate in a similar or dissimilar fashion.
While the covariance will give us some indication of the relationship it will be effected heavily
by the scale the variables are on thus, we will want to standardize it.
Standardization
With the covariance we have some indication of the effect but without standardization the
number we get is not so diagnostic.
Pearson's correlation coefficient, as know as “r”
Performing this calculation we will always end up with a value between -1 and 1.
Standardization
If we have a strong correlation between A and B, can we make the conclusion that
A causes B?
While correlation will tell us something about the relationship between two
variables, it does not imply causality.
1. Third party issue.
2. A correlation does not tell us directionality (e.g., do people find us attractive
because we are successful, or are we successful because we are attractive?).
Types of Correlation
1. Data are interval so the Person’s r can accurately measure the linear
relationship between two variables.
2. The sampling distribution should be normally distributed.
3. Variables should be normally distributed.
Correlation Application
Instructor: Weikang Kao, Ph.D.
Bivariate Correlation in R
Participant 1 2 3 4 5
Height 175 170 180 178 168
Weight 60 70 75 80 69
Bivariate Correlation in R
Kendall's Tau (τ): small dataset with a large number of tied ranks.
cor(Data$Height, Data$Weight, method = “kendall")
rho: 0.4
cor.test(Data$Height, Data$Weight, method = “kendall")
p-value = .48
cor.test(Data$Height, Data$Weight, alternative = “less”, method = “kendall")
Two more non-parametric tests
Height
Heart Rate
Variance accounted for by weight
and heart rate while controlling
height from heart rate.
Partial-Correlation Application
Application
We find a significant positive relationship (r = .40, t(97) = 4.29, p < .001) between
income and what they save – the more they earn the greater the percentage of money
they appear to save (and vice versa –directionality isn’t known from this).
We now can calculate the R square to see how much variance we can explain.
cor(data)^2
R2 = .16
What’s next?
Application
We also want to control the potential third-party effect so to make sure we learn
what we are aiming for.
data <- data[, c("Income", "PI", "MS", "Age")]
cor(Data)
Income PI MS Age
Income 1.00000000 0.39901305 -0.06075949 0.99536381
PI 0.39901305 1.00000000 0.00978232 0.48535636
MS -0.06075949 0.00978232 1.00000000 -0.05690868
Age 0.99536381 0.48535636 -0.05690868 1.00000000
Application
Given the high probability that a lot of the income effect on savings is shared
with age we should run a partial correlation.
Now we can base on our study, to determine what are the variables we want to
look at and what are the variables we would like to control.
For example, we study income and MS while controlling Age.
pc <- pcor(c("Income", "MS", "Age"), var(data))
pc
pc^2
pcor.test(pc, 1, 99)
Summary Write up