0% found this document useful (0 votes)
30 views

Introduction of Correlation

Uploaded by

aarya.raghav9
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Introduction of Correlation

Uploaded by

aarya.raghav9
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction of Correlation

Instructor: Weikang Kao, Ph.D.


Relationships
 There are two primary uses of statistics: Determining differences and
determining relationships (hopefully predictive ones).
 In the first part of this course we talked about common ways to detect whether
differences exist.
 We looked at things like t-tests to make comparisons between two groups, and
ANOVA to look at differences between three or more groups or where we have
an interaction or a confounding blocking variable we wish to consider.
 Often, we are not interested in whether some categorically define groups differ.
We are instead interested in whether one variable is related to another variable.
For example, as fat or sugar intake increases does weight increase?
Relationships

Before we get to relationship, quick review on variance.


 how participants deviate from its mean in a sample.
How to find a relationship between two or more items?
 graphs, plots, look at covariance.
Relationship?
 When one variable deviates from its mean, we could expect the other variable deviates
from its mean in a same or directly opposite way.
Correlation Plot:

Relationships See if we can find some patterns.


Relationships

How do we know about the relationship?


 Recall: Variance
Covariance:
1. we must get the cross-product deviations
2. divide the value by the number of the observation. This average sum of the combined
deviation is the covariance.
Finding a Relationship

Correlation is based on the idea of variance and co-variance: How much a variable differs from
its own mean and how much the two variables deviate in a similar or dissimilar fashion.

What this will return is some number from -∞ to +∞.

While the covariance will give us some indication of the relationship it will be effected heavily
by the scale the variables are on thus, we will want to standardize it.
Standardization

With the covariance we have some indication of the effect but without standardization the
number we get is not so diagnostic.
Pearson's correlation coefficient, as know as “r”

Performing this calculation we will always end up with a value between -1 and 1.
Standardization

How to interpret person’s r?


 A r close to 1 indicates a strong positive relationship (as variable x increases so to does
y) while an r close to -1 indicates a strong negative relationship (as x increases y decreases),
and of course an r close to 0 would indicate no robust relationship between x and y.
How do we know if the r is weak, median or strong?
 ± 0.1, ± 0.3 and ± 0.5
Remember, r does not tell us significance.
Significance

Before we get to test the significance, we need perform the hypothesis.


Null:
Alternative:
The data of Pearson’s r need to be interval: The space between units indicate the
same amount (e.g., Age, Weight, Income, etc.).
There are two ways we can go, about testing the hypothesis.
1. Z-score.
2. t-statistics.
Causality?

If we have a strong correlation between A and B, can we make the conclusion that
A causes B?
 While correlation will tell us something about the relationship between two
variables, it does not imply causality.
1. Third party issue.
2. A correlation does not tell us directionality (e.g., do people find us attractive
because we are successful, or are we successful because we are attractive?).
Types of Correlation

There are two types of correlation: Bivariate correlation and Partial


correlation.
 Bivariate: It shows how much X will change when there is a change
in Y.
 Partial correlation: looks at the relationship between two variables
while controlling the effect of one or more additional variables.
Assumptions of Person’s r

1. Data are interval so the Person’s r can accurately measure the linear
relationship between two variables.
2. The sampling distribution should be normally distributed.
3. Variables should be normally distributed.
Correlation Application
Instructor: Weikang Kao, Ph.D.
Bivariate Correlation in R

 In a current study, we want to know the relationship between height and


weight.
 We randomly select some participants and measure their weight and height.

Participant 1 2 3 4 5
Height 175 170 180 178 168
Weight 60 70 75 80 69
Bivariate Correlation in R

 Check the dataset.


Height<- c(175, 170, 180, 178, 168)
Weight<-c(60, 70, 75, 80, 69)
Data <- data.frame(Height, Weight)
scatterplot(Data$Height, Data$Weight)

cor(Data$Height, Data$Weight, use = "complete.obs", method = "pearson")


Correlation: 0.426687
Bivariate Correlation in R

cor.test(Data$Height, Data$Weight, method = "pearson", conf.level = .99)


 t = 0.81716, df = 3, p-value = 0.4737
99 percent confidence interval:
-0.8776734 - 0.9791785
95 percent confidence interval:
-0.7306240 0.9509622
Bivariate Correlation in R

 What if we have more than two variables?


HeartRate <- c(60, 70, 75, 73, 71)

Date <- Data[, c("Height", "Weight", "HeartRate")]


cor(Data)
Height Weight HeartRate
Height 1.0000000 0.4266870 0.2204325
Weight 0.4266870 1.0000000 0.8932405
HeartRate 0.2204325 0.8932405 1.0000000
Bivariate Correlation in R

More than two variables.


install.packages("Hmisc")
library(Hmisc)
DataMatrix <- as.matrix(Data[, c("Height", "Weight", "HeartRate")])
rcorr(DataMatrix)
P- value
Height Weight HeartRate
Height 0.4737 0.7216
Weight 0.4737 0.0412
HeartRate 0.7216 0.0412
Using R2 for interpretation

 Although conclusion can not be made in terms of causality and direction,


we can use R square to interpret our result.
 R2 : coefficient of determination, is a measure of the amount of variability
in one variable that is shared by another one.
For example, correlation between height and weight is .47, that means the R2
of it is .18.
cor(Data)^2
cor(Data)^2*100
Non-Parametric
Non-Parametric

 Recall: when do we need a Non-Parametric test?


For example, assumption of normality (e.g., lots of skew and potential
outliers), or one (both) of the variables we wish to examine will be ordinal
(ranked data).
 Person’s r requires interval or ratio data.
 The most well-known used testes are Spearman’s correlation coefficient
and Kendall's Tau (τ).
Spearman’s Correlation Coefficient

 Spearman’s correlation coefficient: rs


cor(Data$Height, Data$Weight, method = "spearman")
rho: 0.6
cor.test(Data$Height, Data$Weight, method = "spearman")
p-value = .35
cor.test(Data$Height, Data$Weight, alternative = “less”, method =
"spearman")
Kendall's Tau (τ)

Kendall's Tau (τ): small dataset with a large number of tied ranks.
cor(Data$Height, Data$Weight, method = “kendall")
rho: 0.4
cor.test(Data$Height, Data$Weight, method = “kendall")
p-value = .48
cor.test(Data$Height, Data$Weight, alternative = “less”, method = “kendall")
Two more non-parametric tests

Bootstrapping: Simple, easy to be used, can be applied in


most of the situation, however, the sample size must increase
if we want a better power.
Biserial and point-biserial correlation: used when one of the
two variable is dichotomous. For example, gender, live or
death, being pregnant.
In Class Practice

 Check the person’s r between each variables.


Height: 175, 170, 180, 178, 168, 181, 190, 185, 177, 162
Weight: 60, 70, 75, 80, 69, 78, 82, 84, 72, 53
Heart Rate: 60, 70, 75, 73, 71, 73, 76, 80, 68, 64
 What if we violate the assumption, what should we do?
Partial-Correlation
Partial Correlation

 When we look at the correlation between variables, what if there is a


third-party effect? Would that influence our result?
 Partial correlation helps to exam the relationship between the two
variables yet the effect of a third variable is constantly held.
 For example, can we conduct a partial correlation between weight and
height, while controlling for the effect of heart rate?
What does that mean?
Partial Correlation

Variance accounted for by weight


and heart rate while controlling
Weight Variance accounted for by weight
and height while controlling
height. heart rate

Heart Variance accounted for by


Height
Variance accounted for by height Rate weight, height and heart rate
and heart rate while controlling
weight.
Partial Correlation

Take a look at our data.


Library(ggm)
pc <- pcor(c(“Variable 1", “Variable 2", “Variable Controlled"), var(Data))
r = 0.52
pc^2
R2 = 0.27
pcor.test(model we created, number of variable we try to control, sample size)
p = 0.48
Semi-Partial Correlation

 When we do a partial correlation, we control for the effect of the third


variable, which has effect on both variables.
 When we do semi-partial correlation, we control for the effect of the
third variable, which has effect on only one variable.
 Semi-partial correlation is useful when trying to explain the variance
in one particular variable from a set of predictor variable.
Semi-Partial Correlation

Variance accounted for by height


and heart rate on weight.
Weight

Height
Heart Rate
Variance accounted for by weight
and heart rate while controlling
height from heart rate.
Partial-Correlation Application
Application

In our data file for today (simplerelationships.xlsx) we have three columns of


data:
1. Income – how much a person earns per year
2. Percent Income Saved (PI)– what percent of income carried over to next
year and can be negative if more money was spent than came in.
3. Score on a savings motivation metric (MS) – an ordinal scale.
In this case, we would like to focus on the relationship between Income and
PI.
Application

Before we get started, we always want to take a look at the data.


scatterplot(data$Income, data$PI)
Exam the correlation
cor(data$Income, data$PI, method = "pearson")
 r = 0.40

cor.test(data$Income, data$PI, method = "pearson", conf.level = .95)


 t = 4.29, df = 97, p < .001
 95 percent confidence interval: 0.22 to 0.55
Application

We find a significant positive relationship (r = .40, t(97) = 4.29, p < .001) between
income and what they save – the more they earn the greater the percentage of money
they appear to save (and vice versa –directionality isn’t known from this).
We now can calculate the R square to see how much variance we can explain.
cor(data)^2
R2 = .16
What’s next?
Application

We also want to control the potential third-party effect so to make sure we learn
what we are aiming for.
data <- data[, c("Income", "PI", "MS", "Age")]
cor(Data)
Income PI MS Age
Income 1.00000000 0.39901305 -0.06075949 0.99536381
PI 0.39901305 1.00000000 0.00978232 0.48535636
MS -0.06075949 0.00978232 1.00000000 -0.05690868
Age 0.99536381 0.48535636 -0.05690868 1.00000000
Application

 Given the high probability that a lot of the income effect on savings is shared
with age we should run a partial correlation.
 Now we can base on our study, to determine what are the variables we want to
look at and what are the variables we would like to control.
 For example, we study income and MS while controlling Age.
pc <- pcor(c("Income", "MS", "Age"), var(data))
pc
pc^2
pcor.test(pc, 1, 99)
Summary Write up

 There is a significant relationship between income and percent income saved, r


= .40, p < .001.
 There is a significant relationship between income and age, r = .99, p < .001.
 The relationship between income and savings motivation while controlling age is
not significant, r = -.043, p = .68.
Weekly Lab

 When do we want to use correlation rather than using ANOVA?


Please provide an example.
 Please explain why partial correlation is useful and provide an
example.

You might also like