Special Correlation
Special Correlation
o g y
h o l
s y c
in P
RESEARCH
ti h & STATISTICS
er W
P o w
Psychologist Amit Panwar
8013666663
o g y
h o l
s y c
in P
ti h
er W
P o w
8013666663
o g y
h o l
c
Correlation i n Ps&yits
ti h
Special
wer W Types
P o
8013666663
Correlation: Karl Pearson’s Product Moment Correlation
The correlation coefficient is a statistic that is used to measure the relationship between two different variables. Additionally, the correlation coefficient is used to determine
the strength of a given relationship once two variables have been found to be related.
y
Pearson’s correlation (also called Pearson’s R) is a correlation coefficient that shows the linear relationship between two sets of data. Two letters are used to represent the
g
Pearson correlation: the Greek letter rho (ρ) for a population and the letter “r” for a sample.
o l o
Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:
y ch
s
• -1 indicates a strong negative relationship.
P
• A result of zero indicates no relationship at all.
in
Meaning
ti h
• A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion in the other. For example, shoe sizes
go up in (almost) perfect correlation with foot length.
e W
• A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. For example, the amount
r
of gas in a tank decreases in (almost) perfect correlation with speed.
• Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.
P o w
The absolute value of the correlation coefficient gives us the relationship strength.
The larger the number, the stronger the relationship. For example, |-.75| = .75, which
has a stronger relationship than .65.
8013666663
Correlation: Pearson’s r
One must use the Pearson correlation coefficient if the relationship is linear; both variables are quantitative, normally distributed, and possess no outliers.
o g y
o l
It is important to determine whether or not a correlation coefficient is statistically significant. A statistically significant correlation coefficient suggests the correlation
represents a real-world relationship and is less likely to be attributable to chance or error.
h
y c
You can determine whether or not a correlation is statistically significant by looking for the p-value. Generally speaking, a p-value less than 0.05 indicates a statistically
s
P
significant correlation. This means that there is less than a 5% chance that this finding is due to chance or error.
in
SYMMETRY OF CORRELATIONS:
ti h
The correlation coefficient between the variables is symmetric, which means that the value of the correlation coefficient between Y and X or X and Y will remain the same.
LIMITATION:
er W
DEGREES OF FREEDOM: The degrees of freedom for correlations is the total number of score pairs (N) minus two, i.e. df = N-2
o w
The PPMC is not able to tell the difference between dependent variables and independent variables. For example, if you are trying to find the correlation between a high-
P
calorie diet and diabetes, you might find a high correlation of .8. However, you could also get the same result with the variables switched around.
Using this method, one cannot get information about the slope of the line as it only states whether any relationship between the two variables exists or not.
The Pearson correlation coefficient may likely be misinterpreted, especially in the case of homogeneous data.
8013666663
Correlation: Pearson’s r
Scatter Plot:
A scatterplot is a graph of ordered pairs showing a relationship between two sets of data. When
y
creating a scatterplot, you will have two sets of information, known as bivariate data, which is
g
two sets of variables that can change and are compared to find relationships.
Each point on this graph is called an ordered pair, which is two numbers that indicate a location
o l o
h
on the coordinate plane. The first number is the location on the x-axis, and the second number is
c
the location on the y-axis. To create a scatterplot, first create ordered pairs from the two variables.
y
Put the independent variable on the x-axis and the dependent variable on the y-axis.
P
Next, plot each point on your graph. This will show if there is a correlation between your two
s
in
variables. If the points seem to move in the same direction and are close together, then they likely
ti h
will have a correlation.
er W
P o w
8013666663
Correlation: Pearson’s r
Population Correlation Coefficient: (symbol: ρ) an index expressing the degree of association between two continuously measured variables for a complete population of
y
interest. For example, a researcher could obtain income and education information for all families in a town and calculate a population correlation coefficient for the entire
g
town.
l o
In contrast, the sample correlation coefficient indexes the association for a specific subset of those cases (e.g., every fourth family from a list of all those in the town). The
o
h
sample correlation coefficient of two variables is the sample covariance of two variables divided by the product of the sample standard deviation of two variables.
s y c
in P
ti h
er W
P o w
Glucose
y
Subject Age x xy x2 y2
Level y
1 43 99 4257 1849 9801
l o g
o
2 21 65 1365 441 4225
3 25 79 1975 625 6241
y ch
s
4 42 75 3150 1764 5625
P
5 57 87 4959 3249 7569
in
6 59 81 4779 3481 6561
ti h
Σ 247 486 20485 11409 40022
8013666663
Correlation: Pearson’s r
Question: Test the significance of the correlation coefficient, r = 0.565 using the critical values for the Pearson Product Moment Correlation (PPMC) Test at α = 0.01 for a
y
sample size of 9.
l o
DEGREES OF FREEDOM: The degrees of freedom for correlations is the total number of score pairs (N) minus two, i.e. df = N-2
g
Step 1: Subtract two from the sample size to get df, degrees of freedom.
9–7=2
ch o
P s y
Step 2: Look the values up in the PPMC Table. With df = 7 and α = 0.01, the table value is = 0.798
in
Step 3: Draw a graph, so you can more easily see the relationship.
ti h
r = 0.565 does not fall into the rejection region (above 0.798), so there isn’t enough evidence to state a strong linear relationship exists in the data.
er W
P o w
8013666663
Correlation: Coefficient of Determination
Coefficient of Determination
The coefficient of determination or the correlation coefficient of determination is the measure of how much change in one quantity explains the variability in another
y
quantity. In other words, this statistic enables us to estimate how well, the change in one quantity determines the change in another quantity.
l o g
𝒓𝟐 is the coefficient of determination when there is only one predictor variable.
𝑹𝟐 is the coefficient of determination when there are more than one predictor variables.
ch o
s y
Coefficient of determination and coefficient of correlation: The coefficient of determination 𝑟 ! is simply the square of the coefficient of correlation, r.
P
in
The range of 𝑟 ! is 0 to 1. Coefficient of determination is also calculated to determine how much variability can be explained in the outcome variable by the changes in the
ti h
predictor variable. As the value of the coefficient of determination reaches 1, the power of predictability of the model reaches 100%.
The coefficient of determination allows us to gauge the predictive power of the derived model. Statisticians often calculate the coefficient of determination to determine the
er W
predictive power of the mathematical model they build to model a real life situation. One example of the use of the coefficient of determination is 'weather prediction' where
many models are created but only the mathematical models that provide a high coefficient of determination are chosen for further consideration.
P w
The closer the value of r is to -1 or 1, the stronger is the power of determination between the two variables.
o
For example, if the two data sets X and Y have a positive correlation with r = .54, then the coefficient of determination is 𝑟 ! = .2916 meaning that X successfully explains
29% of the variability in Y. This also means that 71% variability in the values of Y is not explained by the values of X.
𝑟 ! = 0 means, the predictor variable completely fails to predict the outcome variable
𝑟 ! = .25 means that the predictor variable explains 25% variability in the outcome variable
𝑟 ! = .50means that the predictor variable explains 50% variability in the outcome variable
𝑟 ! = .75 means that the predictor variable explains 75% variability in the outcome variable
𝑟 ! = 1 means that the predictor variable explains 100% variability in the outcome variable
8013666663
o g y
h o l
s y c
P
Special Correlations
ti h in
er W
P o w
8013666663
SPECIAL TYPES OF Correlation:
Biserial Correlation
Continuous level (Artificially Continuous Level Variable
o g y
Ordinal data must have an underlying continuity but should be
l
measured as a dichotomous variable, such as anxiety, and
Dichotomous/Ordinal) (Interval/Ratio)
o
depression.
h
Continuous Level Variable
c
Point Bi-serial Correlation Nominal level (Truly Dichotomous) Special Types of Pearson Product Moment Correlation
(Interval/Ratio)
P
Dichotomous)
s y
Nominal level (Truly Used when we have only two unique values in our nominal
variables such as: Gender (Male and Female), BP (Low and High).
in
Used when we have more than two unique values in our
Nominal level (Truly
ti h
Cramer’s V Nominal level (Truly Dichotomous) nominal variables such as: Gender (Male, Female & LGBTQ+),
Dichotomous)
Favorite Music Genre (Classical, Pop, HipHop).
W
There is a latent continuous scale underneath your binary data.
Nominal level (Artificially Nominal level (Artificially
In other words, the trait you are measuring should be continuous
r
Tetrachoric Correlation
Dichotomous) Dichotomous)
e
and not discrete.
w
Spearman Rank Order Ordinal level Ordinal level Assumes a monotonic relationship between variables.
For example, if we want to describe the correlation between height and gender, we should use the point-biserial correlation coefficient. The variable gender is assigned
y
arbitrary values like 1 and 2 to denote males and females respectively.
l o g
The point biserial correlation coefficient, 𝒓𝒑𝒃𝒊 is a special case of Pearson’s correlation coefficient. It measures the relationship between two variables:
ch o
y
• One naturally binary variable [A truly dichotomous variable that is dummy coded. The calculations simplify since typically the values 1 (presence) and 0 (absence) are
s
used for the dichotomous variable.]
in P
Many different situations call for analyzing a link between a binary variable and a continuous variable. For example:
ti h
Does a New Drug A and New Drug B help in improving the state of depression?
Are women or men likely to earn more as Psychology Faculty?
Limitation:
er W
w
If you intentionally force data to become binary so that you can run point biserial correlation, perhaps by splitting continuous ratio variables into two segments, it will make
o
your results less reliable. There are exceptions to this rule of thumb. For example, you could separate test scores or GPAs into pass/fail, creating a logical binary variable.
P
An example of unnaturally forcing a scale into a binary variable: saying that people under 5’9″ are “Short” and those over 5’9″ are “tall.”
The values (of the dummy variables) are arbitrary but if the values are altered the sign of the point-biserial correlation changes. Therefore, the sign of the point-biserial
correlation is not taken into consideration.
8013666663
Correlation: Special Types of Pearson’s Correlation Coefficients Point-Biserial Correlation
M1 = mean (for the entire test) of the group that received the positive binary variable (i.e. the “1”).
M0 = mean (for the entire test) of the group that received the negative binary variable (i.e. the “0”).
Sn = standard deviation for the entire test.
p = Proportion of cases in the “0” group.
o g y
q = Proportion of cases in the “1” group.
h o l
s y c
Question: Choose from the given circumstances when should we use the point-biserial correlation used?
a.
b.
in
In the same circumstances when a repeated measures t-test would be used.
P
In the same circumstances when an independent measures t-test would be used.
ti h
c. When both x and y are measured on an ordinal scale.
d. When both x and y are measured on a continuous scale.
Ans. a
er W
Explanation: The t-statistic is continuous and in repeated measure of t-test the same variable is observed twice or it takes only two possible values. That's why we use in
w
this the point -biserial correlation.
P o
Dummy Variable: A dummy variable is a variable that takes values of 0 and 1, where the
values indicate the presence or absence of something (e.g., a 0 may indicate a placebo and 1
may indicate a drug). Numeric variables can also be dummy coded to explore nonlinear
effects.
Dummy variables are the main way that categorical variables are included as predictors
in statistical and machine-learning models.
8013666663
Correlation: Special Types of Non-Pearson’s Correlation Coefficients Biserial Correlation
The biserial correlation is a correlation coefficient between two continuous variables (X and Y), out of which one is measured dichotomously/ordinally. (X).
The Biserial correlation coefficient termed 𝒓𝒃 is similar to the point Biserial but it takes quantitative data against ordinal data. Both variables must be normally distributed.
g y
The ordinal data must have an underlying continuity but should be measured as a dichotomous variable, such as anxiety, and depression.
o
o l
An example might be test performance vs. anxiety, where anxiety is designated as either high or low. Presumably, anxiety can take on any value in-between, perhaps beyond,
but it may be difficult to measure. We further assume that anxiety is normally distributed.
h
s y c
Anxiety and Depression have an underlying continuity since they can be scored in a continuum but they can also be taken as dichotomous such as Low Anxiety/Depression (0)
P
and High Anxiety/Depression (1).
in
The biserial correlation measures the strength of the relationship between a binary and a continuous variable, where the binary variable has an underlying continuous
ti h
distribution but is measured as binary.
W
Since the factor involving p, q, and the height is always greater than 1 the biserial is always greater than the point-biserial.
er
Kindly Note: Always check for normality of the continuous outcome and the ordinal outcome when conducting biserial correlations.
o w
The rank biserial correlation measures the strength of the relationship between a binary and a rankings (ordinal) variable.
For example, let us say that you have to compute correlation between gender and ownership of the property. Gender takes two levels, male and female. The ownership of
y
property can be measured as either the person owns a property and the person do not own property.
l o g
The significance can be tested by using the Chi-Square distribution.
ch o
P s y
ti h in Caution while interpreting sign: If we assign 0 to females and 1 to males,
W
then we will get the same value of correlation with a different positive sign.
er
P o w
Relationship between Phi coefficient and Chi-
Formula for Phi Correlation
Square statistic.
8013666663
Correlation: Special Types of Non-Pearson’s Correlation Coefficients Tetrachoric
Tetrachoric correlation 𝒓𝒕𝒆𝒕 is a correlation between two dichotomous variables that have underlying continuous distribution.
is a measure of the correlation between two binary variables – that is, variables that can only take on two values like “yes” and “no” or “good” and “bad.”
o g y
This type of correlation is often used in surveys and personality tests in which the questions being asked only have two possible response values.
l
For example, attitude towards females and attitude towards liberalization are two variables to be correlated. Now, we simply measure them as having a positive or negative
o
attitude. So we have 0 (negative attitude) and 1 (positive attitude) scores available on both variables.
y ch
The underlying variables come from a normal distribution.
P s
in
There is a latent continuous scale underneath your binary data. In other words, the trait you are measuring should be continuous and not discrete.
ti h
Formula for 𝒓𝒕𝒆𝒕 = cos θ
er W
So the tetrachoric correlation between attitude towards
w
Liberalization and attitude towards women is positive.
8013666663
Correlation: Special Types of RANK ORDER Correlation Coefficients Spearman’s rho
A well-known psychologist and intelligence theorist, Charles Spearman (1904), developed a correlation procedure called in his honor as Spearman’s rank-order correlation or
Spearman’s rho. It is typically denoted either with the Greek letter rho (ρ), or rs. Like all correlation coefficients, Spearman’s rho measures the strength of association
between two variables.
g y
Spearman correlation does not require continuous-level data (interval or ratio), because it uses ranks instead of assumptions about the distributions of the two variables. This
o
allows us to analyze the association between variables of ordinal measurement levels. Moreover, the Spearman correlation does not assume that the variables are normally
distributed.
h o l
c
The purpose of using a correlation analysis for ranked data is to determine whether or not the rankings of one variable are related to how another variable is
y
ranked. For example, a teacher might take the student rankings from a statistics course and a social studies course and conduct a correlation to determine if there is a
relationship between how the students are ranked in both classes.
P s
in
Pearson's correlation is typically used when there is a linear relationship between variables. Spearman's correlation is used when there is a monotonic relationship between
ti h
variables.
When we calculate Spearman’s rank correlation coefficient, it is assumed that two or more units do not have the same rank.
r W
Comparison of Pearson and Spearman coefficients
e
w
v The fundamental difference between the two correlation coefficients is that the Pearson coefficient works with a linear relationship between the two variables whereas the Spearman
o
Coefficient also works with monotonic relationships as well.
P
v One more difference is that Pearson works with raw data values of the variables whereas Spearman works with rank-ordered variables.
v It can also be used with continuous data when the Pearson’s assumptions are not satisfied.
Now, if we feel that a scatterplot is visually indicating a “might be monotonic, might be linear” relationship, our best bet would be to apply Spearman and not Pearson.
1. The variables that are being analyzed are ranked or ordinal variables, or variables in which the values are arranged in some specific order. For example, the final exam
grades from a statistics course can be ranked from lowest to highest.
o g y
2. The data follows a monotonic trend when graphed. Monotonic data is when the values move in a certain direction on a graph, but they don't do so in a linear fashion. To
l
illustrate this trend, consider the following graphical examples:
h o
Monotonic trend: The data points move in the same direction. However, the rate that the values increase varies, creating a line that is somewhat wavy or curvy.
c
s y
Linear trend: The data points increase at the same rate, which creates a relatively straight line on the graph.
r
q as the value of one variable increases, the other variable value decreases.
e
But, not exactly at a constant rate whereas in a linear relationship, the rate of increase/decrease is
w
o
constant.
P
If a scatterplot indicates a relationship that cannot be expressed by a linear or monotonic function,
then both Pearson and Spearman must not be used to determine the strength of the relationship
between the variables.
Example of a graph that is not monotonic. 8013666663
Correlation: Special Types of RANK ORDER Correlation Coefficients Spearman’s rho
Question: A teacher takes five students and ranks their grades in a statistics course and a psychology P = Spearman's Rho, or the correlation coefficient
course. d = difference between ranks
n = the number of people
Statistics grade
Psychology
grade
Stats rank Psychology rank Difference
Squared
Difference
o g y
98
67
91
65
1
5
3
5
-2
0
h
4
0
o l
93
90
92
93
2
3
2
1
0
2
s y c 0
4
83 85 4 4
in P0 0
ti h
TOTAL 8
W
6Σ 𝒅𝟐 𝟔 (𝟖) 𝟒𝟖
r
𝑃= 𝑃= 𝑃= 𝑃 = 𝟎. 𝟒
5 (𝟓𝟐 − 𝟏)
e
n (𝒏𝟐 − 𝟏) 𝟏𝟐𝟎
P o w
Since this value is positive, it indicates a positive correlation. Based on this information, the teacher can conclude that there is a positive correlation between the rankings of
students in the statistics and psychology courses.
Notice that for this example, the rankings were used rather than the actual class grades or raw scores. This is true when using Spearman's correlation in other contexts; the
actual rankings are analyzed instead of the raw values.
The ranks are called tied ranks when two or more subjects have the same score on a variable.
We usually get larger than the actual value of Spearman’s rho if we employ the Spearman’s rho formula
with tied ranks. A correction is required in
o g y
this formula in order to calculate the correct value of Spearman’s rho.
h o l
c
The easier procedure of correction actually uses Pearson’s formula on the ranks. The formula and the steps are
y
as follows:
s
Where, rs = Spearman’s rho.
P
X = ranks of variable X
in
Y = rank on variable Y
n = number of pairs
ti h
er W
P o w
TIED RANKS: When two or more observations have equal values, if there
is a tie, it is difficult to assign ranks to them. In such cases, the observations
are given the average of the ranks they would have received. Then, a different
formula is used to calculate the correlation coefficient.
8013666663
Correlation: Special Types of RANK ORDER Correlation Coefficients Goodman and Kruskal's gamma (G or γ)
The Gamma statistic, (G or γ) was first proposed in a series of papers from 1954 to 1972 by Leo Goodman and William Kruskal. Gamma can be calculated for
continuous ordinal (ordered) data such as height, time, or age. It can also be used for discrete data like good, better, and best.
y
It is a nonparametric measure of the strength and direction of association that exists between two variables measured on an ordinal scale.
o g
Whilst it is possible to analyze such data using Spearman's rank-order correlation or Kendall's tau-b, Goodman and Kruskal's gamma is recommended when your data
l
o
has many tied ranks.
y ch
P s
Assumption #1: Your two variables should be measured on an ordinal scale. Examples of ordinal variables include Likert scales (e.g., a 7-point scale from "strongly
agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 5-point scale explaining how much a customer liked a product, ranging from
in
"Not very much" to "Yes, a lot").
ti h
Assumption #2: There needs to be a monotonic relationship between the two variables.
er W
P o w
The calculation of Gamma is based on two quantities:
Nc, the number of pairs where the ranked values are the same for both variables. This is called concordant pairs.
Nd, the number of pairs where the ranked values are in reverse order for both variables. This is called the discordant pairs.
Ties, where the values of paired data are equal, are ignored.
8013666663
Correlation: Special Types of RANK ORDER Correlation Coefficients Kendall’s Tau
This correlation procedure was developed by Kendall (1938). Kendall’s tau is based on an analysis of two sets of ranks, X and Y. Kendall’s tau is symbolized as ô,
which is a lowercase Greek letter tau. It is less popularly used.
y
Kendall’s tau is said to better alternative to Spearman’s rho under the conditions of tie ranks. The tau is also supposed to do better than Pearson’s r under the
g
conditions of extreme non-normality. This holds true only under the conditions of very extreme cases.
o l
The parameter (population value) is symbolized as ô and the statistics computed on the sample is symbolized as
o
ch
The tau is based on concordance and discordance among two sets of ranks. if the sign or the direction of RX – RX for subjects A and B is similar to the sign
y
or direction of RY – RY for subjects A and B, then the pair of ranks is said to be concordant (i.e., in agreement)
P s
In the case of subjects A and B, the RX – RX is (1 – 2 = – 1) and RY – RY is also (1– 3 = – 2). The sign or direction of the A and B pair is in agreement. So pair A and B
in
are called as concordant pair.
ti h
Continuous with Outliers or Ordinal: Your variables of interest must be either continuous or ordinal. Continuous means that your variables of interest can basically take
on any value, such as heart rate, height, weight. Kendall’s Tau is often used on continuous data when the data have outliers.
er W
Total Number of such concordant and discordant
PAIRS = n (n – 1)/2
P o w
n (n – 1)/2 = (4 × 3)/2 = 6, so six pairs. (AB, AC, AD,
BC, BD, and CD)
8013666663
o g y
h o l
s y c
in P
ti h
er W
P o w
8013666663