Partial Correlation
Partial Correlation
ETS Example
What data do you have?
Hours of prep GPA SAT score
What kinds of predictions might you make about the relationship between hours of preparation and SAT score? How can you examine the relationship(s)?
Simple Correlation
Goal: determine the relationship between 2 variables (e.g. y and x1) r2yx1 is the shared variance between y and x1
r2yx1
X1
ETS Example
Can look at simple correlation between each pair of variables
prep hours & SAT prep hours & GPA GPA & SAT
r2yx1
X1
ETS Example
Prep hours x SAT
GPA
2.8 3.75 2.6 3.8 3.2 2.75 3.15 2.4 3.3 3.5 3.4 2.4 3.6 2.87 2.9 2.65 2.67 3.5 3.23 3.0
ETS Example
GPA x SAT
GPA
2.8 3.75 2.6 3.8 3.2 2.75 3.15 2.4 3.3 3.5 3.4 2.4 3.6 2.87 2.9 2.65 2.67 3.5 3.23 3.0
ETS Example
GPA x prep hours
GPA
2.8 3.75 2.6 3.8 3.2 2.75 3.15 2.4 3.3 3.5 3.4 2.4 3.6 2.87 2.9 2.65 2.67 3.5 3.23 3.0
ETS Example
GPA & SAT: not surprising GPA & Prep hours: huh? GPA & Prep hours:
People with lower GPAs prep more (why?) Could explain the GPA & Prep hrs
GPA
X2
Partial Correlation
Find the correlation between two variables with the third held constant in BOTH That is, we remove the effect of x2 from both y and x1 r2yx1.x2 is the shared variance of y & x1 with x2 removed
r2yx1.x2 r2yx1
X1
X2
Partial Correlation
Y r2 yx1.x2 X1
y without x2 & x1 without x2 (residuals) We can put this in terms of simple corr. coefficients:
Simple correlation between y and x1 ryx1.x2 =
X2
ryx1 - ryx2rx1x2
( 1 - r2yx2)(1 - r2x1x2)
These represent all the variance without the partialled out relationships
Partial Correlation
Y r2 yx1.x2 X1
H0 : xy = 0 (no relationship) H1 : xy 0 (either positive or negative corr.) t(N-3) = r yx1.x2 - yx1. x2 (1 - r2yx1.x2)/N-3
1- r2 yx1.x2 is the unexplained variance N-3 = degrees of freedom (three variables) (1 - ryx1. x22)/N-3 = standard error of ryx1.x2
X2
ETS Example
Correlation between prep hours and SAT score with GPA partialled out:
ryx1.x2 = = ryx1 - ryx2rx1x2
( 1 - r2yx2)(1 - r2x1x2)
-0.21 -(-0.54*0.71)
(1 - (-0.542))(1 - 0.712)
= 0.28
ETS Example
The partial correlation between prep hours and SAT score with effect of GPA removed: ryx1.x2 = 0.28, r2yx1.x2 = 0.08
t (N-3) = t (17) = ryx1.x2 ( 1 - r2yx1.x2)/N-3 0.28
(1 - 0.08)/17
Semi-Partial Correlation
Find the correlation between two variables with the third held constant in one of the variables That is, we remove the effect of x2 from x1 r2y(x1.x2) is the shared variance of y & x1 with x2 removed from x1
r2y(x1.x2) r2yx1
X1
X2
Semi-Partial Correlation
Why semi-partial? Generally used with multiple regression to remove the effect of one predictor from another predictor without removing that variability in the predicted variable NOT typically reported as the only analysis
r2y(x1.x2) r2yx1
X1
X2
Semi-Partial Correlation
y & x1 without x2 (residuals) Put in terms of simple correlation coefficients:
Simple correlation between y and x1 ry(x1. x2) =
Y r2y(x1. x2) X1
X2
Semi-Partial Correlation
Which will be larger, the partial or the semipartial correlation?
ryx1 - ryx2rx1x2 ( 1 - r2 yx2)(1 - r2x1x2) ryx1 - ryx2rx1x2 ( 1 - r2x1x2)
ryx1.x2 =
partial
ry(x1. x2) =
semi-partial
ETS Example
Going back to the SAT example, suppose we partial out GPA from hours of prep only ryx1 - ryx2rx1x2
ry(x1. x2) = (1 - r2x1x2) -0.21 -(-0.54*0.71) = (1 - 0.542 ) = 0.20
Signicance of Semi-partial
Same as for partial correlation, just substitute the ry(x1.x2) df = N-3
t (N-3) = ry(x1. x2) ( 1 - r2y(x1.x2))/N-3
ETS Example
The semi-partial correlation between prep hours and SAT score with effect of GPA removed: ry(x1.x2) = 0.20, r2y(x1.x2) = 0.04
t (N-3) = t (17) = ry(x1. x2) ( 1 - r2y(x1.x2))/N-3 0.20 ( 1 - 0.04)/17
Multiple Regression
Simple regression: y = a + bx Multiple regression: General Linear Model
y = a + b 1 x1 + b2 x2 (2 predictors) Therefore, the general formula: y = a + b1 x1 + + b k xk (k predictors)
The problem is to solve for k+1 coefficients
k predictors (regressors) + the intercept We are most concerned with the predictors
ETS Example
Prep hours (x1), GPA (x2), & SAT (y)
Use Prep hours and GPA to predict SAT score
Simple regressions
y = -4.79x1 + 1353 y = 300x2 + 355
ETS Example
Use both prep hours and GPA to predict SAT score Now find equation for 3-D relationship
rx1y = 0 rx2y = 0
Notice that these are like the semipartial correlation
2 =
est x2
10
regression
weights
a = y - b1 x1 - b2x2
ETS Example
Use the rs to get the s
rx1x2 = -0.54 rx1y = -0.22 rx2y = 0.72
1 = rx1y - rx2yrx1x2 1 - r2x1x2 2 = rx2y - rx1yrx1x2 1 - r2x1x2 2 = 0.84 est y est x2
= 353
a = y - b1 x1 - b2x2 = 110
11
Therefore, we can solve for Bj Bj = Rij-1 Rjy (in matrix form, this is really easy!) Dont worry about actually calculating these, but be sure you understand the equation!
The same principle for obtaining the intercept in simple regression applies as well:
a = y - b jxj
How far are the points in 3-D space from the plane dened by the equation?
12
Explained Variance
In addition to simple (rxy), partial (ryx1.x2), & semi-partial (ry(x1.x2)) correlation coefcients, we can have a multiple correlation coefcient (Ry.x1x2) Ry.x1x2 = correlation between observed value of y and predicted value of y
can be expressed in terms of beta weights and simple correlation coefcients
Ry.x1x2 =
1 yx1
+ 2 r yx2
OR
R2y.x1x2 =
1r yx1 + 2r yx2
Explained Variance
R2y.x1x2 = 1 ryx1 + 2 ryx2 Any i represents the contribution of variable xi to predicting y The more general version of this equation is simply: R2 = jryxj or in matrix form... R2 = BjRjy
(Just add up the products of the s and the rs)
Explained Variance
R2y.x1x2 = 1 ryx1 + 2 ryx2 If x1 and x2 are uncorrelated:
1 = ryx1
X1 Y
2 = ryx2
X2
13
Explained Variance
R2y.x1x2 = 1 ryx1 + 2 ryx2 If x1 and x2 are correlated:
is are corrected so that overlap is not counted twice
Y
X1
X2
Adjusted R2
R2 is a biased estimate of the population R2 value If you want to estimate the population, use Adjusted R2
Most stats packages calculate both R2 and Adjusted R2 If not, the value can be obtained from the R2:
(k)(1 - R2) Adj R 2 = R2 N-k-1
Signicance Tests
In multiple regression, there are 3 different statistical tests that are of interest
Significance of R2
Is the fit of the regression model significant?
14
Partitioning Variance
(y-y)2 = (y-y)2 + (y-y)2
Total variance in y (aka SStotal) Unexplained Explained Variance Variance (aka SSres) (aka SSreg)
Same as in simple regression! The only difference is that y is generated by a linear function of several independent variables (k predictors) Note: SStotal = SSregression + SSresidual
Signicance of R2
Need a ratio of variances (F value)
F= MSReg MSRes = SSReg/dfReg SSRes/dfRes
Where do these values come from? SSReg = (y-y)2; dfReg = k (# of regressors - 1) SSRes = (y-y)2; dfRes = N-k-1 (# obs - # reg) F for the overall model reflects this ratio
Signicant Increments to R2
As variables (predictors) are added to the regression R2 can
stay the same; additional variable has NO contribution increase; additional variable has some contribution
15
Signicant Increments to R2
Use an FR2 FR2 = RL2 - RS2/kL - k S (1-RL2 )/(N-kL-1)
Signicance of Coefcients
Think about bj in t terms: bj /est bj bj/est bj is distributed as a t with N-k-1 degrees of freedom, where...
est bj = SSRes/N-k-1 SSj(1-Rj2 )
SSj = sum of squares for variable xj Rj2 = squared multiple correlation for predicting j from remaining k-1 predictors (treating xj as the predicted variable)
Signicance of Coefcients
est bj = SSRes/N-k-1 SSj(1-Rj2 )
As Rj2 increases, the denominator of the t equation approaches 0; that is, est bj becomes larger As the remaining xs account for xj , bj is less likely to reach significance
16
Use different types of R2 values to assess importance Use of different measures to assess importance of IVs
17
Recommended for exploratory analyses of very large data sets (> 30 predictors)
With lots of predictors, keeping all but one constant may make it difficult to find any significant These procedures capitalize on chance to find the meaningful variables
18
19
Importance of Regressors
is primarily serve to help dene the equation for predicting y Squared semi-partial correlation (sr2) more appropriate for practical importance
Put in terms of variance explained by each regressor Compare how variance much each regressor explains
When the IVs are correlated, the srj2 s for all of the xjs will not sum to the R2 for the full model
20
Potential Problems
Several assumptions
(see Berry & Feldman pp. 10-11 in book)
Multicollinearity
Perfect collinearity: when one independent variable is perfectly linearly related to one or more of the other regressors
x1 = 2.3x2 + 4 : x1 is perfectly predicted by x2 x1 = 4.1x3 + .45x4 + 11.32 ; x1 is perfectly predicted by linear combination of x3 and x4 Any case where there is an R2 value of 1.00 among the regressors (NOT including y) Why might this be a problem?
Multicollinearity
Perfect collinearity (simplest case)
One variable is a linear function of another Wed be thrilled (and skeptical) to see this in a simple regression However...
21
Multicollinearity
Perfect collinearity (simplest case)
Problem in multiple regression y values will line up in single plane rather than varying about a plane
Multicollinearity
Perfect collinearity (simplest case)
No way to determine the plane that fits the y values best Many possible planes
Multicollinearity
In practice
perfect collinearity violates assumptions of regression less-than-perfect collinearity is more common
not an all or nothing situation can have varying degrees of multicollinearity dealing with multicollinearity depends on what you want to know
22
Multicollinearity
Consequences
If the only goal is prediction, not a problem
plugging in known numbers will give you the unknown value although specific regression weights may vary, the final outcome will not
Multicollinearity
Detecting collinearities
Some clues
full model is significant but none of the individual regressors reach significance instability of weights across multiple samples look at simple regression coefficients for all pairs cumbersome way: regress each independent variable on all other independent variables to see if any R2 values are close to 1
Multicollinearity
What can you do about it?
Increase the sample size
reduce the error offset the effects of multicollinearity
If you know the relationship, you can use that information to offset the effect (yea right!) Delete one of the variables causing the problem
which one? If one is predicted by group of others logical rationale? presumably, the variables were there for theoretical reasons
23
Multicollinearity
Detecting collinearities
SPSS: Collinearity diagnostics & follow-up
Tolerance: 1-R2 for the regression of each IV against the remaining regressors.
Collinearity: tolerance close to 0 Use this to locate the collinearity
To locate collinearity
removing the variable with the lowest tolerance
Suppression
Special case of multicollinearity Suppressor variables are variables that increase the values of R2 by virtue of their correlations with other predictors and NOT the dependent variable The best way to explain this is by way of an example...
Suppression Example
Predicting course grade in a multivariate statistics course with GRE verbal and quantitative
The multiple correlation R was 0.62 (reasonable, right?) However, the s were 0.58 for GRE-Q and -0.24 for GRE-V Does this mean that higher GRE-V scores were associated with lower course performance? Not exactly
24
Suppression Example
Why was the for GRE-V negative?
The GRE-V alone actually had a small positive correlation with course grade The GRE-V and GRE-Q are highly correlated with each other The regression weights indicate that for a given score on the GRE-Q, the lower a person scores on the GRE-V, the higher the predicted course grade
Suppression Example
Another way to put it...
The GRE-Q is a good predictor of course grade, but part of the performance on GRE-Q is determined by GRE-V, so it favors people of high verbal ability. Suppose we have 2 people who score equally on GRE-Q but differently on GRE-V
Bob scores 500 on GRE-Q and 600 on GRE-V Jane scores 500 on GRE-Q and 500 on GRE-V What happens to the predictions about course grade?
Suppression Example
Another way to put it...
Bob: 500 on GRE-Q and 600 on GRE-V Jane: 500 on GRE-Q and 500 on GRE-V Based on the verbal scores, we would predict that Bob should have better quantitative skills than Jane, but he does not score better Thus, Bob must actually have LESS quantitative knowledge than Jane, so we would predict his course grade to be lower. This is equivalent to giving GRE-V a negative regression weight, despite positive correlation
25
Suppression
More generally...
If x2 is a better measure of the source of errors in x1 than in y, then giving x2 a negative regression weight will improve our predictions of y x2 subtracts out/corrects for (i.e., suppresses) sources of error in x1 Suppression seems counterintuitive, but actually improves the model
Suppression
More generally...
Suppressor variables usually considered bad--can cause misinterpretation (GRE example) However, careful exploration
enlighten understanding of interplay of variables improve our prediction of y
Easy to identify
Significant regression weights b/ (reg) & r (simple corr) have opposite signs
Practical Issues
Number of cases
Must exceed number of predictors (N > k) Acceptable N/k ratio depends on
reliability of data researchers goals
26
Practical Issues
Outliers
Correlation is extremely sensitive to outliers Easiest to show with simple correlation rxy = +0.59 rxy = -0.03
Outliers should be assessed for DV and all IVs Ideally, we would identify multivariate outliers, but this is not practical
Practical Issues
Linearity
Multiple regression assumes linear relationship between DV and each IV Relationships can be non-linear, multiple regression may not be appropriate Transformations may rectify non-linearity
logs reciprocals
Practical Issues
Normality
Normally distributed relationship between y and residuals (y-y) violation affects statistical conclusions, but not validity of model
Homoscedasticity
multivariate version of homogeneity of variance violation affects statistical conclusions, but not validity of model
27
y' values
Residuals (y-y') Residuals (y-y) Residuals (y-y') Residuals (y-y)
y' values
Assumptions Met
y' values
Residuals (y-y') Residuals (y-y) Residuals (y-y') Residuals (y-y)
Normality Violated
y' values
Linearity Violated
Homoscedasticity Violated
28