Collinarity
Collinarity
Two independent variables, X1 and X2 are collinear when they are correlated
with each other
In multiple regression study, we assume that the X variables are independent of
each other.
We also assume that each X-variable contains a unique piece of information
about Y.
Y= β0 + β1X1 + β2X2 + ε
We only need Y to depend on X`s
In a multiple linear regression model
Y= β0 + β1X1 + β2X2 + ε
We believe that
β1 = the change in Y for a 1 unit change in X1, while X2 is healed constant
Also
β2 = the change in Y for a 1 unit change in X2, while X1 is healed constant
….. In that X1 and X2 are correlated,
Β1 ≠ the change in Y for a 1 unit change in X1, with X2 is healed constant
also
β2 ≠ the change in Y for a 1 unit change in X2, with X1 is healed constant
Effects of multicollinearity
1. Variance (and the standard errors) of regression coefficient estimators (i.e
the bi) are inflated. This means the var(bi) is too large
2. The magnitude the bi may be different from what we expect
3. The signs of the bi may be the opposite of what we expect
4. Adding or removing any of the x-variables produced large change in the
value of remaining bi or their signs.
5. Sometimes removing data point causes large change in the value of the bi
or the signs.
6. In some cases, F is significant, but the t-values (bi) may not be significant
Test for multicollinearity
1. Calculate the correlation coefficient (r) for each pair of the X-variables. If any of
the r-values is significantly different from zero, then the independent variables
involved may be collinear
t=
Caveat: Although the r for any two X variables may be too small, three
independent variables X1,X2, and X3. may be highly correlated as a
group.
2. Check the variance inflation factor (VIF) is too high
Rule of thumb; collinearity exists if VIF>5. A VIF of 10, for example, means var
(bi) is 10 times what it should be if no collinearity existed (If no collinearity VIF
should be 1)
VIF is a more rigorous check for collinearity than correlation coefficient.
In the regression model Y= β0 + β1X1 + β2X2 + β3X3 + ε
R21 = obtained from regressing X1 on X2 and X3 as follows
X1 = β0 + β2X2 + β3X3 + ε
Similarly,
X2 = β0 + β1X1 + β3X3 + ε => to obtain R22
X3 = β0 + β1X1 + β2X2 + ε => to obtain R23
Solution for multi collinearity
1. Drop the variables causing the problem
If using a larger number of X-variables, a stepwise regression could be
used to determine which of the variables to drop
Removing collinear X-variables is the simplest method of solving the
multicollinearity problem
2. If all the X-variables are retained, then avoid making inferences on the
individual β parameters. Also, restrict inferences about the mean value of Y to
values of X that lie in the experimental region.
3. Re-code form of the independent variables.
Example: if X1 and X2 are collinear, you might try using X1 and the ratio of
X2/X1 instead
4. Ridge regression: this is an alternative estimation procedure to ordinary least
squares
Example
It is believed that the increase in the tar and nicotine contents of cigarette are
accompanied by an increase in the carbon monoxide emitted from the cigarette
smock. We wish to model carbon monoxide content, Y. as a function of tar(X1),
nicotine(X2), and weight(X3). The following model is specified
Y= β0 + β1X1 + β2X2 + β3X3 + ε
Y = carbon monoxide
X1 = tar (milligram)
X2 = nicotine (milligram)
X3 = weight (gram)
A cross-sectional dataset of 25 observations representing 25 brands of
cigarettes is presented