0% found this document useful (0 votes)
44 views

330 Lecture8 2014

This lecture discusses collinearity and its effects on estimating regression coefficients. Collinearity refers to a high correlation between predictor variables in a regression model. When collinearity is present, the standard errors of the regression coefficients are larger, making the coefficients less precise and more sensitive to small changes in the data. The lecture uses an example where two datasets have the same regression model but different correlations between the predictor variables x1 and x2. One dataset has little correlation, while the other has high collinearity. Graphical methods like added variable plots can help assess whether a variable should be added to a regression model given collinearity issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

330 Lecture8 2014

This lecture discusses collinearity and its effects on estimating regression coefficients. Collinearity refers to a high correlation between predictor variables in a regression model. When collinearity is present, the standard errors of the regression coefficients are larger, making the coefficients less precise and more sensitive to small changes in the data. The lecture uses an example where two datasets have the same regression model but different correlations between the predictor variables x1 and x2. One dataset has little correlation, while the other has high collinearity. Graphical methods like added variable plots can help assess whether a variable should be added to a regression model given collinearity issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

STATS 330: Lecture 8

Collinearity

6.08.2014
R-hint(s) of the day
Random numbers and variables
> rnorm(5, mean = 2,sd = 1)
[1] 1.9199447 3.8475595 2.8962234 2.6015305 0.8656212
> runif(5, min = 2, max = 5)
[1] 4.409922 4.232709 3.444322 3.192482 4.263457
> sample(2:5, size = 5, replace = T)
[1] 2 2 4 2 5
> rpois(5, lambda = 3)
[1] 2 2 3 5 4

Writing functions
> mymean <- function(x){sum(x)/length(x)}
> test <- rnorm(20)
> mymean(test)-mean(test)
[1] 0
R-hint(s) of the day
Manipulating functions (e.g., pairs20x)
> pairs20x
function (x, ...)
{
panel.hist <- function(x, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5))
h <- hist(x, plot = FALSE)
breaks <- h$breaks
nB <- length(breaks)
y <- h$counts
y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
...
pairs(x, upper.panel = panel.smooth,
lower.panel = panel.cor, diag.panel = panel.hist, ...)
}
Aims of todays lecture

I Explain the idea of collinearity and its connection with


estimating regression coefficients

I To discuss added variable plots, a graphical method for


deciding if a variable should be added to a regression
Variance of regression coefficients

I We saw in Lecture 6 how the standard errors of the regression


coefficients depend on the error variance 2 : the bigger 2 ,
the bigger the standard errors.

I We also suggested that the standard error depends on the


arrangement of the xs.

I In todays lecture, we explore this idea a bit further.


Example

I Suppose we have a regression relationship of the form

Y = 1 + 2x1 x2 +

between the response variable Y and two covariates x1 and x2 .

I Consider two data sets, A and B, each following the model


above but...
Relationship between x1 and x2

Dataset A, correlation 0.035 Dataset B, correlation 0.989


4

4













2

2






























































































































































































































x2

x2






0

0







































































































































































































































2

2











4

4
4 2 0 2 4 4 2 0 2 4

x1 x1
Fitted planes for data sets A and B

Dataset A, correlation 0.035 Dataset B, correlation 0.989

x2 x1 x2 x1
Conclusion

I The greater the correlation between the covariates, the more


variable the plane.

I In fact, for the coefficient 1 of x1 we have


  2 /(n 1)
Var b1 =
Var(x1 )(1 r 2 )
Generalisation

If we have k explanatory variables, then the variance of the j th


estimated coefficient is
  2 /(n 1)
Var bj =
Var(xj )(1 Rj2 )

where Rj2 is the R 2 value for an regression of response variable j on


the other covariates.
Best case

If the j th variable is orthogonal to (perpendicular to, uncorrelated


with) the other explanatory variables, then Rj2 is equal to zero and
the variance is the smallest possible, i.e.
  2 /(n 1)
Var bj =
Var(xj )
Variance inflation factor

The factor
1
1 Rj2
represents the increase in variance caused by correlation between
the explanatory variables and is called the variance inflation factor
(VIF)
Calculating the VIF: Theory

To calculate the VIF for the j th explanatory variable, use the


relationship
1 1
VIFj = =
1 Rj2 RSSj
TSSj
TSSj Var(variable j)
= =
RSSj Var(residuals)

using the residuals from regressing the j th explanatory variable on


the other covariates.
Calculating the VIF: Example

For the petrol data, calculate the VIF for t.vp


> attach(vapour.df)
> tvp.reg <- lm(t.vp~t.temp+p.temp+p.vp)
> var(t.vp)/var(residuals(tvp.reg))
[1] 66.13817
Correlation increases variance by a factor of 66.
Calculating the VIF: The quick method

A useful mathematical relationship:

Suppose we calculate the inverse of the correlation matrix of the


explanatory variables. Then the VIFs are the diagonal elements.
> vapour.covariates <- vapour.df[-5] # remove response hc
> VIF <- diag(solve(cor(vapour.covariates)))
> VIF
t.temp p.temp t.vp p.vp
11.927292 5.615662 66.138172 60.938695
Pairs plot
30 50 70 90 3 4 5 6 7



hc





20 30 40 50



































































































































90








t.temp
70

0.81
























































































50




























30





p.temp

80
0.88

0.81








60





































40






















t.vp

7

0.85 0.94 0.77


5

p.vp

7
6
0.91 0.93 0.83 0.98

5
4
3
20 30 40 50 40 60 80 3 4 5 6 7
Collinearity

I If one or more variables in a regression have big VIFs, the


regression is said to be collinear.

I Caused by one or more variables being almost linear


combinations of the others.

I Sometimes indicated by high correlations between the


supposedly independent variables.

I Results in imprecise estimation of regression coefficients.

I Standard errors are high, so t-statistics are small, variables are


often non-significant (data is insufficient to detect a
difference)
Non-significance

I If a variable has a non-significant t, then either

I The variable is not related to the response, or


I The variable is related to the response, but it is not required in
the regression because it is strongly related to a third variable
that is in the regression, so we do not need both.

I First case: small t-value, small VIF, small correlation with


response.

I Second case: small t-value, big VIF, high correlation with


response.
Remedy

I The usual remedy is to drop one or more variables from the


model.

I This breaks the linear relationship between the variables.

I This leads to the problem of subset selection; which subset


to choose?
Example: Cement data

I Measurements on batches of cement

I Response variable y : Heat (emitted)

I Explanatory variables:

x1 : amount of tricalcium aluminate (Ca3 Al2O6 , %)

x2 : amount of tricalcium silicate (Ca3 SiO5 , %)

x3 : amount of tetracalcium aluminoferrite (Ca2 (Al, Fe)2 O5 , %)

x4 : amount of dicalcium silicate (Ca2 SiO4 , %)


Cement data: Model

> cement.lm <- lm(y~x1+x2+x3+x4,data=cement)


> summary(cement.lm)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.4054 70.0710 0.891 0.3991
x1 1.5511 0.7448 2.083 0.0708 .
x2 0.5102 0.7238 0.705 0.5009
x3 0.1019 0.7547 0.135 0.8959
x4 -0.1441 0.7091 -0.203 0.8441
...
Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
Cement data: Correlation

> round(cor(cement),2)
x1 x2 x3 x4 y
x1 1.00 0.23 -0.82 -0.25 0.73
x2 0.23 1.00 -0.14 -0.97 0.82
x3 -0.82 -0.14 1.00 0.03 -0.53
x4 -0.25 -0.97 0.03 1.00 -0.82
y 0.73 0.82 -0.53 -0.82 1.00
Cement data: VIF

> diag(solve(cor(cement[,-5])))
x1 x2 x3 x4
38.49621 254.42317 46.86839 282.51286

> apply(cement[,-5],1,sum)
1 2 3 4 5 6 7 8 9 10 11 12 13
99 97 95 97 98 97 97 98 96 98 98 98 98
Cement data: Drop x4

> diag(solve(cor(cement[,-c(4,5)])))
x1 x2 x3
3.251068 1.063575 3.142125
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.19363 3.91330 12.315 6.17e-07 ***
x1 1.69589 0.20458 8.290 1.66e-05 ***
x2 0.65691 0.04423 14.851 1.23e-07 ***
x3 0.25002 0.18471 1.354 0.209
...
Residual standard error: 2.312 on 9 degrees of freedom
Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
F-statistic: 166.3 on 3 and 9 DF, p-value: 3.367e-08
Collinearity

I If covariates actually correlate, their coefficients might not be


significantly different from 0.

I Use VIF to see how much such a correlation affects the


variation of the coefficient.

I Stepwise removal of variables can solve the problem.


Added Variable Plots (AVPs)

I To see if a variable, x say, is needed in a regression:

Step 1: Calculate the residuals from regressing the response on all the
explanatory variables except x;

Step 2: Calculate the residuals from regressing x on the other


covariates;

Step 3: Plot the first set of residuals versus the second set.

I Also called partial regression plots in some books.


Rationale

I The first set of residuals represents the variation in y not


explained by the other explanatory variables.

I The second set of residuals represents the part of x not


explained by the other explanatory variables.

I If there is a relationship between the two sets, there is a


relationship between x and the response y that is not
accounted for by the other explanatory variables.

I Thus, if we see a relationship in the plot, x is needed in the


regression.
Example: Petrol data

> rest.lm <- lm(hc~t.temp+p.temp+p.vp,data=vapour.df)


> y.res <- residuals(rest.lm)
> tvp.lm <- lm(t.vp~t.temp+p.temp+p.vp,data=vapour.df)
> tvp.res <- residuals(tvp.lm)
> plot(tvp.res,y.res,xlab="Tank vapour pressure",
ylab="Hydrocarbon emission",pch=16,col="steelblue")
> abline(lm(y.res~tvp.res),lwd=2)
> abline(h=0,lty=2,col="gray",lwd=2)
AVP for Tank vapour pressure








Hydrocarbon emission

0.4 0.2 0.0 0.2 0.4

Tank vapour pressure


Shortcut in R

The R330-function added.variable.plots draws AVPs


automatically.
> library(R330)
> data(vapour.df)
> vapour.lm <- lm(hc~.,data=vapour.df)
> par(mfrow=c(2,2)) # 2x2 array of plots
> added.variable.plots(vapour.lm)
AVP for Tank vapour pressure
Partial plot of t.temp Partial plot of p.temp

4






2
Residuals

Residuals











0



0








6 4 2

10 5 0 5 10 20 10 0 10

t.temp p.temp

Partial plot of t.vp Partial plot of p.vp

5
5







Residuals

Residuals










0


0

0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.6

t.vp p.vp
Some curious facts about AVPs

I Since residuals have always zero mean, a line fitted through


the plot by least squares will always go through the origin.

I The slope of this line is bk , the estimate of the regression


coefficient k in the full regression using x1 , . . . , xk .

I The amount of scatter about the least squares line reflects


how important xk is as a predictor.
http://doonesbury.washingtonpost.com/strip/archive/2014/06/08

You might also like