0% found this document useful (0 votes)

44 views

330 Lecture8 2014

This lecture discusses collinearity and its effects on estimating regression coefficients. Collinearity refers to a high correlation between predictor variables in a regression model. When collinearity is present, the standard errors of the regression coefficients are larger, making the coefficients less precise and more sensitive to small changes in the data. The lecture uses an example where two datasets have the same regression model but different correlations between the predictor variables x1 and x2. One dataset has little correlation, while the other has high collinearity. Graphical methods like added variable plots can help assess whether a variable should be added to a regression model given collinearity issues.

Uploaded by

Anonymous gUySMcpSq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

330 Lecture8 2014

Uploaded by

Anonymous gUySMcpSq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

STATS 330: Lecture 8

Collinearity

6.08.2014
R-hint(s) of the day
Random numbers and variables
> rnorm(5, mean = 2,sd = 1)
[1] 1.9199447 3.8475595 2.8962234 2.6015305 0.8656212
> runif(5, min = 2, max = 5)
[1] 4.409922 4.232709 3.444322 3.192482 4.263457
> sample(2:5, size = 5, replace = T)
[1] 2 2 4 2 5
> rpois(5, lambda = 3)
[1] 2 2 3 5 4

Writing functions
> mymean <- function(x){sum(x)/length(x)}
> test <- rnorm(20)
> mymean(test)-mean(test)
[1] 0
R-hint(s) of the day
Manipulating functions (e.g., pairs20x)
> pairs20x
function (x, ...)
{
panel.hist <- function(x, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5))
h <- hist(x, plot = FALSE)
breaks <- h$breaks
nB <- length(breaks)
y <- h$counts
y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
...
pairs(x, upper.panel = panel.smooth,
lower.panel = panel.cor, diag.panel = panel.hist, ...)
}
Aims of todays lecture

I Explain the idea of collinearity and its connection with

estimating regression coefficients

I To discuss added variable plots, a graphical method for

deciding if a variable should be added to a regression
Variance of regression coefficients

I We saw in Lecture 6 how the standard errors of the regression

coefficients depend on the error variance 2 : the bigger 2 ,
the bigger the standard errors.

I We also suggested that the standard error depends on the

arrangement of the xs.

I In todays lecture, we explore this idea a bit further.

Example

I Suppose we have a regression relationship of the form

Y = 1 + 2x1 x2 +

between the response variable Y and two covariates x1 and x2 .

I Consider two data sets, A and B, each following the model

above but...
Relationship between x1 and x2

Dataset A, correlation 0.035 Dataset B, correlation 0.989

4

2

2

x2

x2

0

0

2

2

4

4
4 2 0 2 4 4 2 0 2 4

x1 x1
Fitted planes for data sets A and B

Dataset A, correlation 0.035 Dataset B, correlation 0.989

x2 x1 x2 x1
Conclusion

I The greater the correlation between the covariates, the more

variable the plane.

I In fact, for the coefficient 1 of x1 we have

2 /(n 1)
Var b1 =
Var(x1 )(1 r 2 )
Generalisation

If we have k explanatory variables, then the variance of the j th

estimated coefficient is
2 /(n 1)
Var bj =
Var(xj )(1 Rj2 )

where Rj2 is the R 2 value for an regression of response variable j on

the other covariates.
Best case

If the j th variable is orthogonal to (perpendicular to, uncorrelated

with) the other explanatory variables, then Rj2 is equal to zero and
the variance is the smallest possible, i.e.
2 /(n 1)
Var bj =
Var(xj )
Variance inflation factor

The factor
1
1 Rj2
represents the increase in variance caused by correlation between
the explanatory variables and is called the variance inflation factor
(VIF)
Calculating the VIF: Theory

To calculate the VIF for the j th explanatory variable, use the

relationship
1 1
VIFj = =
1 Rj2 RSSj
TSSj
TSSj Var(variable j)
= =
RSSj Var(residuals)

using the residuals from regressing the j th explanatory variable on

the other covariates.
Calculating the VIF: Example

For the petrol data, calculate the VIF for t.vp

> attach(vapour.df)
> tvp.reg <- lm(t.vp~t.temp+p.temp+p.vp)
> var(t.vp)/var(residuals(tvp.reg))
[1] 66.13817
Correlation increases variance by a factor of 66.
Calculating the VIF: The quick method

A useful mathematical relationship:

Suppose we calculate the inverse of the correlation matrix of the

explanatory variables. Then the VIFs are the diagonal elements.
> vapour.covariates <- vapour.df[-5] # remove response hc
> VIF <- diag(solve(cor(vapour.covariates)))
> VIF
t.temp p.temp t.vp p.vp
11.927292 5.615662 66.138172 60.938695
Pairs plot
30 50 70 90 3 4 5 6 7

hc

20 30 40 50

t.temp
70

0.81

50

p.temp

80
0.88

0.81

t.vp

7

0.85 0.94 0.77

p.vp

7
6
0.91 0.93 0.83 0.98

5
4
3
20 30 40 50 40 60 80 3 4 5 6 7
Collinearity

I If one or more variables in a regression have big VIFs, the

regression is said to be collinear.

I Caused by one or more variables being almost linear

combinations of the others.

I Sometimes indicated by high correlations between the

supposedly independent variables.

I Results in imprecise estimation of regression coefficients.

I Standard errors are high, so t-statistics are small, variables are

often non-significant (data is insufficient to detect a
difference)
Non-significance

I If a variable has a non-significant t, then either

I The variable is not related to the response, or

I The variable is related to the response, but it is not required in
the regression because it is strongly related to a third variable
that is in the regression, so we do not need both.

I First case: small t-value, small VIF, small correlation with

response.

I Second case: small t-value, big VIF, high correlation with

response.
Remedy

I The usual remedy is to drop one or more variables from the

model.

I This breaks the linear relationship between the variables.

I This leads to the problem of subset selection; which subset

to choose?
Example: Cement data

I Measurements on batches of cement

I Response variable y : Heat (emitted)

I Explanatory variables:

x1 : amount of tricalcium aluminate (Ca3 Al2O6 , %)

x2 : amount of tricalcium silicate (Ca3 SiO5 , %)

x3 : amount of tetracalcium aluminoferrite (Ca2 (Al, Fe)2 O5 , %)

x4 : amount of dicalcium silicate (Ca2 SiO4 , %)

Cement data: Model

> cement.lm <- lm(y~x1+x2+x3+x4,data=cement)

> summary(cement.lm)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.4054 70.0710 0.891 0.3991
x1 1.5511 0.7448 2.083 0.0708 .
x2 0.5102 0.7238 0.705 0.5009
x3 0.1019 0.7547 0.135 0.8959
x4 -0.1441 0.7091 -0.203 0.8441
...
Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
Cement data: Correlation

> round(cor(cement),2)
x1 x2 x3 x4 y
x1 1.00 0.23 -0.82 -0.25 0.73
x2 0.23 1.00 -0.14 -0.97 0.82
x3 -0.82 -0.14 1.00 0.03 -0.53
x4 -0.25 -0.97 0.03 1.00 -0.82
y 0.73 0.82 -0.53 -0.82 1.00
Cement data: VIF

> diag(solve(cor(cement[,-5])))
x1 x2 x3 x4
38.49621 254.42317 46.86839 282.51286

> apply(cement[,-5],1,sum)
1 2 3 4 5 6 7 8 9 10 11 12 13
99 97 95 97 98 97 97 98 96 98 98 98 98
Cement data: Drop x4

> diag(solve(cor(cement[,-c(4,5)])))
x1 x2 x3
3.251068 1.063575 3.142125
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.19363 3.91330 12.315 6.17e-07 ***
x1 1.69589 0.20458 8.290 1.66e-05 ***
x2 0.65691 0.04423 14.851 1.23e-07 ***
x3 0.25002 0.18471 1.354 0.209
...
Residual standard error: 2.312 on 9 degrees of freedom
Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
F-statistic: 166.3 on 3 and 9 DF, p-value: 3.367e-08
Collinearity

I If covariates actually correlate, their coefficients might not be

significantly different from 0.

I Use VIF to see how much such a correlation affects the

variation of the coefficient.

I Stepwise removal of variables can solve the problem.

Added Variable Plots (AVPs)

I To see if a variable, x say, is needed in a regression:

Step 1: Calculate the residuals from regressing the response on all the
explanatory variables except x;

Step 2: Calculate the residuals from regressing x on the other

covariates;

Step 3: Plot the first set of residuals versus the second set.

I Also called partial regression plots in some books.

Rationale

I The first set of residuals represents the variation in y not

explained by the other explanatory variables.

I The second set of residuals represents the part of x not

explained by the other explanatory variables.

I If there is a relationship between the two sets, there is a

relationship between x and the response y that is not
accounted for by the other explanatory variables.

I Thus, if we see a relationship in the plot, x is needed in the

regression.
Example: Petrol data

> rest.lm <- lm(hc~t.temp+p.temp+p.vp,data=vapour.df)

> y.res <- residuals(rest.lm)
> tvp.lm <- lm(t.vp~t.temp+p.temp+p.vp,data=vapour.df)
> tvp.res <- residuals(tvp.lm)
> plot(tvp.res,y.res,xlab="Tank vapour pressure",
ylab="Hydrocarbon emission",pch=16,col="steelblue")
> abline(lm(y.res~tvp.res),lwd=2)
> abline(h=0,lty=2,col="gray",lwd=2)
AVP for Tank vapour pressure

Hydrocarbon emission

0.4 0.2 0.0 0.2 0.4

Tank vapour pressure

Shortcut in R

The R330-function added.variable.plots draws AVPs

automatically.
> library(R330)
> data(vapour.df)
> vapour.lm <- lm(hc~.,data=vapour.df)
> par(mfrow=c(2,2)) # 2x2 array of plots
> added.variable.plots(vapour.lm)
AVP for Tank vapour pressure
Partial plot of t.temp Partial plot of p.temp

4

2
Residuals

Residuals

0

0

6 4 2

10 5 0 5 10 20 10 0 10

t.temp p.temp

Partial plot of t.vp Partial plot of p.vp

5
5

Residuals

0

0

0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.6

t.vp p.vp
Some curious facts about AVPs

I Since residuals have always zero mean, a line fitted through

the plot by least squares will always go through the origin.

I The slope of this line is bk , the estimate of the regression

coefficient k in the full regression using x1 , . . . , xk .

I The amount of scatter about the least squares line reflects

how important xk is as a predictor.
http://doonesbury.washingtonpost.com/strip/archive/2014/06/08

Sample Report For CFA in APA
100% (1)
Sample Report For CFA in APA
6 pages
Sta 2030 Notes
No ratings yet
Sta 2030 Notes
103 pages
Exercises Lesson Portfolio Moodle 1
No ratings yet
Exercises Lesson Portfolio Moodle 1
3 pages
Econometrics 1 Cumulative Final Study Guide
No ratings yet
Econometrics 1 Cumulative Final Study Guide
35 pages
330 Lecture8 2015
No ratings yet
330 Lecture8 2015
33 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
R Manual PDF
No ratings yet
R Manual PDF
78 pages
Lucero R Tutorial 2016
No ratings yet
Lucero R Tutorial 2016
135 pages
Linear Regression
100% (2)
Linear Regression
228 pages
R Commands
No ratings yet
R Commands
5 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Lecture 12
No ratings yet
Lecture 12
47 pages
The Simple Linear Regression Model and Correlation
100% (1)
The Simple Linear Regression Model and Correlation
64 pages
1 Introduction To R and Rstudio: 2024-2025 Calculus Iii
No ratings yet
1 Introduction To R and Rstudio: 2024-2025 Calculus Iii
3 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
A Short Course On Nonparametric Curve Estimation R PDF
No ratings yet
A Short Course On Nonparametric Curve Estimation R PDF
114 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
Random Process
No ratings yet
Random Process
21 pages
Chap7 Random Process
No ratings yet
Chap7 Random Process
21 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
20 pages
Homework 3 R Tutorial: How To Use This Tutorial
No ratings yet
Homework 3 R Tutorial: How To Use This Tutorial
8 pages
ECON4150 - Introductory Econometrics Lecture 1: Introduction and Review of Statistics
No ratings yet
ECON4150 - Introductory Econometrics Lecture 1: Introduction and Review of Statistics
41 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
Corr and Regress
No ratings yet
Corr and Regress
30 pages
M2L2 CLRM & Simple Linear Regression Analysis
No ratings yet
M2L2 CLRM & Simple Linear Regression Analysis
13 pages
Math 1280 Notes
No ratings yet
Math 1280 Notes
91 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
Chapter 1 Introduction of Regression
No ratings yet
Chapter 1 Introduction of Regression
43 pages
Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099
No ratings yet
Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099
110 pages
Correlation and regression
No ratings yet
Correlation and regression
30 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
Statistical Models in R
No ratings yet
Statistical Models in R
18 pages
Stat 509 Notes
100% (1)
Stat 509 Notes
195 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
Group 10 - Curve Fitting
No ratings yet
Group 10 - Curve Fitting
81 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Mathematical Computations Using R
No ratings yet
Mathematical Computations Using R
53 pages
Econometrics Notes
No ratings yet
Econometrics Notes
95 pages
(Ebook) Introduction to econometrics by Christopher Dougherty ISBN 9780199280964, 0199280967 - The ebook is ready for download to explore the complete content
100% (1)
(Ebook) Introduction to econometrics by Christopher Dougherty ISBN 9780199280964, 0199280967 - The ebook is ready for download to explore the complete content
57 pages
Unit 2 R
No ratings yet
Unit 2 R
16 pages
lecture 6 linear regression
No ratings yet
lecture 6 linear regression
8 pages
Simulation and Modeling1
No ratings yet
Simulation and Modeling1
17 pages
R Programming Student Lab Manual-52-63-3-12
No ratings yet
R Programming Student Lab Manual-52-63-3-12
10 pages
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
No ratings yet
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
20 pages
ComputerLabNotes 2024
No ratings yet
ComputerLabNotes 2024
109 pages
Book IntroStatistics
No ratings yet
Book IntroStatistics
422 pages
R Code Cheat Sheet
No ratings yet
R Code Cheat Sheet
3 pages
Student Notes Madule 2
No ratings yet
Student Notes Madule 2
12 pages
228371_Lecture_Notes_Week_2
No ratings yet
228371_Lecture_Notes_Week_2
76 pages
6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
No ratings yet
6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
39 pages
Chapter 12
No ratings yet
Chapter 12
48 pages
Simple Regression Analysis
No ratings yet
Simple Regression Analysis
60 pages
MATH1208AnnotatedBook Imp
No ratings yet
MATH1208AnnotatedBook Imp
145 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Basics
No ratings yet
Basics
8 pages
Lecture4 Mech SU
No ratings yet
Lecture4 Mech SU
17 pages
Econometrics 2019 PDF
No ratings yet
Econometrics 2019 PDF
143 pages
Corr and Regress
No ratings yet
Corr and Regress
61 pages
Thesis
No ratings yet
Thesis
116 pages
Printer Error! Intellectual Property and Recycled Printer Cartridges, Part 2 - Knowledge - Clayton Utz
No ratings yet
Printer Error! Intellectual Property and Recycled Printer Cartridges, Part 2 - Knowledge - Clayton Utz
1 page
Dolinsky Prezentacia Stat Modeling
No ratings yet
Dolinsky Prezentacia Stat Modeling
26 pages
SAS STAT 14.3 Whatsnew
No ratings yet
SAS STAT 14.3 Whatsnew
13 pages
SFR-RAG Towards Contextually Faithful LLMs
No ratings yet
SFR-RAG Towards Contextually Faithful LLMs
12 pages
WWW - Dentaurum.de: Dentaurum Group Premium Quality
No ratings yet
WWW - Dentaurum.de: Dentaurum Group Premium Quality
2 pages
What Next For Blockchain
No ratings yet
What Next For Blockchain
4 pages
Enterprise Analytics in The Cloud
No ratings yet
Enterprise Analytics in The Cloud
12 pages
Growing by Adapting at Speed
No ratings yet
Growing by Adapting at Speed
3 pages
Building Certificate Application
No ratings yet
Building Certificate Application
2 pages
Cisco's John Chambers On The Digital Era
No ratings yet
Cisco's John Chambers On The Digital Era
4 pages
Ainsworth Dental Gypsums Plaster Etcmsds
No ratings yet
Ainsworth Dental Gypsums Plaster Etcmsds
3 pages
Vertex Rapid Simplified Liquid
No ratings yet
Vertex Rapid Simplified Liquid
16 pages
Perfectim Msds Die Hard Thin L 149 0113
No ratings yet
Perfectim Msds Die Hard Thin L 149 0113
2 pages
Usg Regular Dental Plaster
No ratings yet
Usg Regular Dental Plaster
8 pages
Met01306f Msds
No ratings yet
Met01306f Msds
4 pages
These SDS Pertain To Other Individual Products
No ratings yet
These SDS Pertain To Other Individual Products
8 pages
SPE RMD v1 Refractory Model Dip
No ratings yet
SPE RMD v1 Refractory Model Dip
4 pages
SS Electropolish E972: Electropolishing Electrolyte For Stainless Steel
No ratings yet
SS Electropolish E972: Electropolishing Electrolyte For Stainless Steel
5 pages
Methacrylate Esters Safe Handling Manual (2008
No ratings yet
Methacrylate Esters Safe Handling Manual (2008
34 pages
Vertex Self Curing Powder
No ratings yet
Vertex Self Curing Powder
12 pages
Material Safety Data Sheet: Section 1 Identification of The Substance/Preparation and of The Company/Undertaking
No ratings yet
Material Safety Data Sheet: Section 1 Identification of The Substance/Preparation and of The Company/Undertaking
5 pages
2015 Provider Guide
No ratings yet
2015 Provider Guide
14 pages
Chapter 4: Presentation, Data Analysis and Interpretation of Data
100% (1)
Chapter 4: Presentation, Data Analysis and Interpretation of Data
26 pages
Ho - Diagnostics Examples 2 in SPSS
No ratings yet
Ho - Diagnostics Examples 2 in SPSS
4 pages
Thesis Data Mining Clustering
100% (2)
Thesis Data Mining Clustering
6 pages
Econometrics Q 19 8 2023
No ratings yet
Econometrics Q 19 8 2023
4 pages
Regression Tutorial 201 With NumXL
No ratings yet
Regression Tutorial 201 With NumXL
12 pages
Chapter 9
No ratings yet
Chapter 9
14 pages
Cashflow Analysis
100% (1)
Cashflow Analysis
60 pages
MSC Statistics III IV Sem Syllabus 5 Units Current Batch
No ratings yet
MSC Statistics III IV Sem Syllabus 5 Units Current Batch
28 pages
Jurnal
No ratings yet
Jurnal
9 pages
Cantaib Computer Class 9 Keybook Short Chapter 2 B-01
No ratings yet
Cantaib Computer Class 9 Keybook Short Chapter 2 B-01
3 pages
Key Formulas: Simple Linear Regression
No ratings yet
Key Formulas: Simple Linear Regression
2 pages
Gemachu proposal (1) (1)
No ratings yet
Gemachu proposal (1) (1)
26 pages
Using and Interpreting Statistics 3Rd Edition Corty Solutions Manual Full Chapter PDF
100% (25)
Using and Interpreting Statistics 3Rd Edition Corty Solutions Manual Full Chapter PDF
18 pages
P 1168 Grievance Handling Project Report
100% (2)
P 1168 Grievance Handling Project Report
52 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
Survival Questions
No ratings yet
Survival Questions
28 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Josh Rombach Case 2
No ratings yet
Josh Rombach Case 2
5 pages
Exploratory Research
No ratings yet
Exploratory Research
43 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Measures of Dispersion (Part 2)
No ratings yet
Measures of Dispersion (Part 2)
13 pages
BFAR 9 Employees Undergo Workshop On E-SEAMS
No ratings yet
BFAR 9 Employees Undergo Workshop On E-SEAMS
2 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
Data Analyst in Absolute (Ecso Global Private Limited)
No ratings yet
Data Analyst in Absolute (Ecso Global Private Limited)
29 pages
My PROPOSAL Final Send and Approved
No ratings yet
My PROPOSAL Final Send and Approved
30 pages
STUDENT UNION SHOP REPORT-queueing Model
No ratings yet
STUDENT UNION SHOP REPORT-queueing Model
15 pages
Module 2 Quantitative Reasoning
No ratings yet
Module 2 Quantitative Reasoning
12 pages
Internet Psycho-Education Programs Improve Outcomes in Youth With Type 1 Diabetes
No ratings yet
Internet Psycho-Education Programs Improve Outcomes in Youth With Type 1 Diabetes
8 pages

330 Lecture8 2014

Uploaded by

330 Lecture8 2014

Uploaded by

STATS 330: Lecture 8

I Explain the idea of collinearity and its connection with

I To discuss added variable plots, a graphical method for

I We saw in Lecture 6 how the standard errors of the regression

I We also suggested that the standard error depends on the

I In todays lecture, we explore this idea a bit further.

I Suppose we have a regression relationship of the form

between the response variable Y and two covariates x1 and x2 .

I Consider two data sets, A and B, each following the model

Dataset A, correlation 0.035 Dataset B, correlation 0.989

Dataset A, correlation 0.035 Dataset B, correlation 0.989

I The greater the correlation between the covariates, the more

I In fact, for the coefficient 1 of x1 we have

If we have k explanatory variables, then the variance of the j th

where Rj2 is the R 2 value for an regression of response variable j on

If the j th variable is orthogonal to (perpendicular to, uncorrelated

To calculate the VIF for the j th explanatory variable, use the

using the residuals from regressing the j th explanatory variable on

For the petrol data, calculate the VIF for t.vp

A useful mathematical relationship:

Suppose we calculate the inverse of the correlation matrix of the

0.85 0.94 0.77

I If one or more variables in a regression have big VIFs, the

I Caused by one or more variables being almost linear

I Sometimes indicated by high correlations between the

I Results in imprecise estimation of regression coefficients.

I Standard errors are high, so t-statistics are small, variables are

I If a variable has a non-significant t, then either

I The variable is not related to the response, or

I First case: small t-value, small VIF, small correlation with

I Second case: small t-value, big VIF, high correlation with

I The usual remedy is to drop one or more variables from the

I This breaks the linear relationship between the variables.

I This leads to the problem of subset selection; which subset

I Measurements on batches of cement

I Response variable y : Heat (emitted)

x1 : amount of tricalcium aluminate (Ca3 Al2O6 , %)

x2 : amount of tricalcium silicate (Ca3 SiO5 , %)

x3 : amount of tetracalcium aluminoferrite (Ca2 (Al, Fe)2 O5 , %)

x4 : amount of dicalcium silicate (Ca2 SiO4 , %)

> cement.lm <- lm(y~x1+x2+x3+x4,data=cement)

I If covariates actually correlate, their coefficients might not be

I Use VIF to see how much such a correlation affects the

I Stepwise removal of variables can solve the problem.

I To see if a variable, x say, is needed in a regression:

Step 2: Calculate the residuals from regressing x on the other

I Also called partial regression plots in some books.

I The first set of residuals represents the variation in y not

I The second set of residuals represents the part of x not

I If there is a relationship between the two sets, there is a

I Thus, if we see a relationship in the plot, x is needed in the

> rest.lm <- lm(hc~t.temp+p.temp+p.vp,data=vapour.df)

0.4 0.2 0.0 0.2 0.4

Tank vapour pressure

The R330-function added.variable.plots draws AVPs

Partial plot of t.vp Partial plot of p.vp

I Since residuals have always zero mean, a line fitted through

I The slope of this line is bk , the estimate of the regression

I The amount of scatter about the least squares line reflects

You might also like