Supplement 5 - Multiple Regression
Supplement 5 - Multiple Regression
M. Bremer
Note: We will reserve the term multiple regression for models with two or
more predictors and one response. There are also regression models with two or
more response variables. These models are usually called multivariate regression models.
In this chapter, we will introduce a new (linear algebra based) method for computing
the parameter estimates of multiple regression models. This more compact method
is convenient for models for which the number of unknown parameters is large.
Example: A multiple linear regression model with k predictor variables X1 , X2 , ..., Xk
and a response Y , can be written as
y = 0 + 1 x1 + 2 x2 + k xk + .
As before, the are the residual terms of the model and the distribution assumption we place on the residuals will allow us later to do inference on the remaining model parameters. Interpret the meaning of the regression coefficients
0 , 1 , 2 , ..., k in this model.
More complex models may include higher powers of one or more predictor variables,
e.g.,
y = 0 + 1 x + 2 x2 +
(1)
18
M. Bremer
(2)
Note: Models of this type can be called linear regression models as they can
be written as linear combinations of the -parameters in the model. The x-terms are
the weights and it does not matter, that they may be non-linear in x. Confusingly,
models of type (1) are also sometimes called non-linear regression models or
polynomial regression models, as the regression curve is not a line. Models of
type (2) are usually called linear models with interaction terms.
It helps to develop a little geometric intuition when working with regression models.
Models with two predictor variables (say x1 and x2 ) and a response variable y can
be understood as a two-dimensional surface in space. The shape of this surface
depends on the structure of the model. The observations are points in space and
the surface is fitted to best approximate the observations.
Example: The simplest multiple regression model for two predictor variables is
y = 0 + 1 x1 + 2 x2 +
The surface that corresponds to the model
y = 50 + 10x1 + 7x2
looks like this. It is a plane in R3 with dierent slopes in x1 and x2 direction.
250
200
150
100
50
0
50
100
150
200
10
0
10
10
19
10
M. Bremer
Example: For a simple linear model with two predictor variables and an interaction
term, the surface is no longer flat but curved.
y = 10 + x1 + x2 + x1 x2
140
120
100
80
60
40
20
10
0
0
5
8
10
Example: Polynomial regression models with two predictor variables and interaction terms are quadratic forms. Their surfaces can have many dierent shapes
depending on the values of the model parameters with the contour lines being either
parallel lines, parabolas or ellipses.
y = 0 + 1 x1 + 2 x2 + 11 x21 + 22 x22 + 12 x1 x2 +
1000
300
800
250
200
600
150
400
100
200
50
0
10
0
10
5
10
10
0
5
10
5
10
400
200
200
5
10
10
10
10
0
400
200
600
400
600
800
800
10
1000
10
5
10
10
0
5
10
5
10
20
M. Bremer
i = 1, . . . , n
You can think of the observations as points in (k + 1)-dimensional space if you like.
Our goal in least-squares regression is to fit a hyper-plane into (k + 1)-dimensional
space that minimizes the sum of squared residuals.
2
n
n
k
e2i =
yi 0
j xij
i=1
i=1
j=1
0
xi1
n
i=1
+1
+1
i=1
n
i=1
xi1
x2i1
+2
+2
xi2
i=1
n
i=1
xi1 xi2 +
..
..
..
.
.
.
n
n
n
0
xik +1
xik xi1 +2
xik xi2 +
i=1
i=1
i=1
+k
+k
xik
i=1
n
xi1 xik =
i=1
..
.
n
+k
x2ik
i=1
i=1
n
yi
xi1 yi
i=1
..
.
n
xik yi
i=1
These equations are much more conveniently formulated with the help of vectors
and matrices.
Note: Bold-faced lower case letters will now denote vectors and bold-faced upper case letters will denote matrices. Greek letters cannot be bold-faced in Latex.
Whether a Greek letter denotes a random variable or a vector of random variables
should be clear from the context, hopefully.
21
y=
y1
y2
..
.
yn
M. Bremer
X=
0
1
..
.
k
1 x11 x12
1 x21 x22
.. ..
..
. .
.
1 xn1 xn2
1
2
= ..
.
n
x1k
x2k
..
.
xnk
With this compact notation, the linear regression model can be written in the form
y = X +
In linear algebra terms, the least-squares parameter estimates are the vectors that
minimize
n
2i = = (y X) (y X)
i=1
yy
Column space of X
are the predicted values in our regression model that all lie on the regression
The y
hyper-plane. Suppose further that satisfies the equation above. Then the resid are orthogonal to the columns of X (by the Orthogonal Decomposition
uals y y
Theorem) and thus
22
M. Bremer
=0
X (y X)
X y X X = 0
X X = X y
These vector normal equations are the same normal equations that one could obtain
from taking derivatives. To solve the normal equations (i.e., to find the parameter
multiply both sides with the inverse of X X. Thus, the least-squares
estimates ),
estimator of is (in vector form)
= (X X)1 X y
This of course works only if the inverse exists. If the inverse does not exist, the
normal equations can still be solved, but the solution may not be unique. The
inverse of X X exists, if the columns of X are linearly independent. That means
that no column can be written as a linear combination of the other columns.
in a linear regression model can be expressed as
The vector of fitted values y
= X = X(X X)1 X y = Hy
y
The n n matrix H = X(X X)1 X is often called the hat-matrix. It maps
that lie on the
the vector of observed values y onto the vector of fitted values y
regression hyper-plane. The regression residuals can be written in dierent ways as
= y X = y Hy = (I H)y
=yy
23
M. Bremer
15
20
25
30
50
60
70
80
20
25
30
10
20
30
40
Time
1000
1400
10
15
Cases
200
600
Distance
10
20
30
40
50
60
70
80
200
600
1000
1400
Look at the panels that describe the relationship between the response (here time)
and the predictors. Make sure that the pattern is somewhat linear (look for obvious
curves in which case the simple linear model without powers or interaction terms
would not be a good fit).
Caution: Do not rely too much on a panel of scatterplots to judge how well a multiple linear regression really works. It can be very hard to see. A perfectly fitting
model can look like a random confettiplot if the predictor variables are themselves
correlated.
24
M. Bremer
If a regression model has only two predictor variables, it is also possible to create a
three-dimensional plot of the observations.
20
1500
1000
Distance
40
Time
60
80
Delivery Times
500
0
0
0
10
15
20
25
30
Cases
Thus,
0
2.34123115
1 = 1.61590721
0.01438483
2
25
M. Bremer
Example: Read o the estimated residual variance from the output shown above.
26
M. Bremer
Example: Find the covariance matrix of the least squares estimate vector .
The estimate of the residual variance can still be found via the residual sum of
squares SSRes which has the same definition as in the simple linear regression case.
SSRes =
2i =
i=1
i=1
(yi yi )2 =
If the multiple regression model contains k predictors, then the degree of freedom of
the residual sum of squares is nk (we lose one degree of freedom for the estimation
of each slope and the intercept). Thus
M SRes =
SSRes
=
2
nk1
The residual variance is model dependent. Its estimate changes if additional predictor variables are included in the model or if predictors are removed. It is hard to
say which one the correct residual variance is. We will learn later how to compare
dierent models with each other. In general, a smaller residual variance is preferred
in a model.
27
M. Bremer
Thus,
= (X X)1 X y,
(y X)
(y X)
=
n
2
28
M. Bremer
SSR =
(
yi y)2
i=1
SSRes =
i=1
SST =
i=1
(yi yi )2 = y y X y
(yi y) = y y
yi
i=1
If the value of this test statistic is large, then the regression works well and at
least one predictor in the model is relevant for the response. The F -test statistic
and p-value are reported in the regression ANOVA table (columns F value and
Pr(>F)).
Example: Read o and interpret the result of the F -test for significance of regression in the Delivery Time Example.
Assessing Model adequacy: There are several ways in which to judge how
well a specific model fits. We have already seen that in general, a smaller residual
variance is desirable. Other quantities that describe the goodness of fit of the
model are R2 and adjusted R2 . Recall, that in the simple linear regression model,
R2 was simply the square of the correlation coecient between the predictor and
the response. This is no longer true in the multiple regression model. But there is
another interpretation for R2 . In general, R2 is the proportion of variation in the
response that is explained through the regression on all the predictors in the model.
Including more predictors in a multiple regression model will always bring up the
value of R2 . But using more predictors is not necessarily better. To weigh the
proportion of variation explained with the number of predictors, we can use the
adjusted R2 .
29
M. Bremer
2
RAdj
=1
SSR /(n k 1)
SST /(n 1)
Here, k is the number of predictors in the current model and SSR /(n k) is actually
the estimated residual variance of the model with k predictors. The adjusted R2
does not automatically increase when more predictors are added to the model and
it can be used as one tool in the arsenal of finding the best model for a given data
set. Higher adjusted R2 indicates a better fitting model.
Example: For the Delivery Time data, find R2 and the adjusted R2 for the model
with both predictor variables in the R-output.
Testing Individual Regression Coefficients: As in the simple linear regression model, we can formulate individual hypothesis tests for each slope (or even the
intercept) in the model. For instance
H0 : j = 0,
vs. HA : j = 0
tests whether the slope associated with the j th predictor is significantly dierent
from zero. The test statistics for this test is
t=
j
t(df = n k 1)
se(j )
Here, se(j ) is the square root of the j th diagonal entry of the covariance matrix
This test is a marginal test.
Note: As weve seen before, every two-sided hypothesis test for a regression slope
can also be reformulated as a confidence interval for the same slope. The 95%
confidence intervals for the slopes can also be computed by R (command confint().
30
M. Bremer
1
=
2
where 1 contains the intercept and the slopes for the first k p predictors and 2
contains the remaining p slopes. We want to test
H0 : 2 = 0 vs. HA : 2 = 0
We will compare two alternative regression models to each other:
(Full Model)
y = X +
(Reduced Model)
y = X1 1 +
y (k degrees of freedom).
with SSR () = X
with SSR (1 ) = 1 X y (k p degrees of freedom)
1
With this notation, the regression sum of squares that describes the contribution of
the slopes in 2 given that 1 is already in the model becomes
SSR (2 |1 ) = SSR (1 , 2 ) SSR (1 )
The test statistic that tests the hypotheses described above is
F =
SSR (2 |1 )/p
Fp,nk1
M SRes
31
M. Bremer
SSH /r
Fr,nk1
SSRes /(n k 1)
32
M. Bremer
-45
-40
Slope
-35
-30
2550
2600
2650
2700
2750
Intercept
Other methods for constructing simultaneous confidence intervals include the Bonferroni method which eectively splits the into as many equal portions as confidence intervals need to be computed (say p) and then computes each interval
individually at level (1 /p).
33
M. Bremer
y0 t/2,nk1
2 (1 + x0 (X X)1 x0 )
Example: For the Delivery Time data, calculate a 95% prediction interval for the
time it takes to restock a vending machine with x1 = 8 cases if the driver has to
walk x2 = 275 feet.
Note: In an introductory regression class, you may have learned that it is dangerous to predict new observations outside of the range of data you have collected. For
instance, if you have data on the ages and heights of young girls, all between age 2
and 12, it would not be a good idea to use that linear regression model to predict
the height of a 25 year old young woman. This concept of outside the range has
to be extended in multiple linear regression.
34
x2
range of x 2
Consider a regression problem with two predictor variables in which the collected data
all falls within the ellipse in the picture
shown on the right. The point (x, y) has
coordinates that are each within the ranges
of the observed variables individually, but it
would still not be a good idea to predict the
value of the response at this point, because
we have no data to check the validity of the
model in the vicinity of the point.
Original
Data
y
(x,y)
range of x 1
x1
M. Bremer
xij xj
,
sj
yi =
yi y
sy
i = 1, . . . , n
yi y
yi0 =
SST
where Sjj = (xij xj )2 is the corrected sum of squares for regressor xj . In this
case the regression model becomes
yi0 = b1 wi1 + b2 wi2 + + bk wik + i ,
i = 1, . . . , n
= (W W)1 W y0 .
and the vector of scaled least-squares regression coecients is b
The W W matrix is the correlation matrix for the k predictor variables. I.e., W Wij
is simply the correlation between xi and xj .
The matrices Z in unit normal scaling and W in unit length scaling are closely
related and both methods will produce the exact same standardized regression co The relationship between the original and scaled coecients is
ecients b.
1/2
SST
j = bj
, j = 1, 2, . . . , k
Sjj
35
M. Bremer
Multicollinearity
In theory, one would like to have predictors in a multiple regression model that each
have a dierent influence on the response and are independent from each other. In
practice, the predictor variables are often correlated themselves. Multicollinearity
is the prevalence of near-linear dependence among the regressors.
If one regressor were a linear combination of the other regressors, then the matrix X
(whose columns are the regressors) would have linearly dependent columns, which
would make the matrix (X X) singular (non-invertible). In practice, it would mean
that the predictor that can be expressed through the other predictors cannot contribute any new information about the response. But, worse than that, the linear
dependence of the predictors makes the estimated slopes in the regression model
arbitrary.
Example: Consider a regression model in which somebodys height (in inches) is
expressed as a function of arm-span (in inches). Suppose the true regression equation
is
y = 12 + 1.1x
Now, suppose further that when measuring the arm span, two people took independent measurements in inches (x1 ) and in centimeters (x2 ) of the same subjects and
both variables have erroneously been included in the same linear regression model.
y = 0 + 1 x1 + 2 x2 +
We know that in this case, x2 = 0.394x1 and thus we should have 1 +0.3942 = 1.1,
in theory. But since this is a single equation with two unknowns, there are infinitely many possible solutions - some quite nonsensical. For instance, we could
have 1 = 2.7 and 2 = 9.645. Of course, these slopes are not interpretable in the
context of the original problem. The computer used to fit the data and to compute
parameter estimates cannot distinguish between sensible and nonsensical estimates.
How can you tell whether you have multicollinearity in your data? Suppose your
data have been standardized, so that X X is the correlation matrix for the k predictors in the model. The main diagonal elements of the inverse of the predictor
correlation matrix are called the variance inflation factors (VIF). The larger these
factors are, the more you should worry about multicollinearity in your model. On
the other extreme, VIFs of 1 mean that the predictors are all orthogonal.
In general, the variance inflation factor for the j th regressor coecient can be computed as
1
V IFj =
1 Rj2