Linear Regression
Linear Regression
Olive
Linear
Regression
Linear Regression
David J. Olive
Linear Regression
123
David J. Olive
Department of Mathematics
Southern Illinois University
Carbondale, IL, USA
Y x| T x.
Multiple linear regression and many experimental design models are spe-
cial cases of the linear regression model, and the models can be presented
compactly by dening the population model in terms of the sucient predic-
tor SP = T x and the estimated model in terms of the estimated sucient
T
predictor ESP = x. In particular, the response plot or estimated su-
cient summary plot of the ESP versus Y is used to visualize the conditional
distribution Y | T x. The residual plot of the ESP versus the residuals is used
to visualize the conditional distribution of the residuals given the ESP.
The literature on multiple linear regression is enormous. See Stigler (1986)
and Harter (1974a,b, 1975a,b,c, 1976) for history. Draper (2002) is a good
source for more recent literature. Some texts that were standard at one time
include Wright (1884), Johnson (1892), Bartlett (1900), Merriman (1907),
Weld (1916), Leland (1921), Ezekiel (1930), Bennett and Franklin (1954),
Ezekiel and Fox (1959), and Brownlee (1965). Recent reprints of several of
these texts are available from www.amazon.com.
Draper and Smith (1966) was a breakthrough because it popularized the
use of residual plots, making the earlier texts obsolete. Excellent texts include
Chatterjee and Hadi (2012), Draper and Smith (1998), Fox (2015), Hamil-
ton (1992), Kutner et al. (2005), Montgomery et al. (2012), Mosteller and
Tukey (1977), Ryan (2009), Sheather (2009), and Weisberg (2014). Cook and
Weisberg (1999a) was a breakthrough because of its use of response plots.
v
vi Preface
Multivariate linear regression and MANOVA models are special cases. Recent
results from Kakizawa (2009), Su and Cook (2012), Olive et al. (2015), and
Olive (2016b) make the multivariate linear regression model (Chapter 12)
easy to learn after the student has mastered the multiple linear regression
model (Chapters 2 and 3). For the multivariate linear regression model, it is
assumed that the iid zero mean error vectors have fourth moments.
Fourth, recent literature on plots for goodness and lack of t, bootstrapping,
outlier detection, response transformations, prediction intervals, prediction
regions, and variable selection has been incorporated into the text. See Olive
(2004b, 2007, 2013a,b, 2016a,b,c) and Olive and Hawkins (2005).
Chapter 1 reviews the material to be covered in the text and can be
skimmed and then referred to as needed. Chapters 2 and 3 cover multiple lin-
ear regression, Chapter 4 considers generalized least squares, and Chapters 5
through 9 consider experimental design models. Chapters 10 and 11 cover lin-
ear model theory and the multivariate normal distribution. These chapters
are needed for the multivariate linear regression model covered in Chapter 12.
Chapter 13 covers generalized linear models (GLMs) and generalized additive
models (GAMs).
The text also uses recent literature to provide answers to the following
important questions:
How can the conditional distribution Y | T x be visualized?
How can be estimated?
How can variable selection be performed eciently?
How can Y be predicted?
The text emphasizes prediction and visualizing the models. Some of the
applications in this text using this research are listed below.
1) It is shown how to use the response plot to detect outliers and to assess
the adequacy of linear models for multiple linear regression and experimental
design.
2) A graphical method for selecting a response transformation for linear
models is given. Linear models include multiple linear regression and many
experimental design models. This method is also useful for multivariate linear
regression.
3) A graphical method for assessing variable selection for the multiple
linear regression model is described. It is shown that for submodels I with
k predictors, the widely used screen Cp (I) k is too narrow. More good
submodels are considered if the screen Cp (I) min(2k, p) is used. Variable
selection methods originally meant for multiple linear regression can be ex-
tended to GLMs. See Chapter 13. Similar ideas from Olive and Hawkins
(2005) have been incorporated in Agresti (2013). Section 3.4.1 shows how to
bootstrap the variable selection estimator.
4) Asymptotically optimal prediction intervals for a future response Yf
are given for models of the form Y = T x + e where the errors are iid,
viii Preface
can be used to download the R functions and data sets into R. Type ls(). Over
65 R functions from lregpack.txt should appear. In R, enter the command q().
A window asking Save workspace image? will appear. Click on No to remove
the functions from the computer (clicking on Yes saves the functions on R,
but the functions and data are easily obtained with the source commands).
Chapters 27 can be used for a one-semester course in regression and
experimental design. For a course in generalized linear models, replace some of
the design chapters by Chapter 13. Design chapters could also be replaced by
Chapters 12 and 13. A more theoretical course would cover Chapters 1,10, 11,
and 12.
Acknowledgments
This work has been partially supported by NSF grants DMS 0202922 and
DMS 0600933. Collaborations with Douglas M. Hawkins and R. Dennis Cook
were extremely valuable. I am grateful to the developers of useful mathemat-
ical and statistical techniques and to the developers of computer software
and hardware (including R Core Team (2016)). Cook (1998) and Cook and
Weisberg (1999a) inuenced this book. Teaching material from this text has
been invaluable. Some of the material in this text has been used in a Math
583 regression graphics course, a Math 583 experimental design course, and
a Math 583 robust statistics course. In 2009 and 2016, Chapters 2 to 7 were
used in Math 484, a course on multiple linear regression and experimental
design. Chapters 11 and 12 were used in a 2014 Math 583 theory of linear
Preface ix
models course. Chapter 12 was also used in a 2012 Math 583 multivariate
analysis course. Chapter 13 was used for a categorical data analysis course.
Thanks also goes to Springer, to Springers associate editor Donna
Chernyk, and to several reviewers.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Some Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
xi
xii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Chapter 1
Introduction
This chapter provides a preview of the book but is presented in a rather ab-
stract setting and will be easier to follow after reading the rest of the book.
The reader may omit this chapter on rst reading and refer back to it as nec-
essary. Chapters 2 to 9 consider multiple linear regression and experimental
design models t with least squares. Chapter 1 is useful for extending several
techniques, such as response plots and plots for response transformations used
in those chapters, to alternative tting methods and to alternative regression
models. Chapter 13 illustrates some of these extensions for the generalized
linear model (GLM) and the generalized additive model (GAM).
Response variables are the variables of interest, and are predicted with
a p 1 vector of predictor variables x = (x1 , . . . , xp )T where xT is the
transpose of x. A multivariate regression model has m > 1 response variables.
For example, predict Y1 = systolic blood pressure and Y2 = diastolic blood
pressure using a constant x1 , x2 = age, x3 = weight, and x4 = dosage amount
of blood pressure medicine. The multivariate location and dispersion model
of Chapter 10 is a special case of the multivariate linear regression model of
Chapter 12.
A univariate regression model has one response variable Y . Suppose Y
is independent of the predictor variables x given a function h(x), written
Y x|h(x), where h : Rp Rd and the integer d is as small as possible. Then
Y follows a dD regression model, where d p since Y x|x. If Y x, then
Y follows a 0D regression model. Then there are 0D, 1D, . . . , pD regression
models, and all univariate regression models are dD regression models for
some integer 0 d p. Cook (1998, p. 49) and Cook and Weisberg (1999a,
p. 414) use similar notation with h(x) = (xT 1 , . . . , xT d )T .
The remainder of this chapter considers 1D regression models, where h :
Rp R is a real function. The additive error regression model Y = m(x) + e
is an important special case with h(x) = m(x). See Section 13.2. An impor-
tant special case of the additive error model is the linear regression model
Y = xT + e = x1 1 + + xp p + e. Multiple linear regression and many
experimental design models are special cases of the linear regression model.
The multiple linear regression model has at least one predictor xi that
takes on many values. Chapter 2 ts this model with least squares and Chap-
ter 3 considers variable selection models such as forward selection. There are
many other methods for tting the multiple linear regression model, includ-
ing lasso, ridge regression, partial least squares (PLS), and principal com-
ponent regression (PCR). See James et al. (2013), Olive (2017), and Pelawa
Watagoda and Olive (2017). Chapters 2 and 3 consider response plots, plots
for response transformations, and prediction intervals for the multiple linear
regression model t by least squares. All of these techniques can be extended
to alternative tting methods.
Notation. Often the index i will be suppressed. For example, the linear
regression model
Yi = + T xi + ei (1.2)
for i = 1, . . . , n where is a p 1 unknown vector of parameters, and ei
is a random error. This model could be written Y = + T x + e. More
accurately, Y |x = + T x + e, but the conditioning on x will often be
suppressed. Often the errors e1 , . . . , en are iid (independent and identically
distributed) from a distribution that is known except for a scale parameter.
For example, the ei s might be iid from a normal (Gaussian) distribution
with mean 0 and unknown standard deviation . For this Gaussian model,
estimation of , , and is important for inference and for predicting a new
value of the response variable Yf given a new vector of predictors xf .
The class of 1D regression models is very rich, and many of the most
used statistical models, including GLMs and GAMs, are 1D regression mod-
els. Nonlinear regression, nonparametric regression, and linear regression are
special cases of the additive error regression model
Y = h(x) + e = SP + e. (1.3)
Z = t1 ( + T x + e) (1.5)
Y = t(Z) = + T x + e. (1.6)
4 1 Introduction
Sections 3.2 and 5.4 show how to choose the response transformation t(Z)
graphically, and these techniques are easy to extend to the additive error
regression model Y = h(x) + e. Then the response transformation model is
Y = t (Z) = h (x) + e, and the graphical method for selecting the response
transformation is to plot hi (x) versus ti (Z) for several values of i , choosing
the value of = 0 where the plotted points follow the identity line with unit
slope and zero intercept. For the multiple linear regression model, hi (x) =
xT i where i can be found using the desired tting method, e.g. lasso.
Box (1979) warns that all models are wrong, but some are useful. For
example, the function g in equation (1.4) or the error distribution could
be misspecied. Diagnostics are used to check whether model assumptions
such as the form of g and the proposed error distribution are reasonable.
Often diagnostics use residuals ri . For example, the additive error regression
model (1.3) uses
ri = Yi h(xi )
where h(x) is an estimate of h(x).
Z = t1 ( + T x + e), (1.7)
Y = t(Z) = + T x + e (1.8)
able Y and the predictors x. The response plot is used to visualize the con-
ditional distribution of Y |x, Y |SP , and Y |( + T x) if SP = + T x.
iii) Check for lack of t of the model with a residual plot of the ESP versus
the residuals.
iv) Fit the model and nd h(x). If SP = + T x, estimate and , e.g.,
using maximum likelihood estimators.
v) Estimate the mean function E(Yi |xi ) = (xi ) = di (xi ) or estimate
(xi ) where the di are known constants.
vii) Check for overdispersion with an OD plot. See Section 13.8.
viii) Check whether Y is independent of x, that is, check whether the
nontrivial predictors x are needed in the model. Check whether SP = h(x)
c where the constant c does not depend on the xi . If SP = + T x, check
whether = 0, for example, test Ho : = 0,
ix) Check whether a reduced model can be used instead of the full model.
If SP = + T x = + TR xR + TO xO where the r 1 vector xR consists
of the nontrivial predictors in the reduced model, test Ho : O = 0.
x) Use variable selection to nd a good submodel.
xi) Predict Yi given xi .
The eld of statistics known as regression graphics gives useful results for
examining the 1D regression model (1.1) even when the model is unknown or
misspecied. The following section shows that the sucient summary plot is
useful for explaining the given 1D model while the response plot can often be
used to visualize the conditional distribution of Y |SP . Also see Chapter 13
and Olive (2013b).
Suppose that the response variable Y is quantitative and that at least one pre-
dictor variable xi is quantitative. Then the multiple linear regression (MLR)
model is often a very useful model. For the MLR model,
5
0
Y
5
10
10 5 0 5
SP
5
0
Y
5
10
10 5 0 5
ESP
10 5 0 5
ESP
5
0
Y
5
10
T
x and the estimated conditional mean function is (ESP ) = ESP. The
T
estimated or tted value of Yi is equal to Yi = + x. Now the vertical
deviation of Yi from the identity line is equal to the residual ri = Yi ( +
T
xi ). The interpretation of the ESSP is almost the same as that of the
SSP, but now the mean SP is estimated by the estimated sucient predictor
(ESP). This plot is also called the response plot and is used as a goodness
of t diagnostic. The residual plot is a plot of the ESP versus ri and is used
as a lack of t diagnostic. These two plots should be made immediately after
tting the MLR model and before performing inference. Figures 1.2 and 1.3
show the response plot and residual plot for the articial data.
The response
plot is also a useful visual aid for describing the ANOVA F
test (see 2.4) which tests whether = 0, that is, whether the nontrivial
predictors x are needed in the model. If the predictors are not needed in the
model, then Yi and E(Yi |xi ) should be estimated by the sample mean Y . If
the predictors are needed, then Yi and E(Yi |xi ) should be estimated by the
T
ESP Yi = + xi . If the identity line clearly ts the data better than the
horizontal line Y = Y , then the ANOVA F test should have a small pvalue
and reject the null hypothesis Ho that the predictors x are not needed in the
MLR model. Figure 1.2 shows that the identity line ts the data better than
any horizontal line. Figure 1.4 shows the response plot for the articial data
when only X4 and X5 are used as predictors with the identity line and the
line Y = Y added as visual aids. In this plot the horizontal line ts the data
1.3 Variable Selection 9
about as well as the identity line which was expected since Y is independent
of X4 and X5 .
It is easy to nd data sets where the response plot looks like Figure 1.4,
but the pvalue for the ANOVA F test is very small. In this case, the MLR
model is statistically signicant, but the investigator needs to decide whether
the MLR model is practically signicant.
SP = + T x = + TS xS + TE xE = + TS xS . (1.11)
The extraneous terms that can be eliminated given that the subset S is in
the model have zero coecients: E = 0.
Now suppose that I is a candidate subset of predictors, that S I and
that O is the set of predictors not in I. Then
SP = + T x = + TS xS = + TS xS + T(I/S) xI/S + 0T xO = + TI xI ,
where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, O = 0 if S I. Hence for any
subset I that includes all relevant predictors, the population correlation
corr( + T xi , + T
I xI,i ) = 1. (1.12)
and consideration. To make this advice more specic, use the rule of thumb
that a candidate subset of predictors I is worth considering if the sample
correlation of ESP and ESP(I) satises
T T T T
corr( + xi , I + I xI,i ) = corr( xi , I xI,i ) 0.95. (1.13)
The diculty with this approach is that tting large numbers of possi-
ble submodels involves substantial computation. Fortunately, (ordinary) least
squares (OLS) frequently gives a useful ESP, and methods originally meant
for multiple linear regression using the Mallows Cp criterion (see Jones 1946
and Mallows 1973) also work for more general 1D regression models. As a rule
of thumb, the OLS ESP is useful if |corr(OLS ESP, ESP)| 0.95 where ESP
T
is the standard ESP (e.g., for generalized linear models, the ESP is + x
where (, ) is the maximum likelihood estimator of (, )), or if the OLS
response plot suggests that the OLS ESP is good. Variable selection will be
discussed in much greater detail in Chapters 3 and 13, but the following
methods are useful for a large class of 1D regression models.
Backward elimination starts with the full model. All models contain a
constant = U0 . Hence the full model contains U0 , X1 , . . . , Xp1 . We will also
say that the full model contains U0 , U1 , . . . , Up1 where Ui need not equal Xi
for i 1.
Step 1) k = p 1: t each model with p 1 predictors including a constant.
Delete the predictor Up1 , say, that corresponds to the model with the small-
est Cp . Keep U0 , . . . , Up2 .
1.3 Variable Selection 11
All subsets variable selection examines all subsets and keeps track of
several (up to three, say) subsets with the smallest Cp (I) for each group of
submodels containing k predictors including a constant. This method can be
used for p 30 by using the ecient leaps and bounds algorithms when
OLS and Cp is used (see Furnival and Wilson 1974).
Rule of thumb for variable selection (assuming that the cost of each
predictor is the same): nd the submodel Im with the minimum Cp . If Im
uses km predictors including a constant, do not use any submodel that has
more than km predictors. Since the minimum Cp submodel often has too
many predictors, also look at the submodel Io with the smallest value of k,
say ko , such that Cp 2k. This submodel may have too few predictors.
So look at the predictors in Im but not in Io and see if they can be deleted
or not. (If Im = Io , then it is a good candidate for the best submodel.)
where SSE is the residual sum of squares from the full model and SSE(I) is
the residual sum of squares from the candidate submodel. Then
SSE(I)
Cp (I) = + 2k n = (p k)(FI 1) + k (1.14)
M SE
where MSE is the residual mean square for the full model. Let ESP(I) =
T
I + I x be the ESP for the submodel and let VI = Y ESP (I) so that
T
VI,i = Yi I + I xi . Let ESP and V denote the corresponding quantities
12 1 Introduction
for the full model. Then Olive and Hawkins (2005) show that corr(VI , V ) 1
forces corr(OLS ESP, OLS ESP(I)) 1 and that
SSE np np
corr(V, VI ) = = = .
SSE(I) Cp (I) + n 2k (p k)FI + n p
Notice that the submodel Ik that minimizes Cp (I) also maximizes corr(V, VI )
among all submodels I with k predictors including a constant. If Cp (I) 2k
and n 10p, then 0.948 corr(V, V (I)), and both corr(V, V (I)) 1.0 and
corr(OLS ESP, OLS ESP(I)) 1.0 as n .
Suppose that the OLS ESP and the standard ESP are highly correlated:
|corr(ESP, OLS ESP)| 0.95. Then often OLS variable selection can be used
for the 1D data, and using the pvalues from OLS output seems to be a useful
benchmark. To see this, suppose that n 5p and rst consider the model
Ii that deletes the predictor Xi . Then the model has k = p 1 predictors
including the constant, and the test statistic is ti where
t2i = FIi .
or
Cp (Ii ) = Cp (If ull ) + (t2i 2).
Using the screen Cp (I) min(2k, p) suggests that the predictor Xi should
not be deleted if
|ti | > 2 1.414.
If |ti | < 2, then the predictor can probably be deleted since Cp decreases.
More generally, for the partial F test, notice that by (1.14), Cp (I) 2k
i (p k)FI p + 2k 2k i (p k)Fi p i
1.4 Other Issues 13
p
FI .
pk
Now k is the number of terms in the model including a constant while p k
is the number of terms set to 0. As k 0, the partial F test will reject Ho
(i.e., say that the full model should be used instead of the submodel I) unless
FI is not much larger than 1. If p is very large and p k is very small, then
the partial F test will tend to suggest that there is a model I that is about
as good as the full model even though model I deletes p k predictors.
The Cp (I) k screen tends to overt. An additive error single index model
is Y = m( + xT ) + e. We simulated multiple linear regression and single
index model data sets with p = 8 and n = 50, 100, 1000, and 10000. The true
model S satised Cp (S) k for about 60% of the simulated data sets, but S
satised Cp (S) 2k for about 97% of the data sets.
The 1D regression models oer a unifying framework for many of the most
used regression models. By writing the model in terms of the sucient predic-
tor SP = h(x), many important topics valid for all 1D regression models can
be explained compactly. For example, the previous section presented variable
selection, and equation (1.14) can be used to motivate the test for whether
the reduced model can be used instead of the full model. Similarly, the su-
cient predictor can be used to unify the interpretation of coecients and to
explain models that contain interactions and factors.
Interpretation of Coecients
One interpretation of the coecients in a 1D model (1.11) is that i is the
rate of change in the SP associated with a unit increase in xi when all other
predictor variables x1 , . . . , xi1 , xi+1 , . . . , xp are held xed. Denote a model
by SP = + T x = + 1 x1 + + p xp . Then
SP
i = for i = 1, . . . , p.
xi
Of course, holding all other variables xed while changing xi may not be
possible. For example, if x1 = x, x2 = x2 and SP = + 1 x + 2 x2 , then x2
cannot be held xed when x1 increases by one unit, but
d SP
= 1 + 22 x.
dx
The interpretation of i changes with the model in two ways. First,
the interpretation changes as terms are added and deleted from the SP.
Hence the interpretation of 1 diers for models SP = + 1 x1 and
14 1 Introduction
exp(SP )
E(Y |SP ) = (SP ) = ,
1 + exp(SP )
and the change in the conditional expectation associated with a one unit
increase in xi is more complex.
Interactions
Suppose X1 is quantitative and X2 is qualitative with 2 levels and X2 = 1
for level a2 and X2 = 0 for level a1 . Then a rst order model with interaction
is SP = +1 x1 +2 x2 +3 x1 x2 . This model yields two unrelated lines in the
sucient predictor depending on the value of x2 : SP = + 2 + (1 + 3 )x1
if x2 = 1 and SP = + 1 x1 if x2 = 0. If 3 = 0, then there are two
parallel lines: SP = + 2 + 1 x1 if x2 = 1 and SP = + 1 x1 if x2 = 0.
If 2 = 3 = 0, then the two lines are coincident: SP = + 1 x1 for
both values of x2 . If 2 = 0, then the two lines have the same intercept:
SP = + (1 + 3 )x1 if x2 = 1 and SP = + 1 x1 if x2 = 0. In general,
as factors have more levels and interactions have more terms, e.g. x1 x2 x3 x4 ,
the interpretation of the model rapidly becomes very complex.
1.5 Complements
the plotted points will fall about some line with slope and intercept if
the SLR model holds, but in a plot of SP = + T xi versus Yi , the plotted
points will fall about the identity line with unit slope and zero intercept if the
multiple linear regression model holds. If there are more than two nontrivial
predictors, then we generally cannot nd a sucient summary plot and need
to use an estimated sucient summary plot.
Important theoretical results for the additive error single index model Y =
m(+ T x)+e were given by Brillinger (1977, 1983) and Aldrin et al. (1993).
Li and Duan (1989) extended these results to models of the form
Y = g( + T x, e) (1.15)
where g is a bivariate inverse link function. Olive and Hawkins (2005) discuss
variable selection while Chang (2006) and Chang and Olive (2007, 2010)
discuss (ordinary) least squares (OLS) tests. Severini (1998) discusses when
OLS output is relevant for the Gaussian additive error single index model.
1.6 Problems
R Problem
This chapter introduces the multiple linear regression model, the response
plot for checking goodness of t, the residual plot for checking lack of t,
the ANOVA F test, the partial F test, the t tests, and least squares. The
problems use software R, SAS, Minitab, and Arc.
Denition 2.1. The response variable is the variable that you want to
predict. The predictor variables are the variables used to predict the re-
sponse variable.
Notation. In this text the response variable will usually be denoted by
Y and the p predictor variables will often be denoted by x1 , . . . , xp . The
response variable is also called the dependent variable while the predictor
variables are also called independent variables, explanatory variables, carri-
ers, or covariates. Often the predictor variables will be collected in a vector x.
Then xT is the transpose of x.
Denition 2.2. Regression is the study of the conditional distribu-
tion Y |x of the response variable Y given the vector of predictors x =
(x1 , . . . , xp )T .
Denition 2.3. A quantitative variable takes on numerical values
while a qualitative variable takes on categorical values.
Example 2.1. Archeologists and crime scene investigators sometimes
want to predict the height of a person from partial skeletal remains. A model
for prediction can be built from nearly complete skeletons or from living
humans, depending on the population of interest (e.g., ancient Egyptians
or modern US citizens). The response variable Y is height and the predic-
tor variables might be x1 1, x2 = femur length, and x3 = ulna length.
Denition 2.4. Suppose that the response variable Y and at least one
predictor variable xi are quantitative. Then the multiple linear regression
(MLR) model is
for i = 1, . . . , n. Here n is the sample size and the random variable ei is the
ith error. Suppressing the subscript i, the model is Y = xT + e.
Y = X + e, (2.2)
If the predictor variables are random variables, then the above MLR model
is conditional on the observed values of the xi . That is, observe the xi and
then act as if the observed xi are xed.
2.1 The MLR Model 19
Denition 2.6. The unimodal MLR model has the same assumptions
as the constant variance MLR model, as well as the assumption that the zero
mean constant variance errors e1 , . . . , en are iid from a unimodal distribution
that is not highly skewed. Note that E(ei ) = 0 and V (ei ) = 2 < .
Denition 2.7. The normal MLR model or Gaussian MLR model has
the same assumptions as the unimodal MLR model but adds the assumption
that the errors e1 , . . . , en are iid N (0, 2 ) random variables. That is, the ei
are iid normal random variables with zero mean and variance 2 .
The unknown coecients for the above 3 models are usually estimated
using (ordinary) least squares (OLS).
Denition 2.9. The ordinary least squares (OLS) estimator OLS mini-
mizes
n
QOLS (b) = ri2 (b), (2.4)
i=1
There are many statistical models besides the MLR model, and you should
learn how to quickly recognize an MLR model. A regression model has a
response variable Y and the conditional distribution of Y given the predic-
tors x = (x1 , . . . , xp )T is of interest. Regression models are used to predict Y
and to summarize the relationship between Y and x. If a constant xi,1 1
(this notation means that xi,1 = 1 for i = 1, . . . , n) is in the model, then
xi,1 is often called the trivial predictor, and the MLR model is said to have
20 2 Multiple Linear Regression
Notation: For MLR, the residual plot will often mean the residual
plot of Yi versus ri , and the response plot will often mean the plot of Yi
versus Yi .
Remark 2.1. For any MLR analysis, always make the response
plot and the residual plot of Yi versus Yi and ri , respectively.
22 2 Multiple Linear Regression
Response Plot
3
1500 1600 1700 1800
Y
63
44
Residual Plot
3
0 50
RES
50
63
100
44
Fig. 2.1 Residual and Response Plots for the Tremearne Data
Denition 2.13. An outlier is an observation that lies far away from the
bulk of the data.
Remark 2.2. For MLR, the response plot is important because MLR
is the study of the conditional distribution of Y |xT , and the response
plot is used to visualize the conditional distribution of Y |xT since
Y = xT is a good estimator of xT if is a good estimator of .
If the MLR model is useful, then the plotted points in the response plot
should be linear and scatter about the identity line with no gross outliers.
Suppose the tted values range in value from wL to wH with no outliers. Fix
the t = w in this range and mentally add a narrow vertical strip centered at
w to the response plot. The plotted points in the vertical strip should have a
mean near w since they scatter about the identity line. Hence Y |f it = w is
like a sample from a distribution with mean w. The following example helps
illustrate this remark.
The response plot may look good while the residual plot suggests that the
unimodal MLR model can be improved. Examining plots to nd model vio-
lations is called checking for lack of t. Again assume that n 5p.
The unimodal MLR model often provides a useful model for the data, but
the following assumptions do need to be checked.
i) Is the MLR model appropriate?
ii) Are outliers present?
iii) Is the error variance constant or nonconstant? The constant variance
assumption VAR(ei ) 2 is known as homoscedasticity. The nonconstant
variance assumption VAR(ei ) = i2 is known as heteroscedasticity.
iv) Are any important predictors left out of the model?
v) Are the errors e1 , . . . , en iid?
vi) Are the errors ei independent of the predictors xi ?
Make the response plot and the residual plot to check i), ii), and iii). An
MLR model is reasonable if the plots look like Figures 1.2, 1.3, 1.4, and 2.1.
A response plot that looks like Figure 13.7 suggests that the model is not
linear. If the plotted points in the residual plot do not scatter about the
r = 0 line with no other pattern (i.e., if the cloud of points is not ellipsoidal or
rectangular with zero slope), then the unimodal MLR model is not sustained.
The ith residual ri is an estimator of the ith error ei . The constant variance
assumption may have been violated if the variability of the point cloud in the
residual plot depends on the value of Y . Often the variability of the residuals
increases as Y increases, resulting in a right opening megaphone shape. (Fig-
ure 4.1b has this shape.) Often the variability of the residuals decreases as Y
increases, resulting in a left opening megaphone shape. Sometimes the vari-
ability decreases then increases again, and sometimes the variability increases
then decreases again (like a stretched or compressed football).
Remark 2.3. Residual plots magnify departures from the model while the
response plot emphasizes how well the MLR model ts the data.
Since the residuals ri = ei are estimators of the errors, the residual plot
is used to visualize the conditional distribution e|SP of the errors given the
sucient predictor SP = xT , where SP is estimated by Y = xT . For the
unimodal MLR model, there should not be any pattern in the residual plot:
as a narrow vertical strip is moved from left to right, the behavior of the
residuals within the strip should show little change.
2.3 Checking Lack of Fit 25
Notation. A rule of thumb is a rule that often but not always works well
in practice.
Rule of thumb 2.1. If the residual plot would look good after several
points have been deleted, and if these deleted points were not gross outliers
(points far from the point cloud formed by the bulk of the data), then the
residual plot is probably good. Beginners often nd too many things wrong
with a good model. For practice, use the lregpack function MLRsim to generate
several MLR data sets, and make the response and residual plots for these
data sets: type MLRsim(nruns=10) in R and right click Stop for each plot
(20 times) to generate 10 pairs of response and residual plots. This exercise
will help show that the plots can have considerable variability even when the
MLR model is good. See Problem 2.30.
Rule of thumb 2.2. If the plotted points in the residual plot look like
a left or right opening megaphone, the rst model violation to check is the
assumption of nonconstant variance. (This is a rule of thumb because it is
possible that such a residual plot results from another model violation such
as nonlinearity, but nonconstant variance is much more common.)
The residual plot of Y versus r should always be made. It is also a good idea
to plot each nontrivial predictor xj versus r and to plot potential predictors
wj versus r. If the predictor is quantitative, then the residual plot of xj versus
r should look like the residual plot of Y versus r. If the predictor is qualitative,
e.g. gender, then interpreting the residual plot is much more dicult; however,
if each category contains many observations, then the plotted points for each
category should form a vertical line centered at r = 0 with roughly the same
variability (spread or range).
Rule of thumb 2.3. Suppose that the MLR model uses predictors xj
and that data has been collected on variables wj that are not included in
the MLR model. To check whether important predictors have been left out,
make residual plots of xj and wj versus r. If these plots scatter about the
r = 0 line with no other pattern, then there is no evidence that x2j or wj are
needed in the model. If the plotted points scatter about a parabolic curve,
try adding x2j or wj and wj2 to the MLR model. If the plot of the potential
predictor wj versus r has a linear trend, try adding wj to the MLR model.
The additive error regression model and EE plot in Section 13.7 can also be
used to check whether important predictors have been left out.
Rule of thumb 2.4. To check that the errors are independent of the pre-
dictors, make residual plots of xj versus r. If the plot of xj versus r scatters
about the r = 0 line with no other pattern, then there is no evidence that the
errors depend on xj . If the variability of the residuals changes with the value
of xj , e.g. if the plot resembles a left or right opening megaphone, the errors
may depend on xj . Some remedies for nonconstant variance are considered
in Chapter 4.
26 2 Multiple Linear Regression
To study residual plots, some notation and properties of the least squares
estimator are needed. MLR is the study of the conditional distribution of
Yi |xTi , and the MLR model is Y = X + e where X is an n p matrix
of full rank p. Hence the number of predictors p n. The ith row of X is
xTi = (xi,1 , . . . , xi,p ) where xi,k is the value of the ith observation on the
kth predictor xk . We will denote the jth column of X by Xj v j which
corresponds to the jth variable or predictor xj .
Example 2.4. If Y is brain weight in grams, x1 1, x2 is age, and x3 is
the size of the head in (mm)3 , then for the Gladstone (1905) data
3738 1 39 149.5
4261 1 35 152.5
Y = . , X = . . .. = [v 1 v 2 v 3 ].
.. .. .. .
3306 1 19 141
Hence the rst person had brain weight = 3738, age = 39, and size = 149.5.
After deleting observations with missing values, there were n = 267 cases
(people measured on brain weight, age, and size), and x267 = (1, 19, 141)T .
The second predictor x2 = age corresponds to the 2nd column of X and
is X2 = v 2 = (39, 35, . . . , 19)T . Notice that X1 v 1 = 1 = (1, . . . , 1)T
corresponds to the constant x1 .
The results in the following proposition are properties of least squares
(OLS), not of the underlying MLR model. See Chapter 11 for more linear
model theory. Denitions 2.8 and 2.9 dene the hat matrix H, vector of
tted values Y , and vector of residuals r. Parts f) and g) make residual plots
useful. If the plotted points are linear with roughly constant variance and the
correlation is zero, then the plotted points scatter about the r = 0 line with
no other pattern. If the plotted points in a residual plot of w versus r do show
a pattern such as a curve or a right opening megaphone, zero correlation will
usually force symmetry about either the r = 0 line or the w = median(w)
line. Hence departures from the ideal plot of random scatter about the r = 0
line are often easy to detect.
H T = X T [(X T X)1 ]T (X T )T = H.
n
n
n
A= Yi ri Y ri = Yi ri
i=1 i=1 i=1
n
by d) again. But i=1 Yi ri = r T Y = 0 by e).
ng) Following the argument in f), the resultn follows if A =
i=1 (x i,j x j )(ri r) = 0 where
n x j = i=1 xi,j /n is the sample mean of
the jth predictor. Now r = i=1 ri /n = 0 by d), and thus
n
n
n
A= xi,j ri xj ri = xi,j ri
i=1 i=1 i=1
n
by d) again. But i=1 xi,j ri = XjT r = v Tj r = 0 by c).
28 2 Multiple Linear Regression
Without loss of generality, E(e) = 0 for the unimodal MLR model with a
constant, in that if E(e) = = 0, then the MLR model can always be written
as Y = xT + e where E(e) = 0 and E(Y ) E(Y |x) = xT . To see this
claim notice that
Y = 1 + x2 2 + + xp p + e = 1 + E(e) + x2 2 + + xp p + e E(e)
= 1 + x 2 2 + + x p p + e
where 1 = 1 + E(e) and e = e E(e). For example, if the errors ei are iid
exponential () with E(ei ) = , use ei = ei .
For least squares, it is crucial that 2 exists. For example, if the ei are iid
Cauchy(0,1), then 2 does not exist and the least squares estimators tend to
perform very poorly.
The performance of least squares is analogous to the performance of Y .
The sample mean Y is a very good estimator of the population mean if the
Yi are iid N (, 2 ), and Y is a good estimator of if the sample size is large
and the Yi are iid with mean and variance 2 . This result follows from
the central limit theorem (CLT), but how large is large depends on the
underlying distribution. The n > 30 rule tends to hold for distributions that
are close to normal in that they take on many values and 2 is not huge. Error
distributions that are highly nonnormal with tiny 2 often need n >> 30.
For example, if Y1 , . . . , Yn are iid Gamma(1/m, 1), then n > 25m may be
needed. Another example is distributions that take on one value with very
high probability, e.g. a Poisson random variable with very small variance.
Bimodal and multimodal distributions and highly skewed distributions with
large variances also need larger n. Chihara and Hesterberg (2011, p. 177)
suggest using n > 5000 for moderately skewed distributions.
There are central limit type theorems for the least squares estimators that
depend on the error distribution of the iid errors ei . See Theorems 2.8, 11.25,
and 12.7. We always assume that the ei are continuous random variables with
a probability density function. Error distributions that are close to normal
may give good results for moderate n if n 10p and np 30 where p is the
number of predictors. Error distributions that need large n for the CLT to
apply for e, will tend to need large n for the limit theorems for least squares
to apply (to give good approximations).
Checking whether the errors are iid is often dicult. The iid assumption is
often reasonable if measurements are taken on dierent objects, e.g. people.
In industry often several measurements are taken on a batch of material.
For example a batch of cement is mixed and then several small cylinders of
concrete are made from the batch. Then the cylinders are tested for strength.
2.4 The ANOVA F Test 29
Experience from such experiments suggests that objects (e.g., cylinders) from
dierent batches are independent, but objects from the same batch are not
independent.
One check on independence can also be made if the time order of the
observations is known. Let r[t] be the residual where [t] is the time order of
the trial. Hence [1] was the 1st and [n] was the last trial. Plot the time order
t versus r[t] if the time order is known. Again, trends and outliers suggest
that the model could be improved. A box shaped plot with no trend suggests
that the MLR model is good. A plot similar to the Durbin Watson test plots
r[t1] versus r[t] for t = 2, . . . , n. Linear trend suggests serial correlation while
random scatter suggests that there is no lag 1 autocorrelation. As a rule of
thumb, if the OLS slope b is computed for the plotted points, b > 0.25 gives
some evidence that there is positive correlation between r[t1] and r[t] . Time
series plots, such as the ACF or PACF of the residuals, may be useful.
After tting least squares and checking the response and residual plots to see
that an MLR model is reasonable, the next step is to check whether there is
an MLR relationship between Y and the nontrivial predictors x2 , . . . , xp . If
at least one of these predictors is useful, then the OLS tted values Yi should
be used. If none of the nontrivial predictors is useful, then Y will give as good
predictions as Yi . Here the sample mean
30 2 Multiple Linear Regression
1
n
Y = Yi . (2.5)
n i=1
In the denition below, SSE is the sum of squared residuals and a residual
ri = ei = errorhat. In the literature errorhat is often rather misleadingly
abbreviated as error.
n
SST O = (Yi Y )2 . (2.6)
i=1
n
SSR = (Yi Y )2 . (2.7)
i=1
n
n
SSE = (Yi Yi )2 = ri2 . (2.8)
i=1 i=1
Proof.
n
n
SST O = (Yi Yi + Yi Y )2 = SSE + SSR + 2 (Yi Yi )(Yi Y ).
i=1 i=1
But
n
n
A= ri Yi Y ri = 0
i=1 i=1
Denition 2.15. Assume that a constant is in the MLR model and that
SSTO = 0. The coecient of multiple determination
SSR SSE
R2 = [corr(Yi , Yi )]2 = =1
SSTO SSTO
The following 2 propositions suggest that R2 does not behave well when
many predictors that are not needed in the model are included in the model.
Such a variable is sometimes called a noise variable and the MLR model
is tting noise. Proposition 2.5 appears, for example, in Cramer (1946,
pp. 414415), and suggests that R2 should be considerably larger than p/n
if the predictors are useful. Note that if n = 10p and p 2, then under the
conditions of Proposition 2.5, E(R2 ) 0.1.
Notice that each SS/n estimates the variability of some quantity. SST O/n
SY2 , SSE/n Se2 = 2 , and SSR/n SY2 .
Seber and Lee (2003, pp. 4447) show that when the MLR model holds,
MSE is often a good estimator of 2 . Under regularity conditions, the MSE
is one of the best unbiased quadratic estimators of 2 . For the normal MLR
model, MSE is the uniformly minimum variance unbiased estimator of 2 .
Seber and Lee also give the following theorem that shows that the MSE is an
unbiased estimator of 2 under very weak assumptions
if the MLR model is
appropriate. From Theorem 12.7 MSE is a n consistent estimator of 2 .
Theorem 2.6. If Y = X +e where X is an np matrix of full rank p, if
the ei are independent with E(ei ) = 0, and VAR(ei ) = 2 , then 2 = M SE
is an unbiased estimator of 2 .
The ANOVA F test tests whether any of the nontrivial predictors
x2 , . . . , xp are needed in the OLS MLR model, that is, whether Yi should
be predicted by the OLS t Yi = 1 + xi,2 2 + + xi,p p or with the
sample mean Y . ANOVA stands for analysis of variance, and the computer
output needed to perform the test is contained in the ANOVA table. Below
is an ANOVA table given in symbols. Sometimes Regression is replaced by
Model and Residual by Error.
Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0
Remark 2.4. Recall that for a 4 step test of hypotheses, the pvalue is the
probability of getting a test statistic as extreme as the test statistic actually
observed and that Ho is rejected if the pvalue < . As a benchmark for this
textbook, use = 0.05 if is not given. The 4th step is the nontechnical
conclusion which is crucial for presenting your results to people who are not
familiar with MLR. Replace Y and x2 , . . . , xp by the actual variables used in
the MLR model. Follow Example 2.5.
P
pval pvalue 0
(converges to 0 in probability, so pval is a consistent estimator of pvalue)
as the sample size n . See Theorem 11.25, Section 11.6, and Chang
and Olive (2010). Then the computer output pval is a good estimator of the
unknown pvalue. We will use Ho H0 and Ha HA H1 .
P (Fp1,np > Fo ).
iv) State whether you reject Ho or fail to reject Ho. If Ho is rejected, conclude
that there is an MLR relationship between Y and the predictors x2 , . . . , xp . If
you fail to reject Ho, conclude that there is not an MLR relationship between
Y and the predictors x2 , . . . , xp . (Or there is not enough evidence to conclude
that there is an MLR relationship between Y and the predictors.)
Example 2.5. For the Gladstone (1905) data, the response variable Y =
brain weight, x1 1, x2 = size of head, x3 = sex, x4 = breadth of head,
x5 = circumference of head. Assume that the response and residual plots
look good and test whether at least one of the nontrivial predictors is needed
in the model using the output shown below.
Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 4 5396942. 1349235. 196.24 0.0000
Residual 262 1801333. 6875.32
Solution: i) Ho: 2 = = 5 = 0 Ha: not Ho
ii) Fo = 196.24 from output.
iii) The pval = 0.0 from output.
iv) The pval < (= 0.05 since was not given). So reject Ho. Hence there
is an MLR relationship between brain weight and the predictors size, sex,
breadth, and circumference.
Remark 2.5. There is a close relationship between the response plot and
the ANOVA F test. If n 10p and n p 30 and if the plotted points
follow the identity line, typically Ho will be rejected if the identity line ts
the plotted points better than any horizontal line (in particular, the line
Y = Y ). If a horizontal line ts the plotted points about as well as the identity
line, as in Figure 1.4, this graphical diagnostic is inconclusive (sometimes the
ANOVA F test will reject Ho and sometimes fail to reject Ho), but the MLR
relationship is at best weak. In Figures 1.2 and 2.1, the ANOVA F test
should reject Ho since the identity line ts the plotted points better than
any horizontal line. Under the above conditions, a graphical ANOVA F test
34 2 Multiple Linear Regression
rejects Ho if the response plot is not similar to the residual plot. The graphical
test is inconclusive if the response plot looks similar to the residual plot. The
graphical test is also useful for multiple linear regression methods other than
least squares, such as M -estimators and other robust regression estimators.
Remark 2.6. If the RR plot of the residuals Yi Y versus the OLS resid-
uals ri = Yi Yi shows tight clustering about the identity line, then the MLR
relationship is weak: Y ts the data about as well as the OLS t.
Example 2.6. Cook and Weisberg (1999a, pp. 261, 371) describe a data
set where rats were injected with a dose of a drug approximately proportional
to body weight. The response Y was the fraction of the drug recovered from
the rats liver. The three predictors were the body weight of the rat, the dose
of the drug, and the liver weight. A constant was also used. The experimenter
expected the response to be independent of the predictors, and 19 cases
were used. However, the ANOVA F test suggested that the predictors were
important. The third case was an outlier and easily detected in the response
and residual plots (not shown). After deleting the outlier, the response and
residual plots looked ok and the following output was obtained.
RR Plot
0.10
0.05
full$residual
0.0 -0.05
-0.10
Fig. 2.2 RR Plot With Outlier Deleted, Submodel Uses Only the Trivial Predictor with
Y = Y
Some assumptions are needed on the ANOVA F test. Assume that both
the response and residual plots look good. It is crucial that there are no
outliers. Then a rule of thumb is that if n p is large, then the ANOVA
F test pvalue is approximately correct. An analogy can be made with the
central limit theorem, Y is a good estimator for if the Yi are iid N (, 2 )
and also a good estimator for if the data are iid with mean and variance
2 if n is large enough. Also see Theorem 11.25. More on the robustness and
lack of robustness of the ANOVA F test can be found in Wilcox (2012).
If all of the xi are dierent (no replication) and if the number of predictors
p = n, then the OLS t Yi = Yi and R2 = 1. Notice that Ho is rejected if the
statistic Fo is large. More precisely, reject Ho if
Fo > Fp1,np,1
where
P (F Fp1,np,1 ) = 1
when F Fp1,np . Since R2 increases to 1 while (n p)/(p 1) decreases
to 0 as p increases to n, Theorem 2.7a below implies that if p is large then
the Fo statistic may be small even if some of the predictors are very good. It
is a good idea to use n 10p or at least n 5p if possible. Theorem 11.25
can be used to show that pval is a consistent estimator of the pvalue under
reasonable conditions.
36 2 Multiple Linear Regression
Remark 2.7. When a constant is not contained in the model (i.e., xi,1 is
not equal to 1 for all i), then the computer output still produces an ANOVA
table with the test statistic and pvalue, and nearly the same 4 step test of
hypotheses can be used. The hypotheses are now Ho: 1 = = p = 0 Ha:
not Ho, and you are testing whether or not there is an MLR relationship
between Y and x1 , . . . , xp . An MLR model without a constant (no intercept)
is sometimes called a regression through the origin. See Section 2.10.
2.5 Prediction
(Y1 , x1 ), . . . , (Yn , xn )
well, but when new test data is collected, a very dierent MLR model is
needed to t the new data well. In particular, the MLR model seems to t
the data (Yi , xi ) well for i = 1, . . . , n, but when the researcher tries to predict
Yf for a new vector of predictors xf , the prediction is very poor in that Yf is
not close to the Yf actually observed. Wait until after the MLR model
has been shown to make good predictions before claiming that the
model gives good predictions!
There are several reasons why the MLR model may not t new data well.
i) The model building process is usually iterative. Data Z, w1 , . . . , wr is col-
lected. If the model is not linear, then functions of Z are used as a potential
response variable and functions of the wi as potential predictors. After trial
and error, the functions are chosen, resulting in a nal MLR model using Y
and x1 , . . . , xp . Since the same data set was used during the model building
2.5 Prediction 37
process, biases are introduced and the MLR model ts the training data
better than it ts new test data. Suppose that Y , x1 , . . . , xp are specied
before collecting data and that the residual and response plots from the re-
sulting MLR model look good. Then predictions from the prespecied model
will often be better for predicting new data than a model built from an iter-
ative process.
ii) If (Yf , xf ) come from a dierent population than the population of
(Y1 , x1 ), . . . , (Yn , xn ), then prediction for Yf can be arbitrarily bad.
iii) Even a good MLR model may not provide good predictions for an xf
that is far from the xi (extrapolation).
iv) The MLR model may be missing important predictors (undertting).
v) The MLR model may contain unnecessary predictors (overtting).
The following theorem is analogous to the central limit theorem and the
theory for the tinterval for based on Y and the sample standard deviation
(SD) SY . If the data Y1 , . . . , Yn are iid with mean 0 and variance 2 , then Y
is asymptotically normal and the tinterval will perform well if the sample
size is large enough. The result below suggests that the OLS estimators Yi
and are good if the sample size is large enough. The condition max hi 0
in probability usually holds if the researcher picked the design matrix X or
if the xi are iid random vectors from a well-behaved population. Outliers
D
can cause the condition to fail. Convergence in distribution, Zn Np (0, ),
means the multivariate normal approximation can be used for probability
2.5 Prediction 39
XT X
W 1
n
Equivalently,
D
(X T X)1/2 ( ) Np (0, 2 I p ). (2.13)
where the inequality follows from Chebyshevs inequality. Hence the asymp-
totic coverage of the nominal 95% PI is at least 73.9%. The 95% PI (2.14)
was often quite accurate in that the asymptotic coverage was close to 95% for
a wide variety of error distributions. The 99% and 90% PIs did not perform
as well.
Example 2.8. For the Buxton (1920) data suppose that the response Y
= height and the predictors were a constant, head length, nasal height, bigo-
nal breadth, and cephalic index. Five outliers were deleted leaving 82 cases.
Figure 2.3 shows a response plot of the tted values versus the response Y
with the identity line added as a visual aid. The plot suggests that the model
is good since the plotted points scatter about the identity line in an evenly
populated band although the relationship is rather weak since the correlation
of the plotted points is not very high. The triangles represent the upper and
lower limits of the semiparametric 95% PI (2.17). For this example, 79 (or
96%) of the Yi fell within their corresponding PI while 3 Yi did not. A plot
using the classical PI (2.14) would be very similar for this data. The plot was
made with the following R commands, using the lregpack function piplot.
x <- buxx[-c(61,62,63,64,65),]
Y <- buxy[-c(61,62,63,64,65)]
piplot(x,Y)
1900
1850
1800
1750
1700
Y
1650
1600
1550
Given output showing i and given xf , se(pred), and se(Yf ), Example 2.9
shows how to nd Yf , a CI for E(Yf |xf ), and the classical PI (2.14) for Yf .
42 2 Multiple Linear Regression
Example 2.9. The Rounceeld (1995) data povc.lsp are female and
male life expectancies from n = 91 countries where 6 cases with missing GNP
were deleted. Suppose that it is desired to predict female life expectancy Y
from male life expectancy X. Suppose that if Xf = 60, then se(pred) =
2.1285, and se(Yf ) = 0.2241. Below is some output.
a) Find Yf if Xf = 60.
Solution: In this example, xf = (1, Xf )T since a constant is in the output
above. Thus Yf = 1 + 2 Xf = 2.93739 + 1.12359(60) = 64.478.
b) If Xf = 60, nd a 90% condence interval for E(Y ) E(Yf |xf ).
Solution: The CI is Yf tn2,1/2 se(Yf ) = 64.478 1.645(0.2241) =
64.478 0.3686 = [64.1094, 64.8466]. To use the ttable on the last page of
Chapter 14, use the 2nd to last row marked by Z since d = df = n 2 =
89 > 30. In the last row nd CI = 90% and intersect the 90% column and
the Z row to get the value of t89,0.95 z.95 = 1.645.
c) If Xf = 60, nd a 90% prediction interval for Yf .
Solution: The PI is Yf tn2,1/2 se(pred) = 64.478 1.645(2.1285) =
64.478 3.5014 = [60.9766, 67.9794].
Two more PIs will be dened and then the 4 PIs (2.14), (2.17), (2.18),
and (2.20) will be compared via simulation. An asymptotically conservative
(ac) 100(1 )% PI has asymptotic coverage 1 1 . We used the (ac)
100(1 )% PI
n
Yf max(|/2 |, |1/2 |) (1 + hf ) (2.18)
np
In the simulations described below, will be the sample percentile for the
PIs (2.17) and (2.18). A PI is asymptotically optimal if it has the shortest
asymptotic length that gives the desired asymptotic coverage. If the error
distribution is unimodal, an asymptotically optimal PI can be created by
applying the shorth(c) estimator to the residuals where c =
n(1 ) and
x is the smallest integer x, e.g.,
7.7 = 8. That is, let r(1) , . . . , r(n) be the
order statistics of the residuals. Compute r(c) r(1) , r(c+1) r(2) , . . . , r(n)
r(nc+1) . Let [r(d) , r(d+c1) ] = [1 , 12 ] correspond to the interval with the
smallest distance. Then the large sample asymptotically optimal 100 (1)%
PI for Yf is
[Yf + an 1 , Yf + an 12 ] (2.20)
where an is given by (2.16).
Remark 2.8. We recommend using the asymptotically optimal PI (2.20)
instead of the classical PI (2.14). The lregpack function pisim can be used to
recreate the simulation described below. See Problem 2.29.
A small simulation study compares the PI lengths and coverages for sample
sizes n = 50, 100, and 1000 for several error distributions. The value n =
gives the asymptotic coverages and lengths. The MLR model with E(Yi ) =
1 + xi2 + + xi8 was used. The vectors (x2 , . . . , x8 )T were iid N7 (0, I 7 ).
The error distributions were N(0,1), t3 , and exponential(1) 1. Also, a small
sensitivity study to examine the eects of changing (1 + 15/n) to (1 + k/n)
on the 99% PIs (2.17) and (2.20) was performed. For n = 50 and k between
10 and 20, the coverage increased by roughly 0.001 as k increased by 1.
Tables 2.12.3 show the results of the simulations for the 3 error distri-
butions. The letters c, s, a, and o refer to intervals (2.14), (2.17), (2.18),
and (2.20), respectively. For the normal errors, the coverages were about
right and the semiparametric interval tended to be rather long for n = 50
and 100. The classical PI asymptotic coverage 1 tended to be fairly close
to the nominal coverage 1 for all 3 distributions and = 0.01, 0.05, and
0.1. The asymptotically optimal PI tended to have short length and simulated
coverage close to the nominal coverage.
The partial F test is used to test whether the reduced model is good in
that it can be used instead of the full model. It is crucial that the reduced
model be selected before looking at the data. If the reduced model is selected
after looking at output and discarding the worst variables, then the pvalue
for the partial F test will be too high. For (ordinary) least squares, usually
a constant is used, and we are assuming that both the full model and the
reduced model contain a constant. The partial F test has null hypothesis
Ho : iq+1 = = ip = 0, and alternative hypothesis HA : at least one of the
ij = 0 for j > q. The null hypothesis is equivalent to Ho: the reduced model
is good. Since only the full model and reduced model are being compared,
the alternative hypothesis is equivalent to HA : the reduced model is not as
good as the full model, so use the full model, or more simply, HA : use the
full model.
To perform the partial F test, t the full model and the reduced model
and obtain the ANOVA table for each model. The quantities dfF , SSE(F)
46 2 Multiple Linear Regression
and MSE(F) are for the full model and the corresponding quantities from
the reduced model use an R instead of an F . Hence SSE(F) and SSE(R) are
the residual sums of squares for the full and reduced models, respectively.
Shown below is output only using symbols.
Full model
Reduced model
iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams typically an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np,
while pval is the estimated pvalue.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if the pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.
Six plots are useful diagnostics for the partial F test: the RR plot with
the full model residuals on the vertical axis and the reduced model residuals
on the horizontal axis, the FF plot with the full model tted values on the
vertical axis, and always make the response and residual plots for the full
and reduced models. Suppose that the full model is a useful MLR model. If
the reduced model is good, then the response plots from the full and reduced
models should be very similar, visually. Similarly, the residual plots (of the
tted values versus the residuals) from the full and reduced models should be
very similar, visually. Finally, the correlation of the plotted points in the RR
and FF plots should be high, 0.95, say, and the plotted points in the RR
and FF plots should cluster tightly about the identity line. Add the identity
line to both the RR and FF plots as a visual aid. Also add the OLS line from
regressing r on r R to the RR plot (the OLS line is the identity line in the FF
plot). If the reduced model is good, then the OLS line should nearly coincide
with the identity line in that it should be dicult to see that the two lines
intersect at the origin, as in Figure 2.2. If the FF plot looks good but the
RR plot does not, the reduced model may be good if the main goal of the
analysis is to predict Y.
48 2 Multiple Linear Regression
In Chapter 3, Example 3.8 describes the Gladstone (1905) data. Let the
reduced model use a constant, (size)1/3 , sex, and age. Then Figure 3.7 shows
the response and residual plots for the full and reduced models, and Figure 3.9
shows the RR and FF plots.
Example 2.10. For the Buxton (1920) data, n = 76 after 5 outliers and
6 cases with missing values are removed. Assume that the response variable
Y is height, and the explanatory variables are x2 = bigonal breadth, x3 =
cephalic index, x4 = nger to ground, x5 = head length, x6 = nasal height,
and x7 = sternal height. Suppose that the full model uses all 6 predictors plus
a constant (x1 ) while the reduced model uses the constant, cephalic index,
and nger to ground. Test whether the reduced model can be used instead of
the full model using the output below.
= 41588.9/496.629 = 83.742.
iii) pval = P (F4,69 > 83.742) = 0.00.
iv) The pval < (= 0.05, since was not given), so reject Ho. The full model
should be used instead of the reduced model. (Bigonal breadth, head length,
nasal height, and sternal height are needed in the MLR for height given that
cephalic index and nger to ground are in the model.)
Using a computer to get the pval makes sense, but for exams you may need
to use a table. In ARC, you can use the Calculate probability option from the
ARC menu, enter 83.742 as the value of the statistic, 4 and 69 as the degrees
of freedom, and select the F distribution. To use the table near the end of
Chapter 14, use the bottom row since the denominator degrees of freedom 69
> 30. Intersect with the column corresponding to k = 4 numerator degrees of
freedom. The cuto value is 2.37. If the FR statistic was 2.37, then the pval
would be 0.05. Since 83.472 > 2.37, the pval < 0.05, and since 83.472 >> 2.37,
we can say that the pval 0.0.
2.7 The Wald t Test 49
Example 2.11. Now assume that the reduced model uses the constant,
sternal height, nger to ground, and head length. Using the output below, test
whether the reduced model is good.
Summary Analysis of Variance Table for Reduced Model
Source df SS MS F p-value
Regression 3 259704. 86568. 177.93 0.0000
Residual 72 35030.1 486.528
Solution: The 4 step partial F test follows.
i) Ho: the reduced model is good Ha: use the full model
ii)
SSE(R) SSE(F ) 35030.1.0 34267.4
FR = /M SE(F ) = /496.629
dfR dfF 72 69
= 254.2333/496.629 = 0.512.
iii) The pval = P (F3,69 > 0.512) = 0.675.
iv) The pval > , so reject fail to reject Ho. The reduced model is good.
To use the F table near the end of Chapter 14, use the bottom row since
the denominator degrees of freedom 69 > 30. Intersect with the column cor-
responding to k = 3 numerator degrees of freedom. The cuto value is 2.61.
Since 0.512 < 2.61, pval > 0.05, and this is enough information to fail to
reject Ho.
Some R commands and output to do the above problem are shown below.
cyp <- matrix(scan(),nrow=76,ncol=8,byrow=T,dimnames=
list( c(), c("indx", "ht", "sternal", "finger",
"hdlen","nasal","bigonal", "cephalic")))
#copy and paste the data set cyp.lsp then press enter
cyp <- cyp[,-1]; cyp <- as.data.frame(cyp)
full <- lm(ht~.,data=cyp)
red <- lm(ht~sternal+finger+hdlen,data=cyp)
anova(red,full)
Model 1: ht ~ sternal + finger + hdlen
Model 2: ht ~ sternal + finger + hdlen + nasal
+ bigonal + cephalic
Res.Df RSS Df Sum of Sq F Pr(>F)
1 72 35030
2 69 34267 3 762.67 0.5119 0.6754
Use the normal table or the d = Z line in the ttable if the degrees of freedom
d = n p 30. Again pval is the estimated pvalue.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem.
The added variable plot (also called a partial regression plot) is used to
give information about the test Ho : k = 0. The points in the plot cluster
about a line through the origin with slope = k . An interesting fact is that the
residuals from this line, i.e. the residuals from regressing r (k) on r(xk |x(k) ),
are exactly the same as the usual residuals from regressing Y on x. The range
of the horizontal axis gives information about the collinearity of xk with the
other predictors. Small range implies that xk is well explained by the other
predictors. The r(xk |x(k) ) represent the part of xk that is not explained by
the remaining variables while the r (k) represent the part of Y that is not
explained by the remaining variables.
An added variable plot with a clearly nonzero slope and tight clustering
about a line implies that xk is needed in the MLR for Y given that the other
predictors x2 , . . . , xk1 , xk+1 , . . . , xp are in the model. Slope near zero in the
added variable plot implies that xk may not be needed in the MLR for Y
given that all other predictors x2 , . . . , xk1 , xk+1 , . . . , xp are in the model.
If the zero line with 0 slope and 0 intercept and the OLS line are added to
the added variable plot, the variable is probably needed if it is clear that the
two lines intersect at the origin. Then the point cloud should be tilted away
from the zero line. The variable is probably not needed if the two lines nearly
coincide near the origin in that you cannot clearly tell that they intersect at
the origin.
Shown below is output only using symbols and the following example shows
how to use output to perform the Wald ttest.
Response = Y
Coecient Estimates
Example 2.12. The output above was collected from 26 districts in Prus-
sia in 1843. See Hebbler (1847). The goal is to study the relationship between
Y = the number of women married to civilians in the district with the predic-
tors x2 = the population of the district, and x3 = military women = number
of women married to husbands in the military.
a) Find a 95% condence interval for 2 corresponding to population.
The CI is k tnp,1/2 se(k ). Since n = 26, df = n p = 26 3 = 23.
From the ttable at the end of Chapter 14, intersect the df = 23 row with
the column that is labelled by 95% in the CI row near the bottom of the
table. Then tnp,1/2 = 2.069. Using the output shows that the 95% CI is
0.180225 2.069(0.00503871) = [0.16980, 0.19065].
b) Perform a 4 step test for Ho: 2 = 0 corresponding to population.
i) Ho: 2 = 0 HA : 2 = 0
ii) to2 = 35.768
iii) pval = 0.0
iv) Reject Ho, the population is needed in the MLR model for the number
of women married to civilians if the number of military women is in the
model.
c) Perform a 4 step test for Ho: 3 = 0 corresponding to military women.
i) Ho: 3 = 0 HA : 3 = 0
ii) to3 = 0.713
iii) pval = 0.4883
iv) Fail to reject Ho, the number of military women is not needed in the
MLR model for the number of women married to civilians if population is in
the model.
Figure 2.4, made with the commands shown below, shows the added vari-
able plots for x2 and x3 . The plot for x2 strongly suggests that x2 is needed
in the MLR model while the plot for x3 indicates that x3 does not seem to
be very important. The slope of the OLS line in a) is 0.1802 while the slope
of the line in b) is 1.894.
source("G:/lregdata.txt")
x2 <- marry[,1]
x3 <- marry[,5]
y <- marry[,3]
#par(mfrow=c(1,2),pty="s")
#square plots look nice but have too much white space
par(mfrow=c(1,2))
resy2 <- residuals(lm(y~x3))
resx2 <- residuals(lm(x2~x3))
plot(resx2,resy2)
abline(lsfit(resx2,resy2)$coef)
title("a) Added Variable Plot for x2")
resy3 <- residuals(lm(y~x2))
2.8 The OLS Criterion 53
50000
10000
5000
0
resy2
resy3
0
50000
5000
10000
1600
1300
Y
1000
Fig. 2.5 The OLS Fit Minimizes the Sum of Squared Residuals
n = Y2i
where the residual ri () xTi . In other words, let ri = ri () be the
n
OLS residuals. Then i=1 ri i=1 ri2 () for any p 1 vector , and the
equality holds i = if the n p design
n matrixX n p n. In
is of full rank
n
particular, if X has full rank p, then i=1 ri2 < i=1 ri2 () = i=1 e2i even
if the MLR model Y = X + e is a good approximation to the data.
deviations from the identity line are the residuals ri (). For this data, the OLS
estimator = (498.726, 1.597, 30.462, 0.696)T . Figure 2.5b shows the re-
sponse plot using the ESP xT where = (498.726, 1.597, 30.462, 0.796)T .
Hence only the coecient for x4 was changed; however, the residuals ri () in
the resulting plot are much larger in magnitude on average than the residuals
in the OLS response plot. With slightly larger changes in the OLS ESP, the
resulting will be such that the squared residuals are massive.
Proof: Seber and Lee (2003, pp. 3637). Recall that the hat matrix H =
X(X T X)1 X T and notice that (I H)T = I H, that (I H)H = 0
and that HX = X. Let be any p 1 vector. Then
(Y X )T (X X) = (Y HY )T (HY HX) =
Y T (I H)H(Y X) = 0.
Thus QOLS () = Y X2 = Y X + X X2 =
Y X 2 + X X2 + 2(Y X )T (X X).
Hence
Y X2 = Y X 2 + X X2 . (2.21)
So
Y X2 Y X 2
with equality i
X( ) = 0
n
(Yi xi,1 1 xi,2 2 xi,p p )2 ,
i=1
QOLS () n
= 2 xi,j (Yi xi,1 1 xi,2 2 xi,p p ) = 2(v j )T (Y X)
j i=1
56 2 Multiple Linear Regression
X T Y X T X = 0,
or
X T X = X T Y . (2.22)
Equation (2.22) is known as the normal equations. If X has full rank,
then = (X T X)1 X T Y . To show that is the global minimizer of the
OLS criterion, use the argument following Equation (2.21).
Y x|xT .
n
dQOLS () n
QOLS () = (Yi ) 2
and = 2 (Yi ).
i=1
d i=1
n
Setting the derivative equal to 0 and calling the solution gives i=1 Yi = n
or = Y . The second derivative
d2 QOLS ()
= 2n > 0,
d 2
hence is the global minimizer.
Yi = 1 + 2 Xi + ei = + Xi + ei
Q n
= 2 (Yi 1 2 Xi )
1 i=1
58 2 Multiple Linear Regression
and
2Q
= 2n.
12
Similarly,
Q n
= 2 Xi (Yi 1 2 Xi )
2 i=1
and
2Q n
=2 Xi2 .
22 i=1
Setting the rst partial derivatives to zero and calling the solutions 1 and
2 shows that the OLS estimators 1 and 2 satisfy the normal equations:
n
n
Yi = n1 + 2 Xi and
i=1 i=1
n
n
n
Xi Yi = 1 Xi + 2 Xi2 .
i=1 i=1 i=1
Xi X
ki = n . (2.24)
j=1 (Xj X)
2
The no intercept MLR model, also known as regression through the origin, is
still Y = X+e, but there is no intercept in the model, so X does not contain
a column of ones 1. Hence the intercept term 1 = 1 (1) is replaced by 1 xi1 .
Software gives output for this model if the no intercept or intercept = F
option is selected. For the no intercept model, the assumption E(e) = 0 is
important, and this assumption is rather strong.
Many of the usual MLR results still hold: OLS = (X T X)1 X T Y , the
vector of predicted tted values Y = X OLS = HY where the hat matrix
H = X(X T X)1 X T provided the inverse exists, and the vector of residuals
is r = Y Y . The response plot and residual plot are made in the same way
and should be made before performing inference.
The main dierence in the output is the ANOVA table. The ANOVA F
test in Section 2.4 tests Ho : 2 = = p = 0. The test in this section tests
Ho : 1 = = p = 0 Ho : = 0. The following denition and test
follows Guttman (1982, p. 147) closely.
n
SST = Yi2 . (2.25)
i=1
n
SSM = Yi2 . (2.26)
i=1
n
n
SSE = (Yi Yi ) =
2
ri2 . (2.27)
i=1 i=1
d) The degrees of freedom (df) for SSM is p, the df for SSE is n p and
the df for SST is n. The mean squares are MSE = SSE/(n p) and MSM =
SSM/p.
Source df SS MS F p-value
Model p SSM MSM Fo=MSM/MSE for Ho:
Residual n-p SSE MSE =0
The ANOVA F test can also be found with the no intercept model by
adding a column of ones to the R matrix x and then performing the partial
F test with the full model and the reduced model that only uses the column
of ones. Notice that the intercept=F option needs
to be used to t both
models. The residual standard error = RSE = M SE. Thus SSE = (n
k)(RSE)2 where n k is the denominator degrees of freedom for the F test
2.11 Summary 61
> ls.print(lsfit(x[,1],y,intercept=F))
Residual Standard Error=164.5028
F-statistic (df=1, 266)=15744.48
((266*(164.5028)^2 - 262*(82.9175)^2)/4)/(82.9175)^2
[1] 196.2435
2.11 Summary
1) The response variable is the variable that you want to predict. The pre-
dictor variables are the variables used to predict the response variable.
2) Regression is the study of the conditional distribution Y |x.
3) The MLR model is
for i = 1, . . . , n. Here n is the sample size and the random variable ei is the ith
error. Assume that the errors are iid with E(ei ) = 0 and VAR(ei ) = 2 < .
Assume that the errors are independent of the predictor variables xi . The
unimodal MLR model assumes that the ei are iid from a unimodal distribution
that is not highly skewed. Usually xi,1 1.
Y = X + e,
i=1 ri .
9) If the MLR model contains a constant, then R2 = [corr(Yi , Yi )]2 =
SSR SSE
=1 .
SSTO SSTO
Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0
11) The large sample 100 (1 )% CI for E(Yf |xf ) = xTf = E(Yf ) is
Yf tnp,1/2 se(Yf ) where P (T tnp, ) = if T has a t distribution with
n p degrees of freedom.
12) The classical 100 (1 )% PI for Yf is Yf tnp,1/2 se(pred), but
should be replaced with the asymptotically optimal PI (2.20).
Full model
Reduced model
iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams typically an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np,
while pval is the estimated pvalue.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if the pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.
Use the normal table or the d = Z line in the ttable if the degrees of freedom
d = n p 30.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem. If Ho is
rejected, then conclude that xk is needed in the MLR model for Y given that
the other predictors are in the model. If you fail to reject Ho, then conclude
that xk is not needed in the MLR model for Y given that the other predictors
are in the model.
n n
16) Given i=1 (Xi X)(Yi Y ), i=1 (Xi X)2 , X, and Y , nd the
least squares line Y = 1 + 2 X where
n
(Xi X)(Yi Y )
2 = i=1 n
i=1 (Xi X)
2
and 1 = Y 2 X.
17) Given , sX , sY , X, and Y , nd the least squares line Y = 1 + 2 X
where 2 = sY /sX and 1 = Y 2 X.
2.12 Complements
The Least Squares Central Limit Theorem 2.8 is often a good approximation
if n 10p and the error distribution has light tails, i.e. the probability of
an outlier is nearly 0 and the tails go to zero at an exponential rate or faster.
For error distributions with heavier tails, much larger samples are needed,
and the assumption that the variance 2 exists is crucial, e.g. Cauchy errors
are not allowed. Norman and Streiner (1986, p. 63) recommend n 5p.
The classical MLR prediction interval does not work well and should be re-
placed by the Olive (2007) asymptotically optimal PI (2.20). Lei and Wasser-
man (2014) provide an alternative: use the Lei et al. (2013) PI [rL , rU ] on the
residuals, then the PI for Yf is
[Yf + rL , Yf + rU ]. (2.28)
Bootstrap PIs need more theory and instead of using B = 1000 samples, use
B = max(1000, n). See Olive (2014, pp. 279285).
For the additive error regression model Y = m(x) + e, the response plot
of Y = m(x) vs. Y , with the identity line added as a visual aid, is used
like the MLR response plot. We want n 10 df where df is the degrees of
freedom from tting m. Olive (2013a) provides PIs for this model, including
the location model. These PIs are large sample PIs provided that the sample
quantiles of the residuals are consistent estimators of the population quantiles
2.12 Complements 65
of the errors. The response plot and PIs could also be used for methods
described in James et al. (2013) such as ridge regression, lasso, principal
components regression, and partial least squares. See Pelawa Watagoda and
Olive (2017) if n is not large compared to p.
In addition to large sample theory, we want the PIs to work well on a
single data set as future observations are gathered, but only have the training
data (x1 , Y1 ), . . . , (xn , Yn ). Much like k-fold cross validation for discriminant
analysis, randomly divide the data set into k = 5 groups of approximately
equal size. Compute the model from 4 groups and use the 5th group as a
validation set: compute the PI for xf = xj for each j in the 5th group.
Repeat so each of the 5 groups is used as a validation set. Compute the
proportion of times Yi was in its PI for i = 1, . . . , n as well as the average
length of the n PIs. We want the proportion near the nominal proportion
and short average length if two or more models or PIs are being considered.
Following Chapter 11, under the regularity conditions, much of the infer-
ence that is valid for the normal MLR model is approximately valid for the
unimodal MLR model when the sample size is large. For example, condence
intervals for i are asymptotically correct, as are t tests for i = 0 (see Li
and Duan (1989, p. 1035)), the MSE is an estimator of 2 by Theorems 2.6
and 2.7, and variable selection procedures perform well (see Chapter 3 and
Olive and Hawkins 2005).
Algorithms for OLS are described in Datta (1995), Dongarra et al. (1979),
and Golub and Van Loan (1989). See Harter (1974a,b, 1975a,b,c, 1976) for
a historical account of multiple linear regression. Draper (2002) provides a
bibliography of more recent references.
Cook and Weisberg (1997, 1999a: ch. 17) call a plot that emphasizes model
agreement a model checking plot. Anscombe (1961) and Anscombe and Tukey
(1963) suggested graphical methods for checking multiple linear regression
and experimental design methods that were the state of the art at the
time.
The rules of thumb given in this chapter for residual plots are not perfect.
Cook (1998, pp. 46) gives an example of a residual plot that looks like a
right opening megaphone, but the MLR assumption that was violated was
linearity, not constant variance. Ghosh (1987) gives an example where the
residual plot shows no pattern even though the constant variance assumption
is violated. Searle (1988) shows that residual plots will have parallel lines if
several cases take on each of the possible values of the response variable, e.g.
if the response is a count.
Several authors have suggested using the response plot to visualize the co-
ecient of determination R2 in multiple linear regression. See, for example,
Chambers et al. (1983, p. 280). Anderson-Sprecher (1994) provides an ex-
cellent discussion about R2 . Kachigan (1982, pp. 174177) also gives a good
explanation of R2 . Also see Kvalseth (1985), and Freedman (1983).
Hoaglin and Welsh (1978) discuss the hat matrix H, and Brooks et al.
(1988) recommend using xf < max hi for valid predictions. Simultaneous
66 2 Multiple Linear Regression
R Squared: R2
Sigma hat: M SE
Number of cases: n
Degrees of Freedom : np
Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0
The typical relevant OLS output has the form given above, but occa-
sionally software also includes output for a lack of t test as shown below.
Source df SS MS Fo
Regression p1 SSR MSR Fo=MSR/MSE
Residual np SSE MSE
lack of t cp SSLF MSLF FLF = MSLF/MSPE
pure error nc SSPE MSPE
Yi = m(xi ) + ei (2.29)
where E(Yi |xi ) = m(xi ), m is some possibly nonlinear function, and that
the ei are iid N (0, 2 ). Notice that the MLR model is the special case with
m(xi ) = xTi . The lack of t test needs at least one replicate: 2 or more Ys
with the same value of predictors x. Then there a c replicate groups with
nj observations in the jth group. Each group has the vector of predictors xj ,
2.12 Complements 67
c
say, and at least one nj > 1. Also, j=1 nj = n. Denote the Ys in the jth
group by Yij , and let the sample mean of the Ys in the jth group be Y j .
Then
1
nj
(Yij Y j )2
nj 1 i=1
c
nj
SSP E = (Yij Y j )2 .
j=1 i=1
Although the lack of t test seems clever, examining the response plot and
residual plot is a much more eective method for examining whether or not
the MLR model ts the data well provided that n 10p. A graphical version
of the lack of t test would compute the Y j and see whether they scatter
about the identity line in the response plot. When there are no replicates,
the range of Y could be divided into several narrow nonoverlapping intervals
called slices. Then the mean Y j of each slice could be computed and a step
function with step height Y j at the jth slice could be plotted. If the step
function follows the identity line, then there is no evidence of lack of t.
However, it is easier to check whether the Yi are scattered about the identity
line. Examining the residual plot is useful because it magnies deviations
from the identity line that may be dicult to see until the linear trend is
removed. The lack of t test may be sensitive to the assumption that the
errors are iid N (0, 2 ).
When Y x|xT , then the response plot of the estimated sucient pre-
dictor (ESP) xT versus Y is used to visualize the conditional distribution of
Y |xT , and will often greatly outperform the corresponding lack of t test.
When the response plot can be combined with a good lack of t plot such as
68 2 Multiple Linear Regression
a residual plot, using a one number summary of lack of t such as the test
statistic FLF makes little sense.
Nevertheless, the literature for lack of t tests for various statistical meth-
ods is enormous. See Joglekar et al. (1989), Pena and Slate (2006), and Su
and Yang (2006) for references.
For the following homework problems, Cody and Smith (2006) is useful
for SAS, while Cook and Weisberg (1999a) is useful for Arc. Becker et al.
(1988) and Crawley (2013) are useful for R.
2.13 Problems
2.1. Assume that the response variable Y is height, and the explanatory
variables are X2 = sternal height, X3 = cephalic index, X4 = nger to ground,
X5 = head length, X6 = nasal height, and X7 = bigonal breadth. Suppose that
the full model uses all 6 predictors plus a constant (= X1 ) while the reduced
model uses the constant and sternal height. Test whether the reduced model
can be used instead of the full model using the output above. The data set
had 74 cases.
2.2. The above output, starting on the previous page, comes from the
Johnson (1996) STATLIB data set bodyfat after several outliers are deleted.
It is believed that Y = 1 + 2 X2 + 3 X22 + e where Y is the persons bodyfat
and X2 is the persons density. Measurements on 245 people were taken. In
addition to X2 and X22 , 7 additional measurements X4 , . . . , X10 were taken.
Both the full and reduced models contain a constant X1 1.
b) Test whether the reduced model can be used instead of the full model.
2.3. The above output was produced from the le mussels.lsp in Arc. See
Cook and Weisberg (1999a). Let Y = log(M) where M is the muscle mass
of a mussel. Let X1 1, X2 = log(H) where H is the height of the shell,
and let X3 = log(S) where S is the shell mass. Suppose that it is desired to
predict Yf if log(H) = 4 and log(S) = 5, so that xTf = (1, 4, 5). Assume that
se(Yf ) = 0.410715 and that se(pred) = 0.467664.
2.4. The above output, starting on the previous page, is from the multiple
linear regression of the response Y = height on the two nontrivial predictors
sternal height = height at shoulder, and nger to ground = distance from the
tip of a persons middle nger to the ground.
a) Consider the plot with Yi on the vertical axis and the least squares
tted values Yi on the horizontal axis. Sketch how this plot should look if the
multiple linear regression model is appropriate.
b) Sketch how the residual plot should look if the residuals ri are on the
vertical axis and the tted values Yi are on the horizontal axis.
c) From the output, are sternal height and nger to ground useful for
predicting height? (Perform the ANOVA F test.)
2.5. Suppose that it is desired to predict the weight of the brain (in
grams) from the cephalic index measurement. The output below uses data
from 267 people.
predictor coef Std. Error t-value p-value
Constant 865.001 274.252 3.154 0.0018
cephalic 5.05961 3.48212 1.453 0.1474
Do a 4 step test for 2 = 0.
2.7. Suppose that the 95% condence interval for 2 is [17.457, 15.832].
In the simple linear regression model, is X a useful linear predictor for Y ? If
your answer is no, could X be a useful predictor for Y ? Explain.
2.8. Suppose it is desired to predict the yearly return from the stock
market from the return in January. Assume that the correlation = 0.496.
Using the table below, nd the least squares line Y = 1 + 2 X.
56 63
59 70
64 72
74 84
2.10. In the above table, xi is the length of the femur and yi is the length
of the humerus taken from ve dinosaur fossils (Archaeopteryx) that preserved
both bones. See Moore (2000, p. 99).
a) Complete the table and nd the least squares estimators 1 and 2 .
a) What is E(Yi )?
c) Show that your is the global minimizer of the least squares criterion
d2
Q by showing that the second derivative Q() > 0 for all values of .
d 2
2.12. The location model is Yi = +ei for i = 1, . . . , n where the ei are iid
with mean E(ei ) = 0 and constant variance VAR(ei ) = 2 . The least squares
n
estimator of minimizes the least squares criterion Q() = (Yi )2 .
i=1
To nd the least squares estimator, perform the following steps.
72 2 Multiple Linear Regression
d
a) Find the derivative Q, set the derivative equal to zero and solve for
d
. Call the solution .
b) To show that the solution was indeed the global minimizer of Q, show
d2
that Q > 0 for all real . (Then the solution is a local min and Q is
d 2
convex, so is the global min.)
2.13. The normal error model for simple linear regression through the
origin is
Yi = Xi + ei
for i = 1, . . . , n where e1 , . . . , en are iid N (0, 2 ) random variables.
b) Find E().
c) Find VAR().
n
(Hint: Note that = i=1 ki Yi where the ki depend on the Xi which are
treated as constants.)
2.14. Suppose that the regression model is Yi = 10+2Xi2 +3 Xi3 +ei for
i = 1, . . . , n where the ei are iid N (0, 2 ) random variables. The least squares
n
criterion is Q(3 ) = (Yi 10 2Xi2 3 Xi3 )2 . Find the least squares es-
i=1
d
timator 3 of 3 by setting the rst derivative Q(3 ) equal to zero. Show
d3
that your 3 is the global minimizer of the least squares criterion Q by show-
d2
ing that the second derivative Q(3 ) > 0 for all values of 3 .
d32
Minitab Problems
Double click means press the rightmost mouse button twice in rapid
succession. Drag means hold the mouse button down. This technique is
used to select menu options.
After your computer is on, get into Minitab, often by searching programs
and then double clicking on the icon marked Student Minitab.
i) In a few seconds, the Minitab session and worksheet windows ll the screen.
At the top of the screen there is a menu. The upper left corner has the menu
option File. Move your cursor to File and drag down the option Open
Worksheet. A window will appear. Double click on the icon Student. This
will display a large number of data sets.
2.13 Problems 73
ii) In the middle of the screen there is a scroll bar, a gray line with left and
right arrow keys. Use the right arrow key to make the data le Prof.mtw
appear. Double click on Prof.mtw. A window will appear. Click on OK.
iii) The worksheet window will now be lled with data. The top of the screen
has a menu. Go to Stat and drag down Regression. Another window will
appear: drag down Regression (write this as Stat>Regression>Regression).
iv) A window will appear with variables to the left and the response variable
and predictors (explanatory variables) to the right. Double click on instru-
crs to make it the response. Double click on manner to make it the (pre-
dictor) explanatory variable. Then click on OK.
v) The required output will appear in the session window. You can view the
output by using the vertical scroll bar on the right of the screen.
vi) Copy and paste the output into Word, or to print your single page of
output, go to File, and drag down the option Print Session Window. A
window will appear. Click on ok. Then get your output from the printer.
Use the F3 key to clear entries from a dialog window if you make a mistake
or want a new plot.
To get out of Minitab, move your cursor to the x in the upper right
corner of the screen. When asked whether to save changes, click on no.
2.15. (Minitab problem.) See the above instructions for using Minitab.
Get the data set prof.mtw. Assign the response variable to be instrucr (the
instructor rating from course evaluations) and the explanatory variable (pre-
dictor) to be manner (the manner of the instructor). Run a regression on
these variables.
d) To get residual and response plots you need to store the residuals and
tted values. Use the menu commands Stat>Regression>Regression to get
the regression window. Put instrucr in the Response and manner in the
Predictors boxes. The click on Storage. From the resulting window click
on Fits and Residuals. Then click on OK twice.
To get a response plot, use the commands Graph>Plot, (double click)
place instrucr in the Y box, and Fits1 in the X box. Then click on OK. Print
the plot by clicking on the graph and then clicking on the printer icon.
e) To make a residual plot, use the menu commands Graph>Plot to get
a window. Place Resi1 in the Y box and Fits1 in the X box. Then click
on OK. Print the plot by clicking on the graph and then clicking on the
printer icon.
74 2 Multiple Linear Regression
f) To save your Minitab data on your ash drive, use the menu commands
File>Save Current Worksheet as. In the resulting dialog window, the top
box says Save in and there is an arrow icon to the right of the top box.
Click several times on the arrow icon until the Save in box reads My com-
2.13 Problems 75
puter, then click on Removable Disk (J:). In the File name box, enter
H2d16.mtw. Then click on OK.
SAS Problems
Copy and paste the SAS programs for problems 2.17 and 2.18
from (http://lagrange.math.siu.edu/Olive/lreghw.txt), or enter
the SAS program in Notepad or Word.
SAS is a statistical software package widely used in industry. You will need
a ash dive. Referring to the program for Problem 2.17, the semicolon ;
is used to end SAS commands and the options ls = 70; command makes
the output readable. (An * can be used to insert comments into the SAS
program. Try putting an * before the options command and see what it does
to the output.) The next step is to get the data into SAS. The command data
wcdata; gives the name wcdata to the data set. The command input x
y; says the rst entry is variable x and the 2nd variable y. The command
cards; means that the data is entered below. Then the data is entered
and the isolated semicolon indicates that the last case has been entered. The
command proc print; prints out the data. The command proc corr; will
give the correlation between x and y. The commands proc plot; plot y*x;
makes a scatterplot of x and y. The commands proc reg; model y=x; output
out = a p =pred r =resid; tells SAS to perform a simple linear regression
with y as the response variable. The output data set is called a and contains
the tted values and residuals. The command proc plot data = a; tells SAS
to make plots from data set a rather than data set wcdata. The command
plot resid*(pred x); will make a residual plot of the tted values versus the
residuals and a residual plot of x versus the residuals. The next plot command
makes a response plot.
To use SAS on windows (PC), use the following steps.
i) Get into SAS, often by double clicking on an icon for programs such as a
Math Progs icon and then double clicking on a SAS icon. If your computer
does not have SAS, go to another computer.
ii) A window should appear with 3 icons. Double click on The SAS System
for . . . .
iii) Like Minitab, a window with a split screen will open. The top screen
says Log-(Untitled) while the bottom screen says Editor-Untitled1. Press the
spacebar and an asterisk appears: Editor-Untitled1*.
2.17. a) Copy and paste the program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt), or enter the SAS
program in Notepad or Word. The ls stands for linesize so l is a lowercase L,
not the number one.
When you are done entering the program, you may want to save the pro-
gram as h2d17.sas on your ash drive (J: drive, say). (On the top menu of
the editor, use the commands File > Save as. A window will appear. Use
76 2 Multiple Linear Regression
the upper right arrow to locate Removable Disk (J:) and then type the le
name in the bottom box. Click on OK.)
b) Get back into SAS, and from the top menu, use the File> Open
command. A window will open. Use the arrow in the upper right corner
of the window to navigate to Removable Disk (J:). (As you click on the
arrow, you should see My Documents, C: etc, then Removable Disk (J:).)
Double click on h2d17.sas. (Alternatively cut and paste the program into the
SAS editor window.) To execute the program, use the top menu commands
Run>Submit. An output window will appear if successful.
If you were not successful, look at the log window for hints on errors.
A single typo can cause failure. Reopen your le in Word or Notepad and
make corrections. Occasionally you cannot nd your error. Then nd your
instructor or wait a few hours and reenter the program.
c) To copy and paste relevant output into Word or Notepad, click on the
output window and use the top menu commands Edit>Select All and then
the menu commands Edit>Copy.
In Notepad use the commands Edit>Paste. Then use the mouse to high-
light the relevant output. Then use the commands Edit>Copy.
Finally, in Word, use the command Paste. You can also cut output from
Word and paste it into Notepad.
You may want to save your SAS output as the le HW2d17.doc on your
ash drive.
d) To save your output on your ash drive, use the Word menu commands
File > Save as. In the Save in box select Removable Disk (J:) and in
the File name box enter HW2d17.doc. To get a Word printout, click on the
printer icon or use the menu commands File>Print.
Save the output giving the least squares coecients in Word.
e) Predict Y if X = 40.
f) What is the residual when X = 40?
2.18. This problem shows how to use SAS for MLR. The data are from
Kutner et al. (2005, problem 6.5). The response is brand liking, a measure-
ment for whether the consumer liked the brand. The variable X1 is moisture
content and the variable X2 is sweetness. Copy and paste the program for
this problem from (http://lagrange.math.siu.edu/Olive/lreghw.txt).
a) Execute the SAS program and copy the output le into Notepad. Scroll
down the output that is now in Notepad until you nd the regression coe-
cients and ANOVA table. Then cut and paste this output into Word.
b) Do the 4 step ANOVA F test.
You should scroll through your SAS output to see how it made the re-
sponse plot and various residual plots, but cutting and pasting these plots is
2.13 Problems 77
tedious. So we will use Minitab to get these plots. Find the program for this
problem from (http://lagrange.math.siu.edu/Olive/lreghw.txt). Then
copy and paste the numbers (between cards; and the semicolon ;) into
Minitab. Use the mouse commands Edit>Paste Cells. This should enter
the data in the Worksheet (bottom part of Minitab). Under C1 enter Y and
under C2 enter X1 under C3 enter X2. Use the menu commands
Stat>Regression>Regression to get a dialog window. Enter Y as the re-
sponse variable and X1 and X2 as the predictor variable. Click on Storage
then on Fits, Residuals, and OK OK.
c) To make a response plot, enter the menu commands Graph>Plot
and place Y in the Ybox and FITS1 in the Xbox. Click on OK. Then
use the commands Edit>Copy Graph to copy the plot. Include the plot in
Word with the commands Edit> Paste. If these commands fail, click on
the graph and then click on the printer icon.
d) Based on the response plot, does a linear model seem reasonable?
e) To make a residual plot, enter the menu commands Graph>Plot and
place RESI 1 in the Ybox and FITS1 in the Xbox. Click on OK. Then
use the commands Edit>Copy Graph to copy the plot. Include the plot in
Word with the commands Edit> Paste. If these commands fail, click on
the graph and then click on the printer icon.
f) Based on the residual plot does a linear model seem reasonable?
Problems using ARC
To quit Arc, move the cursor to the x in the upper right corner and click.
Warning: Some of the following problems uses data from the
books webpage (http://lagrange.math.siu.edu/Olive/lregbk.htm).
Save the data les on a ash drive G, say. Get in Arc and use the menu
commands File > Load and a window with a Look in box will appear. Click
on the black triangle and then on Removable Disk (G:). Then click twice on
the data set name.
In Arc, use the menu commands Edit > Copy. In Word, use the menu
command Paste. This should copy the graph into the Word document.
a) Cut and paste the output (from Coecient Estimates to Sigma hat)
into Word. Write down the least squares equation Y = 1 + 2 x.
c) Make a residual plot of the tted values versus the residuals. Use
the commands Graph&Fit > Plot of and put L1:Fit-values in H and
L1:Residuals in V. Put sex in the Mark by box. Move the OLS bar to 1.
Put the plot into Word. Does the plot look ellipsoidal with zero mean?
2.21. In Arc enter the menu commands File>Load>Data and open the
le mussels.lsp. This data set is from Cook and Weisberg (1999a).
The response variable Y is the mussel muscle mass M, and the explanatory
variables are X2 = S = shell mass, X3 = H = shell height, X4 = L = shell
length, and X5 = W = shell width.
Enter the menu commands Graph&Fit>Fit linear LS and t the model:
enter S, H, L, W in the Terms/Predictors box, M in the Response box
and click on OK.
2.22. Get cyp.lsp as described above Problem 2.19. You can open the
le in Notepad and then save it on a ash drive G, say, using the Notepad
menu commands File>Save As and clicking the top checklist then click
Removable Disk (G:). You could also save the le on the desktop, load it
in Arc from the desktop, and then delete the le (sending it to the Recycle
Bin).
a) In Arc enter the menu commands File>Load>Removable Disk (G:)
and open the le cyp.lsp. This data set consists of various measurements
taken on men from Cyprus around 1920. Let the response Y = height and
X = cephalic index = 100(head breadth)/(head length). Use Arc to get the
least squares output and include the relevant output in Word.
i) The Arc menu L1 should have been created for the regression. Use
the menu commands L1>Prediction to open a dialog window. Enter 1400
650 in the box and click on OK. Include the resulting output in Word.
j) Let Xf,2 = 1400 and Xf,3 = 650 and use the output from i) to nd a
95% CI for E(Yf ). Use the last line of the output, that is, se = S(Yf ).
a) Include the ANOVA tables for the full and reduced models in Word.
d) Both plots should cluster tightly about the identity line if the reduced
model is about as good as the full model. Is the reduced model good?
e) Perform the 4 step partial F test (of Ho: the reduced model is good)
using the 2 ANOVA tables from part a).
2.26. a) Activate the cbrain.lsp data set in ARC. Fit least squares with
age, sex, size1/3 , and headht as terms and brnweight as the response. See
Problem 2.20. Assume that the multiple linear regression model is appro-
priate. (This may be a reasonable assumption, 5 infants appear as outliers
but the data set has hardly any cases that are babies. If age was uniformly
represented, the babies might not be outliers anymore.) Assuming that ARC
makes the menu L1 for this regression, select AVP-All 2D. A window will
appear. Move the OLS slider bar to 1 and click on the zero line box. The
window will show the added variable plots for age, sex, size1/3 , and headht
as you move along the slider bar that is below case deletions. Include all 4
added variable plots in Word.
b) What information do the 4 plots give? For example, which variables do
not seem to be needed?
(If it is clear that the zero and OLS lines intersect at the origin, then the
variable is probably needed, and the point cloud should be tilted away from
the zero line. If it is dicult to see where the two lines intersect since they
nearly coincide near the origin, then the variable may not be needed, and the
point cloud may not tilt away from the zero line.)
R Problems
plot(zred$fit,buxy)
abline(0,1)
e) Use the following command to make the residual plot for the reduced
model. Include the plot in Word.
plot(zred$fit,zred$resid)
f) The plots look bad because of 5 massive outliers. The following com-
mands remove the outliers. Include the output in Word.
plot(zred$fit,zbux[,5])
abline(0,1)
i) Use the following command to make the residual plot for the reduced
model without the outliers. Include the plot in Word.
plot(zred$fit,zred$resid)
2.28. Get the R commands for this problem. The data is such that Y =
2 + x2 + x3 + x4 + e where the zero mean errors are iid [exponential(2) -
2]. Hence the residual and response plots should show high skew. Note that
= (2, 1, 1, 1)T . The R code uses 3 nontrivial predictors and a constant, and
the sample size n = 1000.
2.13 Problems 83
a) Copy and paste the commands for part a) of this problem into R. Include
the response plot in Word. Is the lowess curve fairly close to the identity line?
b) Copy and paste the commands for part b) of this problem into R.
Include the residual plot in Word: press the Ctrl and c keys as the same time.
Then use the menu command Paste in Word. Is the lowess curve fairly
close to the r = 0 line?
c) The output out$coef gives . Write down . Is close to ?
c) Repeat b) using the command pisim(n=100, type = 3). Now the er-
rors are EXP(1) - 1.
e) The infants are in the lower left corner of the plot. Do the PIs seem to
be better for the infants or the bulk of the data. Explain briey.
Building a multiple linear regression (MLR) model from data is one of the
most challenging regression problems. The nal full model will have re-
sponse variable Y = t(Z), a constant x1 , and predictor variables x2 =
t2 (w2 , . . . , wr ), . . . , xp = tp (w2 , . . . , wr ) where the initial data consists of
Z, w2 , . . . , wr . Choosing t, t2 , . . . , tp so that the nal full model is a useful
MLR approximation to the data can be dicult.
Model building is an iterative process. Given the problem and data but
no model, the model building process can often be aided by graphs that
help visualize the relationships between the dierent variables in the data.
Then a statistical model can be proposed. This model can be t and inference
performed. Then diagnostics from the t can be used to check the assumptions
of the model. If the assumptions are not met, then an alternative model can
be selected. The t from the new model is obtained, and the cycle is repeated.
This chapter provides some tools for building a good full model.
Warning: Researchers often have a single data set and tend to expect
statistics to provide far more information from the single data set than is
reasonable. MLR is an extremely useful tool, but MLR is at its best when the
nal full model is known before collecting and examining the data. However,
it is very common for researchers to build their nal full model by using
the iterative process until the nal model ts the data well. Researchers
should not expect that all or even many of their research questions can be
answered from such a full model. If the nal MLR full model is built from
a single data set in order to t that data set well, then typically inference
from that model will not be valid. The model may be useful for describing
the data, but may perform very poorly for prediction of a future response.
The model may suggest that some predictors are much more important than
others, but a model that is chosen prior to collecting and examining the data
is generally much more useful for prediction and inference. A single data
Often a nal full model is built after collecting and examining the data.
This procedure is called data snooping, and such models cannot be ex-
pected to be reliable. If possible, spend about 1/8 of the budget to collect
data and build an initial MLR model. Spend another 1/8 of the budget to
collect more data to check the initial MLR model. If changes are necessary,
continue this process until no changes from the previous step are needed,
resulting in a tentative MLR model. Then spend between 1/2 and 3/4 of the
budget to collect data assuming that the tentative model will be useful.
Alternatively, if the data set is large enough, use a training set of a
random sample of k of the n cases to build a model where 10p n/2 k
0.9n. Then use validation set of the other n k cases to conrm that the
model built with the training set is good. This technique may help reduce
biases, but needs n 20p.
Rule of thumb 3.1. If the MLR model is built using the variable se-
lection methods from Section 3.4, then the nal submodel can be used for
description. If the full model was found after collecting the data, the model
may not be useful for inference and prediction. If the full model was selected
before collecting the data, then the prediction region method of bootstrap-
ping the variable selection model, described in Section 3.4.1, may be useful.
general regression problems, not just for multiple linear regression. A power
transformation has the form x = t (w) = w for = 0 and x = t0 (w) =
log(w) for = 0. Often L where
There are several rules of thumb that are useful for visually selecting a
power transformation to remove nonlinearities from the predictors. Let a
plot of X1 versus X2 have X2 is on the vertical axis and X1 on the horizontal
axis.
Rule of thumb 3.2. a) If strong nonlinearities are apparent in the scat-
terplot matrix of the predictors w2 , . . . , wp , it is often useful to remove the
nonlinearities by transforming the predictors using power transformations.
c) Suppose the plot of X1 versus X2 is nonlinear. The unit rule says that
if X1 and X2 have the same units, then try the same transformation for both
X1 and X2 .
Assume that all values of X1 and X2 are positive. Then the following six
rules are often used.
d) The log rule states that a positive predictor that has the ratio between
the largest and smallest values greater than ten should be transformed to logs.
So X > 0 and max(X)/ min(X) > 10 suggests using log(X).
88 3 Building an MLR Model
e) The range rule states that a positive predictor that has the ratio be-
tween the largest and smallest values less than two should not be transformed.
So X > 0 and max(X)/ min(X) < 2 suggests keeping X.
f) The bulging rule states that changes to the power of X2 and the power
of X1 can be determined by the direction that the bulging side of the curve
points. If the curve is hollow up (the bulge points down), decrease the power
of X2 . If the curve is hollow down (the bulge points up), increase the power
of X2 . If the curve bulges towards large values of X1 increase the power of
X1 . If the curve bulges towards small values of X1 decrease the power of X1 .
See Tukey (1977, pp. 173176).
i) The cube root rule says that if X is a volume measurement, then cube
root transformation X 1/3 may be useful.
In the literature, it is sometimes stated that predictor transformations
that are made without looking at the response are free. The reasoning is
that the conditional distribution of Y |(x2 = a2 , . . . , xp = ap ) is the same
as the conditional distribution of Y |[t2 (x2 ) = t2 (a2 ), . . . , tp (xp ) = tp (ap )]:
is simply a change of labelling. Certainly if Y |x = 9 N (0, 1), then
there
Y | x = 3 N (0, 1). To see that Rule of thumb 3.2a does not always work,
suppose that Y = 1 +2 x2 + +p xp +e where the xi are iid lognormal(0,1)
random variables. Then wi = log(xi ) N (0, 1) for i = 2, . . . , p and the
scatterplot matrix of the wi will be linear while the scatterplot matrix of the
xi will show strong nonlinearities if the sample size is large. However, there is
an MLR relationship between Y and the xi while the relationship between Y
and the wi is nonlinear: Y = 1 + 2 ew2 + + p ewp + e = T w + e. Given
Y and the wi with no information of the relationship, it would be dicult to
nd the exponential transformation and to estimate the i . The moral is that
predictor transformations, especially the log transformation, can and often
do greatly simplify the MLR analysis, but predictor transformations can turn
a simple MLR analysis into a very complex nonlinear analysis.
3.1 Predictor Transformations 89
To spread small values of the variable, make i smaller. To spread large values
of the variable, make i larger.
For example, if both variables are right skewed, then there will be many
more cases in the lower left of the plot than in the upper right. Hence small
values of both variables need spreading. Figures 13.3 b) and 13.16 have this
shape.
a) b)
1.4
1 2 3 4 5 6 7
1.0
x
x
0.6
0.2
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w
c) d)
6 8 10 12
60
40
x
x
20
2 4
0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w
vertical variable need spreading. Hence in Figure 3.1d, small values of x need
spreading. Notice that the plotted points bulge down towards large values of
the horizontal variable.
Example 3.2: Mussel Data. Cook and Weisberg (1999a, pp. 351,
433, 447) gave a data set on 82 mussels sampled o the coast of New
Zealand. The response is muscle mass M in grams, and the predictors are
a constant, the length L and height H of the shell in mm, the shell width
W , and the shell mass S. Figure 3.2 shows the scatterplot matrix of the
predictors L, W , H, and S. Examine the variable length. Length is on the
vertical axis on the three top plots and the right of the scatterplot matrix
labels this axis from 150 to 300. Length is on the horizontal axis on the
three leftmost marginal plots, and this axis is labelled from 150 to 300 on the
bottom of the scatterplot matrix. The marginal plot in the bottom left corner
has length on the horizontal and shell on the vertical axis. The marginal
plot that is second from the top and second from the right has height on the
horizontal and width on the vertical axis. If the data is stored in x, the plot
can be made with the following command in R.
pairs(x,labels=c("length","width","height","shell"))
3.1 Predictor Transformations 91
width
140
height
80 100
100 200 300
shell
0
350/10 = 35 > 10, the log rule suggests that log S may be useful. If log S
replaces S in the scatterplot matrix, then there may be some nonlinearity
present in the plot of log S versus W with small values of W needing spread-
ing. Hence the ladder rule suggests reducing from 1 and we tried log(W ).
Figure 3.3 shows that taking the log transformations of W and S results in
a scatterplot matrix that is much more linear than the scatterplot matrix of
Figure 3.2. Notice that the plot of W versus L and the plot of log(W ) versus
L both appear linear. This plot can be made with the following commands.
z <- x; z[,2] <- log(z[,2]); z[,4] <- log(z[,4])
pairs(z,labels=c("length","Log W","height","Log S"))
The plot of shell versus height in Figure 3.2 is nonlinear, and small values
of shell need spreading since if the plotted points were projected on the
horizontal axis, there would be too many points at values of shell near 0.
Similarly, large values of height need spreading.
92 3 Building an MLR Model
Log W
3.4
3.0
140
height
80 100
6
5
Log S
4
3
Denition 3.3. Assume that all of the values of the response Zi are
positive. Then the modied power transformation family
() Zi 1
t (Zi ) Zi = (3.3)
(0)
for = 0 and Zi = log(Zi ). Generally where is some interval such
as [1, 1] or a coarse subset such as L . This family is a special case of the
response transformations considered by Tukey (1957).
Warning: The Rule of thumb 3.2 does not always work. For example, the
log rule may fail. If the relationships in the scatterplot matrix are already lin-
ear or if taking the transformation does not increase the linearity (especially
in the row containing the response), then no transformation may be better
than taking a transformation. For the Arc data set evaporat.lsp, the log
rule suggests transforming the response variable Evap, but no transformation
works better.
There are several reasons to use a coarse grid of powers. First, several of the
powers correspond to simple transformations such as the log, square root, and
cube root. These powers are easier to interpret than = 0.28, for example.
94 3 Building an MLR Model
sqrt(Z)
2000
40
Z
10
0
500 1000 10 30 50
TZHAT TZHAT
c) lambda = 0 d) lambda = 1
0.008
log(Z)
7
1/Z
0.000
5
According to Mosteller and Tukey (1977, p. 91), the most commonly used
power transformations are the = 0 (log), = 1/2, = 1, and = 1/3
transformations in decreasing frequency of use. Secondly, if the estimator n
can only take values in L , then sometimes n will converge (e.g., in prob-
ability) to L . Thirdly, Tukey (1957) showed that neighboring power
transformations are often very similar, so restricting the possible powers to
a coarse grid is reasonable. Note that powers can always be added to the
grid L . Useful powers are 1/4, 2/3, 2, and 3. Powers from numerical
methods can also be added.
If more than one value of L gives a linear plot, take the simplest or
most reasonable transformation or the transformation that makes the most
sense to subject matter experts. Also check that the corresponding residual
plots of W versus W W look reasonable. The values of in decreasing order
of importance are 1, 0, 1/2, 1, and 1/3. So the log transformation would be
chosen over the cube root transformation if both transformation plots look
equally good.
3.2 Graphical Methods for Response Transformations 95
The essential point of the next example is that observations that inuence
the choice of the usual BoxCox numerical power transformation are often
easily identied in the transformation plots. The transformation plots are
especially useful if the bivariate relationships of the predictors, as seen in the
scatterplot matrix of the predictors, are linear.
Example 3.4: Mussel Data. Consider the mussel data of Example 3.2
where the response is muscle mass M in grams, and the predictors are the
length L and height H of the shell in mm, the logarithm log W of the shell
width W, the logarithm log S of the shell mass S, and a constant. With this
starting point, we might expect a log transformation of M to be needed
96 3 Building an MLR Model
a) lambda = 1 b) lambda = 0
50
4
log(Z)
2
Z
20
8
48
0
0
10 10 30 1.0 2.0 3.0 4.0
TZHAT TZHAT
48
Z**(0.28)
0.6
1/Z
2.0
8
8
1.0
0.0
48
1.5 2.0 2.5 3.0 0.00 0.15 0.30
TZHAT TZHAT
Fig. 3.5 Transformation Plots for the Mussel Data
because M and S are both mass measurements and log S is being used as
a predictor. Using log M would essentially reduce all measurements to the
scale of length. The BoxCox likelihood method gave 0 = 0.28 with ap-
proximate 95 percent condence interval 0.15 to 0.4. The log transformation
is excluded under this inference leading to the possibility of using dierent
transformations of the two mass measurements.
Shown in Figure 3.5 are transformation plots for four values of . A striking
feature of these plots is the two points that stand out in three of the four
plots (cases 8 and 48). The BoxCox estimate = 0.28 is evidently inuenced
by the two outlying points and, judging deviations from the identity line in
Figure 3.5c, the mean function for the remaining points is curved. In other
words, the BoxCox estimate is allowing some visually evident curvature
in the bulk of the data so it can accommodate the two outlying points.
Recomputing the estimate of o without the highlighted points gives o =
0.02, which is in good agreement with the log transformation anticipated
at the outset. Reconstruction of the transformation plots indicated that now
the information for the transformation is consistent throughout the data on
the horizontal axis of the plot.
Note that in addition to helping visualize against the data, the transfor-
mation plots can also be used to show the curvature and heteroscedasticity in
the competing models indexed by L . Example 3.4 shows that the plot
can also be used as a diagnostic to assess the success of numerical methods
such as the BoxCox procedure for estimating o .
3.3 Main Eects, Interactions, and Indicators 97
Example 3.5: Mussel Data Again. Return to the mussel data, this
time considering the regression of M on a constant and the four untrans-
formed predictors L, H, W , and S. Figure 3.2 shows the scatterplot matrix
of the predictors L, H, W , and S. Again nonlinearity is present. Figure 3.3
shows that taking the log transformations of W and S results in a linear
scatterplot matrix for the new set of predictors L, H, log W , and log S. Then
the search for the response transformation can be done as in Example 3.4.
Denition 3.5. Suppose that the explanatory variables have the form
x2 , . . . , xk , xjj = x2j , xij = xi xj , x234 = x2 x3 x4 , et cetera. Then the variables
x2 , . . . , xk are main eects. A product of two or more dierent main eects
is an interaction. A variable such as x22 or x37 is a power. An x2 x3 interaction
will sometimes also be denoted as x2 : x3 or x2 x3 .
Rule of thumb 3.3. Suppose that the MLR model contains at least one
power or interaction. Then the corresponding main eects that make up the
powers and interactions should also be in the MLR model.
Rule of thumb 3.3 suggests that if x23 and x2 x7 x9 are in the MLR model,
then x2 , x3 , x7 , and x9 should also be in the MLR model. A quick way to check
whether a term like x23 is needed in the model is to t the main eects models
and then make a scatterplot matrix of the predictors and the residuals, where
the residuals r are on the top row. Then the top row shows plots of xk versus
r, and if a plot is parabolic, then x2k should be added to the model. Potential
predictors wj could also be added to the scatterplot matrix. If the plot of
wj versus r shows a positive or negative linear trend, add wj to the model.
If the plot is quadratic, add wj and wj2 to the model. This technique is for
quantitative variables xk and wj .
98 3 Building an MLR Model
Example 3.6. Two varieties of cement that replace sand with coal waste
products were compared to a standard cement mix. The response Y was the
compressive strength of the cement measured after 7, 28, 60, 90, or 180 days
RESID
Y
500
3000
of curing time = x2 . This cement was intended for sidewalks and barriers but
not for construction. The data is likely from small batches of cement prepared
in the lab, and is likely correlated; however, MLR can be used for exploratory
and descriptive purposes. Actually using the dierent cement mixtures in the
eld (e.g., as sidewalks) would be very expensive. The factor mixture had 3
levels: 2 for the standard cement, and 0 and 1 for the coal based cements.
3.4 Variable Selection 99
A plot of x2 versus Y (not shown but see Problem 3.15) resembled the left
half of a quadratic Y = c(x2 180)2 . Hence x2 and x22 were added to the
model.
Figure 3.6 shows the response plot and residual plots from this model.
The standard cement mix uses the symbol + while the coal based mixes
use an inverted triangle and square. OLS lines based on each mix are added
as visual aids. The lines from the two coal based mixes do not intersect,
suggesting that there may not be an interaction between these two mixes.
There is an interaction between the standard mix and the two coal mixes
since these lines do intersect. All three types of cement become stronger with
time, but the standard mix has the greater strength at early curing times
while the coal based cements become stronger than the standard mix at the
later times. Notice that the interaction is more apparent in the residual plot.
Problem 3.15 adds a factor F x3 based on mix as well as the x2 F x3 and
x22 F x3 interactions. The resulting model is an improvement, but there is
still some curvature in the residual plot, and one case is not t very well.
Variable selection, also called subset or model selection, is the search for a
subset of predictor variables that can be deleted without important loss of
information. A model for variable selection in multiple linear regression can
be described by
where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, O = 0 and the sample correlation
corr(xTi , xTI,i I ) = 1.0 for the population model if S I.
All too often, variable selection is performed and then the researcher tries
to use the nal submodel for inference as if the submodel was selected before
gathering data. At the other extreme, it could be suggested that variable se-
lection should not be done because classical inferences after variable selection
are not valid. Neither of these two extremes is useful.
Ideally the model is known before collecting the data. After the data is
collected, the MLR assumptions are checked and then the model is used
for inference. Alternatively, a preliminary study can be used to collect data.
Then the predictors and response can be transformed until a full model is
built that seems to be a useful MLR approximation of the data. Then variable
selection can be performed, suggesting a nal model. Then this nal model is
the known model used before collecting data for the main part of the study.
See the two paragraphs above the paragraph above Rule of thumb 3.1. If
the full model is known, inference with the bootstrap prediction region
method and prediction intervals of Section 3.4.1 may be useful.
In practice, the researcher often has one data set, builds the full model,
and performs variable selection to obtain a nal submodel. In other words, an
extreme amount of data snooping was used to build the nal model. A major
problem with the nal MLR model (chosen after variable selection or data
snooping) is that it is not valid for inference in that the p-values for the OLS
t-tests and ANOVA F test are likely to be too small, while the p-value for the
partial F test that uses the nal model as the reduced model is likely to be
too high. Similarly, the actual coverage of the nominal 100(1)% prediction
intervals tends to be too small and unknown (e.g., the nominal 95% PIs may
only contain 83% of the future responses Yf ). Thus the model is likely to t
the data set from which it was built much better than future observations.
Call the data set from which the MLR model was built the training data,
consisting of cases (Yi , xi ) for i = 1, . . . , n. Then the future predictions tend
to be poor in that |Yf Yf | tends to be larger on average than |Yi Yi |.
To summarize, a nal MLR model selected after variable selection can be
useful for description and exploratory analysis: the tests and intervals can
be used for exploratory purposes, but the nal model is usually not valid for
inference.
Generally the research paper should state that the model was built with
one data set, and is useful for description and exploratory purposes, but
3.4 Variable Selection 101
should not be used for inference. The research paper should only suggest
that the model is useful for inference if the model has been shown to be
useful on data collected after the model was built. For example, if
the researcher can collect new data and show that the model produces valid
inferences (e.g., 97 out of 100 95% prediction intervals contained the future
response Yf ), then the researcher can perhaps claim to have found a model
that is useful for inference.
Other problems exist even if the full MLR model Y = xT + e is good.
Let I {1, . . . , p} and let xI be the nal vector of predictors. If xI is missing
important predictors contained in the full model, sometimes called undert-
ting, then the nal model Y = xTI I + e may be a very poor approximation
to the data, in particular the full model may be linear while the nal model
may be nonlinear. Similarly the full model may satisfy V (ei ) = 2 while the
constant variance assumption is violated by the submodel: V (ei ) = i2 . These
two problems are less severe if the joint distribution of (Y, xT )T is multivari-
ate normal, since then Y = xTI I + e satises the constant variance MLR
model regardless of the subset I used. See Problem 10.10.
In spite of these problems, if the researcher has a single data set with
many predictors, then usually variable selection must be done. Let p 1 be
the number of nontrivial predictors and assume that the model also contains
a constant. Also assume that n 10p. If the MLR model found after variable
selection has good response and residual plots, then the model may be very
useful for descriptive and exploratory purposes.
Simpler models are easier to explain and use than more complicated mod-
els, and there are several other important reasons to perform variable selec-
tion. First, an MLR model with unnecessary predictors has a mean square
error for prediction that is too large. Let xS contain the necessary predictors,
let x be the full model, and let xI be a submodel. If (3.4) holds and S I,
then E(Y |xI ) = xTI I = xTS S = xT . Hence OLS applied to Y and xI
yields an unbiased estimator I of I . If (3.4) holds, S I, S is a k 1
vector, and I is a j 1 vector with j > k, then
1 1
n n
2 j 2 k
V (YIi ) = > = V (YSi ). (3.7)
n i=1 n n n i=1
1
n
1 2 2 p
V (Yi ) = tr( 2 H) = tr((X T X)1 X T X) =
n i=1 n n n
model or exclude all of the indicator variables from the model. If the model
contains powers or interactions, also include all main eects in the model
(see Section 3.3).
Next we suggest methods for nding a good submodel. We make the sim-
plifying assumptions that the full model is good, that all predictors have the
same cost, that each submodel contains a constant, and that there is no the-
ory requiring that a particular predictor must be in the model. Also assume
that n 10p, and that the response and residual plots of the full model
are good. Rule of thumb 3.5 should be used for the full model and for all
submodels.
The basic idea is to obtain tted values from the full model and the can-
didate submodel. If the candidate model is good, then the plotted points in
a plot of the submodel tted values versus the full model tted values should
follow the identity line. In addition, a similar plot should be made using the
residuals.
A problem with this idea is how to select the candidate submodel from
the nearly 2p potential submodels. One possibility would be to try to order
the predictors in importance, say x1 , . . . , xp . Then let the kth model contain
the predictors x1 , x2 , . . . , xk for k = 1, . . . , p. If the predicted values from the
submodel are highly correlated with the predicted values from the full model,
then the submodel is good. All subsets selection, forward selection, and
backward elimination can be used (see Section 1.3), but criteria to separate
good submodels from bad are needed.
Two important summaries for submodel I are R2 (I), the proportion of
the variability of Y explained by the nontrivial predictors in the model, and
M SE(I) = I2 , the estimated error variance. See Denitions 2.15 and 2.16.
Suppose that model I contains k predictors, including a constant. Since
adding predictors does not decrease R2 , the adjusted RA 2
(I) is often used,
where
n n
RA2
(I) = 1 (1 R2 (I)) = 1 M SE(I) .
nk SST
See Seber and Lee (2003, pp. 400401). Hence the model with the maximum
2
RA (I) is also the model with the minimum M SE(I).
where SSE is the error sum of squares from the full model, and SSE(I) is the
error sum of squares from the candidate submodel. An extremely important
criterion for variable selection is the Cp criterion.
104 3 Building an MLR Model
Denition 3.8.
SSE(I)
Cp (I) = + 2k n = (p k)(FI 1) + k
M SE
where MSE is the error mean square for the full model.
From Section 1.3, recall that all subsets selection, forward selection, and
backward elimination produce one or more submodels of interest for k =
2, . . . , p where the submodel contains k predictors including a constant. The
following proposition helps explain why Cp is a useful criterion and suggests
that for subsets I with k terms, submodels with Cp (I) min(2k, p) are
especially interesting. Olive and Hawkins (2005) show that this interpretation
of Cp can be generalized to 1D regression models with a linear predictor T x,
such as generalized linear models. Denote the residuals and tted values from
the full model by ri = Yi xTi = Yi Yi and Yi = xTi , respectively.
Similarly, let I be the estimate of I obtained from the regression of Y on xI
and denote the corresponding residuals and tted values by rI,i = Yi xTI,i I
and YI,i = xTI,i I where i = 1, . . . , n.
Proposition 3.1. Suppose that a numerical variable selection method
suggests several submodels with k predictors, including a constant, where
2 k p.
a) The model I that minimizes Cp (I) maximizes corr(r, rI ).
p
b) Cp (I) 2k implies that corr(r, rI ) 1 .
n
c) As corr(r, rI ) 1,
Using the screen Cp (I) min(2k, p) suggests that the predictor xi should
not be deleted if
|ti | > 2 1.414.
If |ti | < 2, then the predictor can probably be deleted since Cp decreases.
The literature suggests using the Cp (I) k screen, but this screen eliminates
too many potentially useful submodels.
3.4 Variable Selection 105
Six graphs will be used to compare the full model and the candidate sub-
model. Let be the estimate of obtained from the regression of Y on all
of the terms x. Many numerical methods such as forward selection, back-
ward elimination, stepwise, and all subsets methods using the Cp (I) criterion
(Jones 1946; Mallows 1973) have been suggested for variable selection. We
will use the FF plot, RR plot, the response plots from the full and submodel,
and the residual plots (of the tted values versus the residuals) from the full
and submodel. These six plots will contain a great deal of information about
the candidate subset provided that Equation (3.4) holds and that a good
estimator (such as OLS) for and I is used.
For these plots to be useful, it is crucial to verify that a multiple linear
regression (MLR) model is appropriate for the full model. Both the re-
sponse plot and the residual plot for the full model need to be
used to check this assumption. The plotted points in the response plot
should cluster about the identity line (that passes through the origin with
unit slope) while the plotted points in the residual plot should cluster about
the horizontal axis (the line r = 0). Any nonlinear patterns or outliers in
either plot suggest that an MLR relationship does not hold. Similarly, be-
fore accepting the candidate model, use the response plot and the residual
plot from the candidate model to verify that an MLR relationship holds for
the response Y and the predictors xI . If the submodel is good, then the
residual and response plots of the submodel should be nearly identical to the
corresponding plots of the full model. Assume that all submodels contain a
constant.
visual aids. The subset I is good if the plotted points cluster tightly about
the identity line in both plots. In particular, the OLS line and the identity
line should nearly coincide so that it is dicult to tell that the two lines
intersect at the origin in the RR plot.
To verify that the six plots are useful for assessing variable selection,
the following notation will be useful. Suppose that all submodels include
a constant and that X is the full rank n p design matrix for the full
model. Let the corresponding vectors of OLS tted values and residuals
be Y = X(X T X)1 X T Y = HY and r = (I H)Y , respectively.
Suppose that X I is the n k design matrix for the candidate submodel
and that the corresponding vectors of OLS tted values and residuals are
Y I = X I (X TI X I )1 X TI Y = H I Y and r I = (I H I )Y , respectively.
A plot can be very useful if the OLS line can be compared to a reference
line and if the OLS slope is related to some quantity of interest. Suppose that
a plot of w versus z places w on the horizontal axis and z on the vertical axis.
Then denote the OLS line by z = a + bw. The following proposition shows
that the plotted points in the FF, RR, and response plots will cluster about
the identity line. Notice that the proposition is a property of OLS and holds
even if the data does not follow an MLR model. Let corr(x, y) denote the
correlation between x and y.
Also recall that the OLS line passes through the means of the two variables
(w, z).
(*) Notice that the OLS slope from regressing z on w is equal to one if
and only if the OLS slope from regressing w on z is equal to [corr(z, w)]2 .
2 T
i) The slope b = 1 if YI,i Yi = YI,i . This equality holds since Y I Y =
T
Y T H I Y = Y T H I H I Y = Y I Y I . Since b = 1, a = Y Y = 0.
v) The OLS line passes through the origin. Hence a = 0. The slope b =
r T r I /r T r. Since r T r I = Y T (I H)(I H I )Y and (I H)(I H I ) =
I H, the numerator r T r I = r T r and b = 1.
vi) Again a = 0 since the OLS line passes through the origin. From v),
SSE(I)
1= [corr(r, rI )].
SSE
108 3 Building an MLR Model
Hence
SSE
corr(r, rI ) =
SSE(I)
and the slope
SSE
b= [corr(r, rI )] = [corr(r, rI )]2 .
SSE(I)
Remark 3.2. Daniel and Wood (1980, p. 85) suggest using Mallows
graphical method for screening subsets by plotting k versus Cp (I) for models
close to or under the Cp = k line. Proposition 3.2 vi) implies that if Cp (I) k
or FI < 1, then corr(r, rI ) and corr(ESP, ESP (I)) both go to 1.0 as n .
Hence models I that satisfy the Cp (I) k screen will contain the true model
S with high probability when n is large. This result does not guarantee that
the true model S will satisfy the screen, but overt is likely. Let d be a lower
bound on corr(r, rI ). Proposition 3.2 vi) implies that if
1 p
Cp (I) 2k + n 2 1 2 ,
d d
Rule of thumb 3.7. Assume that the full model has good response and
residual plots and that n 10p. Let subset I have k predictors, including a
constant. Know how to nd good models from output. The following rules of
thumb (roughly in order of decreasing importance) may be useful. It is often
not possible to have all 10 rules of thumb to hold simultaneously. Let Imin be
the minimum Cp model and let II be the model with the fewest predictors
satisfying Cp (II ) Cp (Imin ) + 1. Do not use more predictors than model II
to avoid overtting. Then the submodel I is good if
i) the response and residual plots for the submodel looks like the response
and residual plots for the full model,
ii) corr(ESP,ESP(I)) = corr(Y, YI ) 0.95.
iii) The plotted points in the FF plot (= EE plot for MLR) cluster tightly
about the identity line.
iv) Want the p-value 0.01 for the partial F test that uses I as the reduced
model.
v) The plotted points in the RR plot cluster tightly about the identity line.
vi) Want R2 (I) > 0.9R2 and R2 (I) > R2 0.07 (recall that R2 (I) R2 =
R2 (f ull) since adding predictors to I does not decrease R2 (I)).
vii) Want Cp (Imin ) Cp (I) min(2k, p) with no big jumps in Cp (the
increase should be less than four) as variables are deleted.
viii) Want hardly any predictors with p-values > 0.05.
ix) Want few predictors with p-values between 0.01 and 0.05.
x) Want MSE(I) to be smaller than or not much larger than the MSE from
the full model.
(If n 5p, use the above rules, but we want n 10k.)
estimated sucient predictors that are highly correlated with the full model
ESP (the correlation should be at least 0.9 and preferably greater than 0.95).
Similarly, make a scatterplot matrix of the residuals for M1, M2, M3, M4,
and M5.
To summarize, the nal submodel should have few predictors, few variables
with large OLS t test pvalues (0.01 to 0.05 is borderline), good response and
residual plots, and an FF plot (= EE plot) that clusters tightly about the
identity line. If a factor has c 1 indicator variables, either keep all c 1
indicator variables or delete all c 1 indicator variables, do not delete some
of the indicator variables.
Example 3.7. The pollution data of McDonald and Schwing (1973) can
be obtained from STATLIB or the texts website. The response Y = mort
is the mortality rate, and most of the independent variables were related
to pollution. A scatterplot matrix of the rst 9 predictors and Y was made
and then a scatterplot matrix of the remaining predictors with Y . The log
rule suggested making the log transformation with 4 of the variables. The
summary output is shown below and on the following page. The response
and residual plots were good. Notice that p = 16 and n = 60 < 5p. Also
many p-values are too high.
Response = MORT
Label Estimate Std. Error t-value p-value
Constant 1881.11 442.628 4.250 0.0001
DENS 0.00296 0.00397 0.747 0.4588
EDUC -19.6669 10.7005 -1.838 0.0728
log[HC] -31.0112 15.5615 -1.993 0.0525
HOUS -0.40107 1.64372 -0.244 0.8084
HUMID -0.44540 1.06762 -0.417 0.6786
JANT -3.58522 1.05355 -3.403 0.0014
JULT -3.84292 2.12079 -1.812 0.0768
log[NONW] 27.2397 10.1340 2.688 0.0101
log[NOX] 57.3041 15.4764 3.703 0.0006
OVR65 -15.9444 8.08160 -1.973 0.0548
POOR 3.41434 2.74753 1.243 0.2206
POPN -131.823 69.1908 -1.905 0.0633
PREC 3.67138 0.77814 4.718 0.0000
log[SO] -10.2973 7.38198 -1.395 0.1700
WWDRK 0.88254 1.50954 0.585 0.5618
Taking the minimum Cp model from backward elimination gives the out-
put shown below. The response and residual plots were OK although the
2
correlation in the RR and FF plots was not real high. The R in the sub-
model decreased from about 0.79 to 0.74 while = M SE was 33.22 for
the full model and 33.31 for the submodel. Removing nonlinearities from the
predictors by using two scatterplots and the log rule, and then using back-
ward elimination and forward selection, seems to be very eective for nding
the important predictors for this data set. See Problem 13.17 in order to
reproduce this example with the essential plots.
Response = MORT
Label Estimate Std. Error t-value p-value
Constant 943.934 82.2254 11.480 0.0000
EDUC -15.7263 6.17683 -2.546 0.0138
JANT -1.86899 0.48357 -3.865 0.0003
log[NONW] 33.5514 5.93658 5.652 0.0000
log[NOX] 21.7931 4.29248 5.077 0.0000
PREC 2.92801 0.59011 4.962 0.0000
death was acute, 3 if the cause of death was chronic, and coded as 2 oth-
erwise. A variable ageclass was coded as 0 if the age was under 20, 1 if the
age was between 20 and 45, and as 3 if the age was over 45. Head size, the
product of the head length, head breadth, and head height, is a volume mea-
surement, hence (size)1/3 was also used as a predictor with the same physical
dimensions as the other lengths. Thus there are 11 nontrivial predictors and
one response, and all models will also contain a constant. Nine cases were
deleted because of missing values, leaving 267 cases.
Figure 3.7 shows the response plots and residual plots for the full model
and the nal submodel that used a constant, size1/3 , age, and sex. The ve
cases separated from the bulk of the data in each of the four plots correspond
to ve infants. These may be outliers, but the visual separation reects the
small number of infants and toddlers in the data. A purely numerical variable
selection procedure would miss this interesting feature of the data. We will
rst perform variable selection with the entire data set, and then examine the
eect of deleting the ve cases. Using forward selection and the Cp statistic
on the Gladstone data suggests the subset I5 containing a constant, (size)1/3 ,
age, sex, breadth, and cause with Cp (I5 ) = 3.199. The pvalues for breadth
and cause were 0.03 and 0.04, respectively. The subset I4 that deletes cause
has Cp (I4 ) = 5.374 and the pvalue for breadth was 0.05. Figure 3.8d shows
the RR plot for the subset I4 . Note that the correlation of the plotted points
is very high and that the OLS and identity lines nearly coincide.
100
FRES
Y
200
400
200 100
SRES3
Y
400
Fig. 3.7 Gladstone data: comparison of the full model and the submodel.
yI,i . Then the RR plot of rI,i versus ri , and the FF plot of YI,i versus Yi were
constructed.
1200
100
FRES
FFIT
0
-200 -100 0 100 200 400 600 800 1000 1200 1400
SRES1 SFIT1
200
100
FRES
FRES
0
0
-200
-200
Fig. 3.8 Gladstone data: submodels added (size)1/3 , sex, age, and nally breadth.
For this model, the correlation in the FF plot (Figure 3.8b) was very
high, but in the RR plot the OLS line did not coincide with the identity
line (Figure 3.8a). Next sex was added to I, but again the OLS and identity
lines did not coincide in the RR plot (Figure 3.8c). Hence age was added
to I. Figure 3.9a shows the RR plot with the OLS and identity lines added.
These two lines now nearly coincide, suggesting that a constant plus (size)1/3 ,
sex, and age contains the relevant predictor information. This subset has
Cp (I) = 7.372, R2 (I) = 0.80, and I = 74.05. The full model which used
11 predictors and a constant has R2 = 0.81 and = 73.58. Since the Cp
criterion suggests adding breadth and cause, the Cp criterion may be leading
to an overt.
Figure 3.9b shows the FF plot. The ve cases in the southwest corner cor-
respond to ve infants. Deleting them leads to almost the same conclusions,
although the full model now has R2 = 0.66 and = 73.48 while the submodel
has R2 (I) = 0.64 and I = 73.89.
116 3 Building an MLR Model
a) RR Plot b) FF Plot
1400
200
1200
100
1000
FRES
FFIT
0
800
-100
600
-200
400
-200 -100 0 100 200 400 600 800 1000 1200 1400
SRES3 SFIT3
Fig. 3.9 Gladstone data with Predictors (size)1/3 , sex, and age
a) RR Plot b) FF plot
0.10
0.50
0.05
0.45
full$residual
ffit
0.40
0.0 -0.05
0.35
-0.10
0.30
-0.10 -0.05 0.0 0.05 0.10 0.30 0.35 0.40 0.45 0.50
sub$residual sfit
Example 3.9. Cook and Weisberg (1999a, pp. 261, 371) describe a data
set where rats were injected with a dose of a drug approximately proportional
to body weight. The data set is included as the le rat.lsp in the Arc software
3.4 Variable Selection 117
The point of this example is that a subset of outlying cases can cause
numeric second-moment criteria such as Cp to nd structure that does not
exist. The FF and RR plots can sometimes detect these outlying cases, allow-
ing the experimenter to run the analysis without the inuential cases. The
example also illustrates that global numeric criteria can suggest a model with
one or more nontrivial terms when in fact the response is independent of the
predictors.
Numerical variable selection methods for MLR are very sensitive to in-
uential cases such as outliers. Olive and Hawkins (2005) show that a plot
of the residuals versus Cooks distances (see Section 3.5) can be used to de-
tect inuential cases. Such cases can also often be detected from response,
residual, RR, and FF plots.
Warning: deleting inuential cases and outliers will often lead to
better plots and summary statistics, but the cleaned data may no
longer represent the actual population. In particular, the resulting
model may be very poor for both prediction and description.
Multiple linear regression data sets with cases that inuence numerical
variable selection methods are common. Table 3.1 shows results for seven
interesting data sets. The rst two rows correspond to the Ashworth (1842)
data, the next 2 rows correspond to the Gladstone data in Example 3.8, and
the next 2 rows correspond to the Gladstone data with the 5 infants deleted.
Rows 7 and 8 are for the Buxton (1920) data, while rows 9 and 10 are for
the Tremearne (1911) data. These data sets are available from the books
website. Results from the nal two data sets are given in the last 4 rows. The
last 2 rows correspond to the rat data described in Example 3.9. Rows 11
and 12 correspond to the ais data that comes with Arc (Cook and Weisberg
1999a).
118 3 Building an MLR Model
The full model used p predictors, including a constant. The nal submodel
I also included a constant, and the nontrivial predictors are listed in the
second column of Table 3.1. For a candidate submodel I, let Cp (I, c) denote
the value of the Cp statistic for the clean data that omits inuential cases and
outliers. The third column lists p, Cp (I), andCp (I, c) while the rst column
gives the set of inuential cases. Two rows are presented for each data set. The
second row gives the response variable and any predictor transformations.
For example, for the Gladstone data p = 10 since there were 9 nontrivial
predictors plus a constant. Only the predictor size was transformed, and the
nal submodel is the one given in Example 3.8. For the rat data, the nal
submodel is the one given in Example 3.9: none of the 3 nontrivial predictors
was used.
Table 3.1 and simulations suggest that if the subset I has k predictors,
then using the Cp (I) min(2k, p) screen is better than using the conven-
tional Cp (I) k screen. The major and ais data sets show that deleting the
inuential cases may increase the Cp statistic. Thus interesting models from
the entire data set and from the clean data set should be examined.
Example 3.10. Conjugated linoleic acid (CLA) occurs in beef and dairy
products and appears to have many human health benets. Joanne Numrich
provided four data sets where the response was the amount of CLA (or re-
lated compounds), and the explanatory variables were feed components from
the cattle diet. The data was to be used for descriptive and exploratory pur-
poses. Several data sets had outliers with unusually high levels of CLA. These
outliers were due to one researcher and may be the most promising cases in
the data set. However, to describe the bulk of the data with OLS MLR,
the outliers were omitted. In one of the data sets there are 33 cases and 25
3.4 Variable Selection 119
The bootstrap will be described and then applied to variable selection. Sup-
pose there is data w1 , . . . , wn collected from a distribution with cdf F into
an n p matrix W . The empirical distribution, with cdf Fn , gives each ob-
served data case wi probability 1/n. Let the statistic Tn = t(W ) = t(Fn )
be computed from the data. Suppose the statistic estimates = t(F ). Let
t(W ) = t(Fn ) = Tn indicate that t was computed from an iid sample from
the empirical distribution Fn : a sample of size n was drawn with replacement
from the observed sample w1 , . . . , wn .
Some notation is needed to give the Olive (2013a) prediction region used
to bootstrap a hypothesis test. Suppose w1 , . . . , wn are iid p 1 random
vectors with mean and nonsingular covariance matrix w . Let a future
test observation wf be independent of the wi but from the same distribution.
Let (w, S) be the sample mean and sample covariance matrix where
1 1
n n
w= wi and S = S w = (wi w)(wi w)T . (3.8)
n i=1 n1
i=1
Let Di2 = Dw 2
i
for each observation wi . Let D(c) be the cth order statistic of
D1 , . . . , Dn . Consider the hyperellipsoid
An = {w : Dw
2
(w, S) D(c)
2
} = {w : Dw (w, S) D(c) }. (3.10)
Hence if there was an iid sample T1,n , . . . , TB,n of the statistic, the Olive
(2013a) large sample 100(1 )% prediction region {w : D2 (T , S T ) D(c) 2
}
for Tf,n contains E(Tn ) = with asymptotic coverage 1 . To make the
asymptotic coverage equal to 1, use the large sample 100(1)% condence
region {w : D2 (T1,n , S T ) D(c)
2
}. The prediction region method bootstraps
3.4 Variable Selection 121
this procedure by using a bootstrap sample of the statistic T1,n , . . . , TB,n .
Centering the region at T1,n instead of T is not needed since the bootstrap
sample is centered near Tn : the distribution of n(Tn ) is approximated
by the distribution of n(T Tn ) or by the distribution of n(T T ).
T
2
If H0 is true and E[] = c, then = 0. Let D0 = T [S T ]1 T and fail to re-
ject H0 if D0 D(UB ) and reject H0 if D0 > D(UB ) . This percentile method
is equivalent to computing the prediction region (3.10) on the wi = Ti and
checking whether 0 is in the prediction region.
Methods for bootstrapping the multiple linear regression model are well
known. The estimated covariance matrix of the (ordinary) least squares esti-
mator is
OLS ) = M SE(X T X)1 .
Cov(
The residual bootstrap computes the least squares estimator and obtains the
n residuals and tted values r1 , . . . , rn and Y1 , . . . , Yn . Then a sample of size
n is selected with replacement from the residuals resulting in r11 , . . . , rn1 .
Hence the empirical distribution of the residuals is used. Then a vector Y 1 =
(Y11 T
, . . . , Yn1 ) is formed where Yi1 = Yi + ri1
. Then Y 1 is regressed on X
resulting in the estimator 1 . This process is repeated B times resulting in
the estimators 1 , . . . , B . This method should have n 10p so that the
residuals ri are close to the errors ei .
Efron (1982, p. 36) notes that for the residual bootstrap, the sample co-
variance matrix of the i is estimating the population bootstrap matrix
np
M SE(X T X)1 as B . Hence the residual bootstrap standard
n
np
error SE(i ) SE(i,OLS ).
n
T T
If the z i = (Yi , xi ) are iid observations from some population, then a
sample of size n can be drawn with replacement from z 1 , . . . , z n . Then the
response and predictor variables can be formed into vector Y 1 and design
matrix X 1 . Then Y 1 is regressed on X 1 resulting in the estimator 1 . This
process is repeated B times resulting in the estimators 1 , . . . , B . If the
z i are the rows of a matrix Z, then this nonparametric bootstrap uses the
empirical distribution of the z i .
Following Seber and Lee (2003, p. 100), the classical test statistic for test-
ing H0 : A = c, where A is a full rank r p matrix, is
n
(A c)T [M SE A(X T X)1 AT ]1 (A c),
np
2
and we expect D(U B)
n 2
np r,1 , for large n and B, and p << n.
K
U= i Np (0, I2i W Ii ,0 ),
i=1
K
0 i 1, i=1 i = 1, and K is the number of subsets Ii that contain S.
Inference techniques for the variable selection model have not had much
success. Efron (2014) lets t(Z) be a scalar valued statistic, based on all of the
data Z, that estimates a parameter of interest . Form a bootstrap sample
1
n
Z i and t(Z i ) for i = 1, . . . , B. Then = s(Z) = t(Z i ), a boot-
B i=1
strap smoothing or bagging estimator. In the regression setting with vari-
able selection, Z i can be formed with the nonparametric or residual boot-
strap using the full model. The prediction region method can also be applied
to t(Z). For example, when A is 1 p, the prediction region method uses
= A c, t(Z) = A c and T = . Efron (2014) used the condence
interval T z1 SE(T ) which is symmetric about T . The prediction re-
gion method uses T ST D(UB ) which is also a symmetric interval centered
at T . If both the prediction region method and Efrons method are large
sample condence intervalsfor , then they have the same asymptotic length
(scaled by multiplying by n), since otherwise the shorter interval will have
lower asymptotic coverage. Since the prediction region interval is a percentile
interval, the shorth(c) interval could have much shorter length than both the
Efron interval and the prediction region interval if the bootstrap distribution
is not symmetric.
The prediction region method can be used for vector valued statistics and
parameters, and may not need the statistic to be asymptotically normal.
These features are likely useful for variable selection models. Prediction in-
tervals and regions can have higher than the nominal coverage 1 if the
distribution is discrete or a mixture of a discrete distribution and some other
distribution. In particular, coverage can be high if the wi distribution is a
mixture of a point mass at 0 and the method checks whether 0 is in the
prediction region. Such a mixture often occurs for variable selection meth-
ods. The bootstrap sample for the Wi = ij can contain many zeroes and be
highly skewed if the jth predictor is weak. Then the computer program may
fail because S w is singular, but if all or nearly all of the ij = 0, then there
is strong evidence that the jth predictor is not needed given that the other
predictors are in the variable selection method.
As an extreme simulation case, suppose ij = 0 for i = 1, . . . , B and for
each run in the simulation. Consider testing H0 : j = 0. Then regardless of
the nominal coverage 1 , the closed interval [0,0] will contain 0 for each
run and the observed coverage will be 1 > 1 . Using the open interval
(0,0) would give observed coverage 0. Also intervals [0, b] and [a, 0] correctly
suggest failing to reject j = 0, while intervals (0, b) and (a, 0) incorrectly
suggest rejecting H0 : j = 0. Hence closed regions and intervals make sense.
3.4 Variable Selection 125
Olive (2016a) showed that applying the prediction region method results
in a large sample 100(1 )% condence region for for a wide variety of
problems, and used the method for variable selection where = .
Example 3.11. Cook and Weisberg (1999a, pp. 351, 433, 447) gives a
data set on 82 mussels sampled o the coast of New Zealand. Let the response
variable be the logarithm log(M ) of the muscle mass, and the predictors are
the length L and height H of the shell in mm, the logarithm log(W ) of the shell
width W, the logarithm log(S) of the shell mass S and a constant. Inference
for the full model is shown along with the shorth(c) nominal 95% condence
intervals for i computed using the nonparametric and residual bootstraps.
As expected, the residual bootstrap intervals are close to the classical least
squares condence intervals i 2SE(i ).
The minimum Cp model from all subsets variable selection uses a constant,
H, and log(S). The shorth(c) nominal 95% condence intervals for i using
the residual bootstrap are shown. Note that the interval for H is right skewed
and contains 0 when closed intervals are used instead of open intervals. The
least squares output is also shown, but should only be used for inference if
the model was selected before looking at the data.
It was expected that log(S) may be the only predictor needed, along with
a constant, since log(S) and log(M ) are both log(mass) measurements and
likely highly correlated. Hence we want to test H0 : 2 = 3 = 4 = 0 with
the Imin model selected by all subsets variable selection. (Of course this test
would be easy to do with the full model using least squares theory.) Then
H0 : A = (2 , 3 , 4 )T = 0. Using the prediction region method with the
full model gave an interval [0,2.930] with D0 = 1.641. Note that 23,0.95 =
2.795. So fail to reject H0 . Using the prediction region method with the Imin
variable selection model had [0, D(UB ) ] = [0, 3.293] while D0 = 1.134. So fail
to reject H0 .
library(leaps)
y <- log(mussels[,5]); x <- mussels[,1:4]
x[,4] <- log(x[,4]); x[,2] <- log(x[,2])
out <- regboot(x,y,B=1000)
tem <- rowboot(x,y,B=1000)
outvs <- vselboot(x,y,B=1000) #get bootstrap CIs,
apply(out$betas,2,shorth3);
apply(tem$betas,2,shorth3);
apply(outvs$betas,2,shorth3)
ls.print(outvs$full)
ls.print(outvs$sub)
#test if beta_2 = beta_3 = beta_4 = 0
Abeta <- out$betas[,2:4]
#prediction region method with residual bootstrap
predreg(Abeta)
Abeta <- outvs$betas[,2:4]
#prediction region method with Imin
predreg(Abeta)
Example 3.12. Consider the Gladstone (1905) data set where the vari-
ables are as in Problem 3.6. Output is shown below for the full model and the
bootstrapped minimum Cp forward selection estimator. Note that the shorth
intervals for length and sex are quite long. These variables are often in and
often deleted from the bootstrap forward selection model. Output for II is
also shown. For this data set, II = Imin .
The regression models used the residual bootstrap on the full model least
squares estimator and on the all subsets variable selection estimator for the
model Imin . The residuals were from least squares applied to the full model
in both cases. Results are shown for when the iid errors ei N (0, 1). Ta-
ble 3.2 shows two rows for each model giving the observed condence interval
coverages and average lengths of the condence intervals. The term reg is
for the full model regression, and the term vs is for the all subsets variable
selection. The column for the test gives the length and coverage = P(fail
to reject H0 ) for the interval [0, D(UB ) ] where D(UB ) is the cuto for the
condence region. The volume of the condence
region will decrease to 0 as
n . The cuto will often be near r,0.95 if the statistic T is asymp-
2
totically normal. Note that 22,0.95 = 2.448 is very close to 2.449 for the
full model regression bootstrap test. The coverages were near 0.95 for the
regression bootstrap on the full model. For Imin the coverages were near 0.95
for 1 and 2 , but higher for the other 3 tests since zeroes often occurred for
j for j = 3, 4. The average lengths and coverages were similar for the full
model and all subsets variable selection Imin for 1 and 2 , but the lengths
were shorter for Imin for 3 and 4 .
3.5 Diagnostics
1
n
u= ui (3.14)
n i=1
and
1
n
C = Cov(U ) = (ui u)(ui u)T , (3.15)
n 1 i=1
Y (i) = X (3.16)
(i)
denote the n 1 vector of tted values from estimating with OLS without
the ith case. Denote the jth element of Y (i) by Y(i),j . It can be shown that
the variance of the ith residual VAR(ri ) = 2 (1 hi ). The usual estimator
of the error variance is n 2
r
= i=1 i .
2
np
The (internally) studentized residual
r
ei = i
1 hi
has zero mean and approximately unit variance.
1
n
= (Y(i),j Yj )2 .
2 j=1
p
ri2 hi e2i hi
CDi = = .
2 (1 hi ) 1 hi
p p 1 hi
When the statistics CDi , hi , and MDi are large, case i may be an outlier or
inuential case. Examining a stem plot or dot plot of these three statistics for
unusually large values can be useful for agging inuential cases. Cook and
Weisberg (1999a, p. 358) suggest examining cases with CDi > 0.5 and that
cases with CDi > 1 should always be studied. Since H = H T and H = HH,
the hat matrix is symmetric andnidempotent. Hence the eigenvalues of H are
zero or one, and trace(H) = i=1 hi = p. It can be shown that 0 hi 1.
Rousseeuw and Leroy (1987, pp. 220, 224) suggest using hi > 2p/n and
MD2i > 2p1,0.95 as benchmarks for leverages and Mahalanobis distances
where 2p1,0.95 is the 95th percentile of a chisquare distribution with p 1
degrees of freedom.
Note that Proposition 3.4c) implies that Cooks distance is the product
of the squared residual and a quantity that becomes larger the farther ui
is from u. Hence inuence is roughly the product of leverage and distance
of Yi from Yi (see Fox 1991, p. 21). Mahalanobis distances and leverages
both dene hyperellipsoids based on a metric closely related to the sample
covariance matrix of the nontrivial predictors. All points ui on the same
hyperellipsoidal contour are the same distance from u and have the same
leverage (or the same Mahalanobis distance).
Cooks distances, leverages, and Mahalanobis distances can be eective for
nding inuential cases when there is a single outlier, but can fail if there
are two or more outliers. Nevertheless, these numerical diagnostics combined
with response and residual plots are probably the most eective techniques for
detecting cases that eect the tted values when the multiple linear regression
model is a good approximation for the bulk of the data.
A scatterplot of x versus y (recall the convention that a plot of x versus y
means that x is on the horizontal axis and y is on the vertical axis) is used to
132 3 Building an MLR Model
visualize the conditional distribution y|x of y given x (see Cook and Weisberg
1999a, p. 31). For the simple linear regression model (with one nontrivial
predictor x2 ), the most eective technique for checking the assumptions of
the model is to make a scatterplot of x2 versus Y and a residual plot of x2
versus ri . Departures from linearity in the scatterplot suggest that the simple
linear regression model is not adequate. The points in the residual plot should
scatter about the line r = 0 with no pattern. If curvature is present or if the
distribution of the residuals depends on the value of x2 , then the simple linear
regression model is not adequate.
In general there is more than one nontrivial predictor and in this setting
two plots are crucial for any multiple linear regression analysis, re-
gardless of the regression estimator (e.g., OLS, L1 etc.). The rst plot is the
residual plot of the tted values Yi versus the residuals ri , and the second
plot is the response plot of the tted values Yi versus the response Yi .
Recalling Denitions 2.11 and 2.12, residual and response plots are plots of
wi = xTi versus ri and Yi , respectively, where is a known p 1 vector. The
most commonly used residual and response plots takes = . Plots against
the individual predictors xj and potential predictors are also used. If the
residual plot is not ellipsoidal with zero slope, then the unimodal MLR model
(with iid errors from a unimodal distribution that is not highly skewed) is
not sustained. In other words, if the variables in the residual plot show some
type of dependency, e.g. increasing variance or a curved pattern, then the
multiple linear regression model may be inadequate. Proposition 2.1 showed
that the response plot simultaneously displays the tted values, response, and
residuals. The plotted points in the response plot should scatter about the
identity line if the multiple linear regression model holds. Recall that residual
plots magnify departures from the model while the response plot emphasizes
how well the model ts the data.
When the bulk of the data follows the MLR model, the following rules of
thumb are useful for nding inuential cases and outliers from the response
and residual plots. Look for points with large absolute residuals and for points
far away from Y . Also look for gaps separating the data into clusters. To
determine whether small clusters are outliers or good leverage points, give
zero weight to the clusters, and t an MLR estimator to the bulk of the
data. Denote the weighted estimator by w . Then plot Yw versus Y using
the entire data set. If the identity line passes through the bulk of the data
but not the cluster, then the cluster points may be outliers. In Figure 3.7,
the 5 infants are good leverage points in that the t to the bulk of the data
passes through the cluster of infants. For the Buxton (1920) data, the cluster
of cases far from the bulk of the data in Figure 3.11 are outliers.
To see why gaps are important, recall that the coecient of determination
R2 is equal to the squared correlation (corr(Y, Y ))2 . R2 over emphasizes the
3.6 Outlier Detection 133
strength of the MLR relationship when there are two clusters of data since
much of the variability of Y is due to the smaller cluster.
Denition 3.12. Outliers are cases that lie far from the bulk of the data.
Hence Y outliers are cases that have unusually large vertical distances from
the MLR t to the bulk of the data while x outliers are cases with predictors
x that lie far from the bulk of the xi . Suppose that some analysis to detect
outliers is performed. Masking occurs if the analysis suggests that one or
more outliers are in fact good cases. Swamping occurs if the analysis suggests
that one or more good cases are outliers.
The residual and response plots are very useful for detecting outliers. If
there is a cluster of cases with outlying Y s, the identity line will often pass
through the outliers. If there are two clusters with similar Y s, then the two
plots may fail to show the clusters. Then using methods to detect x outliers
may be useful.
Let the q continuous predictors in the MLR model be collected into vectors
ui for i = 1, . . . , n. Let the n q matrix W have n rows uT1 , . . . , uTn . Let the
q 1 column vector T (W ) be a multivariate location estimator, and let the
q q symmetric positive denite matrix C(W ) be a covariance estimator.
Often q = p 1 and only the constant is omitted from xi to create ui .
for each point ui . Notice that Di2 is a random variable (scalar valued).
1
n
T (W ) = u = ui ,
n i=1
and
1
n
C(W ) = S = (ui T (W ))(ui T (W ))T
n 1 i=1
Olive (2002) shows that the plotted points in the DD plot will follow the
identity line with zero intercept and unit slope if the predictor distribution
is multivariate normal (MVN), and will follow a line with zero intercept but
nonunit slope if the distribution is elliptically contoured with nonsingular
covariance matrix but not MVN. (Such distributions have linear scatterplot
matrices. See Chapter 10.) Hence if the plotted points in the DD plot follow
some line through the origin, then there is some evidence that outliers and
strong nonlinearities have been removed from the predictors.
Response Plot
500 1000 1500
Y
61
62
0
61
RES
64
63
65 62
-150
Figure 3.11 shows the response plot and residual plot for the Buxton
data. Although an index plot of Cooks distance CDi may be useful for
agging inuential cases, the index plot provides no direct way of judg-
ing the model against the data. As a remedy, cases in the response plot
with CDi > min(0.5, 2p/n) were highlighted. Notice that the OLS t passes
through the outliers, but the response plot is resistant to Y outliers since Y
is on the vertical axis. Also notice that although the outlying cluster is far
from Y , only two of the outliers had large Cooks distance. Hence masking
occurred for both Cooks distances and for OLS residuals, but not for OLS
tted values.
5
250
4
150
RD
RD
3
2
50
1
0
Figure 3.12a shows the DD plot made from the four predictors head length,
nasal height, bigonal breadth, and cephalic index. The ve massive outliers cor-
respond to head lengths that were recorded to be around 5 feet. Figure 3.12b
is the DD plot computed after deleting these points and suggests that the
predictor distribution is now much closer to a multivariate normal distribu-
tion.
High leverage outliers are a particular challenge to conventional numer-
ical MLR diagnostics such as Cooks distance, but can often be visualized
using the response and residual plots. The following techniques are useful for
detecting outliers when the multiple linear regression model is appropriate.
1. Find the OLS residuals and tted values and make a response plot and
a residual plot. Look for clusters of points that are separated from the
bulk of the data and look for residuals that have large absolute values.
Beginners frequently label too many points as outliers. Try to estimate
the standard deviation of the residuals in both plots. In the residual plot,
3.6 Outlier Detection 137
look for residuals that are more than 5 standard deviations away from
the r = 0 line. The identity line and r = 0 line may pass right through a
cluster of outliers, but the cluster of outliers can often be detected because
there is a large gap between the cluster and the bulk of the data, as in
Figure 3.11.
2. Make a DD plot of the predictors that take on many values (the continuous
predictors).
3. Make a scatterplot matrix of several diagnostics such as leverages, Cooks
distances, and studentized residuals.
3.7 Summary
FF and RR plots cluster tightly about the identity line. In the RR plot, the
OLS line and identity line can be added to the plot as visual aids. It should
be dicult to see that the OLS and identity lines intersect at the origin, so
the two lines should nearly coincide at the origin. If the FF plot looks good
but the RR plot does not, the submodel may be good if the main goal of the
analysis is for prediction.
10) Forward selection Step 1) k = 1: Start with a constant w1 = x1 .
Step 2) k = 2: Compute Cp for all models with k = 2 containing a constant
and a single predictor xi . Keep the predictor w2 = xj , say, that minimizes Cp .
Step 3) k = 3: Fit all models with k = 3 that contain w1 and w2 . Keep the
predictor w3 that minimizes Cp . . . .
Step j) k = j: Fit all models with k = j that contains w1 , w2 , . . . , wj1 . Keep
the predictor wj that minimizes Cp . . . .
Step p): Fit the full model.
11) Let Imin correspond to the submodel with the smallest Cp . Find
the submodel II with the fewest number of predictors such that Cp (II )
Cp (Imin ) + 1. Then II is the initial submodel that should be examined. It
is possible that II = Imin or that II is the full model. Models I with fewer
predictors than II such that Cp (I) Cp (Imin ) + 4 are interesting and should
also be examined. Models I with k predictors, including a constant and with
fewer predictors than II such that Cp (Imin ) + 4 < Cp (I) min(2k, p) should
be checked.
12) There are several guidelines for building an MLR model. Suppose
that variable Z is of interest and variables W1 , . . . , Wr have been collected
along with Z. Make a scatterplot matrix of W1 , . . . , Wr and Z. (If r is large,
several matrices may need to be made. Each one should include Z.) Remove
or correct any gross outliers. It is often a good idea to transform the Wi
to remove any strong nonlinearities from the predictors. Eventually
140 3 Building an MLR Model
3.8 Complements
With one data set, OLS is a great place to start but a bad place to end. If
n = 5kp where k > 2, it may be useful to take a random sample of n/k cases
to build the MLR model. Then check the model on the full data set.
Predictor Transformations
Cook (1993) shows that partial residual plots are useful for visualizing
provided that the plots of xi versus xj are linear. More general ceres plots, in
particular ceres plots with smooth augmentation, can be used to visualize
if Y = uT + (x j )p + e but the linearity condition fails. Fitting the additive
model Y = 1 + j=2 Sj (xj ) + e or Y = 1 + 2 x2 + + j1 xj1 + S(xj ) +
j+1 xj+1 + + p xp + e and plotting S(xj ) can be useful. Similar ideas are
also useful for GLMs. See Chapter 13 and Olive (2013b) which also discusses
response plots for many regression models.
The assumption that all values of x1 and x2 are positive for power trans-
formation can be removed by using the modied power transformations of
Yeo and Johnson (2000).
142 3 Building an MLR Model
Response Transformations
Application 3.1 was suggested by Olive (2004b, 2013b) for additive error
regression models Y = m(x) + e. An advantage of this graphical method is
that it works for linear models: that is, for multiple linear regression and for
many experimental design models. Notice that if the plotted points in the
transformation plot follow the identity line, then the plot is also a response
plot. The method is also easily performed for MLR methods other than least
squares.
A variant of the method would plot the residual plot or both the response
and the residual plot for each of the seven values of . Residual plots are also
useful, but they no not distinguish between nonlinear monotone relationships
and nonmonotone relationships. See Fox (1991, p. 55).
Cook and Olive (2001) also suggest a graphical method for selecting and
assessing response transformations under model (3.2). Cook and Weisberg
(1994) show that a plot of Z versus xT (swap the axis on the transformation
plot for = 1) can be used to visualize t if Y = t(Z) = xT + e, suggesting
that t1 can be visualized in a plot of xT versus Z.
If there is nonlinearity present in the scatterplot matrix of the nontrivial
predictors, then transforming the predictors to remove the nonlinear-
ity will often be a useful procedure. More will be said about response
transformations for experimental designs in Section 5.4.
?regsubsets
Bootstrap
Olive (2016a,b,c) showed that the prediction region method for creating
a large sample 100(1 )% condence region for an r 1 parameter vector
is a special case of the percentile method when r = 1, and gave sucient
conditions for r > 1. The shorth method gives the shortest percentile method
intervals, asymptotically, and should be used when B 1000. Efron (2014)
reviews some alternative methods for variable selection inference.
Consider the residual bootstrap, and let r W denote an n1 random vector
of elements selected with replacement from the n residuals r1 , . . . , rn . Then
there are K = nn possible values for r W . Let r W W
1 , . . . , r K be the possible
W
values of r . These values are equally likely, so are selected with probability
= 1/K. Note that r W has a discrete distribution. Then
E(r1j )
..
E(r W
j )= . .
E(rnj )
Now the marginal distribution of rij takes on the n values r1 , . . . , rn with
the same probability 1/n. So each of the n marginal distributions is the
n
empirical distribution of the residuals. Hence E(rij ) = i=1 ri /n = r, and
r = 0 for least squares residuals for multiple linear regression when there
is a constant in the model. So for least squares, E(r Wj ) = 0, and E( j ) =
(X T X)1 X T E(Y + r W T
j ) = (X X)
1
X T Y = (X T X)1 X T HY =
(X T X)1 X T Y = = n
Diagnostics
Outliers
RMVN estimators is given in Olive (2008: ch. 10, 2016c: ch. 4) and Olive and
Hawkins (2010). These three estimators are also used in Zhang et al. (2012).
Lasso and Other Variable Selection Techniques
Response plots, prediction intervals, and the bootstrap prediction region
method are also useful for other variable selection techniques such as lasso
and ridge regression. If n 400 and p 3000, Bertsimas et al. (2016) give a
fast all subsets variable selection method.
Recent theory for lasso assumes that is selected before looking at the
data, rather than being estimated using kfold cross validation. See Hastie
et al. (2015). The prediction region method appears to be useful when n >> p
if none of the i = 0, but (in 2016) it takes a long time to simulate lasso with
kfold cross validation.
Lasso seems to work under ASSUMPTION L: assume the predictors are
uncorrelated or the number of active predictors (predictors with nonzero
coecients) is not much larger than 20. When n is xed and p increases, the
lasso prediction intervals increase in length slowly provided that assumption
L held. Methods are being developed that should work under more reasonable
assumptions. See Pelawa Watagoda (2017) and Pelawa Watagoda and Olive
(2017).
3.9 Problems
3.1. From the above output from backward elimination, what terms
should be used in the MLR model to predict Y ? (You can tell that the non-
trivial variables are nger to ground, nasal height, and sternal height from
the delete lines. DONT FORGET THE CONSTANT!)
3.2. The table on the following page gives summary statistics for 4 MLR
models considered as nal submodels after performing variable selection. The
response plot and residual plot for the full model L1 was good. Model L3 was
the minimum Cp model found. Which model should be used as the nal
submodel? Explain briey why each of the other 3 submodels should not be
used.
3.9 Problems 147
3.3. The above table gives summary statistics for 4 MLR models consid-
ered as nal submodels after performing variable selection. The response plot
and residual plot for the full model L1 was good. Model L2 was the minimum
Cp model found.
a) Which model is II , the initial submodel to look at?
b) What other model or models, if any, should be examined?
3.4. The output below and on the following page is from software that
does all subsets variable selection. The data is from Ashworth (1842). The
predictors were A = log(1692 property value), B = log(1841 property value),
and C = log(percent increase in value), while the response variable is Y =
log(1841 population).
a) The top output corresponds to data with 2 small outliers. From this
output, what is the best model? Explain briey.
b) The bottom output corresponds to the data with the 2 outliers removed.
From this output, what is the best model? Explain briey.
k CP R SQ R SQ RESID SS VARIABLES
-- ----- ---- ------ -------- -------------
1 903.5 0.0000 0.0000 183.102 INTERCEPT ONLY
2 0.7 0.9052 0.9062 17.1785 B
2 406.6 0.4944 0.4996 91.6174 A
2 426.0 0.4748 0.4802 95.1708 C
3 2.1 0.9048 0.9068 17.0741 A C
3 2.6 0.9043 0.9063 17.1654 B C
3 2.6 0.9042 0.9062 17.1678 A B
4 4.0 0.9039 0.9069 17.0539 A B C
R Problems
zx <- cbrainx[,c(1,3,5,6,7,8,9,10)]
zbrain <- as.data.frame(cbind(cbrainy,zx))
zfull <- lm(cbrainy~.,data=zbrain)
summary(zfull)
back <- step(zfull)
To quit Arc, move the cursor to the x in the northeast corner and click.
Problems 3.73.11 use data sets that come with Arc (Cook and Weisberg
1999a).
3.7 . a) In Arc enter the menu commands File>Load>Data and open
the le big-mac.lsp. Next use the menu commands Graph&Fit> Plot of to
obtain a dialog window. Double click on TeachSal and then double click on
BigMac. Then click on OK. These commands make a plot of x = TeachSal =
primary teacher salary in thousands of dollars versus y = BigMac = minutes
of labor needed to buy a Big Mac and fries. Include the plot in Word.
3.9 Problems 151
3.8 . In Arc enter the menu commands File>Load>Data and open the
le mussels.lsp. Use the commands Graph&Fit>Scatterplot Matrix of. In
the dialog window select H, L, W, S, and M (so select M last). Click on OK
and include the scatterplot matrix in Word. The response M is the edible part
of the mussel while the 4 predictors are shell measurements. Are any of the
marginal predictor relationships nonlinear? Is E(M |H) linear or nonlinear?
a) Fit the full model with Y = log V ol, X1 = log D, and X2 = log Ht.
Add the output that has the LS coecients to Word.
152 3 Building an MLR Model
b) Fitting the full model will result in the menu L1. Use the commands
L1>AVPAll 2D. This will create a plot with a slider bar at the bottom
that says log[D]. This is the added variable plot for log(D). To make an added
variable plot for log(Ht), click on the slider bar. Add the OLS line to the AV
plot for log(Ht) by moving the OLS slider bar to 1, and add the zero line by
clicking on the Zero line box. Include the resulting plot in Word.
c) Fit the reduced model that drops log(Ht). Make an RR plot with the
residuals from the full model on the V axis and the residuals from the sub-
model on the H axis. Add the LS line and the identity line as visual aids.
(Click on the Options menu to the left of the plot and type y=x in the
resulting dialog window to add the identity line.) Include the plot in Word.
d) Similarly make an FF plot using the tted values from the two models.
Add the OLS line which is the identity line. Include the plot in Word.
e) Next put the residuals from the submodel on the V axis and log(Ht)
on the H axis. Move the OLS slider bar to 1, and include this residual plot
in Word.
f) Next put the residuals from the submodel on the V axis and the tted
values from the submodel on the H axis. Include this residual plot in Word.
g) Next put log(Vol) on the V axis and the tted values from the submodel
on the H axis. Move the OLS slider bar to 1, and include this response plot
in Word.
Graph&Fit menu, select Fit linear LS. Use log[BigMac] as the response
and the other 9 log variables as the Terms. This model is the full model.
Include the output in Word.
d) Make a response plot (L1:Fit-Values in H and log(BigMac) in V) and
residual plot (L1:Fit-Values in H and L1:Residuals in V), and include both
plots in Word.
e) Using the L1 menu, select Examine submodels and try forward
selection and backward elimination. Using the Cp min(2k, p) rule suggests
that the submodel using log[service], log[TeachSal], and log[TeachTax] may be
good. From the Graph&Fit menu, select Fit linear LS, t the submodel
and include the output in Word.
f) Make a response plot (L2:Fit-Values in H and log(BigMac) in V) and
residual plot (L2:Fit-Values in H and L2:Residuals in V) for the submodel,
and include the plots in Word.
g) Make an RR plot (L2:Residuals in H and L1:Residuals in V) and FF plot
(L2:Fit-Values in H and L1:Fit-Values in V) for the submodel, and include
the plots in Word. Move the OLS slider bar to 1 in each plot to add the
identity line. For the RR plot, click on the Options menu then type y = x in
the long horizontal box near the bottom of the window and click on OK to
add the identity line.
h) Do the plots and output suggest that the submodel is good? Explain.
Warning: The following problems use data from the books web-
page (http://lagrange.math.siu.edu/Olive/lregbk.htm). Save the
data les on a ash drive G, say. Get in Arc and use the menu commands
File > Load and a window will appear. Click on Removable Disk (G:).
Then click twice on the data set name.
3.12 . The following data set has 5 babies that are good leverage
points: they look like outliers but should not be deleted because they follow
the same model as the bulk of the data.
a) In Arc enter the menu commands File>Load>Removable Disk (G:)
and open the le cbrain.lsp. Select transform from the cbrain menu, and add
size1/3 using the power transformation option (p = 1/3). From
Graph&Fit, select Fit linear LS. Let the response be brnweight and as terms
include everything but size and Obs. Hence your model will include size1/3 .
This regression will add L1 to the menu bar. From this menu, select Examine
submodels. Choose forward selection. You should get models including k = 2
to 12 terms including the constant. Find the model with the smallest Cp (I) =
CI statistic and include all models with the same k as that model in Word.
That is, if k = 2 produced the smallest CI , then put the block with k = 2
into Word. Next go to the L1 menu, choose Examine submodels and choose
Backward Elimination. Find the model with the smallest CI and include all
of the models with the same value of k in Word.
154 3 Building an MLR Model
g) For your submodel in f), make an RR plot with the residuals from the
full model on the V axis and the residuals from the submodel on the H axis.
Add the OLS line and the identity line y=x as visual aids. Include the RR
plot in Word.
h) Similarly make an FF plot using the tted values from the two models.
Add the OLS line which is the identity line. Include the FF plot in Word.
i) Using the submodel, include the response plot (of Y versus Y ) and
residual plot (of Y versus the residuals) in Word.
j) Using results from f)-i), explain why your submodel is a good model.
3.13. Activate the cyp.lsp data set. Choosing no more than 3 nonconstant
terms, try to predict height with multiple linear regression. Include a plot with
the tted values on the horizontal axis and height on the vertical axis. Is your
model linear? Also include a plot with the tted values on the horizontal axis
and the residuals on the vertical axis. Does the residual plot suggest that the
linear model may be inappropriate? (There may be outliers in the plot. These
could be due to typos or because the error distribution has heavier tails than
the normal distribution.) State which model you use.
3.16. This problem gives a slightly simpler model than Problem 3.15 by
using the indicator variable x3 = 1 if standard cement (if x2 = 2) and x3 =
0 otherwise (if x2 is 0 or 1). Activate the cement.lsp data.
156 3 Building an MLR Model
a) From the cement menu, select Transform, select x1, and place a 2 in the
p box. This should add x12 to the data set. From the cement menu, select
Make interactions and select x1 and x3.
b) From Graph&Fit select Fit linear LS, select x1, x12 , x3, and x1*x3 as
the terms and y as the response. Include the output in Word.
c) Make the response and residual plots. When making these plots, place
x2 in the Mark by box. Include the plots in Word. Does the model seem ok?
3.17 . Get the McDonald and Schwing (1973) data pollution.lsp from
(http://lagrange.math.siu.edu/Olive/lregbk.htm), and save the le on
a ash drive. Activate the pollution.lsp dataset with the menu commands
File > Load > Removable Disk (G:) > pollution.lsp. Scroll up the screen
to read the data description. Often simply using the log rule on the predictors
with max(x)/ min(x) > 10 works wonders.
a) Make a scatterplot matrix of the rst nine predictor variables and the re-
sponse Mort. The commands Graph&Fit > Scatterplot-Matrix of will bring
down a Dialog menu. Select DENS, EDUC, HC, HOUS, HUMID, JANT,
JULT, NONW, NOX, and MORT. Then click on OK.
A scatterplot matrix with slider bars will appear. Move the slider bars for
NOX, NONW, and HC to 0, providing the log transformation. In Arc, the
diagonals have the min and max of each variable, and these were the three
predictor variables satisfying the log rule. Open Word.
In Arc, use the menu commands Edit > Copy. In Word, use the menu
command Paste. This should copy the scatterplot matrix into the Word
document. Print the graph.
b) Make a scatterplot matrix of the last six predictor variables and the
response Mort. The commands Graph&Fit > Scatterplot-Matrix of will
bring down a Dialog menu. Select OVR65, POOR, POPN, PREC, SO,
WWDRK, and MORT. Then click on OK. Move the slider bar of SO to 0
and copy the plot into Word. Print the plot as described in a).
c) Click on the pollution menu and select Transform. Click on the log
transformations button and select HC, NONW, NOX, and SO. Click on OK.
Then t the full model with the menu commands Graph&Fit > Fit lin-
ear LS. Select MORT for the response. For the terms, select DENS, EDUC,
log[HC], HOUS, HUMID, JANT, JULT, log[NONW], log[NOX], OVR65,
POOR, POPN, PREC, log[SO], and WWDRK. Click on OK.
This model is the full model. To make the response plot use the menu
commands Graph&Fit >Plot of. Select MORT for the V-box and L1:Fit-
Values for the H-box. Click on OK. When the graph appears, move the OLS
slider bar to 1 to add the identity line. Copy the plot into Word.
To make the residual plot use the menu commands Graph&Fit >Plot
of. Select L1:Residuals for the V-box and L1:Fit-Values for the H-box. Click
on OK. Copy the plot into Word. Print the two plots.
3.9 Problems 157
to count a womans husband if he was not at home. Do not use the predictor
X2 in the full model. Do parts a), b), and c) above Problem 3.19.
3.24 . For the le pop.lsp, described below, use Z = Y . Do parts a), b),
and c) above Problem 3.19.
This data set comes from Ashworth (1842). Try transforming all variables
to logs. Then the added variable plots show two outliers. Delete these two
cases. Notice the eect of these two outliers on the pvalues for the coecients
and on numerical methods for variable selection.
Note: then log(Y ) and log(X2 ) make a good submodel.
3.25 . For the le pov.lsp, described below, use i) Z = f lif e and ii)
Z = gnp2 = gnp + 2. This data set comes from Rounceeld (1995). Making
loc into a factor may be a good idea. Use the commands poverty>Make factors
and select the variable loc. For ii), try transforming to logs and deleting the 6
cases with gnp2 = 0. (These cases had missing values for gnp. The le povc.lsp
has these cases deleted.) Try your nal submodel on the data that includes
the 6 cases with gnp2 = 0. Do parts a), b), and c) above Problem 3.19.
b) Write down your nal model (e.g., a very poor nal model is
exp(BigM ac) = 1 + 2 exp(EngSal) + 3 (T eachSal)3 + e).
c) Include the least squares output for your model and between 3 and 5
plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.
3.28. This is like Problem 3.27 with the BigMac data. Assume that a
multiple linear regression model holds for Y = t(Z) and for some terms
(usually powers or logs of the predictors). Using the techniques learned in
class, nd such a model. Give output for the full model, output for the nal
submodel and use several plots to justify your choices. These data sets, as
well as the BigMac data set, come with Arc. See Cook and Weisberg (1999a).
(INSTRUCTOR: Allow 2 hours for each part.)
file "response" Z
a) allomet.lsp BRAIN
b) casuarin.lsp W
c) evaporat.lsp Evap
d) hald.lsp Y
e) haystack.lsp Vol
f) highway.lsp rate
(From the menu Highway, select "Add a variate" and
type sigsp1 = sigs + 1. Then you can transform
sigsp1.)
g) landrent.lsp Y
h) ozone.lsp ozone
i) paddle.lsp Weight
3.9 Problems 161
j) sniffer.lsp Y
k) water.lsp Y
i) Write down the full model that you use and include the full model
residual plot and response plot in Word. Give R2 for the full model.
iii) Include the least squares output for your model and between 3 and
5 plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.
3.29 . a) Activate buxton.lsp (you need to download the le onto your
ash drive Removable Disk (G:)). From the Graph&Fit menu, select Fit
linear LS. Use height as the response variable and bigonal breadth, cephalic
index, head length, and nasal height as the predictors. Include the output in
Word.
b) Make a response plot (L1:Fit-Values in H and height in V) and residual
plot (L1:Fit-Values in H and L1:Residuals in V) and include both plots in
Word.
c) In the residual plot use the mouse to move the cursor just above and
to the left of the outliers. Hold the leftmost mouse button down and move
the mouse to the right and then down. This will make a box on the residual
plot that contains the outliers. Go to the Case deletions menu and click
on Delete selection from data set. From the Graph&Fit menu, select Fit
linear LS and t the same model as in a) (the model should already be
entered, just click on OK). Include the output in Word.
d) Make a response plot (L2:Fit-Values in H and height in V) and residual
plot (L2:Fit-Values in H and L2:Residuals in V) and include both plots in
Word.
e) Explain why the outliers make the MLR relationship seem much
stronger than it actually is. (Hint: look at R2 .)
Variable Selection in SAS
3.30. Copy and paste the SAS program for this problem into the SAS
editor. Then perform the menu commands Run>Submit to obtain about
15 pages of output. Do not print out the output.
The data is from SAS Institute (1985, pp. 695704, 717718). Aerobic
tness is being measured by the ability to consume oxygen. The response
Y = Oxygen (uptake rate) is expensive to measure, and it is hoped that
the OLS Y can be used instead. The variables are Age in years, Weight in
kg, RunTime = time in minutes to run 1.5 miles, RunPulse = heart rate
when Y is measured, RestPulse = heart rate while running, and MaxPulse =
maximum heart rate recorded while running.
162 3 Building an MLR Model
The concepts of a random vector, the expected value of a random vector, and
the covariance of a random vector are needed before covering generalized least
squares. Recall that for random variables Yi and Yj , the covariance of Yi and
Yj is Cov(Yi , Yj ) i,j = E[(Yi E(Yi ))(Yj E(Yj )] = E(Yi Yj )E(Yi )E(Yj )
provided the second moments of Yi and Yj exist.
where the ij entry of Cov(Y ) is Cov(Yi , Yj ) = i,j provided that each i,j
exists. Otherwise Cov(Y ) does not exist.
and
E(AY ) = AE(Y ) and E(AY B) = AE(Y )B. (4.2)
Also
Cov(a + AY ) = Cov(AY ) = ACov(Y )AT . (4.3)
E(Y ) = X + E(e) = X.
since H T = H and HH = H.
Recall that the vector of residuals r OLS = (I H)Y = Y Y OLS . Hence
E(r OLS ) = E(Y ) E(Y OLS ) = E(Y ) E(Y ) = 0. Using (4.3) and (4.4),
Denition 4.3. Suppose that the response variable and at least one of the
predictor variables is quantitative. Then the generalized least squares (GLS)
model is
Y = X + e, (4.5)
where Y is an n 1 vector of dependent variables, X is an n p matrix
of predictors, is a p 1 vector of unknown coecients, and e is an n 1
vector of unknown errors. Also E(e) = 0 and Cov(e) = 2 V where V is a
known n n positive denite matrix.
Denition 4.4. The GLS estimator
Y = X + e, (4.7)
W LS = (X T V 1 X)1 X T V 1 Y . (4.8)
The tted values are Y F GLS = X F GLS . The feasible weighted least squares
(FWLS) estimator is the special case of the FGLS estimator where V =
V () is diagonal. Hence the estimated weights wi = 1/vi = 1/vi (). The
FWLS estimator and tted values will be denoted by F W LS and Y F W LS ,
respectively.
Notice that the ordinary least squares (OLS) model is a special case of
GLS with V = I n , the n n identity matrix. It can be shown that the GLS
estimator minimizes the GLS criterion
166 4 WLS and Generalized Least Squares
Notice that the FGLS and FWLS estimators have p + q + 1 unknown param-
eters. These estimators can perform very poorly if n < 10(p + q + 1).
The GLS and WLS estimators can be found from the OLS regression
(without an intercept) of a transformed model. Typically there will be a
constant in the model: the rst column of X is a vector of ones. Following
Seber and Lee (2003, pp. 6668), there is a nonsingular n n matrix K
such that V = KK T . Let Z = K 1 Y , U = K 1 X, and = K 1 e. This
method uses the fast, but rather unstable, Cholesky decomposition.
Proposition 4.1. a)
Z = U + (4.10)
2
follows the OLS model since E() = 0 and Cov() = I n .
b) The GLS estimator GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi + ei . The corresponding OLS model Z = U +
is equivalent to Zi = uTi + i for i = 1, . . . , n where uTi is the ith row of U .
Then Zi = wi Yi and ui = wi xi . Hence W LS can be obtained from the
OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .
Cov() = K 1 Cov(e)(K 1 )T = 2 K 1 V (K 1 )T
= 2 K 1 KK T (K 1 )T = 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The rst column of U is K 1 1 = 1.
b) Let ZU denote the OLS estimator obtained by regressing Z on U .
Then
ZU = (U T U )1 U T Z = (X T (K 1 )T K 1 X)1 X T (K 1 )T K 1 Y
Following Johnson and Wichern (1988, p. 51) and Freedman (2005, p. 54),
there is a symmetric, nonsingular n n square root matrix R = V 1/2 such
that V = RR. Let Z = R1 Y , U = R1 X and = R1 e. This method
uses the spectral theorem (singular value decomposition) and has better com-
putational properties than transformation based on the Cholesky decompo-
sition.
Proposition 4.2. a)
Z = U + (4.11)
2
follows the OLS model since E() = 0 and Cov() = I n .
b) The GLS estimator GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi + ei . The corresponding OLS model Z = U +
is equivalent to Zi = uTi + i for i = 1, . . . , n where uTi is the ith row of U .
Then Zi = wi Yi and ui = wi xi . Hence W LS can be obtained from the
OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .
= 2 R1 RR(R1 ) = 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The rst column of U is R1 1 = 1.
b) Let ZU denote the OLS estimator obtained by regressing Z on U .
Then
Remark 4.1. Standard software produces WLS output and the ANOVA
F test and Wald t tests are performed using this output.
168 4 WLS and Generalized Least Squares
Remark 4.2. The FGLS estimator can also be found from the OLS
regression (without an intercept) of Z on U where V () = RR. Similarly
the FWLS estimator can be found from the OLS regression (without an in-
tercept) of Zi = wi Yi on ui = wi xi . But now U is a random matrix
instead of a constant matrix. Hence these estimators are highly nonlinear.
OLS output can be used for exploratory purposes, but the pvalues are gen-
erally not correct. The Olive (2016a,b) nonparametric bootstrap tests may
be useful for FGLS and FWLS. The nonparametric bootstrap could also be
applied to the OLS estimator.
Under regularity conditions, the OLS estimator OLS is a consistent esti-
mator of when the GLS model holds, but GLS should be used because it
generally has higher eciency.
Denition 4.8. Let ZU be the OLS estimator from regressing Z on U .
The vector of tted values is Z = U ZU and the vector of residuals is
r ZU = Z Z. Then ZU = GLS for GLS, ZU = F GLS for FGLS,
ZU = W LS for WLS, and ZU = F W LS for FWLS. For GLS, FGLS,
WLS, and FWLS, a residual plot is a plot of Zi versus rZU,i and a response
plot is a plot of Zi versus Zi .
Notice that the residual and response plots are based on the OLS output
from the OLS regression without intercept of Z on U . If the model is good,
then the plotted points in the response plot should follow the identity line
in an evenly populated band while the plotted points in the residual plot
should follow the line rZU,i = 0 in an evenly populated band (at least if the
distribution of is not highly skewed).
Plots based on YGLS = X ZU and on ri,GLS = Yi Yi,GLS should be
similar to those based on OLS . Although the plot of Yi,GLS versus Yi should
be linear, the plotted points will not scatter about the identity line in an
evenly populated band. Hence this plot cannot be used to check whether
the GLS model with V is a good approximation to the data. Moreover, the
ri,GLS and Yi,GLS may be correlated and usually do not scatter about the
r = 0 line in an evenly populated band. The plots in Denition 4.8 are both
a check on linearity and on whether the model using V (or V ) gives a good
approximation of the data, provided that n > k(p + q + 1) where k 5 and
preferably k 10.
For GLS and WLS (and for exploratory purposes for FGLS and FWLS),
plots and model building and variable selection should be based on Z and U .
Form Z and U and then use OLS software for model selection and variable
selection. If the columns of X are v 1 , . . . , v p , then the columns of U are
U1 , . . . , Up where Uj = R1 v j corresponds to the jth predictor Xj . For
example, the analog of the OLS residual plot of jth predictor versus the
residuals is the plot of the jth predictor Uj versus rZU . The notation is
confusing but the idea is simple: form Z and U , then use OLS software and
the OLS techniques from Chapters 2 and 3 to build the model.
4.2 GLS, WLS, and FGLS 169
12
2
RESID
1
Y
6
2
4
2 4 6 8 2 4 6 8
FIT FIT
ZRESID
0
Z
5
0 4 8 12 2 0 4 8 12
ZFIT ZFIT
Fig. 4.1 Plots for Draper and Smith Data
Example 4.2. Draper and Smith (1981, pp. 112114) present an FWLS
example with n = 35 and p = 2. Hence Y = 1 + 2 x + e. Let vi =
vi () = 1.5329 0.7334xi + 0.0883x2i . Thus = (1.5329, 0.7334, 0.0883)T .
Figure 4.1a and b shows the response and residual plots based on the OLS
regression of Y on x. The residual plot has the shape of the right opening
megaphone, suggesting that the variance is not constant. Figure 4.1c and d
shows the response and residual plots based on FWLS with weights wi = 1/vi .
See Problem 4.2 to reproduce these plots. Software meant for WLS needs the
weights. Hence FWLS can be computed using WLS software with the es-
timated weights, but the software may print WLS instead of FWLS, as in
Figure 4.1c and d.
Warning. A problem with the response and residual plots for GLS and
FGLS given in Denition 4.8 is that some of the transformed cases (Zi , uTi )T
can be outliers or high leverage points.
iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams often an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.
Assume that the GLS model contains a constant 1 . The GLS ANOVA F
test of Ho : 2 = = p versus Ha: not Ho uses the reduced model that
contains the rst column of U . The GLS ANOVA F test of Ho : i = 0
versus Ho : i = 0 uses the reduced model with the ith column of U deleted.
For the special case of WLS, the software will often have a weights option
that will also give correct output for inference.
Example 4.3. Suppose that the data from Example 4.2 has valid weights,
so that WLS can be used instead of FWLS. The R commands below per-
form WLS.
> ls.print(lsfit(dsx,dsy,wt=dsw))
Residual Standard Error=1.137
R-Square=0.9209
F-statistic (df=1, 33)=384.4139, p-value=0
Estimate Std.Err t-value Pr(>|t|)
Intercept -0.8891 0.3004 -2.9602 0.0057
X 1.1648 0.0594 19.6065 0.0000
> ls.print(lsfit(u[,1],z,intercept=F))
Residual Standard Error=3.9838, R-Square=0.7689
F-statistic (df=1, 34)=113.1055, p-value=0
Estimate Std.Err t-value Pr(>|t|)
X 4.5024 0.4234 10.6351 0
> ((34*(3.9838)^2-33*(1.137)^2)/1)/(1.137)^2
[1] 384.4006
The WLS t-test for this data has t = 19.6065 which corresponds to F =
t2 = 384.4 since this test is equivalent to the WLS ANOVA F test when there
is only one predictor. The WLS t-test for the intercept has F = t2 = 8.76.
This test statistic can be found from the no intercept OLS model by leaving
the rst column of U out of the model, then perform the partial F test as
shown below.
> ls.print(lsfit(u[,2],z,intercept=F))
Residual Standard Error=1.2601
F-statistic (df=1, 34)=1436.300
Estimate Std.Err t-value Pr(>|t|)
X 1.0038 0.0265 37.8985 0
> ((34*(1.2601)^2-33*(1.137)^2)/1)/(1.137)^2
[1] 8.760723
172 4 WLS and Generalized Least Squares
4.4 Complements
The theory for GLS and WLS is similar to the theory for the OLS MLR
model, but the theory for FGLS and FWLS is often lacking or huge sam-
ple sizes are needed. However, FGLS and FWLS are often used in practice
because usually V is not known and V must be used instead. Kariya and
Kurata (2004) is a PhD level text covering FGLS. Cook and Zhang (2015)
suggest an envelope method for WLS.
Shi and Chen (2009) describe numerical diagnostics for GLS. Long and
Ervin (2000) discuss methods for obtaining standard errors when the constant
variance assumption is violated.
Following Sheather (2009, ch. 9, ch. 10) many linear models with seri-
ally correlated errors (e.g. AR(1) errors) and many linear mixed models
can be t with FGLS. Both Sheather (2009) and Houseman et al. (2004)
use the Cholesky decomposition and make the residual plots based on the
Cholesky residuals Z Z where V () = KK T . We recommend plots based
on Z Z where V () = RR. In other words, use transformation corre-
sponding to Proposition 4.2 instead of the transformation corresponding to
Proposition 4.1.
4.5 Problems
R Problems
4.2. Download the wlsplot function and the Draper and Smith (1981)
data dsx, dsy, dsw.
a) Enter the R command wlsplot(x=dsx, y = dsy, w = dsw) to re-
produce Figure 4.1. Once you have the plot you can print it out directly, but
it will generally save paper by placing the plots in the Word editor.
b) Activate Word (often by double clicking on a Word icon). Click on the
screen and type Problem 4.2. In R, click on the plot and then press the
keys Ctrl and c simultaneously. This procedure makes a temporary copy of
the plot. In Word, move the pointer to Edit and hold down the leftmost mouse
button. This will cause a menu to appear. Drag the pointer down to Paste.
In the future, these menu commands will be denoted by Edit>Paste. The
plot should appear on the screen. To save your output on your ash drive
(J, say), use the Word menu commands File > Save as. In the Save in box
select Removable Disk (J:) and in the File name box enter HW4d2.doc. To
exit from Word, click on the X in the upper right corner of the screen. In
Word a screen will appear and ask whether you want to save changes made
in your document. Click on No. To exit from R, type q() or click on the
X in the upper right corner of the screen and then click on No.
4.3. Download the fwlssim function. This creates WLS data if type
is 1 or 3 and FWLS data if type is 2 or 4. Let the sucient predictor
SP = 25 + 2x2 + + 2xp . Then Y = SP + |SP 25k|e where the xij and
ei are iid N (0, 1). Thus Y |SP N (SP, (SP 25k)2 2 ). If type is 1 or 2,
then k = 1/5, but k = 1 if type is 3 or 4. The default has 2 = 1.
The function creates the OLS response and residual plots and the FWLS
(or WLS) response and residual plots.
a) Type the following command several times. The OLS and WLS plots
tend to look the same.
fwlssim(type=1)
b) Type the following command several times. Now the FWLS plots often
have outliers.
fwlssim(type=2)
c) Type the following command several times. The OLS residual plots have
a saddle shape, but the WLS plots tend to have highly skewed tted values.
fwlssim(type=3)
d) Type the following command several times. The OLS residual plots
have a saddle shape, but the FWLS plots tend to have outliers and highly
skewed tted values.
fwlssim(type=4)
Chapter 5
One Way Anova
5.1 Introduction
n1 units treatment 1, the next n2 units treatment 2, . . . , and the nal np units
treatment p.
Balanced designs have the group sizes the same: ni m = n/p. Label the
units alphabetically so Carroll gets 1, . . . , Xumong gets 9. The R function
sample can be used to draw a random permutation. Then the rst 3 numbers
in the permutation correspond to group 1, the next 3 to group 2, and the nal
3 to group 3. Using the output shown below gives the following 3 groups.
> sample(9)
[1] 6 7 9 5 1 4 2 8 3
> rand(9,3)
$perm
[1] 6 7 9 5 1 4 2 8 3
$groups
[1] 2 3 3 2 2 1 1 3 1
Denition 5.5. Replication means that for each treatment, the ni re-
sponse variables Yi,1 , . . . , Yi,ni are approximately iid random variables.
5.2 Fixed Eects One Way Anova 177
Example 5.2. a) If ten students work two types of paper mazes three
times each, then there are 60 measurements that are not replicates. Each
student should work the six mazes in random order since speed increases
with practice. For the ith student, let Zi1 be the average time to complete
the three mazes of type 1, let Zi2 be the average time for mazes of type 2,
and let Di = Zi1 Zi2 . Then D1 , . . . , D10 are replicates.
b) Cobb (1998, p. 126) states that a student wanted to know if the shapes
of sponge cells depends on the color (green or white). He measured hundreds
of cells from one white sponge and hundreds of cells from one green sponge.
There were only two units so n1 = 1 and n2 = 1. The student should have
used a sample of n1 green sponges and a sample of n2 white sponges to get
more replicates.
c) Replication depends on the goals of the study. Box et al. (2005, pp. 215
219) describe an experiment where the investigator times how long it takes
him to bike up a hill. Since the investigator is only interested in his perfor-
mance, each run up a hill is a replicate (the time for the ith run is a sample
from all possible runs up the hill by the investigator). If the interest had been
on the eect of eight treatment levels on student bicyclists, then replication
would need n = n1 + + n8 student volunteers where ni ride their bike up
the hill under the conditions of treatment i.
The one way Anova model is used to compare p treatments. Usually there
is replication and Ho: 1 = 2 = = p is a hypothesis of interest. In-
vestigators may also want to rank the population means from smallest to
largest.
Denition 5.6. Let fZ (z) be the pdf of Z. Then the family of pdfs
fY (y) = fZ (y ) indexed by the location parameter , < < ,
is the location family for the random variable Y = + Z with standard
pdf fZ (z).
Denition 5.7. A one way xed eects Anova model has a single qualita-
tive predictor variable W with p categories a1 , . . . , ap . There are p dierent
distributions for Y , one for each category ai . The distribution of
Y |(W = ai ) fZ (y i )
where the location family has second moments. Hence all p distributions come
from the same location family with dierent location parameter i and the
same variance 2 .
Denition 5.8. The one way xed eects normal Anova model is the spe-
cial case where
178 5 One Way Anova
Y |(W = ai ) N (i , 2 ).
Example 5.3. The pooled 2 sample ttest is a special case of a one way
Anova model with p = 2. For example, one population could be ACT scores
for men and the second population ACT scores for women. Then W = gender
and Y = score.
Denition 5.9. The cell means model is the parameterization of the one
way xed eects Anova model such that
Yij = i + eij
where Yij is the value of the response variable for the jth trial of the ith
factor level. The i are the unknown means and E(Yij ) = i . The eij are
iid from the location family with pdf fZ (z) and unknown variance 2 =
VAR(Yij ) = VAR(eij ). For the normal cell means model, the eij are iid
N (0, 2 ) for i = 1, . . . , p and j = 1, . . . , ni .
The cell means model is a linear model (without intercept) of the form
Y = X c c + e =
Y11 100 ... 0 e11
.. .. .. .. .. ..
. . . . . .
Y1,n1 1 0 0 ... 0 e1,n1
e
Y21 0 1 0 ... 0
1 21
.. .. .. .. .. ..
. . . . . .
2
.. + .
Y2,n2 = 0 1 0 ... 0 (5.1)
. e2,n2
. . . . .. .
.. .. .. .. . p ..
Yp,1 0 0 0 ... 1 ep,1
. . . . .. .
.. .. .. .. . .
.
Yp,np 0 0 0 ... 1 ep,np
ni
Notation. Let Yi0 = j=1 Yij and let
1
ni
i = Y i0 = Yi0 /ni = Yij . (5.2)
ni j=1
5.2 Fixed Eects One Way Anova 179
E(Y ) = X c c = (1 , . . . , 1 , 2 , . . . , 2 , . . . , p , . . . , p )T ,
c = (X Tc X c )1 X Tc Y = (Y 10 , . . . , Y p0 )T = (1 , . . . , p )T .
Since the cell means model is a linear model, there is an associated response
plot and residual plot. However, many of the interpretations of the OLS
quantities for Anova models dier from the interpretations for MLR models.
First, for MLR models, the conditional distribution Y |x makes sense even if
x is not one of the observed xi provided that x is not far from the xi . This
fact makes MLR very powerful. For MLR, at least one of the variables in x
is a continuous predictor. For the one way xed eects Anova model, the p
distributions Y |xi make sense where xTi is a row of X c .
Also, the OLS MLR ANOVA F test for the cell means model tests H0 :
c = 0 H0 : 1 = = p = 0, while the one way xed eects ANOVA F
test given after Denition 5.13 tests H0 : 1 = = p .
Denition 5.10. Consider the one way xed eects Anova model. The
response plot is a plot of Yij i versus Yij and the residual plot is a plot of
Yij i versus rij .
The points in the response plot scatter about the identity line and the
points in the residual plot scatter about the r = 0 line, but the scatter need
not be in an evenly populated band. A dot plot of Z1 , . . . , Zm consists of an
axis and m points each corresponding to the value of Zi . The response plot
consists of p dot plots, one for each value of i . The dot plot corresponding
to i is the dot plot of Yi1 , . . . , Yi,ni . The p dot plots should have roughly the
180 5 One Way Anova
Rule of thumb 5.2. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
The assumption of the Yij coming from the same location family with
dierent location parameters i and the same constant variance 2 is a big
assumption and often does not hold. Another way to check this assumption is
to make a box plot of the Yij for each i. The box in the box plot corresponds
to the lower, middle, and upper quartiles of the Yij . The middle quartile
is just the sample median of the data mij : at least half of the Yij mij
and at least half of the Yij mij . The p boxes should be roughly the same
length and the median should occur in roughly the same position (e.g., in
the center) of each box. The whiskers in each plot should also be roughly
similar. Histograms for each of the p samples could also be made. All of the
histograms should look similar in shape.
Example 5.4. Kuehl (1994, p. 128) gives data for counts of hermit crabs
on 25 dierent transects in each of six dierent coastline habitats. Let Z be
the count. Then the response variable Y = log10 (Z + 1/6). Although the
5.2 Fixed Eects One Way Anova 181
counts Z varied greatly, each habitat had several counts of 0 and often there
were several counts of 1, 2, or 3. Hence Y is not a continuous variable. The cell
means model was t with ni = 25 for i = 1, . . . , 6. Each of the six habitats
was a level. Figure 5.1a and b shows the response plot and residual plot.
There are 6 dot plots in each plot. Because several of the smallest values in
each plot are identical, it does not always look like the identity line is passing
through the six sample means Y i0 for i = 1, . . . , 6. In particular, examine the
dot plot for the smallest mean (look at the 25 dots furthest to the left that
fall on the vertical line FIT 0.36). Random noise (jitter) has been added to
the response and residuals in Figure 5.1c and d. Now it is easier to compare
the six dot plots. They seem to have roughly the same spread.
1
RESID
1
Y
0
-1
0
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT
1
JR
JY
1
0
-1
0
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT
The plots contain a great deal of information. The response plot can be
used to explain the model, check that the sample from each population (treat-
ment) has roughly the same shape and spread, and to see which populations
have similar means. Since the response plot closely resembles the residual plot
in Figure 5.1, there may not be much dierence in the six populations. Lin-
earity seems reasonable since the samples scatter about the identity line. The
residual plot makes the comparison of similar shape and spread easier.
182 5 One Way Anova
p
ni
SST O = (Yij Y 00 )2 .
i=1 j=1
p
SST R = ni (Y i0 Y 00 )2 .
i=1
p
ni
SSE = (Yij Y i0 )2 .
i=1 j=1
1 1
p ni i p n
2 = M SE = 2
rij = (Yij Y i0 )2 =
n p i=1 j=1 n p i=1 j=1
1
p
(ni 1)Si2 = Spool
2
n p i=1
2
where Spool is known as the pooled variance estimator.
The ANOVA F test tests whether the p means are equal. If Ho is not
rejected and the means are equal, then it is possible that the factor is unim-
portant, but it is also possible that the factor is important but the
level is not. For example, the factor might be type of catalyst. The yield
may be equally good for each type of catalyst, but there would be no yield if
no catalyst was used.
The ANOVA table is the same as that for MLR, except that SSTR re-
places the regression sum of squares. The MSE is again an estimator of 2 .
The ANOVA F test tests whether all p means i are equal. Shown below is
an ANOVA table given in symbols. Sometimes Treatment is replaced by
Between treatments, Between Groups, Model, Factor, or Groups.
Sometimes Error is replaced by Residual, or Within Groups. Some-
times p-value is replaced by P, P r(> F ), or PR > F. The p-value
is nearly always an estimated p-value, denoted by pval.
5.2 Fixed Eects One Way Anova 183
Source df SS MS F p-value
Treatment p-1 SSTR MSTR Fo=MSTR/MSE for Ho:
Error n-p SSE MSE 1 = = p
P (Fp1,np > Fo ).
iv) State whether you reject Ho or fail to reject Ho. If the pval , reject Ho
and conclude that the mean response depends on the factor level. (Hence not
all of the treatment means are equal.) Otherwise fail to reject Ho and conclude
that the mean response does not depend on the factor level. (Hence all of the
treatment means are equal, or there is not enough evidence to conclude that
the mean response depends on the factor level.) Give a nontechnical sentence.
max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ),
then the one way ANOVA F test results will be approximately correct if the
response and residual plots suggest that the remaining one way Anova model
assumptions are reasonable. See Moore (2007, p. 634). If all of the ni 5,
replace the standard deviations by the ranges of the dot plots when exam-
ining the response and residual plots. The range Ri = max(Yi,1 , . . . , Yi,ni )
min(Yi,1 , . . . , Yi,ni ) = length of the ith dot plot for i = 1, . . . , p.
The assumption that the zero mean iid errors have constant variance
V (eij ) 2 is much stronger for the one way Anova model than for the mul-
tiple linear regression model. The assumption implies that the p population
distributions have pdfs from the same location family with dierent means
1 , . . . , p but the same variances 12 = = p2 2 . The one way ANOVA
F test has some resistance to the constant variance assumption, but con-
dence intervals have much less resistance to the constant varianceassumption.
Consider condence intervals for i such as Y i0 tni 1,1/2 M SE/ ni .
MSE is a weighted average of the Si2 . Hence MSE overestimates small i
2
2 2
and underestimates large i when the i are not equal. Hence using M SE
instead of Si will make the CI too long or too short, and Rule of thumb 5.3
does not apply to condence intervals based on MSE.
184 5 One Way Anova
Remark 5.2. When the assumption that the p groups come from the
same location family with nite variance 2 is violated, the one way ANOVA
F test may not make much sense because unequal means may not imply the
superiority of one category over another. Suppose Y is the time in minutes
until relief from a headache and that Y1j N (60, 1) while Y2j N (65, 2 ).
If 2 = 1, then the type 1 medicine gives headache relief 5 minutes faster, on
average, and is superior, all other things being equal. But if 2 = 100, then
many patients taking medicine 2 experience much faster pain relief than those
taking medicine 1, and many experience much longer time until pain relief.
In this situation, predictor variables that would identify which medicine is
faster for a given patient would be very useful.
Example 5.5. The output below represents grams of fat (minus 100
grams) absorbed by doughnuts using 4 types of fat. See Snedecor and Cochran
(1967, p. 259). Let i denote the mean amount of fati absorbed by doughnuts,
i = 1, 2, 3 and 4. a) Find 1 . b) Perform a 4 step ANOVA F test.
n1
Solution: a) 1c = 1 = Y 10 = Y10 /n1 = j=1 Y1j /n1 =
(64 + 72 + 68 + 77 + 56 + 95)/6 = 432/6 = 72.
b) i) H0 : 1 = 2 = 3 = 4 Ha : not H0
ii) F = 5.41
iii) pval = 0.0069
iv) Reject H0 , the mean amount of fat absorbed by doughnuts depends on
the type of fat.
Notice that the strain of clover 3dok1 appears to have the highest mean
nitrogen content. There are 4 pairs of means that are not signicantly dier-
ent. The letter B suggests 3dok5 and 3dok7, the letter C suggests 3dok7 and
compos, the letter D suggests compos and 3dok4, while the letter E suggests
3dok4 and 3dok13 are not signicantly dierent.
B 23.980 5 3dok5
B
C B 19.920 5 3dok7
C
C D 18.700 5 compos
D
E D 14.640 5 3dok4
E
E 13.260 5 3dok13
Remark 5.3. Two graphical methods can also be used. Recall from Chap-
ter 1 that a response plot is an estimated sucient summary plot. If n is not
too small, each ni 5, and the sample mean (where the dot plot crosses
the identity line) for one dot plot is below or above another dot plot, then
conclude that the population mean corresponding to the higher dot plot is
greater than the sample mean corresponding to the lower dot plot. As the
ni increase, the sample mean of one dot plot only needs to be above or be-
low most of the cases in the other dot plot. The p population means may or
may not be equal if all p of the dot plots have lots of overlap. This will hap-
pen, for example, if the response plot looks like the residual plot. Hence this
graphical method is inconclusive for Figure 5.1a. Remark 5.2 gives another
situation where this graphical method can fail. An advantage of this graphi-
cal method is that the p populations do not need to come from populations
with the same variance or from the same location scale family as long as OLS
gives a consistent estimator of . The second graphical method is given in
Denition 5.15.
Example 5.6, continued: Figure 5.2 shows the response and residual
plots for the clover data. The plots suggest the constant variance assumption
is not reasonable. The population means may or may not dier for the groups
with the two smallest sample means, but these two groups appear to have
smaller population means than the other groups. Similarly, the population
means may or may not dier for the two groups with sample means near
5.2 Fixed Eects One Way Anova 187
Response Plot
30
Y
20
10
15 20 25
FIT
Residual Plot
4
RES
2
8
15 20 25
FIT
Fig. 5.2 Response and Residual Plots for Clover Data
20, but these two groups appear to have population means that are smaller
than the two groups with the largest sample means. The population means
of these last two groups may or may not dier. Figure 5.2 was made with the
following commands, using the lregpack function aovplots.
x<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,
5,6,6,6,6,6)
y<-c(19.4,32.6,27.0,32.1,33.0,17.7,24.8,27.9,25.2,
24.3,17.0,19.4,9.1,11.9,15.8,20.7,21.0,20.5,18.8,
18.6,14.3,14.4,11.8,11.6,14.2,17.3,19.4,19.1,16.9,
20.8)
x <- factor(x)
z <- aov(y~x)
aovplots(Y=y,FIT=fitted(z),RES=resid(z))
#right click stop twice
Denition 5.15. Graphical Anova for the one way model uses the
residuals as a reference set instead of a t, F , or normal distribution. The
scaled treatment deviations or scaled eect c(Y i0 Y 00 ) = c(i Y 00 )
are scaled to have the same variability as the residuals. A dot plot of the
scaled deviations is placed above the dot plot of the residuals. Assume that
188 5 One Way Anova
For n 100, let r(1) r(2) r(n) be the order statistics of the
residuals. Then instead of the range, use r(0.975n) r(0.025n) as the dis-
tance where
x is the smallest integer x, e.g.
7.7 = 8. So eects outside
of the interval (r(0.025n) , r(0.975n) ) are signicant. See Box et al. (2005,
pp. 136, 166). A derivation of the scaling constant c = (n p)/(p 1) is
given in Section 5.6.
ganova(x,y)
smn 0.0296 0.0661 -0.0508 -0.0449
Treatments "A" "B" "C" "D"
Example 5.7. Cobb (1998) describes a one way Anova design used to
study the amount of calcium in the blood. For many animals, the bodys
ability to use calcium depends on the level of certain hormones in the blood.
The response was 1/(level of plasma calcium). The four groups were A: Fe-
male controls, B: Male controls, C: Females given hormone, and D: Males
5.3 Random Eects One Way Anova 189
given hormone. There were 10 birds of each gender, and ve from each gen-
der were given the hormone. The output above uses the lregpack function
ganova to produce Figure 5.3.
In Figure 5.3, the top dot plot has the scaled treatment deviations. From
left to right, these correspond to C, D, A, and B since the output shows that
the deviation corresponding to C is the smallest with value 0.050. Since the
deviations corresponding to C and D are much closer than the range of the
residuals, the C and D eects yielded similar mean response values. A and
B appear to be signicantly dierent from C and D. The distance between
the scaled A and B treatment deviations is about the same as the distance
between the smallest and largest residuals, so there is only marginal evidence
that the A and B eects are signicantly dierent.
Since all 4 scaled deviations lie outside of the range of the residuals, all
eects A, B, C, and D appear to be signicant.
Denition 5.16. For the random eects one way Anova, the levels of
the factor are a random sample of levels from some population of levels F .
The cell means model for the random eects one way Anova is Yij = i + eij
for i = 1, . . . , p and j = 1, . . . , ni . The i are randomly selected from some
population with mean and variance 2 , where i F is equivalent to
i . The eij and i are independent, and the eij are iid from a location
family with pdf f , mean 0, and variance 2 . The Yij |i f (y i ), the
location family with location parameter i and variance 2 . Unconditionally,
E(Yij ) = and V (Yij ) = 2 + 2 .
For the random eects model, the i are independent random variables
with E(i ) = and V (i ) = 2 . The cell means model for xed eects one
way Anova is very similar to that for the random eects model, but the i
are xed constants rather than random variables.
Denition 5.17. For the normal random eects one way Anova model,
N (, 2 ). Thus the i are independent N (, 2 ) random variables. The
eij are iid N (0, 2 ) and the eij and i are independent. For this model,
Yij |i N (i , 2 ) for i = 1, . . . , p. Note that the conditional variance 2 is
the same for each i . Unconditionally, Yij N (, 2 + 2 ).
The ANOVA tables for the xed and random eects one way Anova models
are exactly the same, and the two F tests are very similar. The main dierence
is that the conclusions for the random eects model can be generalized to
the entire population of levels. For the xed eects model, the conclusions
only hold for the p xed levels. If Ho : 2 = 0 is true and the random eects
model holds, then the Yij are iid with pdf f (y ). So the F statistic for
the random eects test has an approximate Fp1,np distribution if the ni
are large by the results for the xed eects one way ANOVA test. For both
tests, the pval is an estimate of the population p-value.
Source df SS MS F P
brand 5 854.53 170.906 238.71 0.0000
error 42 30.07 0.716
Example 5.8. Data is from Kutner et al. (2005, problem 25.7). A re-
searcher is interested in the amount of sodium in beer. She selects 6 brands
of beer at random from 127 brands and the response is the average sodium
content measured from 8 cans of each brand.
a) State whether this is a random or xed eects one way Anova. Explain
briey.
b) Using the output above, perform the appropriate 4 step ANOVA F
test.
Solution: a) Random eects since the beer brands were selected at random
from a population of brands.
b) i) H0 : 2 = 0 Ha : 2 > 0
ii) F0 = 238.71
iii) pval = 0.0
iv) Reject H0 , so 2 > 0 and the mean amount of sodium depends on the
beer brand.
Remark 5.4. The response and residual plots for the random eects mod-
els are interpreted in the same way as for the xed eects model, except that
the dot plots are from a random sample of p levels instead of from p xed
levels.
5.4 Response Transformations for Experimental Design 191
Denition 5.18. Assume that all of the values of the response Zi are
positive. A power transformation has the form Y = t (Z) = Z for = 0
and Y = t0 (Z) = log(Z) for = 0 where L = {1, 1/2, 0, 1/2, 1}.
In the following example, the plots show t (Z) on the vertical axis. The
label TZHAT of the horizontal axis are the tted values that result from
using t (Z) as the response in the software.
2.5
6
6
5
2.0
4
4
1/sqrt(Z)
LOG(Z)
1.5
1/Z
3
2
1.0
2
0
0.5
1
0
0.0
-2
1.2 1.4 1.6 1.8 2.0 2.2 0.7 0.8 0.9 1.0 1.1 1.0 1.5 2.0
TZHAT TZHAT TZHAT
20
400
15
300
sqrt(Z)
Z
10
200
100
5
0
0
3 4 5 10 20 30 40 50 60 70
TZHAT TZHAT
5.5 Summary
1) The xed eects one way Anova model has one qualitative explanatory
variable called a factor and a quantitative response variable Yij . The factor
variable has p levels, E(Yij ) = i and V (Yij ) = 2 for i = 1, . . . , p and
j = 1, . . . , ni . Experimental units are randomly assigned to the treatment
levels.
2) Let n = n1 + + np . In an experiment, the investigators use random-
ization to randomly assign n units to treatments. Draw a random permutation
of {1, . . . , n}. Assign the rst n1 units to treatment 1, the next n2 units to
treatment 2, . . . , and the nal np units to treatment p. Use ni m = n/p if
possible. Randomization washes out the eect of lurking variables.
3) The 4 step xed eects one way ANOVA F test has steps
i) Ho: 1 = 2 = = p and Ha: not Ho.
ii) F o = MSTR/MSE is usually given by output.
iii) The pval = P(Fp1,np > F o) is usually given by output.
iv) If pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. (Hence all of the treatment
means are equal, or there is not enough evidence to conclude that the mean
response depends on the factor level.) Give a nontechnical sentence.
Source df SS MS F p-value
Treatment p-1 SSTR MSTR Fo=MSTR/MSE for Ho:
Error n-p SSE MSE 1 = = p
Y |(W = ai ) fZ (y i )
where the location family has second moments. Hence all p distributions
come from the same location family with dierent location parameter i and
the same variance 2 . The one way xed eects normal Anova model is the
special case where Y |(W = ai ) N (i , 2 ).
7) The response plot is a plot of Y versus Y . For the one way Anova model,
the response plot is a plot of Yij = i versus Yij . Often the identity line with
unit slope and zero intercept is added as a visual aid. Vertical deviations
from the identity line are the residuals rij = Yij Yij = Yij i . The plot
will consist of p dot plots that scatter about the identity line with similar
shape and spread if the xed eects one way Anova model is appropriate.
The ith dot plot is a dot plot of Yi,1 , . . . , Yi,ni . Assume that each ni 10. If
the response plot looks like the residual plot, then a horizontal line ts the p
dot plots about as well as the identity line, and there is not much dierence
in the i . If the identity line is clearly superior to any horizontal line, then
at least some of the means dier.
8) The residual plot is a plot of Y versus residual r = Y Y . The plot will
consist of p dot plots that scatter about the r = 0 line with similar shape
and spread if the xed eects one way Anova model is appropriate. The ith
dot plot is a dot plot of ri,1 , . . . , ri,ni . Assume that each ni 10. Under
the assumption that the Yij are from the same location family with dierent
parameters i , each of the p dot plots should have roughly the same shape
and spread. This assumption is easier to judge with the residual plot than
with the response plot.
9) Rule of thumb: If max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ), then the one
way ANOVA F test results will be approximately correct if the response and
residual plots suggest that the remaining one way Anova model assumptions
are reasonable. Replace the Si by the ranges Ri of the dot plots in the residual
and response plots.
10) In an experiment, the investigators assign units to treatments. In
an observational study, investigators simply observe the response, and the
treatment groups need to be p random samples from p populations (the lev-
els). The eects of lurking variables are present in observational studies.
11) If a qualitative variable has c levels, represent it with c1 or c indicator
variables. Given a qualitative variable, know how to represent the data with
indicator variables.
12) The cell means model for the xed eects one way Anova is Yij =
i + eij where Yij is the value of the response variable for the jth trial
of the ith factor level for i = 1, . . . , p and j = 1, . . . , ni . The i are the
unknown means and E(Yij ) = i . The eij are iid from the location family
with pdf fZ (z), zero mean, and unknown variance 2 = V (Yij ) = V (eij ).
For the normal cell means model, the eij are iid N (0, 2 ). The estimator
ni
i = Y i0 = j=1 Yij /ni = Yij . The ith residual is rij = Yij Y i0 , and Y 00 is
5.5 Summary 195
p
the samplemeanof all of the Yij and n = i=1 ni . The total sum of squares
p ni
j=1 (Yij Y 00 ) , the treatment sum of squares SSTR =
2
SSTO =
p i=1
p ni
i=1 ni (Y i0 Y 00 ) , and the error sum of squares SSE = j=1 (Yij
2
i=1
2 2
Y i0 ) . The MSE is an estimator of . The ANOVA table is the same as
that for multiple linear regression, except that SSTR replaces the regression
sum of squares and that SSTO, SSTR, and SSE have n 1, p 1, and n p
degrees of freedom.
ni
13) Let Yi0 = j=1 Yij and let
1
ni
i = Y i0 = Yi0 /ni = Yij .
ni j=1
Hence the dot notation means sum over the subscript corresponding to the
p ni
0, e.g. j. Similarly, Y00 = i=1 j=1 Yij is the sum of all of the Yij . Be able
to nd i from data.
14) If the p treatment groups have the same pdf (so i in the location
family) with nite variance 2 , and if the one way ANOVA F test statistic is
n!
computed from all ways of assigning ni of the response variables
n1 ! np !
to treatment i, then the histogram of the F test statistic is approximately
Fp1,np for large ni .
15) For the one way Anova, the tted values Yij = Y i0 and the residuals
rij = Yij Yij .
16) Know that for the random eects one way Anova, the levels of
the factor are a random sample of levels from some population of levels F .
Assume the i are iid with mean and variance 2 . The cell means model
for the random eects one way Anova is Yij = i + eij for i = 1, . . . , p and
j = 1, . . . , ni . The sample size n = n1 + + np and often ni m so n = pm.
The i and eij are independent. The eij have mean 0 and variance 2 . The
Yij |i f (y i ), a location family with variance 2 while eij f (y). In
the test below, if H0 : 2 = 0 is true, then the Yij are iid with pdf f (y ),
so the F statistic Fp1,np if the ni are large.
17) Know that the 4 step random eects one way Anova test is
i) H0 : 2 = 0 HA : 2 > 0
ii) F0 = M ST R/M SE is usually obtained from output.
iii) The pval = P (Fp1,np > F0 ) is usually obtained from output.
iv) If pval reject Ho, conclude that 2 > 0 and that the mean response
depends on the factor level. Otherwise, fail to reject Ho, conclude that 2 = 0
and that the mean response does not depend on the factor level. (Or there
is not enough evidence to conclude that the mean response depends on the
factor level.)
18) Know how to tell whether the experiment is a xed or random eects
one way Anova. (Were the levels xed or a random sample from a population
of levels?)
196 5 One Way Anova
Y = to (Z) = E(Y ) + e = xT + e
where the subscripts (e.g., Yij ) have been suppressed. If o was known, then
Y = to (Z) would follow the DOE model. Assume that all of the values
of the response Z are positive. A power transformation has the form
Y = t (Z) = Z for = 0 and Y = t0 (Z) = log(Z) for = 0 where
L = {1, 1/2, 0, 1/2, 1}.
20) A graphical method for response transformations computes the tted
values W from the DOE model using W = t (Z) as the response for each
of the ve values of L . Let T = W = TZHAT and plot TZHAT vs.
t (Z) for {1, 1/2, 0, 1/2, 1}. These plots are called transformation
plots. The residual or error degrees of freedom used to compute the MSE
should not be too small. Choose the transformation Y = t (Z) that has the
best plot. Consider the one way Anova model with ni 5 for i = 1, . . . , p.
i) The dot plots should spread about the identity line with similar shape
and spread. ii) Dot plots that are approximately symmetric are better than
skewed dot plots. iii) Spread that increases or decreases with TZHAT (the
shape of the plotted points is similar to a right or left opening megaphone)
is bad.
21) The transformation plot for the selected transformation is also the
response plot for that model (e.g., for the model that uses Y = log(Z) as
the response). Make all of the usual checks on the DOE model (residual and
response plots) after selecting the response transformation.
22) The log rule says try Y = log(Z) if max(Z)/ min(Z) > 10 where
Z > 0 and the subscripts have been suppressed (so Z Zij for the one way
Anova model).
p p
23) A contrast C = i=1 ki i where i=1 ki = 0. The estimated contrast
p
is C = i=1 ki Y i0 .
p
24) Consider
p a family of null hypotheses for contrasts {Ho : i=1 ki i = 0
where i=1 ki = 0 and the ki may satisfy other constraints }. Let S denote
the probability of a type I error for a single test from the family. The family
level F is an upper bound on the (usually unknown) size T . Know how to
interpret F T = P(of making at least one type I error among the family
of contrasts) where a type I error is a false rejection.
25) Two important families of contrasts are the family of all possible con-
trasts and the family of pairwise dierences Cij = i j where i = j.
The Schee multiple comparisons procedure has a F for the family of all
possible contrasts, while
# $ the Tukey multiple comparisons procedure has a F
for the family of all p2 pairwise contrasts.
5.6 Complements 197
5.6 Complements
Often the data does not consist of samples from p populations, but consists
of a group of n = mp units where m units are randomly assigned to each of
the p treatments. Then the Anova models can still be used to compare treat-
ments, but statistical inference to a larger population cannot be made. Of
course a nonstatistical generalization to larger populations can be made. The
nonstatistical generalization from the group of units to a larger population
is most compelling if several experiments are done with similar results. For
example, generalizing the results of an experiment for psychology students
to the population of all of the university students is less compelling than the
following generalization. Suppose one experiment is done for psychology stu-
dents, one for engineers, and one for English majors. If all three experiments
198 5 One Way Anova
give similar results, then generalize the results to the population of all of the
universitys students.
Four good tests on the design and analysis of experiments are Box et al.
(2005), Cobb (1998), Kuehl (1994), and Ledolter and Swersey (2007). Also
see Dean and Voss (2000), Kirk (2012), Maxwell and Delaney (2003), Mont-
gomery (2012), and Oehlert (2000).
A randomization test has H0 : the dierent treatments have no eect.
This null hypothesis is also true if all p pdfs Y |(W = ai ) fZ (y ) are
n!
the same. An impractical randomization test uses all M = n1 !n p!
ways of
assigning ni of the Yij to treatment i for i = 1, . . . , p. Let F0 be the usual
F statistic. The F statistic is computed for each of the M permutations and
H0 is rejected if the proportion of the M F statistics that are larger than
F0 is less than . The distribution of the M F statistics is approximately
Fp1,np for large n when H0 is true. The power of the randomization test is
also similar to that of the usual F test. See Hoeding (1952). These results
suggest that the usual F test is semiparametric: the pvalue is approximately
correct if n is large and if all p pdfs Y |(W = ai ) fZ (y ) are the same.
Let [x] be the integer part of x, e.g. [7.7] = 7. Olive (2014, section 9.3)
shows that practical randomization tests that use a random sample of
max(1000, [n log(n)]) permutations have level and power similar to the tests
that use all M possible permutations. See Ernst (2009) and the lregpack func-
tion rand1way for R code.
All of the parameterizations of the one way xed eects Anova model
yield the same predicted values, residuals, and ANOVA F test, but the inter-
pretations of the parameters dier. The cell means model is a linear model
(without intercept) of the form Y = X c c + e = that can be t using OLS.
The OLS MLR output gives the correct tted values and residuals but an
incorrect ANOVA table. An equivalent linear model (with intercept) with
correct OLS MLR ANOVA table as well as residuals and tted values can
be formed by replacing any column of the cell means model by a column of
ones 1. Removing the last column of the cell means model and making the
rst column 1 gives the model Y = 0 + 1 x1 + + p1 xp1 + e given in
matrix form by (5.5) on the following page.
It can be shown that the OLS estimators corresponding to (5.5) are 0 =
Y p0 = p , and i = Y i0 Y p0 = i p for i = 1, . . . , p 1. The cell means
model has i = i = Y i0 .
Wilcox (2012) gives an excellent discussion of the problems that outliers
and skewness can cause for the one and two sample tintervals, the ttest,
tests for comparing 2 groups, and the ANOVA F test. Wilcox (2012) replaces
ordinary population means by truncated population means and uses trimmed
means to create analogs of one way Anova and multiple comparisons.
5.6 Complements 199
110 ... 0
. . .
.. .. ..
..
.
Y11 e11
.. 110 ... 0 ..
. ... 0
.
101
Y1,n1
.. .. .. ..
e1,n1
Y21 . . . . e21
0
.. 1 01 ... 0
..
. .. 1 +
= .. .. ..
.
. .
Y2,n2 . . . . .. (5.5)
e 2,n2
. 1 00 ... 1 .
.. p1 ..
.
.. ... ... ..
Yp,1 . ep,1
. 100 ... 1 .
.. 1 0 0
. .
... 0
Yp,np . . . .. ep,np
.. .. .. .
1 0 0 ... 0
Graphical Anova uses scaled treatment eects = scaled treatment de-
viations di = cdi = c(Y i0 Y 00 ) for i = 1, . . . , p. Following Box et al.
(2005, p. 166), suppose ni m = n/p for i = 1, . . . , n. If Ho: 1 =
= p is true, want the sample variance of the scaled deviations to
be approximately
p equal to the sample variance of the residuals. So want
1 2 2 p
p i=1 c d i M ST R SST R/(p 1) md2i /(p 1)
1 1 n 2 = F0 = = = i=1n
i=1 ri M SE SSE/(n p) i=1 ri /(n p)
2
n
p p
since SST R = i=1 m(Y i0 Y 00 ) = i=1 mdi . So
2 2
p 2n 2
p m(np) 2
i=1 c di i=1 di
F0 = n p2 = n p12 .
i=1 ri i=1 ri
mp (n p) (n p)
c2 = =
n (p 1) (p 1)
since mp/n = 1. Thus c = (n p)/(p 1).
For Graphical Anova, see Box et al. (2005, pp. 136, 150, 164, 166) and
Hoaglin et al. (1991). The R package granova, available from
(http://streaming.stat.iastate.edu/CRAN/), and authored by R.M.
Pruzek and J.E. Helmreich, may be useful.
The modied power transformation family
() Zi 1
Yi = t (Zi ) Zi =
for = 0 and t0 (Zi ) = log(Zi ) for = 0 where L .
200 5 One Way Anova
Box and Cox (1964) give a numerical method for selecting the response
transformation for the modied power transformations. Although the method
gives a point estimator o , often an interval of reasonable values is gen-
erated (either graphically or using a prole likelihood to make a condence
interval), and L is used if it is also in the interval.
There are several reasons to use a coarse grid L of powers. First, several of
the powers correspond to simple transformations such as the log, square root,
and reciprocal. These powers are easier to interpret than = 0.28, for exam-
ple. Secondly, if the estimator n can only take values in L , then sometimes
n will converge in probability to L . Thirdly, Tukey (1957) showed
that neighboring modied power transformations are often very similar, so
restricting the possible powers to a coarse grid is reasonable.
The graphical method for response transformations is due to Olive (2004b).
A variant of the method would plot the residual plot or both the response
and the residual plot for each of the ve values of . Residual plots are also
useful, but they do not distinguish between nonlinear monotone relationships
and nonmonotone relationships. See Fox (1991, p. 55). Alternative methods
are given by Cook and Olive (2001) and Box et al. (2005, p. 321).
An alternative to one way Anova is to use FWLS (see Chapter 4) on the
cell means model with 2 V = diag(12 , . . . , p2 ) where i2 occurs ni times
on the diagonal and i2 is the variance of the ith group for i = 1, . . . , p.
ni
Then V = diag(S12 , . . . , Sp2 ) where Si2 = ni11 j=1 (Yij Y i0 )2 is the sample
variance of the Yij . Hence the estimated weights for FWLS are wij wi =
1/Si2 . Then the FWLS cell means model has Y = X c c + e as in (5.1) except
Cov(e) = diag(12 , . . . , p2 ).
Hence Z = U c c + . Then U Tc U c = diag(n1 w1 , . . . , np wp ), (U Tc U c )1 =
1
diag(S12 /n1 , . . . , Sp2 /np ) = (X V X T )1 , and U Tc Z = (w1 Y10 , . . . , wp Yp0 )T .
Thus from Chapter 4,
F W LS = (Y 10 , . . . , Y p0 )T = c .
That is, the FWLS estimator equals the one way Anova estimator of based
on OLS applied to the cell means model. The ANOVA F test generalizes
the pooled t test in that the two tests are equivalent for p = 2. The FWLS
procedure is also known as the Welch one way Anova and generalizes the
Welch t test. The Welch t test is thought to be much better than the pooled
t test if n1 = n2 and 12 = 22 . See Brown and Forsythe (1974a,b), Kirk (1982),
pp. 100, 101, 121, 122), Olive (2014, pp. 278279), Welch (1947, 1951), and
Problem 5.11.
In matrix form Z = U c c + becomes
5.6 Complements 201
w1 Y1,1 w1 0 0 ... 0 11
.. .. .. .. .. ..
. . .
. . .
w1 Y1,n1 w1 0 0 ... 0
1,n1
w2 Y21 0 w2 0 ... 0 21
1
.. .. .. .. .. ..
. . 2 .
. = . .
. + . (5.6)
w2 Y2,n2 0 0
w2 0 ... .. 2,n2
.. .. .. .. ..
. p ...
. . . .
wp Yp,1 0 0 0 ... wp p,1
.. . .. .. .. .
.. .
.
. . . .
wp Yp,np 0 0 0 ... wp p,np
2(p2) p
FW =
i=1 (1 u ) /(ni
wi 2
1+ p2 1 1)
p p
where wi = ni /Si2 , u = i=1 wi and Y00 = i=1 wi Y i0 /u. Then the test
statistic is compared to an Fp1,dW distribution where dW =
f and
3 % w i &2
p
1/f = 1 /(ni 1).
p 1 i=1
2 u
For the modied Welch (1947) test, the test statistic is compared to an
Fp1,dM W distribution where dM W =
f and
p p
(S 2 /ni )2 i=1 (1/wi )
2
f = p i=11 i 2 2
= p 1 2
.
i=1 ni 1 (Si /ni ) i=1 ni 1 (1/wi )
where p
% ni & 2 % ni & 2
ci = 1 Si / 1 Si .
n i=1
n
The lregpack function anovasim can be used to simulate and compare
the four tests with the usual one way ANOVA test. Some simulation results
are in Haenggi (2009).
5.7 Problems
got 1000, four got 5000, and four got 10000. These four groups are denoted by
none, n1000, n5000, and n10000, respectively. The seedling growths
were all recorded and the table below gives the one way ANOVA results.
a) What is none ?
b) Do a four step test for whether the four mean growths are equal.
(So Ho: none = n1000 = n5000 = n10000 .)
c) Examine the Bonferroni comparison of means. Which groups of means
are not signicantly dierent?
> sample(11)
[1] 7 10 9 8 1 6 3 11 2 4 5
y1 y5 y2 y3 y4
9.8 10.8 15.4 17.6 21.6
5.6. The tensile strength of a cotton nylon ber used to make womens
shirts is believed to be aected by the percentage of cotton in the ber. The
5 levels of cotton percentage that are of interest are tabled above. Also shown
is a (Tukey pairwise) comparison of means. Which groups of means are not
signicantly dierent? Data is from Montgomery (1984, pp. 51, 66).
5.8. Ledolter and Swersey (2007, p. 49) describe a one way Anova design
used to study the eectiveness of 3 product displays (A, B, and C). Fifteen
stores were used and each display was randomly assigned to 5 stores. The
response Y was the sales volume for the week during which the display was
present compared to the base sales for that store.
a) Find 2 = B using output on the previous page.
b) Perform a 4 step ANOVA F test.
2 0 2 4 6
Residuals
Fig. 5.5 Graphical Anova for Problem 5.9
ganova(x,y)
smn -3.2333 -3.0374 6.2710
Treatments "A" "B" "C"
5.9. Ledolter and Swersey (2007, p. 49) describe a one way Anova design
used to study the eectiveness of 3 product displays (A, B, and C). Fifteen
stores were used and each display was randomly assigned to 5 stores. The
response Y was the sales volume for the week during which the display was
present compared to the base sales for that store. Figure 5.5 is the Graphical
Anova plot found using the lregpack function ganova.
a) Which two displays (from A, B, and C) yielded similar mean sales
volume?
b) Which eect (from A, B, and C) appears to be signicant?
206 5 One Way Anova
Source df SS MS F P
treatment 3 89.19 29.73 15.68 0.0002
error 12 22.75 1.90
Problems using R.
5.11. The pooled t procedures are a special case of one way Anova with
p = 2. Consider the pooled t CI for 1 2 . Let X1 , . . . , Xn1 be iid with
mean 1 and variance 12 . Let Y1 , . . . , Yn2 be iid with mean 2 and variance
22 . Assume that the two samples are independent (or that n1 + n2 units
were randomly assigned to two groups) and that ni for i = 1, 2 in
such a way that = n1n+n1
2
(0, 1). Let = 22 /12 , and let the pooled
sample variance Sp = [(n1 1)S12 + (n2 1)S22 ]/[n1 + n2 2] and 2 =
2
X Y (1 2 ) D
2 N (0, 1) and
S1 S22
n1 + n2
S12 S22
n1 + n2 X Y (1 2 ) X Y (1 2 ) D
2 2
= N (0, 2 ).
1 1 S1 S2 1 1
Sp n1 + n2 n1 + n2
Sp n1 + n2
y <- ycrab+1/6
aovtplt(crabhab,y)
5.14. The following data set considers the number of warp breaks per
loom, where the factor is tension (low, medium, or high).
a) Copy and paste the commands for this problem into R.
Highlight the ANOVA table by pressing the left mouse key and dragging
the cursor over the ANOVA table. Then use the menu commands Edit>
Copy. Enter Word and use the menu command Paste. b) To place the
residual plot in Word, get into R and click on the plot, hit the Ctrl and c
keys at the same time. Enter Word and use the menu command Paste or
hit the Ctrl and v keys at the same time.
c) Copy and paste the commands for this part into R.
Click on the response plot, hit the Ctrl and c keys at the same time. Enter
Word and use the menu command Paste.
5.15. Obtain the Box et al. (2005, p. 134) blood coagulation data from
lregdata and the R program ganova from lregpack. The program does graph-
ical Anova for the one way Anova model.
a) Enter the following command and include the plot in Word by simulta-
neously pressing the Ctrl and c keys, then using the menu command Paste
in Word, or hit the Ctrl and v keys at the same time.
ganova(bloodx,bloody)
The scaled treatment deviations are on the top of the plot. As a rule
of thumb, if all of the scaled treatment deviations are within the spread of
the residuals, then population treatment means are not signicantly dierent
(they all give response near the grand mean). If some deviations are outside
of the spread of the residuals, then not all of the population treatment means
5.7 Problems 209
are equal. Box et al. (2005, p. 137) state The graphical analysis discourages
overreaction to high signicance levels and avoids underreaction to very
nearly signicant dierences.
b) From the output, which two treatments means were approximately the
same?
z<-rand1way(y=bloody,group=bloodx,B=1000)
hist(z$rdist)
z$Fpval
z$randpval
5.16. Cut and paste the SAS program for this problem into the SAS
Editor.
To execute the program, use the top menu commands Run>Submit. An
output window will appear if successful.
(If you were not successful, look at the log window for hints on errors.
A single typo can cause failure. Reopen your le in Word or Notepad and
make corrections. Occasionally you cannot nd your error. Then nd your
instructor or wait a few hours and reenter the program.)
Data is from SAS Institute (1985, pp. 126129). See Example 5.6.
a) In SAS, use the menu commands Edit>Select All then Edit>Copy.
In Word, use the menu command Paste. Highlight the rst page of output
and use the menu command Cut. (SAS often creates too much output.
These commands reduce the output from 4 pages to 3 pages.)
You may want to save your SAS output as the le HW5d16.doc on your
ash drive.
b) Perform the 4 step test for Ho: 1 = 2 = = 6 .
c) From the residual and response plots, does the assumption of equal
population standard deviations (i = for i = 1, . . . , 6) seem reasonable?
5.17. To get in ARC, you need to nd the ARC icon. Suppose the ARC icon
is in a math progs folder. Move the cursor to the math progs folder, click the
right mouse button twice, move the cursor to ARC, double click, move the
cursor to ARC, double click. These menu commands will be written math
progs > ARC > ARC. To quit ARC, move cursor to the x in the northeast
corner and click.
This Cook and Weisberg (1999a, p. 289) data set contains IQ scores on
27 pairs of identical twins, one raised by foster parents IQf and the other
by biological parents IQb. C gives the social class of the biological parents:
C = 1 for upper class, 2 for middle class and 3 for lower class. Hence the
Anova test is for whether mean IQ depends on class.
a) Activate twins.lsp dataset with the menu commands
File > Load > Data > twins.lsp.
b) Use the menu commands Twins>Make factors, select C and click on
OK. The line {F}C Factor 27 Factorrst level dropped should appear on
the screen.
c) Use the menu commands Twins>Description to see a description of
the data.
d) Enter the menu commands Graph&Fit>Fit linear LS and select {F}C
as the term and IQb as the response. Highlight the output by pressing the
left mouse key and dragging the cursor over the output. Then use the menu
commands Edit> Copy. Enter Word and use the menu command Paste.
5.7 Problems 211
This McKenzie and Goldman (1999, p. T-234) data set has 30 three-month-
old infants randomized into ve groups of 6 each. Each infant is shown a
mobile of one of ve multicolored designs, and the goal of the study is to see
if the infant attention span varies with type of design of mobile. The times
that each infant spent watching the mobile are recorded.
b) Choose Stat>Basic Statistics>Display Descriptive Statistics, select
C1 Time as the Variable, click the By variable option and press Tab.
Select C2 Design as the By variable. c) From the window in b), click on
Graphs the Boxplots of data option, and OK twice. Click on the plot
and then click on the printer icon to get a plot of the boxplots.
d) Select Stat>ANOVA>One-way, select C1-time as the response and
C2-Design as the factor. Click on Store residuals and click on Store ts.
Then click on OK. Click on the output and then click on the printer icon.
e) To make a residual plot, select Graph>Plot. Select Resi1 for Y
and Fits1 for X and click on OK. Click on the plot and then click on
the printer icon to get the residual plot.
f) To make a response plot, select Graph>Plot. Select C1 Time for
Y and Fits1 for X and click on OK. Click on the plot and then click
on the printer icon to get the response plot.
g) Do the 4 step test for Ho: 1 = 2 = = 5 .
To get out of Minitab, move your cursor to the x in the NE corner of
the screen. When asked whether to save changes, click on no.
Chapter 6
The K Way Anova Model
For a K way Anova model, A1 , . . . , AK are the factors with li levels for
i = 1, . . . , K. Hence there are l1 l2 lK treatments where each treatment
uses exactly one level from each factor. First the two way Anova model is
discussed and then the model with K > 2. Interactions between the K factors
are important.
Denition 6.1. The xed eects two way Anova model has two factors
A and B plus a response Y . Factor A has a levels and factor B has b levels.
There are ab treatments.
Denition 6.2. The cell means model for two way Anova is Yijk =
ij + eijk where i = 1, . . . , a; j = 1, . . . , b; and k = 1, . . . , m. The sample size
n = abm. The ij are constants and the eijk are iid from a location family
with mean 0 and variance 2 . Hence the Yijk f (yij ) come from a location
family with location parameter ij . The tted values are Yijk = Y ij0 = ij
while the residuals rijk = Yijk Yijk .
For one way Anova models, the cell sizes ni need not be equal. For K way
Anova models with K 2 factors, the statistical theory is greatly simplied
if all of the cell sizes are equal. Such designs are called balanced designs.
Denition 6.3. A balanced design has all of the cell sizes equal: for the
two way Anova model, nij m.
Denition 6.4. A two way Anova design uses factorial crossing if each
combination of an A level and a B level is used and called a treatment. There
are ab treatments for the two way Anova model.
Remark 6.1. If A and B are factors, then there are 5 possible models.
i) The two way Anova model has terms A, B, and AB.
ii) The additive model or main eects model has terms A and B.
iii) The one way Anova model that uses factor A.
iv) The one way Anova model that uses factor B.
v) The null model does not use any of the three terms A, B, or AB. If the
null model holds, then Yijk f (y 00 ) so the Yijk form a random sample of
size n from a location family, and the distribution of the response is the same
for all ab treatments. For models i)iv), the distribution of the response is
not the same for all ab treatments.
Remark 6.2. The response plot, residual plot, and transformation plots
for response transformations are used in the same way as Chapter 5. The
plots work best if the MSE degrees of freedom max(10, n/5). The model
is overtting if 1 MSE df < max(10, n/5), and then the plots may only
be useful for detecting large deviations from the model. For the model that
contains A, B, and AB, there will be ab dot plots of size m, and we need
m 5 to check for similar shape and spread. For the additive model, the
response and residual plots often look like those for multiple linear regression.
Then the plotted points should scatter about the identity line or r = 0 line
in a roughly evenly populated band if the additive two way Anova model is
reasonable. We want n 5(number of parameters in the model) for inference.
So we want n 5ab or m 5 when all interactions and main eects are in
the two way Anova model.
Shown is an ANOVA table for the two way Anova model given in symbols.
Sometimes Error is replaced by Residual, or Within Groups. A and
B are the main eects while AB is the interaction. Sometimes p-value is
replaced by P, P r(> F ), or PR > F. The p-value corresponding to
FA is for Ho: 10 = = a0 . The p-value corresponding to FB is for Ho:
01 = = 0b . The p-value corresponding to FAB is for Ho: there is no
interaction. The sample p-value pval is an estimator of the population
p-value.
6.1 Two Way Anova 215
Source df SS MS F p-value
A a-1 SSA MSA FA =MSA/MSE pval
B b-1 SSB MSB FB =MSB/MSE pval
AB (a 1)(b 1) SSAB MSAB FAB =MSAB/MSE pval
Error n ab = ab(m 1) SSE MSE
does not depend on the level of B. (Or there is not enough evidence to
conclude that the mean response depends on the level of B.)
The interaction plot is rather hard to use, especially if the nij = m are
small. For small m, the curves can be far from parallel, even if there is no
interaction. The further the curves are from being parallel, the greater the
evidence of interaction. Intersection of curves suggests interaction unless the
two curves are nearly the same. The two curves may be nearly the same if
two levels of one factor give nearly the same mean response for each level of
the other factor. Then the curves could cross several times even though there
is no interaction. Software lls space. So the vertical axis needs to be checked
to see whether the sample means for two curves are close with respect to
the standard error M SE/m for the means.
The interaction plot is the most useful if the conclusions for the plot agree
with the conclusions for the F test for no interaction.
Denition 6.7. The overparameterized two way Anova model has Yijk =
ij + eijk with ij = 00 + i + j + ()ij where the interaction parameters
()ij = ij i0 0j + 00 . The A main eects are i = i0 00 for
i = 1, . . . , a.The B main
eects are j = 0j 00 for j= 1, . . . , b. Here
i i = 0, j j =
i ()ij = 0 for j = 1, . . . , b and j ()ij = 0 for
0,
i = 1, . . . , a. Thus i j ()ij = 0.
1.5
2
1
1.4
mean of Y
1.3
1.2
1.1
1 2
A
Fig. 6.1 Interaction Plot for Example 6.1.
with line segments, then there will be b parallel curves with curve height
depending on j . If there is interaction, then not all of the p curves will be
parallel. The interaction plot replaces the ij by the ij = Y ij0 .
Example 6.2. The output below uses data from Kutner et al. (2005, prob-
lems 19.1415). The output is from an experiment on hay fever, and 36 vol-
unteers were given medicine. The two active ingredients (factors A and B)
in the medicine were varied at three levels each (low, medium, and high).
218 6 The K Way Anova Model
The response is the number of hours of relief. (The factor names for this
problem are A and B.)
a) Give a four step test for the A*B interaction.
b) Give a four step test for the A main eects.
c) Give a four step test for the B main eects.
Source DF SS MS F P
A 2 220.0200 110.0100 1827.86 0.000
B 2 123.6600 61.8300 1027.33 0.000
Interaction 4 29.4250 7.3562 122.23 0.000
Error 27 1.6250 0.0602
Use factorial crossing to compare the eects (main eects, pairwise inter-
actions, . . . , K-fold interaction if there are K factors) of two or more factors.
If A1 , . . . , AK are the factors with li levels for i = 1, . . . , K; then there are
l1 l2 lK treatments where each treatment uses exactly one level from each
factor.
Source df SS MS F p-value
#K $ K main eects e.g. SSA = MSA FA pA
2 2 factor interactions e.g. SSAB = MSAB FAB pAB
#K $
3 3 factor interactions e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
# $ . . . . .
K
K1 K 1 factor interactions
the K factor interaction SSA L = MSA L FAL pAL
Error SSE MSE
6.3 Summary 219
On the previous page is a partial ANOVA table for a K way Anova design
with the degrees of freedom left blank. For A, use H0 : 100 = = l1 00 .
The other main eects have similar null hypotheses. For interaction, use H0 :
no interaction.
These models get complex rapidly as K and the number of levels li in-
crease. As K increases, there are a large number of models to consider. For
experiments, usually the 3 way and higher order interactions are #not $ signif-
icant. Hence a full model that includes all K main eects and K 2 2 way
interactions is a useful starting point for response, residual, and transforma-
tion plots. The higher order interactions can be treated as potential terms
and checked for signicance. As a rule of thumb, signicant interactions tend
to involve signicant main ' eects.
K
The sample size n = m i=1 li m 2K is minimized by taking li = 2 for
i = 1, . . . , K. Hence the sample size grows exponentially fast with K. Designs
that use the minimum number of levels 2 are discussed in Section 8.1.
6.3 Summary
1) The xed eects two way Anova model has two factors A and B plus a
response Y . Factor A has a levels and factor B has b levels. There are ab
treatments. The cell means model is Yijk = ij + eijk where i = 1, . . . , a; j =
1, . . . , b; and k = 1, . . . , m. The sample size n = abm. The ij are constants
and the eijk are iid with mean 0 and variance 2 . Hence the Yijk f (y ij )
come from a location family with location parameter ij . The tted values
are Yijk = Y ijo = ij while the residuals rijk = Yijk Yijk .
2) Know that the 4 step test for AB interaction is
i) Ho: no interaction HA : there is an interaction
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho, and conclude that there is an interaction between A
and B, otherwise fail to reject Ho, and conclude that there is no interaction
between A and B.
3) Keep A and B in the model if there is an AB interaction.
4) Know that the 4 step test for A main eects is
i) Ho: 10 = = a0 HA : not Ho
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A.
5) Know that the 4 step test for B main eects is
i) Ho: 01 = = 0b HA : not Ho
ii) FB is obtained from output.
220 6 The K Way Anova Model
6) Shown is an ANOVA table for the two way Anova model given in sym-
bols. Sometimes Error is replaced by Residual, or Within Groups. A
and B are the main eects while AB is the interaction. Sometimes p-value
is replaced by P, P r(> F ), or PR > F. The p-value corresponding to
FA is for Ho: 10 = = a0 . The p-value corresponding to FB is for Ho:
01 = = 0b . The p-value corresponding to FAB is for Ho: there is no
interaction.
Source df SS MS F p-value
A a-1 SSA MSA FA =MSA/MSE pval
B b-1 SSB MSB FB =MSB/MSE pval
AB (a 1)(b 1) SSAB MSAB FAB =MSAB/MSE pval
Error n ab = ab(m 1) SSE MSE
6.4 Complements
Four good texts on the design and analysis of experiments are mentioned
in the Complements of Chapter 5. The software for K way Anova is often
used to t block designs. Each block is entered as if it were a factor and the
main eects model is t. The one way block design treats the block like one
factor and the treatment factor as another factor and uses two way Anova
software without interaction to get the correct sum of squares, F statistic,
and p-value. The Latin square design treats the row block as one factor, the
column block as a second factor, and the treatment factor as another factor.
Then the three way Anova software for main eects is used to get the correct
sum of squares, F statistic, and p-value. These two designs are described in
Chapter 7. The K way software is also used to get output for the split plot
designs described in Chapter 9.
Consider nding a model using pretesting or variable selection, and then
acting as if that model was selected before examining the data. This method
does not lead to valid inference. See Fabian (1991) for results on the 2 way
Anova model. If the method can be automated, the bootstrap method of Olive
(2016a) is conjectured to be useful for inference. This bootstrap method may
also be useful for unbalanced designs where the nij are not all equal to m.
Gail (1996) explains why it took so long to use double blinded completely
randomized controlled experiments to test new vaccines.
222 6 The K Way Anova Model
6.5 Problems
a) Copy and paste the SAS program into SAS, use the le command
Run>Submit.
b) Click on the Graph1 window and scroll down to the second interaction
plot of tmp vs ymn. Press the printer icon to get the plot.
c) Is interaction present?
d) Click on the output window then click on the printer icon. This will
produce 5 pages of output, but only hand in the ANOVA table, response plot,
and residual plots.
(Cutting and pasting the output into Word resulted in bad plots. Using
Notepad gave better plots, but the printer would not easily put the ANOVA
table and two plots on one page each.)
e) Do the residual and response plots look ok?
6.4. a) Copy the SAS data for problem 6.3 into Notepad. Then hit Enter
every three numbers so that the data is in 3 columns.
1 50 130
1 50 155
1 50 74
1 50 180
1 65 34
. . .
. . .
. . .
3 80 60
b) Copy and paste the data into Minitab using the menu commands
Edit>Paste Cells and click on OK. Right below C1 type material, below
C2 type temp and below C3 type mvoltage.
c) Select Stat>ANOVA>Two-way, select C3 mvoltage as the response
and C1 material as the row factor and C2 temp as the column factor.
Click on Store residuals and click on Store ts. Then click on OK.
Click on the output and then click on the printer icon.
d) To make a residual plot, select Graph>Plot. Select Resi1 for Y and
Fits1 for X and click on OK. Click on the printer icon to get a plot of
the graph.
e) To make a response plot, select Graph>Plot. Select C3 mvoltage for
Y and Fits1 for X and click on OK. Click on the printer icon to get
a plot of the graph.
R Problem
6.5. The Box et al. (2005, p. 318) poison data has 4 types of treatments
(1,2,3,4) and 3 types of poisons (1,2,3). Each animal is given a poison and a
treatment, and the response is survival in hours. Get the poison data from
lregdata.
a) Type the following commands to see that the output for the three
models is the same. Print the output.
out1<-aov(stime~ptype*treat,poison)
summary(out1)
out2<-aov(stime~ptype + treat + ptype*treat,poison)
summary(out2)
out3<-aov(stime~.^2,poison)
summary(out3)
#The three models are the same.
b) Type the following commands to see the residual plot. Include the plot
in Word.
plot(fitted(out1),resid(out1))
title("Residual Plot")
c) Type the following commands to see the response plot. Include the plot
in Word.
6.5 Problems 225
attach(poison)
out4 <- aov((1/stime)~ptype*treat,poison)
summary(out4)
f) Type the following commands to get the residual plot. Copy the plot
into Word.
plot(fitted(out4),resid(out4))
title("Residual Plot")
g) Type the following commands to get the response plot. Copy the plot
into Word.
h) Type the following commands to get the interaction plot. Copy the plot
into Word.
interaction.plot(treat,ptype,(1/stime))
detach(poison)
Blocks are groups of similar units and blocking can yield experimental designs
that are more ecient than designs that do not block. One way block designs
and Latin square designs will be discussed.
Denition 7.1. A block is a group of mk similar or homogeneous units.
In a block design, each unit in a block is randomly assigned to one of k
treatments with each treatment getting m units from the block. The meaning
of similar is that the units are likely to have similar values of the response
when given identical treatments.
In agriculture, adjacent plots of land are often used as blocks since adjacent
plots tend to give similar yields. Litter mates, siblings, twins, time periods
(e.g., dierent days), and batches of material are often used as blocks.
Following Cobb (1998, p. 247), there are 3 ways to get blocks. i) Sort units
into groups (blocks) of mk similar units. ii) Divide large chunks of material
(blocks) into smaller pieces (units). iii) Reuse material or subjects (blocks)
several times. Then the time slots are the units.
Example 7.1. For i), to study the eects of k dierent medicines, sort
n = bk people into b groups of size k according to similar age and weight. For
ii), suppose there are b plots of land. Divide each plot into k subplots. Then
each plot is a block and the subplots are units. For iii), give the k dierent
treatments to each person over k months. Then each person has a block of
time slots and the ith month = time slot is the unit.
Suppose there are b blocks and n = kb. The one way Anova design randomly
assigns b of the units to each of the k treatments. Blocking places a constraint
on the randomization, since within each block of units, exactly one unit is
randomly assigned to each of the k treatments.
Hence a one way Anova design would use the R command sample(n) and
the rst b units would be assigned to treatment 1, the second b units to
treatment 2, . . . , and the last b units would be assigned to treatment k.
For the completely randomized block designs, described below, the com-
mand sample(k) is done b times: once for each block. The ith command is
for the units of the ith block. If k = 5 and the sample(5) command yields
2 5 3 1 4, then the 2nd unit in the ith block is assigned to treatment
1, the 5th unit to treatment 2, the 3rd unit to treatment 3, the 1st unit to
treatment 4, and the 4th unit to treatment 5.
Remark 7.1. Blocking and randomization often makes the iid error
assumption hold to a useful approximation.
Denition 7.2. For the one way block design or completely ran-
domized block design (CRBD), there is a factor A with k levels and
there are b blocks. The CRBD model is
1
b
io
i = ( + i + j ) = + i .
b b j=1
So the i are all equal if the i are all equal. The errors eij are iid with 0
mean and constant variance 2 .
Notice that the CRBD model is additive: there is no block treatment in-
teraction. The ANOVA table for the CRBD is like the ANOVA table for a
two way Anova main eects model. Shown below is a CRBD ANOVA table in
symbols. Sometimes Treatment is replaced by Factor or Model. Some-
times Blocks is replaced by the name of the blocking variable. Sometimes
Error is replaced by Residual.
7.1 One Way Block Designs 229
Source df SS MS F p-value
Blocks b-1 SSB MSB Fblock pblock
Treatment k-1 SSTR MSTR F0 =MSTR/MSE pval for Ho
Error (k 1)(b 1) SSE MSE
Rule of thumb 7.1. If pblock 0.1, then blocking was not useful. If
0.05 < pblock < 0.1, then the usefulness was borderline. If pblock 0.05,
then blocking was useful.
Remark 7.2. The response, residual, and transformation plots are used
almost in the same way as for the one and two way Anova model, but all of
the dot plots have sample size m = 1. Look for the plotted points falling in
roughly evenly populated bands about the identity line and r = 0 line. See
Problem 7.4 for these plots and the following plot.
Denition 7.3. The block response scatterplot plots blocks versus the
response. The plot will have b dot plots of size k with a symbol corresponding
to the treatment. Dot plots with clearly dierent means suggest that blocking
was useful. A symbol pattern within the blocks (e.g., symbols A and B are
always highest while C and D are always lowest) suggests that the response
depends on the factor level.
Denition 7.4. Graphical Anova for the CRBD model uses the resid-
uals as a reference
set instead of an F distribution. The scaled treatment
deviations b 1(Y i0 Y 00 ) have about the same
variability as the resid-
uals if Ho is true. The scaled block deviations k 1(Y 0j Y 00 ) also have
about the same variability as the residuals if blocking is ineective. A dot
plot of the scaled block deviations is placed above the dot plot of the scaled
treatment deviations which is placed above the dot plot of the residuals. For
small n 40, suppose the distance between two scaled deviations (A and
B, say) is greater than the range of the residuals = max(rij ) min(rij ).
Then declare A and B to be signicantly dierent. If the distance is less
than the range, do not declare A and B to be signicantly dierent. Scaled
230 7 Block Designs
deviations that lie outside the range of the residuals are signicant: the cor-
responding treatment means are signicantly dierent from the overall mean.
For n 100, let r(1) r(2) r(n) be the order statistics of the
residuals. Then instead of the range, use r(0.975n) r(0.025n) as the dis-
tance where
x is the smallest integer x, e.g.
7.7 = 8. So eects out-
side of the interval (r(0.025n) , r(0.975n) ) are signicant. See Box et al. (2005,
pp. 150151).
Example 7.2. Ledolter and Swersey (2007, p. 60) give completely ran-
domized block design data. The block variable = market had 4 levels (1
Binghamton, 2 Rockford, 3 Albuquerque, 4 Chattanooga) while the treat-
ment factor had 4 levels (A no advertising, B $6 million, C $12 million, D
$18 million advertising dollars in 1973). The response variable was average
cheese sales (in pounds per store) sold in a 3-month period.
a) From the graphical Anova in Figure 7.1, were the blocks useful?
b) Perform an appropriate 4 step test for whether advertising helped cheese
sales.
Solution: a) In Figure 7.1, the top dot plot is for the scaled block deviations.
The leftmost dot corresponds to blocks 4 and 1, the middle dot to block 3
and the rightmost dot to block 1 (see output from the lregpack function
ganova2). Yes, the blocks were useful since some (actually all) of the dots
corresponding to the scaled block deviations fall outside the range of the
residuals. This result also agrees with pblock = 4.348e06 < 0.05.
b) i) Ho: 1 = 2 = 3 = 4 HA : not Ho
ii) Fo = 1.313
iii) pval = 0.3292
iv) Fail to reject Ho, the mean sales does not depend on advertising level.
In Figure 7.1, the middle dot plot is for the scaled treatment deviations.
From left to right, these correspond to B, A, D, and C since the output shows
that the deviation corresponding to C is the largest with value 733.3. Since
7.1 One Way Block Designs 231
30
25
Treatmentdevs
20
15
10
the four scaled treatment deviations all lie within the range of the residuals,
the four treatments again do not appear to be signicant.
Example 7.3. Snedecor and Cochran (1967, p. 300) give a data set with
5 types of soybean seed. The response frate = number of seeds out of 100
that failed to germinate. Five blocks were used. On the previous page is a
block response scatterplot where A, B, C, D, and E refer to seed type. The 2
in the second block indicates that A and C both had values 10. Which type
of seed has the highest germination failure rate?
a) A b) B c) C d) D e) E
Solution: a) A since A is on the top for blocks 25 and second for block 1.
Response Plot
10
5
Y
6
2
4 6 8 10 12
FIT
Residual Plot
5
2 4
RES
2
4 6 8 10 12
FIT
Fig. 7.2 One Way Block Design Does Not Fit All of the Data
Note: The response and residual plots in Figure 7.2 suggest that one case
is not t well by the model. The Bs and Es in the block response plot suggest
that there may be a block treatment interaction, which is not allowed by the
completely randomized block design. Figure 7.2 was made with the following
commands using the lregpack function aovplots.
Blocking is used to reduce the MSE so that inference such as tests and con-
dence intervals are more precise. Below is a partial ANOVA table for a k way
Anova design with one block where the degrees of freedom are left blank. For
A, use H0 : 100 = = l1 00 . The other main eects have similar null
hypotheses. For interaction, use H0 : no interaction.
These models get complex rapidly as k and the number of levels li in-
crease. As k increases, there are a large number of models to consider. For
experiments, usually the 3 way and higher order interactions are not signif-
icant.
# $ Hence a full model that includes the blocks, all k main eects, and
all k2 two way interactions is a useful starting point for response, residual,
and transformation plots. The higher order interactions can be treated as
potential terms and checked for signicance. As a rule of thumb, signicant
interactions tend to involve signicant main eects.
Source df SS MS F p-value
block SSblock MSblock Fblock pblock
#k $ k main eects e.g. SSA = MSA FA p A
2 2 way interactions e.g. SSAB = MSAB FAB pAB
#k $
3 3 way interactions e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
# $ . . . . .
k
k1 k 1 way interactions
the k way interaction SSA L = MSA L FAL pAL
Error SSE MSE
The following example has one block and 3 factors. Hence there are 3 two
way interactions and 1 three way interaction.
Example 7.4. Snedecor and Cochran (1967, pp. 361364) describe a block
design (2 levels) with three factors: food supplements Lysine (4 levels), Me-
thionine (3 levels), and Protein (2 levels). Male pigs were fed the supplements
in a 432 factorial arrangement and the response was average daily weight
gain. The ANOVA table is shown on the following page. The model could be
234 7 Block Designs
Solution: a) Randomly.
b) Yes, 0.0379 < 0.05.
c) H0 : 0010 = 0020 HA : not H0
FP = 19.47
pval = 0.0002
Reject H0 , the mean weight gain depends on the protein level.
d) None.
Source df SS MS F pvalue
block 1 0.1334 0.1334 4.85 0.0379
L 3 0.0427 0.0142 0.5164 0.6751
M 2 0.0526 0.0263 0.9564 0.3990
P 1 0.5355 0.5355 19.47 0.0002
LM 6 0.2543 0.0424 1.54 0.2099
LP 3 0.2399 0.0800 2.91 0.0562
MP 2 0.0821 0.0410 1.49 0.2463
LMP 6 0.0685 0.0114 0.4145 0.8617
error 23 0.6319 0.0275
Latin square designs have a lot of structure. The design contains a row block
factor, a column block factor, and a treatment factor, each with a levels. The
7.3 Latin Square Designs 235
two blocking factors, and the treatment factor are crossed, but it is assumed
that there is no interaction. A capital letter is used for each of the a treatment
levels. So a = 3 uses A, B, C while a = 4 uses A, B, C, D.
Five Latin squares are shown below. The rst, third, and fth are standard.
If a = 5, there are 56 standard Latin squares.
A B C A B C A B C D A B C D E A B C D E
B C A C A B B A D C E A B C D B A E C D
C A B B C A C D A B D E A B C C D A E B
D C B A C D E A B D E B A C
B C D E A E C D B A
Yijk = + i + j + k + eijk
where i is the ith treatment eect, j is the jth row block eect, k is the
kth column block eect with i, j, and k = 1, . . . , a. The errors eijk are iid
with 0 mean and constant variance 2 . The ith treatment mean i = + i .
Shown below is an ANOVA table for the Latin square model given in
symbols. Sometimes Error is replaced by Residual, or Within Groups.
Sometimes rblocks and cblocks are replaced by the names of the blocking
factors. Sometimes p-value is replaced by P, P r(> F ), or PR > F.
Source df SS MS F p-value
rblocks a1 SSRB MSRB Frow prow
cblocks a1 SSCB MSCB Fcol pcol
treatments a1 SSTR MSTR Fo =MSTR/MSE pval
Error (a 1)(a 2) SSE MSE
Rule of thumb 7.2. Let pblock be prow or pcol . If pblock 0.1, then block-
ing was not useful. If 0.05 < pblock < 0.1, then the usefulness was borderline.
If pblock 0.05, then blocking was useful.
Be able to perform the 4 step ANOVA F test for the Latin square
design. This test is similar to the xed eects one way ANOVA F test.
i) Ho: 1 = 2 = = a and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fa1,(a1)(a2) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. (Or there is not enough evidence
236 7 Block Designs
to conclude that the mean response depends on the factor level.) Give a
nontechnical sentence. Use = 0.05 if is not given.
Remark 7.4. The response, residual, and transformation plots are used
almost in the same way as for the one and two way Anova models, but all of
the dot plots have sample size m = 1. Look for the plotted points falling in
roughly evenly populated bands about the identity line and r = 0 line. See
Problem 7.5 and the following example.
Source df SS MS F P
rblocks 3 774.335 258.1117 2.53 0.1533
cblocks 3 133.425 44.4750 0.44 0.7349
fertilizer 3 1489.400 496.4667 4.87 0.0476
error 6 611.100 101.8500
Example 7.5. Dunn and Clark (1974, p. 129) examine a study of four
fertilizers on yields of wheat. The row blocks were 4 types of wheat. The
column blocks were 4 plots of land. Each plot was divided into 4 subplots
and a Latin square design was used. (To illustrate the inference for Latin
square designs, ignore the fact that the data had an outlier. Case 14 had a
yield of 64.5 while the next highest yield was 35.5. For the response plot in
Figure 7.3, note that both Y and Y are large for the high yield. Also note
that Y underestimates Y by about 10 for this case.)
a) Were the row blocks useful? Explain briey.
b) Were the column blocks useful? Explain briey.
c) Do an appropriate 4 step test.
Solution:
a) No, prow = 0.1533 > 0.1.
b) No, pcol = 0.7349 > 0.1.
c) i) H0 : 1 = 2 = 3 = 4 HA : not H0
ii) F0 = 4.87
iii) pval = 0.0476
iv) Reject H0 . The mean yield depends on the fertilizer level.
Figure 7.3 was made with the following commands using the lregpack func-
tion aovplots.
rblocks <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4)
cblocks <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
fertilizer <- c(1,2,3,4,2, 3, 4, 1, 3, 4, 1, 2, 4, 1, 2, 3)
yield <- c(35.5,24.5,14.7,35.5, 14.4, 6.2, 13.7, 24.5, 14.1,
16.2, 34.3, 19.7, 15.0, 64.5, 34.6, 19.0)
rblocks <- factor(rblocks)
cblocks <- factor(cblocks)
fertilizer <- factor(fertilizer)
dcls <- data.frame(yield,rblocks,cblocks,fertilizer)
rm(yield,rblocks,cblocks,fertilizer)
7.3 Latin Square Designs 237
Response Plot
14
10 30 50
Y
10 20 30 40 50
FIT
Residual Plot
10
14
RES
0
10
10 20 30 40 50
FIT
Fig. 7.3 Latin Square Data
attach(dcls)
z <- aov(yield~rblocks+cblocks+fertilizer)
summary(z)
aovplots(Y=yield,FIT=fitted(z),RES=resid(z))
#right click Stop twice, drag the plots to make them square
detach(dcls)
Remark 7.5. The Latin square model is additive, but the model is often
incorrectly used to study nuisance factors that can interact. Factorial or
fractional factorial designs should be used when interaction is possible.
Example 7.6. In the social sciences, often a blocking factor is time: the
levels are a time slots. Following Cobb (1998, p. 254), a Latin square design
was used to study the response Y = blood sugar level, where the row blocks
were 4 rabbits, the column blocks were 4 time slots, and the treatments were
4 levels of insulin. Label the rabbits as I, II, III, and IV; the dates as 1, 2, 3,
4; and the 4 insulin levels i1 < i2 < i3 < i4 as 1, 2, 3, 4. Suppose the random
permutation for the rabbits was 3, 1, 4, 2; the permutation for the dates 1,
4, 3, 2; and the permutation for the insulin levels was 2, 3, 4, 1. Then i2 is
238 7 Block Designs
the northwest corner of the square gets B = variety 2, the northeast corner
gets D = variety 4, the southwest corner gets A = variety 3, the southeast
corner gets C = variety 5, et cetera.
7.4 Summary
Source df SS MS F p-value
Blocks b-1 SSB MSB Fblock pblock
Treatment k-1 SSTR MSTR F0 =MSTR/MSE pval for Ho
Error (k 1)(b 1) SSE MSE
6) Rule of thumb: If pblock 0.1, then blocking was not useful. If 0.05 <
pblock < 0.1, then the usefulness was borderline. If pblock 0.05, then blocking
was useful.
7) The response, residual, and transformation plots for CRBD models are
used almost in the same way as for the one and two way Anova model, but all
of the dot plots have sample size m = 1. Look for the plotted points falling
in roughly evenly populated bands about the identity line and r = 0 line.
8) The block response scatterplot plots blocks versus the response.
The plot will have b dot plots of size k with a symbol corresponding to the
treatment. Dot plots with clearly dierent means suggest that blocking was
useful. A symbol pattern within the blocks suggests that the response depends
on the factor level.
240 7 Block Designs
9) Shown is an ANOVA table for the Latin square model given in sym-
bols. Sometimes Error is replaced by Residual, or Within Groups.
Sometimes rblocks and cblocks are replaced by the blocking factor name.
Sometimes p-value is replaced by P, P r(> F ), or PR > F.
Source df SS MS F p-value
rblocks a1 SSRB MSRB Frow prow
cblocks a1 SSCB MSCB Fcol pcol
treatments a1 SSTR MSTR Fo =MSTR/MSE pval
Error (a 1)(a 2) SSE MSE
10) Let pblock be prow or pcol . Rule of thumb: If pblock 0.1, then blocking
was not useful. If 0.05 < pblock < 0.1, then the usefulness was borderline. If
pblock 0.05, then blocking was useful.
11) The ANOVA F test for the Latin square design with a treatments is
nearly the same as the xed eects one way ANOVA F test.
i) Ho: 1 = 2 = = a and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fa1,(a1)(a2) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. Give a nontechnical sentence.
12) The response, residual, and transformation plots for Latin square de-
signs are used almost in the same way as for the one and two way Anova
models, but all of the dot plots have sample size m = 1. Look for the plotted
points falling in roughly evenly populated bands about the identity line and
r = 0 line.
13) The randomization is done in 3 steps. Draw 3 random permutations
of 1, . . . , a. Use the 1st permutation to randomly assign row block levels to
the numbers 1, . . . , a. Use the 2nd permutation to randomly assign column
block levels to the numbers 1, . . . , a. Use the 3rd permutation to randomly
assign treatment levels to the 1st a letters (A, B, C, and D if a = 4).
14) Graphical Anova for the completely randomizedblock de-
makes a dot plot of the scaled block deviations j = k 1j =
sign
k 1(y 0j0 y 000 ) on top, a dot plot of scaled treatment deviations (eects)
i = b 1i = b 1(y i00 y 000 ) in the middle, and a dot plot of the
residuals on the bottom. Here k is the number of treatments and b is the
number of blocks.
15) Graphical Anova uses the residuals as a reference distribution. Suppose
the dot plot of the residuals looks good. Rules of thumb: i) An eect is
marginally signicant if its scaled deviation is as big as the biggest residual
or as negative as the most negative residual. ii) An eect is signicant if it is
well beyond the minimum or maximum residual. iii) Blocking was eective
if at least one scaled block deviation is beyond the range of the residuals.
7.5 Complements 241
iv) The treatments are dierent if at least one scaled treatment eect is
beyond the range of the residuals. (These rules depend on the number of
residuals n. If n is very small, say 8, then the scaled eect should be well
beyond the range of the residuals to be signicant. If the n is 40, the value
of the minimum residual and the value of the maximum residual correspond
to a 1/40 + 1/40 = 1/20 = 0.05 critical value for signicance.)
7.5 Complements
Box et al. (2005, pp. 150156) explain Graphical Anova for the CRBD and
why randomization combined with blocking often makes the iid error assump-
tion hold to a reasonable approximation.
It is easier to see model deciencies if the response and residual plots are
square. In R, drag the plots so the plots look square. Matched pairs tests are
a special case of CRBD with k = 2.
The R package granova may be useful for graphical Anova. It is available
from (http://streaming.stat.iastate.edu/CRAN/) and authored by R.M.
Pruzek and J.E. Helmreich. Also see Hoaglin et al. (1991).
A randomization test has H0 : the dierent treatments have no eect.
This null hypothesis is also true if within each block, all k pdfs are from the
same location family. Let j = 1, . . . , b index the b blocks. There are b pdfs, one
for each block, that come from the same location family but possibly dierent
location parameters: fZ (y 0j ). Let A be the treatment factor with k levels
ai . Then Yij |(A = ai ) fZ (y 0j ) where j is xed and i = 1, . . . , k.
Thus the levels ai have no eect on the response, and the Yij are iid within
each block if H0 holds. Note that there are k! ways to assign Y1j , . . . Ykj
to the k treatments within each block. An impractical randomization test
uses all M = [k!]b ways of assigning responses to treatments. Let F0 be the
usual CRBD F statistic. The F statistic is computed for each of the M
permutations and H0 is rejected if the proportion of the M F statistics that
are larger than F0 is less than . The distribution of the M F statistics is
approximately Fk1,(k1)(b1) for large n under H0 . The randomization test
and the usual CBRD F test also have the same power, asymptotically. See
Hoeding (1952) and Robinson (1973). These results suggest that the usual
CRBD F test is semiparametric: the pvalue is approximately correct if n is
large and if all k pdfs Yij |(A = ai ) fZ (y 0j ) are the same for each block
where j is xed and i = 1, . . . , k. If H0 does not hold, then there are kb pdfs
Yij |(A = ai ) fZ (y ij ) from the same location family. Hence the location
parameter depends on both the block and treatment.
Olive (2014, section 9.3) shows that practical randomization tests that
use a random sample of max(1000, [n log(n)]) randomizations have level and
power similar to the tests that use all M possible randomizations. Here each
randomization uses b randomly drawn permutations of 1, . . . , k.
242 7 Block Designs
Hunter (1989) discusses some problems with the Latin square design.
Welch (1990) suggests that the ANOVA F test is not a good approxima-
tion for the permutation test for the Latin square design.
7.6 Problems
7.4. This problem is for a one way block design and uses data from Box
et al. (2005, p. 146).
a) Copy and paste the SAS program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt). Print out the out-
put but only turn in the ANOVA table, residual plot, and response plot.
b) Do the plots look ok?
c) Copy the SAS data into Minitab much as done for Problem 6.4. Right
below C1 type block, below C2 type treat, and below C3 type yield.
d) Select Stat>ANOVA>Two-way, select C3 yield as the response and
C1 block as the row factor and C2 treat as the column factor. Click on
Fit additive model, click on Store residuals, and click on Store ts.
Then click on OK.
e) block response scatterplot: Use le commands Edit>Command
Line Editor and write the following lines in the window.
GSTD
LPLOT yield vs block codes for treat
f) Click on the submit commands box and print the plot. Click on the
output and then click on the printer icon.
g) Copy (http://lagrange.math.siu.edu/Olive/lregdata.txt) into R.
Type the following commands to get the following ANOVA table.
z<-aov(yield~block+treat,pen)
summary(z)
7.5. This problem is for a Latin square design and uses data from Box
et al. (2005, pp. 157160).
244 7 Block Designs
Copy and paste the SAS program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt).
a) Click on the output and use the menu commands Edit>Select All
and Edit>Copy. In Word use the menu command Paste then use the
left mouse button to highlight the rst page of output. Then use the menu
command Cut. Then there should be one page of output including the
ANOVA table. Print out this page.
b) Copy the data for this problem from
(http://lagrange.math.siu.edu/Olive/lregdata.txt)
into R. Use the following commands to create a residual plot. Copy and paste
the plot into Word. (Click on the plot and simultaneously hit the Ctrl and c
buttons. Then go to Word and use the menu command Paste.)
z<-aov(emissions~rblocks+cblocks+additives,auto)
summary(z)
plot(fitted(z),resid(z))
title("Residual Plot")
abline(0,0)
c) Use the following commands to create a response plot. Copy and paste
the plot into Word. (Click on the plot and simultaneously hit the Ctrl and c
buttons. Then go to Word and use the menu command Paste.)
attach(auto)
FIT <- auto$emissions - z$resid
plot(FIT,auto$emissions)
title("Response Plot")
abline(0,1)
detach(auto)
d) Do the plots look ok?
e) Were the column blocks useful? Explain briey.
f) Were the row blocks useful? Explain briey.
g) Do an appropriate 4 step test.
7.6. Obtain the Box et al. (2005, p. 146) penicillin data from
(http://lagrange.math.siu.edu/Olive/lregdata.txt) and the R pro-
gram ganova2 from (http://lagrange.math.siu.edu/Olive/lregpack.
txt). The program does graphical Anova for completely randomized block
designs.
a) Copy and paste the R commands for this problem into R. Include the
plot in Word by simultaneously pressing the Ctrl and c keys, then using the
menu command Paste in Word.
b) Blocking seems useful because some of the scaled block deviations are
outside of the spread of the residuals. The scaled treatment deviations are in
the middle of the plot. Do the treatments appear to be signicantly dierent?
Chapter 8
Orthogonal Designs
Orthogonal designs for factors with two levels can be t using least squares.
The orthogonality of the contrasts allows each coecient to be estimated
independently of the other variables in the model.
kf
This chapter covers 2k factorial designs, 2R fractional factorial designs,
and Plackett Burman PB(n) designs. The entries in the design matrix X are
either 1 or 1. The columns of the design matrix X are orthogonal: cTi cj = 0
for i = j where ci is the ith column of X. Also cTi ci = n, and the absolute
values of the column entries sum to n.
The rst column of X is 1, the vector of ones, but the remaining columns
of X are the coecients of a contrast. Hence the ith column ci has entries
that are 1 or 1, and the entries of the ith column ci sum to 0 for i > 1.
Factorial designs are a special case of the k way Anova designs of Chapter 6,
and these designs use factorial crossing to compare the eects (main eects,
pairwise interactions, . . . , k-fold interaction) of the k factors. If A1 , . . . , Ak are
the factors with li levels for i = 1, . . . , k then there are l1 l2 lk treatments
where each ' treatment uses exactly one level from each factor. The sample
k
size n = m i=1 li m 2k . Hence the sample size grows exponentially fast
with k. Often the number of replications m = 1.
Often each run is expensive, for example, in industry and medicine. A goal
is to improve the product in terms of higher quality or lower cost. Often the
subject matter experts can think of many factors that might improve the
product. The number of runs n is minimized by taking li = 2 for i = 1, . . . , k.
Rule of thumb 8.1. Do not spend more than 25% of the budget on the
initial experiment. It may be a good idea to plan for four experiments, each
taking 25% of the budget.
p p
Denition 8.3. Recall that a contrast C = i=1 di i where i=1 di =
p
0, and the estimated contrast is C = i=1 di Y i0 where i and Y i0 are
appropriate population and sample means. In a table of contrasts, the
coecients di of the contrast are given where a corresponds to 1 and a +
corresponds to 1. Sometimes a column I corresponding to the overall mean
is given where each entry is a +. The column corresponding to I is not a
contrast.
To make a table of contrasts there is a rule for main eects and a rule for
interactions.
8.1 Factorial Designs 247
I A B C AB AC BC ABC y
+ + + + y 1110
+ + + + y 2110
+ + + + y 1210
+ + + + y 2210
+ + + + y 1120
+ + + + y 2120
+ + + + y 1220
+ + + + + + + + y 2220
divisor 8 4 4 4 4 4 4 4
The table of contrasts for a 24 design is shown on the following page. The
column of ones corresponding to I was omitted. Again rows correspond to
runs and the levels of the main eects A, B, C, and D completely specify the
run. The rst row of the table corresponds to the low levels of A, B, C, and
D. In the second row, the level of A is high while B, C, and D are low. Note
that the interactions are obtained by multiplying the component columns
where + = 1 and = 1. Hence the rst row of the column corresponding
to the ABC entry is ()()() = .
Randomization for a 2k design: The runs are determined by the levels
of the k main eects in the table of contrasts. So a 23 design is determined by
the levels of A, B, and C. Similarly, a 24 design is determined by the levels
of A, B, C, and D. Randomly assign units to the m2k runs. Often the units
are time slots. If possible, perform the m2k runs in random order.
Genuine run replicates need to be used. A common error is to take m
measurements per run, and act as if the m measurements are from m runs.
248 8 Orthogonal Designs
Denition 8.4. If the response depends on the two levels of the factor,
then the factor is called active. If the response does not depend on the two
levels of the factor, then the factor is called inert.
Active factors appear to change the mean response as the level of the factor
changes from 1 to 1. Inert factors do not appear to change the response as
the level of the factor changes from 1 to 1. An inert factor could be needed
but the level low or high is not important, or the inert factor may not be
needed and so can be omitted from future studies. Often subject matter
experts can tell whether the inert factor is needed or not.
The 2k designs are used for exploratory data analysis: they provide
answers to the following questions.
i) Which combinations of levels are best?
ii) Which factors are active and which are inert? That is, use the 2k design
to screen for factors where the response depends on whether the level is high
or low.
iii) How should the levels be modied to improve the response?
If all 2k runs give roughly the same response, then choose the levels that
are cheapest to increase prot. Also the system tends to be robust to changes
in the factor space so managers do not need to worry about the exact values
of the levels of the factors.
In an experiment, there will be an interaction between management, sub-
ject matter experts (often engineers), and the data analyst (statistician).
8.1 Factorial Designs 249
Remark
# $ 8.1. If m = 1, then there
# $ is one response per run but k main
eects, k2 2 factor interactions, kj j factor interactions, and 1 k way in-
teraction. Then the MSE df = 0 unless at least one high order interaction
is assumed
# $ to be zero. A full model that includes all k main eects and
all k2 two way interactions is a useful starting point for response, residual,
and transformation plots. The higher order interactions can be treated as
potential terms and checked for signicance. As a rule of thumb, signicant
interactions tend to involve signicant main eects.
Rule of thumb 8.2. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case
is an outlier if it is well beyond these 2 lines. This rule often fails for large
outliers since often the identity line goes through or near a large outlier so
its residual is near zero. Often such outliers are still far from the bulk of
the data, and there will be a gap in the response plot (along the identity
line) separating the bulk of the data from the outliers. Such gaps appear in
Figures 3.7, 3.10b) (in an FF plot), 3.11, and 7.3 where the gap would be
easier to see if the plot was square. A response that is far from the bulk of
the data in the response plot is a large outlier (large in magnitude).
Rule of thumb 8.3. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
Denition 8.6. A critical mix is a single combination of levels, out of
2k , that gives good results. Hence a critical mix produces good outliers (or a
single outlier if m = 1).
Be able to pick out active and inert factors and good (or the best) combi-
nations of factors (cells or runs) from the table of contrasts = table of runs.
Often the table will only contain the contrasts for the main eects. If high
values of the response are desirable, look for high values of y for m > 1. If
m = 1, then y = y. The following two examples help illustrate the process.
O H C y
5.9
+ 4.0
+ 3.9
+ + 1.2
+ 5.3
+ + 4.8
+ + 6.3
+ + + 0.8
250 8 Orthogonal Designs
Solution: a) The two lowest values of y are 0.8 and 1.2 which correspond
to + + + and + + . (Note that if the 1.2 was 4.2, then + + + corresponding
to 0.8 would be a critical mix.)
b) C would be inert since O and H should be at their high + levels.
run R T C D y
1 14
2 + 16
3 + 8
4 + + 22
5 + 19
6 + + 37
7 + + 20
8 + + + 38
9 + 1
10 + + 8
11 + + 4
12 + + + 10
13 + + 12
14 + + + 30
15 + + + 13
16 + + + + 30
Suppose the model using all of the columns of X is used. If some columns
are removed (e.g. those corresponding to the insignicant eects), then for
2k designs the following quantities remain unchanged for the terms that were
not deleted: the eects, the coecients, and SS(eect) = MS(eect). The
MSE, SE(eect), F and t statistics, pvalues, tted values, and residuals do
change.
The regression equation corresponding to the signicant eects (e.g. found
with a QQ plot of Denition 8.9) can be used to form a reduced model. For
example, suppose the full (least squares) tted model is Yi = 0 + 1 xi1 +
2 xi2 + 3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 . Suppose the A, B, and
AB eects are signicant. Then the reduced (least squares) tted model is
Yi = 0 + 1 xi1 + 2 xi2 + 12 xi12 where the coecients (s) for the reduced
model can be taken from the full model since the 2k design is orthogonal.
The coecient 0 corresponding to I is equal to the I eect, but the
coecient of a factor x corresponding to an eect is = 0.5 eect. Consider
signicant eects and assume interactions can be ignored.
i) If a large response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
ii) If a small response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
Rule of thumb 8.4. To predict Y with Y , the number of coecients =
the number of s in the model should be n/2, where the sample size n =
number of runs. Otherwise the model is overtting.
From the regression equation Y = xT , be able to predict Y given x. Be
able to tell whether x = 1 or x = 1 should be used. Given the x values
of the main eects, get the x values of the interactions by multiplying the
columns corresponding to the main eects.
Least squares output in symbols is shown below. Often Estimate is re-
placed by Coef or Coecient. Often Intercept is replaced by Con-
stant. The t statistic and pvalue are for whether the term or eect is sig-
nicant. So t12 and p12 are for testing whether the x12 term or AB eect is
signicant.
The least squares coecient = 0.5 (eect). The sum of squares for an x
correspondingto an eect is equal to SS(eect).
SE(coef) = SE() = 0.5
SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
Example 8.3. a) The biggest possible model for the 23 design is Y =
0 + 1 x1 + 2 x2 + 3 x3 + 12 x12 + 13 x13 + 23 x23 + 123 x123 + e with least
squares tted or predicted values given by Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 +
12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 .
The second order model is Y = 0 +1 x1 +2 x2 +3 x3 +12 x12 +13 x13 +
23 x23 + e. The main eects model is Y = 0 + 1 x1 + 2 x2 + 3 x3 + e.
b) A typical least squares output for the 23 design with m = 2 is shown
below. Often Estimate is replaced by Coef.
Residual Standard Error=2.8284 = sqrt(MSE)
R-Square=0.9763 F-statistic (df=7, 8)=47.054 pvalue=0
There are several advantages to least squares over 2k software. The dis-
advantage of the following four points is that the design will no longer be
orthogonal: the estimated coecients and hence the estimated eects will
depend on the terms in the model. i) If there are several missing values or
outliers, delete the corresponding rows from the design matrix X and the
vector of responses y as long as the number of rows of the design matrix
the number of columns. ii) If the exact quantitative levels are not observed,
replace them by the observed levels cx in the design matrix. iii) If the wrong
levels are used in a run, replace the corresponding row in the design ma-
trix by a row corresponding to the levels actually used. iv) The number of
replications per run i can be mi , that is, we do not need mi m.
If the number of replications m 2, then the standard error for the eect is
MSE
SE(eect) = . (8.2)
m2k2
Sometimes M SE is replaced by 2 .
M SE
SE(mean) = (8.3)
m2k
where m2k = n, m 2, and sometimes M SE is replaced by 2 .
The sum of squares for an eect is also the mean square for the eect since
df = 1.
M S(eect) = SS(eect) = m2k2 (eect)2 (8.4)
for m 1.
A 95% condence interval (CI) for an eect is
where dfe is the MSE degrees of freedom. Use tdfe ,0.975 z0.975 = 1.96 if
dfe > 30. The eect is signicant if the CI does not contain 0, while the eect
is not signicant if the CI contains 0.
Source df SS MS F p-value
A 1 SSA MSA FA pA
B 1 SSB MSB FB pB
C 1 SSC MSC FC pC
AB 1 SSAB MSAB FAB pAB
AC 1 SSAC MSAC FAC pAC
BC 1 SSBC MSBC FBC pBC
ABC 1 SSABC MSA FABC pABC
Error (m 1)2k SSE MSE
eect
One can use t statistics for eects with t0 = tdfe where dfe
SE(eect)
is the MSE df. Then t20 = M S(ef f ect)/M SE = F0 F1,dfe .
Source df SS MS F p-value
k# main
$ eects 1 e.g. SSA = MSA FA p A
k
2 2 factor interactions 1 e.g. SSAB = MSAB FAB pAB
#k $
3 3 factor interactions 1 e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
.
# k $ . . . .
k1 k 1 factor interactions
the k factor interaction 1 SSA L = MSA L FAL pAL
Error (m 1)2k SSE MSE
I A B C AB AC BC ABC y
+ + + + 6.333
+ + + + 4.667
+ + + + 9.0
+ + + + 6.667
+ + + + 4.333
+ + + + 2.333
+ + + + 7.333
+ + + + + + + + 4.667
divisor 8 4 4 4 4 4 4 4
Normal QQ plot
2
1
effects
0
1
2
The lregpack functions twocub and twofourth can be used to nd the ef-
fects, SE(eect), and QQ plots for 23 and 24 designs. If m = 1, the twofourth
function also makes the response and residual plots based on the second order
model for 24 designs.
For the data in Example 8.4, the output below and on the following page
shows that the A and C eects have values 2.166 and 2.000 while the B
eect is 2.500. These are the three signicant eects shown in the QQ plot
in Figure 8.1. The two commands below produced the output.
z<-c(6.333,4.667,9,6.667,4.333,2.333,7.333,4.667)
twocub(z,m=3,MSE=0.54)
$Aeff
[1] -2.16625
$Beff
[1] 2.50025
258 8 Orthogonal Designs
$Ceff
[1] -2.00025
$ABeff
[1] -0.33325
$ACeff
[1] -0.16675
$BCeff
[1] 0.16675
$ABCeff
[1] 0.00025
$MSA
[1] 28.15583
$MSB
[1] 37.5075
$MSC
[1] 24.006
$MSAB
[1] 0.6663334
$MSAC
[1] 0.1668334
$MSABC
[1] 3.75e-07
$MSE
[1] 0.54
$SEeff
[1] 0.3
Factorial designs are expensive since n = m2k when there are k factors and m
replications. A fractional factorial design uses n = m2kf where f is dened
below, and so costs much less. Such designs can be useful when the higher
order interactions are not signicant.
kf
Denition 8.10. A 2R fractional factorial design has k factors and
kf
takes m2 runs where the number of replications m is usually 1. The design
is an orthogonal design and each factor has two levels low = 1 and high =
1. R is the resolution of the design.
kf
Remark 8.2. A 2R design has no q factor interaction (or main eect for
q = 1) confounded with any other eect consisting of less than R q factors.
kf
So a 2III design has R = 3 and main eects are confounded with 2 factor
kf
interactions. In a 2IV design, R = 4 and main eects are not confounded
with 2 factor interactions but 2 factor interactions are confounded with other
2 factor interactions. In a 2Vkf design, R = 5 and main eects and 2 factor
interactions are only confounded with 4 and 3 way or higher interactions
respectively. The R = 4 and R = 5 designs are good because the 3 way and
higher interactions are rarely signicant, but these designs are more expensive
than the R = 3 designs.
In a 2Rkf
design, each eect is confounded or aliased with 2f 1 other
eects. Thus the M th main eect is really an estimate of the M th main eect
plus 2f 1 other eects. If R 3 and none of the two factor interactions are
signicant, then the M th main eect is typically a useful estimator of the
population M th main eect.
Rule of thumb 8.8. Main eects tend to be larger than q factor inter-
action eects, and the lower order interaction eects tend to be larger than
the higher order interaction eects. So two way interaction eects tend to be
larger than three way interaction eects.
Rule of thumb 8.9. Signicant interactions tend to have signicant
component main eects. Hence if A, B, C, and D are factors, B and D are
inert and A and C are active, then the AC eect is the two factor interaction
most likely to be active. If only A was active, then the two factor interactions
containing A (AB, AC, and AD) are the ones most likely to be active.
Suppose each run costs $1000 and m = 1. The 2k factorial designs need 2k
runs while fractional factorial designs need 2kf runs. These designs use the
fact that three way and higher interactions tend to be inert for experiments.
23 A B C AB AC BC ABC
241
IV A B C AB+ AC+ BC+ D
252
IIIABC D E BC+ BE+
263
IIIABC D E F AF+
274
IIIABC D E F G
Consider the designs given in Remarks 8.3 and 8.4. Least squares estimates
kf
for the 2R designs with ko = 3 use the design matrix corresponding to a 23
design while the designs with ko = 4 use the design matrix corresponding to
the 24 design given in Section 8.1.
Randomly assign units to runs. Do runs in random order if possible. In in-
dustry, units are often time slots (periods of time), so randomization consists
8.2 Fractional Factorial Designs 261
Assume none of the interactions are signicant. Then the 274 III fractional
factorial design allows estimation of 7 main eects in 23 = 8 runs. The 21511
III
fractional factorial design allows estimation of 15 main eects in 24 = 16 runs.
The 23126
III fractional factorial design allows estimation of 31 main eects in
25 = 32 runs.
Fractional factorial designs with k f = ko can be t with software meant
for 2ko designs. Hence the lregpack functions twocub and twofourth can
262 8 Orthogonal Designs
be used for the ko = 3 and ko = 4 designs that use the standard table
of contrasts. The response and residual plots given by twofourth are not
appropriate, but the QQ plot and the remaining output are relevant. Some
of the interactions will correspond to main eects for the fractional factorial
design.
For example, if the Example 8.4 data was from a 241 IV design, then the
A, B, and C eects would be the same, but the D eect is the eect labelled
ABC. So the D eect 0.
Normal QQ plot
40
30
effects
20
10
0
Example 8.5. Montgomery (1984, pp. 344346) gives data from a 274 III
design with the QQ plot shown in Figure 8.2. The goal was to study eye focus
time with factors A = sharpness of vision, B = distance of target from eye,
C = target shape, D = illumination level, E = target size, F = target density,
and G = subject. The lregpack function twocub gave the eects above.
a) What is the D eect?
b) What eects are signicant?
Solution: By the last line in the table given in Remark 8.3, note that for
this design, A, B, C, AB, AC, BC, ABC correspond to A, B, C, D, E, F, G. So
the AB eect from the output is the D eect.
8.3 Plackett Burman Designs 263
I A B C AB AC BC ABC y
+ + + + 86.8
+ + + + 85.9
+ + + + 79.4
+ + + + 60.0
+ + + + 94.6
+ + + + 85.4
+ + + + 84.5
+ + + + + + + + 80.3
used to screen k main eects when the number of runs n is small. Often
k = n 4, n 3, n 2, or n 1 is used. We will assume that the number of
replications m = 1.
A contrast matrix for the PB(12) design is shown below. Again the column
of plusses corresponding to I is omitted. If k = 8 then eects A to H are
used but eects J, K, and L are empty. As a convention the mean square
and sum of squares for factor E will be denoted as MSe and SSe while MSE
= 2 .
run A B C D E F G H J K L
1 + - + - - - + + + - +
2 + + - + - - - + + + -
3 - + + - + - - - + + +
4 + - + + - + - - - + +
5 + + - + + - + - - - +
6 + + + - + + - + - - -
7 - + + + - + + - + - -
8 - - + + + - + + - + -
9 - - - + + + - + + - +
10 + - - - + + + - + + -
11 - + - - - + + + - + +
12 - - - - - - - - - - -
The PB(n) designs are k factor 2 level orthogonal designs. So nding
quantities such as eects, MS, SS, least squares estimates, et cetera for PB(n)
kf
designs is similar to nding the corresponding quantities for the 2k and 2R
designs. Randomize units (often time slots) to runs and least squares can be
used.
Remark 8.6. For the PB(n) design, let c be a column from the table of
contrasts where + = 1 and = 1. Let y be the column of responses since
m = 1. If k < n 1, pool the last J = n 1 k empty eects into the
MSE with df = J as the full model. This procedure is done before looking
at the data, so is not data snooping. The MSE can also be given or found
by pooling insignicant MSs into the MSE, but the latter method uses data
snooping. This pooling needs to be done if k = n 1 since then there is no
df for MSE. The following formulas ignore the I eect.
cT y 2cT y
a) The eect corresponding to c is eect = = .
n/2 n
MSE 4MSE
b) The standard error for the eect is SE(eect) = = .
n/4 n
M SE
c) SE(mean) = .
n
d) The sum of squares and mean sum of squares for an eect is
n
M S(eect) = SS(eect) = (eect)2 .
4
8.3 Plackett Burman Designs 265
15
10
effects
5
0
5
For the PB(n) design, the least squares coecient = 0.5 (eect). The sum
of squares for an x corresponding
to an eect is equal to SS(eect). SE(coef)
= SE() = 0.5 SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
Example 8.7. Shown below is least squares output using PB(12) data
from Ledolter and Swersey (2007, pp. 244256). There were k = 10 factors so
the MSE has 1 df and there are too many terms in the model. In this case the
QQ plot shown in Figure 8.3 is more reliable for nding signicant eects.
a) Which eects, if any, appear to be signicant from the QQ plot?
b) Let the reduced model Y = 0 + r1 xr1 + + rj xrj where j is the
number of signicant terms found in a). Write down the reduced model.
c) Want large Y . Using the model in b), choose the x values that will give
large Y , and predict Y .
Estimate Std.Err t-value Pr(>|t|)
Intercept 6.7042 2.2042 3.0416 0.2022
c1 8.5792 2.2042 3.8922 0.1601
c2 -1.7958 2.2042 -0.8147 0.5648
c3 2.3125 2.2042 1.0491 0.4847
c4 4.1208 2.2042 1.8696 0.3127
c5 3.1542 2.2042 1.4310 0.3883
c6 -3.3958 2.2042 -1.5406 0.3665
c7 0.9542 2.2042 0.4329 0.7399
c8 -1.1208 2.2042 -0.5085 0.7005
c9 1.3125 2.2042 0.5955 0.6581
c10 1.7875 2.2042 0.8110 0.5662
266 8 Orthogonal Designs
Solution: a) The most signicant eects are either in the top right or
bottom left corner. Although the points do not all scatter closely about the
line, the point in the bottom left is not signicant. So none of the eects
corresponding to the bottom left of the plot are signicant. A is the signicant
eect with value 2(8.5792) = 17.1584. See the top right point of Figure 8.3.
b) Y = 6.7042 + 8.5792x1 .
c) Y = 6.7042 + 8.5792(1) = 15.2834.
The lregpack function pb12 can be used to nd eects and MS(eect) for
PB(12) data. Least squares output and a QQ plot are also given.
8.4 Summary
for m 1.
8.4 Summary 267
7) If a single run out of 2k cells gives good values for the response, then
that run is called a critical mix.
8) A factor is active if the response depends on the two levels of the factor,
and is inert, otherwise.
9) Randomization for a 2k design: randomly assign units to the m2k runs.
The runs are determined by the levels of the k main eects in the table of
contrasts. So a 23 design is determined by the levels of A, B, and C. Similarly,
a 24 design is determined by the levels of A, B, C, and D. Perform the m2k
runs in random order if possible.
10) A table of contrasts for a 23 design is shown below. The rst column
is for the mean and is not a contrast. The last column corresponds to the
cell means. Note that y 1110 = y111 if m = 1. So y might be replaced by y if
m = 1.
I A B C AB AC BC ABC y
+ + + + y 1110
+ + + + y 2110
+ + + + y 1210
+ + + + y 2210
+ + + + y 1120
+ + + + y 2120
+ + + + y 1220
+ + + + + + + + y 2220
divisor 8 4 4 4 4 4 4 4
11) Be able to pick out active and inert factors and good (or the best)
combinations of factors (cells or runs) from the table of contrasts = table of
runs.
12) Plotted points far away from the identity line and r = 0 line are
potential outliers, but often the identity line goes through or near an outlier
that is large in magnitude. Then the case has a small residual. Look for gaps
is the response and residual plots.
13) A 95% condence interval (CI) for an eect is
where dfe is the MSE degrees of freedom. Use tdfe ,0.975 z0.975 = 1.96 if
dfe > 30. The eect is signicant if the CI does not contain 0, while the eect
is not signicant if the CI contains 0.
14) Suppose there is no replication so m = 1. Find J interaction mean
squares that are small compared to the bulk of the mean squares. Add them
up (pool them) to make M SE with dfe = J. So
268 8 Orthogonal Designs
Source df SS MS F p-value
A 1 SSA MSA FA pA
B 1 SSB MSB FB pB
C 1 SSC MSC FC pC
AB 1 SSAB MSAB FAB pAB
AC 1 SSAC MSAC FAC pAC
BC 1 SSBC MSBC FBC pBC
ABC 1 SSABC MSA FABC pABC
Error (m 1)2k SSE MSE
8.4 Summary 269
18) Below is the ANOVA table for a 2k design. For A, use H0 : 100 =
200 . The other main eects have similar null hypotheses. For interaction,
use H0 : no interaction. If m = 1 use a procedure similar to point 14) for
exploratory purposes.
Source df SS MS F p-value
k# main
$ eects 1 e.g. SSA = MSA FA p A
k
#k2$ 2 factor interactions 1 e.g. SSAB = MSAB FAB pAB
3 3 factor interactions 1 e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
#. k $ . . . .
k1 k 1 factor interactions
the k factor interaction 1 SSA L = MSA L FAL pAL
Error (m 1)2k SSE MSE
23 A B C AB AC BC ABC
241
IV A B C AB+ AC+ BC+ D
252
IIIABC D E BC+ BE+
263
IIIABC D E F AF+
274
IIIABC D E F G
44) Suppose the full model using all of the columns of X is used. If some
columns are removed (e.g. those corresponding to the insignicant eects),
then for the orthogonal designs in point 43) the following quantities remain
unchanged for the terms that were not deleted: the eects, the coecients,
8.4 Summary 273
50) The least squares coecient = 0.5 (eect). The sum of squares for an
x corresponding
to an eect is equal to SS(eect). SE(coef) = SE() = 0.5
SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
51) The Plackett Burman PB(n) designs have k factors where 2 k
kf
n 1. The factors have 2 levels and orthogonal contrasts like the 2k and 2R
designs. The PB(n) designs are resolution 3 designs, but the confounding of
main eects with 2 factor interactions is complex. The PB(n) designs use n
runs where n is a multiple of 4. The values n = 12, 20, 24, 28, and 36 are
especially common.
52) The PB(n) designs are usually used with main eects so assume that all
interactions are insignicant. So they are main eects designs used to screen k
main eects when the number of runs n is small. Often k = n 4, n 3, n 2,
or n 1 is used. We will assume that the number of replications m = 1.
53) If k = n 1 there is no df for MSE. If k < n 1, pool the last
J = n 1 k empty eects into the MSE with df = J as the full model.
This procedure is done before looking at the data, so is not data snooping.
run A B C D E F G H J K L
1 + - + - - - + + + - +
2 + + - + - - - + + + -
3 - + + - + - - - + + +
4 + - + + - + - - - + +
5 + + - + + - + - - - +
6 + + + - + + - + - - -
7 - + + + - + + - + - -
8 - - + + + - + + - + -
9 - - - + + + - + + - +
10 + - - - + + + - + + -
11 - + - - - + + + - + +
12 - - - - - - - - - - -
54) The contrast matrix for the PB(12) design is shown above. Again the
column of plusses corresponding to I is omitted. If k = 8 then eects A to
8.5 Complements 275
H are used but eects J, K, and L are empty. As a convention the mean
square and sum of squares for factor E will be denoted as MSe and SSe while
MSE = 2 .
55) The PB(n) designs are k factor 2 level orthogonal designs. So nding
eects, MS, SS, least squares estimates, et cetera for PB(n) designs is similar
kf
to nding the corresponding quantities for the 2k and 2R designs.
56) For the PB(n) design, let c be a column from the table of contrasts
where + = 1 and = 1. Let y be the column of responses since m = 1.
For k < n 1, MSE can be found for the full model as in 53). MSE can also
be given or found by pooling insignicant MSs into the MSE, but the latter
method uses data snooping.
cT y 2cT y
a) The eect corresponding to c is eect = = .
n/2 n
MSE 4MSE
b) The standard error for the eect is SE(eect) = = .
n/4 n
M SE
c) SE(mean) = .
n
d) The sum of squares and mean square for an eect is
n
M S(eect) = SS(eect) = (eect)2 .
4
57) For the PB(n) design, the least squares coecient = 0.5 (eect). The
sum of squares for an x corresponding to an eect is equal to SS(eect).
SE(coef) = SE() = 0.5 SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
8.5 Complements
Box et al. (2005) and Ledolter and Swersey (2007) are excellent references
for k factor 2 level orthogonal designs.
Suppose it is desired to increase the response Y and that A, B, C, . . . are
the k factors. The main eects for A, B, . . . measure
Y Y
, ,
A B
et cetera. The interaction eect AB measures
Y
.
AB
Hence
Y Y Y
0, 0, and large
A B AB
276 8 Orthogonal Designs
8.6 Problems
8.1. From the above least squares output, what is the AB eect?
I A B C AB AC BC ABC Y
+ + + + 3.81
+ + + + 4.28
+ + + + 3.74
+ + + + 4.10
+ + + + 3.75
+ + + + 3.66
+ + + + 3.82
+ + + + + + + + 3.68
I A B C AB AC BC ABC y
+ + + + 86.8
+ + + + 85.9
+ + + + 79.4
+ + + + 60.0
+ + + + 94.6
+ + + + 85.4
+ + + + 84.5
+ + + + + + + + 80.3
8.4. Suppose that for 23 data with m = 2, the MSE = 407.5625. Find
SE(eect).
I A B C AB AC BC ABC y
+ + + + 63.6
+ + + + 76.8
+ + + + 60.3
+ + + + 80.3
+ + + + 67.2
+ + + + 71.3
+ + + + 68.3
+ + + + + + + + 74.3
divisor 8 4 4 4 4 4 4 4
I A B C AB AC BC ABC y
+ + + + 32
+ + + + 35
+ + + + 28
+ + + + 31
+ + + + 48
+ + + + 39
+ + + + 28
+ + + + + + + + 29
divisor 8 4 4 4 4 4 4 4
8.6 Problems 279
8.7. Suppose the B eect = 5, SE(ef f ect) = 2, and dfe = 8.
i) Find a 95% condence interval for the B eect.
ii) Is the B eect signicant? Explain briey.
R (along with 1 SAS and 1 Minitab) Problems
8.8. Copy the Box et al. (2005, p. 199) product development data from
(http://lagrange.math.siu.edu/Olive/lregdata.txt) into R.
Then type the following commands.
8.9. Get the SAS program for this problem from (http://lagrange.math.siu.
edu/Olive/lreghw.txt). The data is the pilot plant example from Box et al.
(2005, pp. 177186). The response variable is Y = yield, while the three
predictors (T = temp, C = concentration, K = catalyst) are at two levels.
a) Print out the output but do not turn in the rst page.
b) Do the residual and response plots look ok?
8.10. Get the data for this problem. The data is the pilot plant example
from Box et al. (2005, pp. 177186) examined in Problem 8.9. Minitab needs
the levels for the factors and the interactions.
Highlight the data and use the menu commands Edit>Copy. In Minitab,
use the menu command Edit>PasteCells. After a window appears, click on
ok.
Below C1 type A, below C2 type B, below C3 type C and below
C8 type yield.
a) Use the menu command STAT>ANOVA>Balanced Anova put
yield in the responses box and
A|B|C
in the Model box. Click on Storage. When a window appears, click on
Fits and Residuals. Then click on OK. This window will disappear.
Click on OK.
b) Next highlight the bottom 8 lines and use the menu commands
Edit>Delete Cells. Then the data set does not have replication. Use the
menu command STAT>ANOVA>Balanced Anova put yield in the re-
sponses box and
A B C A*C
in the Model box. Click on Storage. When a window appears, click on
Fits and Residuals. Then click on OK. This window will disappear.
Click on OK.
(The model A|B|C would have resulted in an error message, not enough
data.)
c) Print the output by clicking on the top window and then clicking on
the printer icon.
d) Make a response plot with the menu commands Graph>Plot with
yield in the Y box and FIT2 in the X box. Print by clicking on the printer
icon.
e) Make a residual plot with the menu commands Graph>Plot with
RESI2 in the Y box and FIT2 in the X box. Print by clicking on the printer
icon.
f) Do the plots look ok?
8.11. Get the R code and data for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt). The data is the pilot plant
8.6 Problems 281
example from Box et al. (2005, pp. 177186) examined in Problems 8.9 and
8.10.
a) Copy and paste the code into R. Then copy and paste the output into
Notepad. Print out the page of output.
b) The least squares estimate = coecient for x1 is half the A eect. So
what is the A eect?
This is the Ledolter and Swersey (2007, p. 80) cracked pots 24 data and
the response and residual plots are from the model without 3 and 4 factor
interactions.
b) Copy the plots into Word and print the plots. Do the response and
residual plots look ok?
282 8 Orthogonal Designs
8.15. Download lregpack into R. The data is the PB(12) example from
Box et al. (2005, p. 287).
a) Type the following commands. Copy and paste the QQ plot into Word
and print the plot.
b) Copy and paste the output into Notepad and print the output.
c) As a 25 design, the eects B, D, BD, E, and DE were thought to be real.
The PB(12) design works best when none of the interactions is signicant.
From the QQ plot and the output for the PB(12) design, which factors, if
any, appear to be signicant?
d) The output gives the A, B, C, D, and E eects along with the cor-
responding least squares coecients 1 , . . . , 5 . What is the relationship
between the coecients and the eects?
For parts e) to g), act as if the PB(12) design with 5 factors is
appropriate.
e) The full model has Y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + 5 x5 . The
reduced model is Y = 0 + j xj where xj is the signicant term found in c).
Give the numerical formula for the reduced model.
f) Compute Y using the full model if xi = 1 for i = 1, . . . , 5. Then compute
Y using the reduced model if xj = 1.
g) If the goal of the experiment is to produce large values of Y , should
xj = 1 or xj = 1 in the reduced model? Explain briey.
Chapter 9
More on Experimental Designs
This chapter considers split plot designs briey and reviews the ten designs
considered in Chapter 5 Section 9.1. The one and two way Anova designs,
completely randomized block design, and split plot designs are the building
blocks for more complicated designs. Some split plot designs can be written as
a linear model, Y = xT +e, but the errors are dependent with a complicated
correlation structure.
Denition 9.1. Split plot designs have two units. The large units are
called whole plots and contain blocks of small units called subplots. The
whole plots get assigned to factor A while the subplots get assigned to factor
B (randomly if the units are experimental but not randomly if the units are
observational). A and B are crossed so the AB interaction can be studied.
The split plot design depends on how whole plots are assigned to A. Three
common methods are described below, and methods a) and b) are described
in more detail in the following subsections. The randomization and split plot
ANOVA table depend on the design used for assigning the whole plots to
factor A.
a) The whole plots are assigned to A completely at random, as in a one
way Anova.
b) The whole plots are assigned to A and to a blocking variable as in a
completely randomized block design (if the whole plots are experimental, but
a complete block design is used if the whole plots are observational).
c) The whole plots are assigned to A, to row blocks, and to column blocks
as in a Latin square.
The key feature of a split plot design is that there are two units of dierent
sizes: one size for each of the 2 factors of interest. The larger units are assigned
to A. The large units contain blocks of small units assigned to factor B. Also
factors A and B are crossed.
Shown below is the split plot ANOVA table when the whole plots are assigned
to factor A as in a one way Anova design. The whole plot error is error(W) and
can be obtained as an A*replication interaction. The subplot error is error(S).
FA = M SA/M SEW, FB = M SB/M SES, and FAB = M SAB/M SES. R
computes the three test statistics and pvalues correctly, but for SAS FA and
the pvalue pA need to be computed using MSA, MSEW, dfA , and dfeW ob-
tained from the ANOVA table. Sometimes error(W) is also denoted as
residuals. There are ma whole plots, and each whole plot contains b sub-
plots. Thus there are mab subplots. As always, the pvalue column actually
gives pval, an estimate of the pvalue.
Source df SS MS F p-value
A a1 SSA MSA FA pA
error(W) or A*repl a(m 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
residuals or error(S) a(m 1)(b 1) SSES MSES
The tests of interest for this split plot design are nearly identical to those of
a two way Anova model. Yijk has i = 1, . . . , a, j = 1, . . . , b, and k = 1, . . . , m.
Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
between A and B. (Or there is not enough evidence to conclude that there is
an interaction.)
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A. (Or there is not enough evidence to
conclude that the response depends on the level of A.)
9.1 Split Plot Designs 285
Source df SS MS F p-value
variety 7 763.16 109.02 1.232 0.3421
MSEW 16 1415.83 88.49
treatment 3 30774.3 10258.1 423.44 0.00
variety*treatment 21 2620.1 124.8 5.150 0.00
error(S) 48 1162.8 24.2
Example 9.1. This split plot data is from Chambers and Hastie (1993,
p. 158). There were 8 varieties of guayule (rubber plant) and 4 treatments
were applied to seeds. The response was the rate of germination. The whole
plots were greenhouse ats and the subplots were 4 subplots of the ats. Each
at received seeds of one variety (A). Each subplot contained 100 seeds and
was treated with one of the treatments (B). There were m = 3 replications
so each variety was planted in 3 ats for a total of 24 ats and 4(24) = 96
observations.
Factorial crossing: Variety and treatments (A and B) are crossed since all
combinations of variety and treatment occur. Hence the AB interaction can
be measured.
Blocking: The whole plots are the 24 greenhouse ats. Each at is a block
of 4 subplots. Each of the 4 subplots gets one of the 4 treatments.
Randomization: The 24 ats are assigned to the 8 varieties completely at
random. Use the sample(24) command to generate a random permutation.
The rst 3 numbers of the permutation get variety one, the next 3 get variety
2, . . . , the last 3 get variety 8. Use the sample(4) command 24 times, once
for each at. If 2, 4, 1, 3 was the permutation for the ith at, then the 1st
subplot gets treatment 3, the 2nd gets treatment 1, the 3rd gets treatment
4, and the 4th subplot gets treatment 2.
Fail to reject Ho, the mean rate of germination does not depend on va-
riety. (This test would make more sense if there was no variety * treatment
interaction.)
b) Ho: 010 = = 040 Ha: not Ho
FB = 423.44
pval = 0.00
Reject Ho, the mean rate of germination depends on treatment.
c) Ho: no interaction Ha: there is an interaction
FAB = 5.15
pval = 0.00
Reject Ho, there is a variety * treatment interaction.
Shown below is the split plot ANOVA table when the whole plots are
assigned to factor A and a blocking variable as in a completely random-
ized block design. The whole plot error is error(W) and can be obtained as
a block*A interaction. The subplot error is error(S). FA = M SA/M SEW,
FB = M SB/M SES, and FAB = M SAB/M SES. Factor A has a levels
and factor B has b levels. There are r blocks of a whole plots. Each whole
plot contains b subplots, and each block contains a whole plots and thus ab
subplots. Hence there are ra whole plots and rab subplots.
SAS computes the last two test statistics and pvalues correctly, and the
last line of SAS output gives FA and the pvalue pA . The initial line of output
for A is not correct. The output for blocks is probably not correct.
Source df SS MS F p-value
blocks r1
A a1 SSA MSA FA pA
error(W) or block*A (r 1)(a 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
error(S) a(r 1)(b 1) SSES MSES
The tests of interest for this split plot design are nearly identical to those
of a two way Anova model. Yijk has i = 1, . . . , a, j = 1, . . . , b and k = 1, . . . , r.
Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
9.1 Split Plot Designs 287
between A and B. (Or there is not enough evidence to conclude that there is
an interaction.)
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A. (Or there is not enough evidence to
conclude that the response depends on the level of A.)
c) The 4 step test for B main eects is
i) Ho: 010 = = 0b0 HA : not Ho.
ii) FB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B. (Or there is not enough evidence to
conclude that the response depends on the level of B.)
Source df SS MS F p-value
Block 5 4.150 0.830
Variety 2 0.178 0.089 0.65 0.5412
Block*Variety 10 1.363 0.136
Date 3 1.962 0.654 23.39 0.00
Variety*Date 6 0.211 0.035 1.25 0.2973
error(S) 45 1.259 0.028
Example 9.2. The ANOVA table above is for the Snedecor and Cochran
(1967, pp. 369372) split plot data where the whole plots are assigned to
factor A and to blocks in a completely randomized block design. Factor A =
variety of alfalfa (ladak, cossack, ranger). Each eld had two cuttings, with
the second cutting on July 7, 1943. Factor B = date of third cutting (none,
Sept. 1, Sept. 20, Oct. 7) in 1943. The response variable was yield (tons per
acre) in 1944. The 6 blocks were elds of land divided into 3 plots of land,
one for each variety. Each of these 3 plots was divided into 4 subplots for
date of third cutting. So each block had 3 whole plots and 12 subplots.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
Warning: Although the split plot model can be written as a linear model,
the errors are not iid and have a complicated correlation structure. It is also
dicult to get tted values and residuals from the software, so the model
cant be easily checked with response and residual plots. These facts make
the split plot model very hard to use for most researchers.
b) For a random eects one way Anova model, the levels are a random
sample from a population oflevels.
p
Randomization: Let n = i=1 mi and do the sample(n) command. Assign
the rst m1 units to treatment (level) 1, the next m2 units to treatment 2,
. . . , the last mp units to treatment p.
II) Two way Anova: Factor A has a levels and factor B has b levels. The
two factors are crossed, forming ab cells.
Randomization: Let n = mab and do the sample(n) command. Randomly
assign m units to each of the ab cells. Assign the rst m units to the (A, B) =
(1, 1) cell, the next m units to the (1,2) cell, . . . , the last m units to the (a, b)
cell.
III) k way Anova: There are k factors A1 , . . . , Ak with a1 , . . . , ak levels,
'k
respectively. The k factors are crossed, forming i=1 ai cells.
'k
Randomization: Let n = m i=1 ai and do the sample(n) command. Ran-
domly assign m units to each cell. Each cell is a combination of levels, so the
(1, 1, . . . , 1, 1) cell gets the 1st m units.
IV) Completely randomized block design: Factor A has k levels (treat-
ments), and there are b blocks (a blocking variable has b levels) of k units.
Randomization: Let n = kb and do the sample(k) command b times.
Within each block of k units, randomly assign 1 unit to each treatment.
V) Latin squares: Factor A has a levels (treatments), the row blocking
variable has a blocks of a units, and the column blocking variable has a blocks
of a units. There are a2 units since the row and column blocking variables are
crossed. The treatment factor, row blocking variable, and column blocking
variable are also crossed. A Latin square is such that each of the a treatments
occurs once in each row and once in each column.
Randomization: Pick an a a Latin square. Use the sample(a) command
to assign row levels to numbers 1 to a. Use the sample(a) command to assign
column levels to numbers 1 to a. Use the sample(a) command to assign
treatment levels to the rst a capital letters. If possible, use the sample(a2 )
command to assign units, 1 unit to each cell of the Latin square.
VI) 2k factorial design: There are k factors, each with 2 levels.
Randomization: Let n = m2k and do the sample(n) command. Randomly
assign m units to each cell. Each cell corresponds to a run which is determined
by a string of k +s and s corresponding to the k main eects.
kf
VII) 2R fractional factorial design: There are k factors, each with 2
levels.
Randomization: Let n = 2kf and do the sample(n) command. Randomly
assign 1 unit to each run which is determined by a string of k +s and s
corresponding to the k main eects.
VIII) Plackett Burman P B(n) design: There are k factors, each with 2
levels.
290 9 More on Experimental Designs
Try to become familiar with the designs and their randomization so that
you can recognize a design given a story problem.
9.3 Summary
1) The analysis of the response, not that of the residuals, is of primary im-
portance. The response plot can be used to analyze the response in the back-
ground of the tted model. For linear models such as experimental designs,
the estimated mean function is the identity line and should be added as a
visual aid.
2) Assume that the residual degrees of freedom are large enough for testing.
Then the response and residual plots contain much information. Linearity and
constant variance may be reasonable if the plotted points scatter about the
identity line in a (roughly) evenly populated band. Then the residuals should
scatter about the r = 0 line in an evenly populated band. It is easier to check
linearity with the response plot and constant variance with the residual plot.
Curvature is often easier to see in a residual plot, but the response plot can
be used to check whether the curvature is monotone or not. The response plot
is more eective for determining whether the signal to noise ratio is strong
or weak, and for detecting outliers, inuential cases, or a critical mix.
3) The three basic principles of DOE (design of experiments) are
i) use randomization to assign units to treatments.
ii) Use factorial crossing to compare the eects (main eects, pairwise
interactions, . . . , J-fold interaction) for J 2 factors. If A1 , . . . , AJ are the
factors with li levels for i = 1, . . . , J then there are l1 l2 lJ treatments
where each treatment uses exactly one level from each factor.
iii) Blocking is used to divide units into blocks of similar units where
similar means the units are likely to have similar values of the response
when given the same treatment. Within each block randomly assign units to
treatments.
292 9 More on Experimental Designs
4) Split plot designs have two units. The large units are called whole plots
and contain blocks of small units called subplots. The whole plots get assigned
to factor A while the subplots get assigned to factor B (randomly if the units
are experimental but not randomly if the units are observational). A and B
are crossed so the AB interaction can be studied.
5) The split plot design depends on how whole plots are assigned to A.
Three common methods are a) the whole plots are assigned to A completely
at random, as in a one way Anova, b) the whole plots are assigned to A
and to a blocking variable as in a completely randomized block design (if the
whole plots are experimental, a complete block design is used if the whole
plots are observational), c) the whole plots are assigned to A, to row blocks,
and to column blocks as in a Latin square.
6) The split plot ANOVA table when whole plots are assigned to levels of
A as in a one way Anova is shown below. The whole plot error is error(W) and
can be obtained as an A*replication interaction. The subplot error is error(S).
FA = M SA/M SEW, FB = M SB/M SES, and FAB = M SAB/M SES. R
computes the three test statistics and pvalues correctly, but for SAS FA
and the pvalue pA need to be computed using MSA, MSEW, dfA , and dfeW
obtained from the ANOVA table.
Source df SS MS F p-value
A a1 SSA MSA FA pA
error(W) or A*repl a(m 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
residuals or error(S) a(m 1)(b 1) SSES MSES
Source df SS MS F p-value
blocks r1
A a1 SSA MSA FA pA
error(W) or block*A (r 1)(a 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
error(S) a(r 1)(b 1) SSES MSES
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B.
9.4 Complements
9.5 Problems
Source df SS MS F p-value
Block 2 77.55 38.78
Method 2 128.39 64.20 7.08 0.0485
Block*Method 4 36.28 9.07
Temp 3 434.08 144.69 41.94 0.00
Method*Temp 6 75.17 12.53 2.96 0.0518
error(S) 12 50.83 4.24
9.1. The ANOVA table above is for the Montgomery (1984, pp. 386389)
split plot data where the whole plots are assigned to factor A and to blocks
in a completely randomized block design. The response variable is tensile
strength of paper. Factor A is (preparation) method with 3 levels (1, 2, 3).
Factor B is temperature with 4 levels (200, 225, 250, 275). The pilot plant
can make 12 runs a day and the experiment is repeated each day, with days
as blocks. A batch of pulp is made by one of the 3 preparation methods. Then
the batch of pulp is divided into 4 samples, and each sample is cooked at one
of the four temperatures.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
9.5 Problems 295
Source df SS MS F p-value
Block 1 0.051 0.051
Nitrogen 3 37.32 12.44 29.62 0.010
Block*Nitrogen 3 1.26 0.42
Thatch 2 3.82 1.91 9.10 0.009
Nitrogen*Thatch 6 4.15 0.69 3.29 0.065
error(S) 12 1.72 0.21
9.2. The ANOVA table above is for the Kuehl (1994, pp. 473481) split
plot data where the whole plots are assigned to factor A and to blocks in
a completely randomized block design. The response variable is the average
chlorophyll content (mg/gm of turf grass clippings). Factor A is nitrogen
fertilizer with 4 levels (1, 2, 3, 4). Factor B is length of time that thatch was
allowed to accumulate with 3 levels (2, 5, or 8 years).
There were 2 blocks of 4 whole plots to which the levels of factor A were
assigned. The 2 blocks formed a golf green which was seeded with turf grass.
The 8 whole plots were plots of golf green. Each whole plot had 3 subplots
to which the levels of factor B were randomly assigned.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
Source df SS MS F p-value
Block 5 4.150 0.830
Variety 2 0.178 0.089 0.65 0.5412
Block*Variety 10 1.363 0.136
Date 3 1.962 0.654 23.39 0.00
Variety*Date 6 0.211 0.035 1.25 0.2973
error(S) 45 1.259 0.028
9.3. The ANOVA table above is for the Snedecor and Cochran (1967, pp.
369372) split plot data where the whole plots are assigned to factor A and to
blocks in a completely randomized block design. Factor A = variety of alfalfa
(ladak, cossack, ranger). Each eld had two cuttings, with the second cutting
on July 7, 1943. Factor B = date of third cutting (none, Sept. 1, Sept. 20,
Oct. 7) in 1943. The response variable was yield (tons per acre) in 1944. The
6 blocks were elds of land divided into 3 plots of land, one for each variety.
Each of these 3 plots was divided into 4 subplots for date of third cutting.
So each block had 3 whole plots and 12 subplots.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
296 9 More on Experimental Designs
attach(steel)
out<-aov(resistance~heat*coating + Error(wplots),steel)
summary(out)
detach(steel)
This split plot steel data is from Box et al. (2005, p. 336). The whole plots
are time slots to use a furnace, which can hold 4 steel bars at one time. Factor
A = heat has 3 levels (360, 370, 380o F). Factor B = coating has 4 levels
(4 types of coating: c1, c2, c3, and c4). The response was corrosion resistance.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
9.7. This is the same data as in Problem 9.6, using SAS. Copy and paste
the SAS program from (http://lagrange.math.siu.edu/Olive/lrsashw.txt)
into SAS, run the program, then print the output. Only include the second
page of output.
To get the correct F statistic for heat, you need to divide MS heat by MS
wplots.
f (z|, )
where the ith row of W is xTi and the jth column is v j . Each column v j of
W corresponds to a variable. For example, the data may consist of n visitors
to a hospital where the p = 2 variables height and weight of each individual
were measured.
There are some dierences in the notation used in multiple linear regression
and multivariate location dispersion models. Notice that W could be used
as the design matrix in multiple linear regression although usually the rst
column of the regression design matrix is a vector of ones. The n p design
matrix in the multiple linear regression model was denoted by X, and xTi
was the ith row of X. In the multivariate location dispersion model, X and
X i will be used to denote a p 1 random vector with observed value xi ,
and xTi is the ith row of the data matrix W . Johnson and Wichern (1988,
pp. 7, 53) uses X to denote the n p data matrix and an n 1 random
vector, relying on the context to indicate whether X is a random vector or
data matrix. Software tends to use dierent notation. For example, R will
use commands such as
var(x)
to compute the sample covariance matrix of the data. Hence x corresponds
to W , x[,1] is the rst column of x, and x[4, ] is the 4th row of x.
and
E(AX) = AE(X) and E(AXB) = AE(X)B. (10.3)
Thus
Cov(a + AX) = Cov(AX) = ACov(X)AT . (10.4)
Some important properties of multivariate normal (MVN) distributions are
given in the following three propositions. These propositions can be proved
using results from Johnson and Wichern (Johnson and Wichern (1988), pp.
127132).
Cov(X) = .
X 1 |X 2 = x2 Nq (1 + 12 1 1
22 (x2 2 ), 11 12 22 21 ).
Example 10.1. Let p = 2 and let (Y, X)T have a bivariate normal distri-
bution. That is,
Y Y Y2 Cov(Y, X)
N2 , 2 .
X X Cov(X, Y ) X
1 1 1 1 1
exp( (x2 + 2xy + y 2 )) f1 (x, y) + f2 (x, y)
2 2 1 2 2(1 2 ) 2 2
where x and y are real and 0 < < 1. Since both marginal distributions
of fi (x, y) are N(0,1) for i = 1 and 2 by Proposition
(( 10.2 a), the marginal
distributions of X and Y are N(0,1). Since xyfi (x, y)dxdy = for i = 1
and for i = 2, X and Y are uncorrelated, but X and Y are not independent
since f (x, y) = fX (x)fY (y).
E(X) = (10.7)
and
Cov(X) = cX (10.8)
where
cX = 2 (0).
U D2 = D2 (, ) = (X )T 1 (X ). (10.9)
p/2
h(u) = kp up/21 g(u). (10.10)
(p/2)
E(X|B T X) = + M B B T (X ) = aB + M B B T X (10.11)
aB = M B B T = (I p M B B T ),
and
M B = B(B T B)1 .
10.2 Elliptically Contoured Distributions 305
See Problem 10.11. Notice that in the formula for M B , can be replaced
by c where c > 0 is a constant. In particular, if the EC distribution has
2nd moments, Cov(X) can be used instead of .
b) Even if the rst moment does not exist, the conditional median
MED(Y |X) = + T X
where and are given in a).
Then B T B = XX and
Y X
B = .
XX
Now
Y Y Y
E |X =E |B T
X X X
T 1 T Y Y
= + B(B B) B
X X
by Lemma 10.4. The right-hand side of the last equation is equal to
Y X Y Y X 1 1
XX X + Y X XX X
+ 1
XX (X X ) =
XX X
X (1 )Np (, ) + Np (, c)
where c > 0 and 0 < < 1. Since the multivariate normal distribution is
elliptically contoured
10.3 Sample Mahalanobis Distances 307
E(X|B T X) = (1 )[ + M 1 B T (X )] + [ + M 2 B T (X )]
= + [(1 )M 1 + M 2 ]B T (X ) + M B T (X ).
Since M B only depends on B and , it follows that M 1 = M 2 = M =
M B . Hence X has an elliptically contoured distribution by Lemma 10.4. See
Problem 10.4 for a related result.
for each point X i . Notice that Di2 is a random variable (scalar valued).
and that the term 1/2 (X ) is the pdimensional analog to the Z-score
used to transform a univariate N (, 2 ) random variable into an N (0, 1)
random variable. Hence the sample Mahalanobis distance Di = Di2 is an
analog of the absolute value |Zi | of the sample Z-score Zi = (Xi X)/. Also
notice that the Euclidean distance of xi from the estimate of center T (W )
is Di (T (W ), I p ) where I p is the p p identity matrix.
{x : (x )T 1 (x ) 2p ()} = {x : Dx
2
(, ) 2p ()}
1
n
T (W ) = X = X i,
n i=1
and
1
n
C(W ) = S = (X i X)(X i X)T
n 1 i=1
= Di (TA , C A ) for i = 1, . . . , n.
The DD plot can be used simultaneously as a diagnostic for whether the
data arise from a multivariate normal distribution or from another EC dis-
tribution with nonsingular covariance matrix. EC data will cluster about a
straight line through the origin; MVN data in particular will cluster about the
identity line. Thus the DD plot can be used to assess the success of numerical
transformations towards elliptical symmetry. This application is important
since many statistical methods assume that the underlying data distribution
10.5 Problems 309
is MVN or EC. If the plotted points do not cluster tightly about some line
through the origin, then the data may not have an EC distribution. Plotted
points that are far from the bulk of the plotted points tend to be outliers. A
DD plot of the continuous predictors is useful for detecting outliers in these
variables. See Example 3.14 and Section 3.6. For regression, the DD plot of
the residuals can be useful. See Chapter 12. The lregpack function ddplot4 will
make the DD plot using the robust RMVN estimator. See Olive and Hawkins
(2010) and Olive (2016c, ch. 5).
10.4 Complements
Johnson and Wichern (1988, 2007), Mardia et al. (1979), and Olive (2016c)
are good references for multivariate statistical analysis. The elliptically con-
toured distributions generalize the multivariate normal distribution and are
discussed in Johnson (1987). Cambanis et al. (1981), Chmielewski (1981),
and Eaton (1986) are also important references. Olive (2002) discussed uses
of the DD plot. Robust estimators are discussed in Olive (2004a, 2016c) and
Zhang et al. (2012).
10.5 Problems
a) If 12 = 10 nd E(Y |X).
10.5. In Proposition 10.5b, show that if the second moments exist, then
can be replaced by Cov(X).
10.7. Using the notation in Proposition 10.6, show that if the second
moments exist, then
1
XX XY = [Cov(X)]
1
Cov(X, Y ).
10.8. Using the notation under Lemma 10.4, show that if X is elliptically
contoured, then the conditional distribution of X 1 given that X 2 = x2 is
also elliptically contoured.
constant vector.
10.10. Recall that Cov(X, Y ) = E[(X E(X))(Y E(Y ))T ]. Using the
notation of Proposition 10.6, let (Y, X T )T be ECp+1 (, , g) where Y is a
random variable. Let the covariance matrix of (Y, X T ) be
T T Y Y Y X VAR(Y ) Cov(Y, X)
Cov((Y, X ) ) = c =
XY XX Cov(X, Y ) Cov(X)
= Y T X and
= [Cov(X)]1 Cov(X, Y ).
10.11. (Due to R.D. Cook.) Let X be a p1 random vector with E(X) =
0 and Cov(X) = . Let B be any constant full rank p r matrix where
1 r p. Suppose that for all such conforming matrices B,
E(X|B T X) = M B B T X
R Problems
10.12. a) Download the maha function that creates the classical Maha-
lanobis distances.
10.13 . a) Assuming that you have done the two source commands above
Problem 10.12 (and the library(MASS) command), type the command
ddcomp(buxx). This will make 4 DD plots based on the DGK, FCH, FMCD,
and median ball estimators. The DGK and median ball estimators are the
two attractors used by the FCH estimator. With the leftmost mouse button,
move the cursor to each outlier and click. This data is the Buxton (1920)
data and cases with numbers 61, 62, 63, 64, and 65 were the outliers with
head lengths near 5 feet. After identifying the outliers in each plot, hold the
rightmost mouse button down and click on Stop to advance to the next plot.
When done, hold down the Ctrl and c keys to make a copy of the plot. Then
paste the plot in Word.
b) Repeat a) but use the command ddcomp(cbrainx). This data is the
Gladstone (1905) data and some infants are multivariate outliers.
c) Repeat a) but use the command ddcomp(museum[,-1]). This data is the
Schaahausen (1878) skull measurements and cases 4860 were apes while
the rst 47 cases were humans.
Theory for linear models is used to show that linear models have good
statistical properties. Linear model theory previously proved in the text in-
cludes Propositions 2.1, 2.2, 2.3, 2.10, 3.1, 3.2, 4.1, 4.2, and Theorem 3.3.
Some matrix manipulations are illustrated in Example 4.1. Unproved results
include Propositions 2.4, 2.5, 2.6, 2.11, Theorems 2.6, 2.7, and 2.8.
Warning: This chapter is much harder than the previous chapters. Often
a linear model theory course is taught at the Masters level.
Vector spaces, subspaces, and column spaces should be familiar from linear
algebra, but are reviewed below.
Two important vector spaces are Rk and V = {0}. Showing that a set M
is a subspace is a common method to show that M is a vector space.
The space spanned by the rows of A is the row space of A. The row space
of A is the column space C(AT ) of AT . Note that
w1
m
Aw = [a1 a2 . . . am ] ... = wi a i .
wm i=1
With the design matrix X, dierent notation is used to denote the columns
of X since both the columns and rows X are important. Let
T
x1
..
X = [v 1 v 2 . . . v p ] = .
xTn
be an n p matrix. Note that C(X) = {y Rn : y = Xb for some b Rp }.
Hence Xb is a typical element of C(X) and Aw is a typical element of C(A).
Note that
T T
x1 x1 b b1
..
p
.. ..
Xb = . b = . = [v 1 v 2 . . . v p ] . = bi v i .
T T i=1
xn xn b bp
11.1 Projection Matrices and the Column Space 315
Generalized inverses are useful for the non-full rank linear model and for
dening projection matrices.
Some results from linear algebra are needed to prove parts of the above
theorem. Unless told otherwise, matrices in this text are real. Then the
eigenvalues of a symmetric matrix A are real. If A is symmetric, then
rank(A) = number of nonzero eigenvalues of A. Recall that if AB is
a square matrix, then tr(AB) = tr(BA). Similarly, if A1 is m1 m2 ,
A2 is m2 m3 , . . . , Ak1 is mk1 mk , and Ak is mk m1 , then
tr(A1 A2 Ak ) = tr(Ak A1 A2 Ak1 ) = tr(Ak1 Ak A1 A2 Ak2 ) =
= tr(A2 A3 Ak A1 ). Also note that a scalar is a 1 1 matrix, so
tr(a) = a.
For part d), note that if y = w + z, then (I n P X )y = z [C(X)] .
Hence the result follows from the denition of a projection matrix by inter-
changing the roles of w and z. Part e) follows from the denition of a pro-
jection matrix since if y C(X), then y = y + 0 where y = w and 0 = z. If
y C(X) then y = 0 + y where 0 = w and y = z. Part g) is a special case
of f). In k), P X is singular unless p = n since rank(X) = r p < n unless
p = n, and P X is an nn matrix. Need rank(P X ) = n for P X to be nonsin-
gular. For h), P X P X R = P X R by f) since each column of P X r C(P X ).
Taking transposes and using symmetry shows P X R P X = P X R . For i), if
is an eigenvalue of P X , then for some x = 0, x = P X x = P 2X x = 2 x
since P X is idempotent by c). Hence = 2 is real since P X is symmetric,
so = 0 or = 1. Then j) follows from i) since rank(P X ) = number of
nonzero eigenvalues of P X = tr(P X ).
11.2 Quadratic Forms 317
n
1 T
A1 = T 1 T T = ti ti .
i=1 i
The following theorem is often useful. Both the expected value and trace
are linear operators. Hence tr(A + B) = tr(A) + tr(B), and E[tr(X)] =
tr(E[X]) when the expected value of the random matrix X exists.
318 11 Theory for Linear Models
Proof. Two proofs are given. i) Searle (1971, p. 55): Note that E(xxT ) =
+ T . Since the quadratic form is a scalar and the trace is a linear
operator, E[xT Ax] = E[tr(xT Ax)] = E[tr(AxxT )] = tr(E[AxxT ]) =
tr(A + AT ) = tr(A) + tr(AT ) = tr(A) + T A.
nii) Graybill (1976, p. 140):
n Using E(xi xj ) = ij + i j , E[xT Ax] =
n n
j=1 aij (ij + i j ) = tr(A) + A.
T
i=1 j=1 aij E(xi xj ) = i=1
Much of the theoretical results for quadratic forms assume that the ei
are iid N (0, 2 ). These exact results are often special cases of large sample
theory that holds for a large class of iid zero mean error distributions that
have V (ei ) 2 . For linear models, Y is typically an n 1 random vector.
The following theorem from statistical inference will be useful:
Y T AY
2r or Y T AY 2 2r
2
i A is idempotent of rank r.
c) If Y Nn (0, ) where > 0, then Y T AY 2r i A is idempotent
of rank r = rank(A).
Y T AY
= Z T AZ 2r
2
i A is idempotent of rank r.
The following theorem is a corollary of Craigs Theorem.
320 11 Theory for Linear Models
Proof. Two proofs are given. a) i) From the above remarks, and
xj e j
using ex = , mY (t) = (1 2t)(n+2j)/2 =
j=0
j! j=0
j!
% &j
e
12t
(1 2t)n/2 =
j=0
j!
11.2 Quadratic Forms 321
2t
(1 2t)n/2 exp + = (1 2t)n/2 exp .
1 2t 1 2t
ii) Let W N ( , 1) where = 2. Then W 2 2 (1, /2) = 2 (1, ).
Let W X where X 2n1 2 (n 1, 0), and let Y = W 2 + X 2 (n, )
by b). Then mW 2 (t) =
,
2 2 1 1
E(etW ) = etw exp (w )2 dw =
2 2
,
1 2 1
exp tw2 (w2 2 w + ) dw =
2 2 2
,
1 1 2
exp (w 2tw2 2 w + ) dw =
2 2
, ,
1 1 2 1 1
exp (w (1 2t) 2 w + ) dw = exp A dw
2 2 2 2
where A = [ 1 2t (w b)]2 + c with
2t
b= and c =
1 2t 1 2t
after algebra. Hence m2W (t) =
,
c/2 1 1 1 1 1 c/2 1
e exp (w b) dw = e
2
1 2t 2 1 1
2 12t 1 2t
12t
(
since the integral = 1 = f (w)dw where f (w) is the N (b, 1/(1 2t)) pdf.
Thus
1 t
mW 2 (t) = exp .
1 2t 1 2t
So mY (t) = mW 2 +X (t) = mW 2 (t)mX (t) =
(n1)/2
1 t 1 1 t
exp = exp =
1 2t 1 2t 1 2t (1 2t)n/2 1 2t
n/2 2t
(1 2t) exp .
1 2t
b) i) By a), mk Yi (t) =
i=1
+
k +
k
mYi (t) = (1 2t)ni /2 exp(i [1 (1 2t)1 ]) =
i=1 i=1
322 11 Theory for Linear Models
k
k
1
(1 2t) i=1 ni /2 exp i [1 (1 2t) ] ,
i=1
k
k
2
the ni , i mgf.
i=1 i=1
ii) Let Yi = where the Z i Nni (i , I ni ) are independent. Let
Z Ti Z i
Z1 1
Z2 2
Z = . Nk ni . , I k ni Nk ni (Z , I k ni ).
.. i=1 .. i=1 i=1 i=1
Zk k
k
k
k
Then Z Z = T
Z Ti Z i = Yi 2 ni , Z where
i=1 i=1 i=1
TZ Z
k
T
k
i i
Z = = = i .
2 i=1
2 i=1
n
n
n
[E(Zi4 ) (E[Zi2 ])2 ] = [4i + 62i + 3 4i 22i 1] = [42i + 2]
i=1 i=1 i=1
11.3 Least Squares Theory 323
= 2n + 4T = 2n + 8.
For the following theorem, see Searle (1971, p. 57). Most of the results in
Theorem 11.14 are corollaries of Theorem 11.13. Recall that the matrix in a
quadratic form is symmetric, unless told otherwise.
The following theorem is useful for constructing ANOVA tables. See Searle
(1971, pp. 6061).
Y 2 + 2 + 2(Y )T ( ) Y 2
X T X = X T Y .
Remark 11.1. a) Know how to nd the max and min of a function h that
is continuous on an interval [a,b] and dierentiable on (a, b). Solve h (x) 0
and nd the places where h (x) does not exist. These values are the critical
points. Evaluate h at a, b, and the critical points. One of these values will
be the min and one the max.
b) Assume h is continuous. Then a critical point o is a local max of h()
if h is increasing for < o in a neighborhood of o and if h is decreasing for
> o in a neighborhood of o . The rst derivative test is often used.
d2
c) If h is strictly concave h() < 0 for all , then any local max
d2
of h is a global max.
d2
d) Suppose h (o ) = 0. The 2nd derivative test states that if 2 h(o ) < 0,
d
then o is a local max.
e) If h() is a continuous function on an interval with endpoints a < b
(not necessarily nite), and dierentiable on (a, b) and if the critical point
is unique, then the critical point is a global maximum if it is a local
maximum (because otherwise there would be a local minimum and the critical
point would not be unique). To show that is the MLE (the global maximizer
of h() = log L()), show that log L() is dierentiable on (a, b). Then show
d
that is the unique solution to the equation log L() = 0 and that the
d
2
d
2nd derivative evaluated at is negative: 2 log L()| < 0. Similar remarks
d
hold for nding 2 using the prole likelihood.
Let = 2 . Then
n 1
log(Lp ( 2 )) = c log( 2 ) 2 Q,
2 2
and
n 1
log(Lp ( )) = c log( ) Q.
2 2
Hence
d log(Lp ( )) n Q set
= + 2 = 0
d 2 2
or n + Q = 0 or n = Q or
n 2
Q r np
= = 2 = i=1 i = M SE,
n n n
which is a unique solution.
Now
-
d2 log(Lp ( )) n 2Q -- n 2n n
2
= 2
3- = 2 3 = 2 < 0.
d 2 2 = 2 2 2
1 1
n n
u = (ui u)(ui u)T and uY = (ui u)(Yi Y ).
n 1 i=1 n 1 i=1
P
Denition 11.19. The notation as n means that is a
consistent estimator of , or, equivalently, that converges in probability
to .
Lemma 11.18:
Seber
and
Lee (2003, p. 106). Let X = (1 X 1 ).
nY nY n nuT
T
Then X Y = = n , XT X = ,
T
X1 Y i=1 ui Yi nu X T1 X 1
1
+ uT D 1 u uT D 1
and (X T X)1 = n
D 1 u D 1
1
where the (p 1) (p 1) matrix D 1 = [(n 1) u ]1 = u /(n 1).
The following theorem gives some properties of the least squares estimators
and MSE under the normal least squares model. Similar properties will be
developed without the normality assumption.
Proof. Let P = P X .
a) Since A = (X T X)1 X T is a constant matrix,
( )T X T X( )
= ( )T [Cov()]1 ( ) 2p
2
by Theorem 11.11.
c) Since Cov(, r) = Cov((X T X)1 X T Y , (I P )Y ) =
2 (X T X)1 X T I(I P ) = 0, r. Thus RSS = r2 , and M SE.
d) Since P X = X and X T P = X T , it follows that X T (I P ) = 0 and
(I P )X = 0. Thus RSS = r T r = Y T (I P )Y =
L c Nr (L c, 2 L(X T X)1 LT ).
D
Let rFR = 2 rF1 /M SE. If H0 is true, rFR 2r for a large class of zero
mean error distributions. See Theorem 11.25 c).
D
Denition 11.21. If Z n Z as n , then Z n converges in distri-
bution to the random vector Z, and Z is the limiting distribution of Z n
means that the distribution of Z is the limiting distribution of Z n . The
D
notation Z n Nk (, ) means Z Nk (, ).
X1 /d1
Fd1 ,d2 .
X2 /d2
k
If Ui 21 are iid, then i=1 Ui 2k . Let d1 = r and k = d2 = dn . Hence if
X2 2dn , then
d n
X2 Ui P
= i=1 = U E(Ui ) = 1
dn dn
332 11 Theory for Linear Models
D
by the law of large numbers. Hence if W Fr,dn , then rWn 2r .
1
(L c)T [L(X T X)1 LT ]1 (L c) 2r
D
rFR = (11.6)
M SE
as n .
Denition 11.22. A test with test statistic Tn is a large sample right tail
test if the test rejects H0 if Tn > an and P (Tn > an ) = n as n
when H0 is true.
T1 X T1 (I P 1 )Y + T1 X T1 (I P 1 )X 1 1 Y T (I P 1 )X 1 1 ,
X1 /d1
Fd1 ,d2 .
X2 /d2
Hence
Y T (P P 1 )Y /r Y T (P P 1 )Y
= Fr,np
Y (I P )Y /(n p)
T rM SE
when H0 is true. Since RSS = Y T (I P )Y and RSS(R) = Y T (I P 1 )Y ,
RSS(R) RSS = Y T (I P 1 [I P ])Y = Y T (P P 1 )Y , and thus
Y T (P P 1 )Y
FR = Fr,np .
rM SE
334 11 Theory for Linear Models
Let X tnp . Then X 2 F1,np . The two tail Wald t test for H0 :
j = 0 versus H1 : j = 0 is equivalent to the corresponding right tailed F
test since rejecting H0 if |X| > tnp (1 ) is equivalent to rejecting H0 if
X 2 > F1,np (1 ).
Under the OLS model where FR Fq,np when H0 is true (so the ei are
iid N (0, 2 )), the pvalue = P (W > FR ) where W Fq,np . In general, we
can only estimate the pvalue. Let pval be the estimated pvalue. Then pval
P
= P (W > FR ) where W Fq,np , and pval pvalue an n for the
large sample partial F test. The pvalues in output are usually actually pvals
(estimated pvalues).
If H0 : 2 = 0 is true, then = 0.
Proof. Note that the denominator is the M SE, and (n p)M SE/ 2
2np by the proof of Theorem 11.25. By Theorem 11.14 f),
T X T (P P 1 )X
Y (P P 1 )Y /
T 2 2
r,
2 2
11.4 Nonfull Rank Linear Models 335
11.5 Summary
Y T AY 2 2r i A is idempotent of rank r.
12) Often theorems are given for when Y Nn (0, I). If Y Nn (0, 2 I),
then apply the theorem using Z = Y / Nn (0, I).
13) Suppose Y1 , . . . , Yn are independent N (i , 1) random
n variables so that
Y = (Y1 , . . . , Yn )T Nn (, I n ). Then Y T Y = i=1 i (n, =
Y 2 2
T 2
/2), a noncentral (n, ) distribution,with n degrees of freedom and
n
noncentrality parameter = T /2 = 12 i=1 2i 0. The noncentrality
T
parameter = = 2 is also used.
14) Theorem 11.16. Let = X C(X) where Yi = xTi + ri () and
the residual ri () depends on . The least squares estimator is the
n of2 R that minimizes
p
value the least squares criterion
r
i=1 i () = Y X 2
.
15) Let xTi = (1 uTi ), and let T = (1 TS ) where 1 is the intercept and
the slopes vector S = (2 , . . . , p )T . Let the population covariance matrices
Cov(u) = u , and Cov(u, Y ) = uY . If the (Yi , uTi )T are iid, then the
population coecients from an OLS regression of Y on u are
11.6 Complements
where
C OLS = 1 T 2 T 1
u E[(Y S u) (u E(u))(u E(u)) ] u .
iii) Chen and Li (1998): Let A be a known full rank constant k (p 1)
matrix. If the null hypothesis Ho: A U = 0 is true, then
D
n(A S cA U ) = nA S Nk (0, AC OLS AT )
and
AC OLS AT = 2 A 1 T
uA .
To create test statistics, the estimator
1 2 1
n n
T
2 = MSE = ri = (Yi S ui )2
np np
i=1 i=1
can also be useful. Notice that for general 1D regression models, the OLS
MSE estimates 2 rather than the error variance 2 .
iv) Result iii) suggests that a test statistic for Ho : A U = 0 is
T 1
WOLS = n S AT [A u AT ]1 L S / 2 2k .
D
11.7 Problems
11.2. Suppose Yi = xTi + ei where the errors are iid double exponential
(0, ) where > 0. Then the likelihood function is
1
n
1 1
L(, ) = n n exp |Yi xTi | .
2 i=1
n
Suppose that is a minimizer of i=1 |Yi xTi |.
a) By direct maximization, show that is an MLE of regardless of the
value of .
b) Find an MLE of by maximizing
1
n
1 1
L() L(, ) = n n exp |Yi xTi | .
2 i=1
i=1
2 n 2 2 i=1
11.7 Problems 341
n
a) Suppose that W minimizes i=1 wi (yi xTi )2 . Show that W is the
MLE of .
b) Then nd the MLE 2 of 2 .
11.5. Find the vector a such that aT Y is an unbiased estimator for E(Yi )
if the usual linear model holds.
This chapter will show that multivariate linear regression with m 2 re-
sponse variables is nearly as easy to use, at least if m is small, as multiple
linear regression which has m = 1 response variable. Plots for checking the
model are given, and prediction regions that are robust to nonnormality are
developed. For hypothesis testing, it is shown that the Wilks lambda statis-
tic, Hotelling Lawley trace statistic, and Pillais trace statistic are robust to
nonnormality.
Some of the proofs in this chapter are at a higher level than the rest of
the book.
12.1 Introduction
Denition 12.1. The response variables are the variables that you want
to predict. The predictor variables are the variables used to predict the
response variables.
where v 1 = 1.
The p m matrix
1,1 1,2 . . . 1,m
2,1 2,2 . . . 2,m
) *
B= . .. .. .. = 1 2 . . . m .
.. . . .
p,1 p,2 . . . p,m
The n m matrix
1,1 1,2 . . . 1,m T
2,1 2,2
. . . 2,m ) * .1
E= . .. .. . = e1 e2 . . . em = .. .
.. . . ..
Tn
n,1 n,2 . . . n,m
Y = X + e, (12.2)
The ei are iid with zero mean and variance 2 , and multiple linear regression
is used to estimate the unknown parameters and 2 .
or y i = xi + i = E(y i ) + i where
xTi 1
xTi 2
E(y i ) = xi = B T xi = .. .
.
xTi m
The notation y i |xi and E(y i |xi ) is more accurate, but usually the condi-
tioning is suppressed. Taking xi to be a constant (or condition on xi if the
predictor variables are random variables), y i and i have the same covariance
matrix. In the multivariate regression model, this covariance matrix does
not depend on i. Observations from dierent cases are uncorrelated (often
346 12 Multivariate Linear Regression
independent), but the m errors for the m dierent response variables for the
same case are correlated. If X is a random matrix, then assume X and E
are independent and that expectations are conditional on X.
Denition 12.4. Least squares is the classical method for tting multi-
variate linear regression. The least squares estimators are
) *
B = (X T X)1 X T Z = 1 2 . . . m .
The residuals E = Z Z = Z X B =
T
1 1,1 1,2 . . . 1,m
T ) * 2,1 2,2 . . . 2,m
2
. = r 1 r 2 . . . r m = .. .. .. . .
.. . . . ..
Tn n,1 n,2 . . . n,m
1 T
T n
(Z Z)T (Z Z) (Z X B)T (Z X B) E E
= = = i i .
nd nd nd n d i=1
12.1 Introduction 347
and
E = [I X(X T X)1 X]Z.
The following two theorems show that the least squares estimators are
fairly good. Also see Theorem 12.7 in Section 12.4. Theorem 12.2 can also be
n1
used for ,d = Sr .
nd
n
Theorem 12.2. S r = + OP (n1/2 ) and n1 i=1 i Ti = +
OP(n1/2 ) if the following three
n conditions hold: B B = OP (n1/2 ),
1 n T 1 T 1/2
n i=1 i xi = OP (1), and n i=1 xi xi = OP (n ).
T
Proof. Note that y i = B T xi + i = B xi +i . Hence i = (B B)T xi +
i . Thus
n
n
n
i Ti = (i i +i )(i i +i )T = [i Ti +i (i i )T +(i i )Ti ]
i=1 i=1 i=1
n
n
n
= i Ti + ( i xTi )(B B) + (B B)T ( xi Ti )+
i=1 i=1 i=1
n
(B B)T ( xi xTi )(B B).
i=1
348 12 Multivariate Linear Regression
n n
Thus 1
n i Ti
i=1 = 1
n i=1 i Ti +
OP (1)OP (n1/2 ) + OP (n1/2 )OP (1) + OP (n1/2 )OP (n1/2 )OP (n1/2 ),
n
and the result follows since n1 i=1 i Ti = + OP (n1/2 ) and
n 1 T
n
Sr = i i .
n 1 n i=1
S r and are also n consistent estimators of by Su and Cook (2012,
p. 692). See Theorem 12.7.
This section suggests using residual plots, response plots, and the DD plot
to examine the multivariate linear model. The residual plots are often used
to check for lack of t of the multivariate linear model. The response plots
are used to check linearity and to detect inuential cases and outliers. The
response and residual plots are used exactly as in the m = 1 case correspond-
ing to multiple linear regression and experimental design models. See earlier
chapters of this book, Olive et al. (2015), Olive and Hawkins (2005), and
Cook and Weisberg (1999a, p. 432; 1999b).
Denition 12.5. A response plot for the jth response variable is a plot
of the tted values Yij versus the response Yij . The identity line with slope
one and zero intercept is added to the plot as a visual aid. A residual plot
corresponding to the jth response variable is a plot of Yij versus rij .
Remark 12.1. Make the m response and residual plots for any multi-
variate linear regression. In a response plot, the vertical deviations from the
identity line are the residuals rij = Yij Yij . Suppose the model is good, the
jth error distribution is unimodal and not highly skewed for j = 1, . . . , m,
and n 10p. Then the plotted points should cluster about the identity line
in each of the m response plots. If outliers are present or if the plot is not
linear, then the current model or data need to be transformed or corrected.
If the model is good, then each of the m residual plots should be ellipsoidal
with no trend and should be centered about the r = 0 line. There should not
be any pattern in the residual plot: as a narrow vertical strip is moved from
left to right, the behavior of the residuals within the strip should show little
change. Outliers and patterns such as curvature or a fan shaped plot are bad.
12.2 Plots for the Multivariate Linear Regression Model 349
Rule of thumb 12.2. If the plotted points in the residual plot look like
a left or right opening megaphone, the rst model violation to check is the
assumption of nonconstant variance. (This is a rule of thumb because it is
possible that such a residual plot results from another model violation such
as nonlinearity, but nonconstant variance is much more common.)
Remark 12.2. Residual plots magnify departures from the model while
the response plots emphasize how well the multivariate linear regression model
ts the data.
The RMVN DD plot of the residual vectors i is used to check the error
distribution, to detect outliers, and to display the nonparametric prediction
region developed in Section 12.3. The DD plot suggests that the error dis-
tribution is elliptically contoured if the plotted points cluster tightly about
a line through the origin as n . The plot suggests that the error distri-
bution is multivariate normal if the line is the identity line. If n is large and
350 12 Multivariate Linear Regression
the plotted points do not cluster tightly about a line through the origin, then
the error distribution may not be elliptically contoured. These applications
of the DD plot for iid multivariate data are discussed in Olive (2002, 2008,
2013a) and Section 10.3. The RMVN estimator has not yet been proven to be
a consistent estimator when computed from residual vectors, but simulations
suggest that the RMVN DD plot of the residual vectors is a useful diagnostic
plot. The lregpack function mregddsim can be used to simulate the DD plots
of the residual vectors for various distributions.
Predictor transformations for the continuous predictors can be made ex-
actly as in Section 3.1, while response transformations can be made as in
Section 3.2 for each of the m response variables.
Warning: The Rule of thumb 3.2 does not always work. For example, the
log rule may fail. If the relationships in the scatterplot matrix are already
linear or if taking the transformation does not increase the linearity, then
no transformation may be better than taking a transformation. For the Arc
data set evaporat.lsp with m = 1, the log rule suggests transforming the
response variable Evap, but no transformation works better.
The classical large sample 100(1 )% prediction region for a future value
xf given iid data x1 , . . . , , xn is {x : Dx 2
(x, S) 2p,1 }, while for multi-
variate linear regression, the classical large sample 100(1 )% prediction
region for a future value y f given xf and past data (x1 , y i ), . . . , (xn , y n ) is
{y : Dy2
(y f , ) 2m,1 }. See Johnson and Wichern (1988, pp. 134, 151,
312). By equation (10.10), these regions may work for multivariate normal
xi or i , but otherwise tend to have undercoverage. Olive (2013a) replaced
2p,1 by the order statistic D(U 2
n)
where Un decreases to
n(1 ). This
section will use a similar technique from Olive (2016b) to develop possibly
the rst practical large sample prediction region for the multivariate linear
model with unknown error distribution. The following technical theorem will
be needed to prove Theorem 12.4.
12.3 Asymptotically Optimal Prediction Regions 351
Proof. Let Bn denote the subset of the sample space on which n has
an inverse. Then P (Bn ) 1 as n . Now
1
2
Dx (n , n ) = (x n )T n (x n ) =
1 1 1
(x n )T + n (x n ) =
a a
1
1 1
(x n )T + n (x n ) + (x n )T (x n ) =
a a
1 1
(x n )T ( 1 + a n )(x n ) +
a
1
(x + n ) T
(x + n )
a
1 2
= (x )T 1 (x ) + (x )T 1 ( n )+
a a
1 1 1
( n )T 1 ( n ) + (x n )T [a n 1 ](x n )
a a
on Bn , and the last three terms are oP (1) under a) and OP (n ) under b).
{z : Dz
2
(y f , ) D(U
2
n)
} = {z : Dz (y f , ) D(Un ) }. (12.4)
a) Consider the n prediction regions for the data where (y f,i , xf,i ) =
(y i , xi ) for i = 1, . . . , n. If the order statistic D(Un ) is unique, then Un of the
n prediction regions contain y i where Un /n 1 as n .
b) If (y f , ) is a consistent estimator of (E(y f ), ), then (12.4) is a
large sample 100(1 )% prediction region for y f .
c) If (y f , ) is a consistent estimator of (E(y f ), ), and the i come
from an elliptically contoured distribution such that the highest density re-
gion is {z : Dz (0, ) D1 }, then the prediction region (12.4) is asymp-
totically optimal.
1 1
2
Dy i
(y i , ) = (y i y i )T (y i y i ) = Ti i = D2 i (0, ).
1
Notice that if exists, then 100qn % of the n training data y i are in their
corresponding prediction region with xf = xi , and qn 1 even if (y i , )
is not a good estimator or if the regression model is misspecied. Hence the
coverage qn of the training data is robust to model assumptions. Of course the
volume of the prediction region could be large if a poor estimator (y i , ) is
used or if the i do not come from an elliptically contoured distribution. The
response, residual, and DD plots can be used to check model assumptions.
If the plotted points in the RMVN DD plot cluster tightly about some line
through the origin and if n max[3(m + p)2 , mp + 30], we expect the volume
of the prediction region to be fairly low for the least squares estimators.
If n is too small, then multivariate data is sparse and the covering ellipsoid
for the training data may be far too small for future data, resulting in severe
undercoverage. Also notice that qn = 1 /2 or qn = 1 + 0.05 for n 20p.
At the training data, the coverage qn 1 , and qn converges to the
nominal coverage 1 as n . Suppose n 20p. Then the nominal 95%
prediction region uses qn = 0.975 while the nominal 50% prediction region
uses qn = 0.55. Prediction distributions depend both on the error distribution
and on the variability of the estimator (y f , ). This variability is typically
unknown but converges to 0 as n . Also, residuals tend to underestimate
errors for small n. For moderate n, ignoring estimator variability and using
qn = 1 resulted in undercoverage as high as min(0.05, /2). Letting the
coverage qn decrease to the nominal coverage 1 inates the volume of
the prediction region for small n, compensating for the unknown variability
of (y f , ).
354 12 Multivariate Linear Regression
Theorem 12.5 will show that the prediction region (12.4) can also be found
by applying the nonparametric prediction region, described below, on the z i .
Olive (2013a, 2016c: ch. 5) derived prediction regions for a future observation
xf given n iid p 1 random vectors xi . These regions are reviewed below
and then similar regions are used for multivariate linear regression. Suppose
(T, C) is an estimator of multivariate location and dispersion (, ) such as
the classical estimator (x, S). For h > 0, consider the hyperellipsoid
{z : (z T )T C 1 (z T ) h2 } = {z : Dz
2
h2 } = {z : Dz h}. (12.5)
A future observation xf is in region (12.5) if Dxf h. If (T, C) is a consistent
estimator of (, d), then (12.5) is a large sample (1 )100% prediction
region if h = D(Un ) where D(Un ) is the qn th sample quantile of the Di with
p replacing m. The classical parametric multivariate
normal large sample
2
prediction region uses (T, C) = (x, S) and h = p,1 . The nonparametric
region uses the classical estimator (T, C) = (x, S) and h = D(Un ) . The
semiparametric region uses (T, C) = (TRM V N , C RM V N ) and h = D(Un ) . The
parametric MVN region uses (T, C) = (TRM V N , C RM V N ) and h2 = 2p,qn
where P (W 2p,qn ) = qn if W 2p .
Consider the multivariate linear regression model. The semiparametric and
parametric regions are only conjectured to be large sample prediction re-
gions, but are useful as diagnostics. Let = ,d=p , z i = y f + i , and
Di2 (y f , S r ) = (z i y f )T S 1
r (z i y f ) for i = 1, . . . , n. Then the large sample
nonparametric 100(1 )% prediction region is
{z : Dz
2
(y f , S r ) D(U
2
n)
} = {z : Dz (y f , S r ) D(Un ) }, (12.6)
while the (Johnson and Wichern 1988: p. 312) classical large sample 100(1
)% prediction region is
{z : Dz2
(y f , ) 2m,1 } = {z : Dz (y f , ) 2m,1 }. (12.7)
sample mean and sample covariance matrix (dened above Denition 10.7)
applied to the z i . The sample mean and sample covariance matrix of the resid-
ual vectors is (0, S r ) since least squares was used. Hence the z i = y f +i have
sample covariance matrix S r , and sample mean y f . Hence (T, C) = (y f , S r ),
and the Di (y f , S r ) are used to compute D(Un ) .
The RMVN DD plot of the residuals will be used to display the prediction
regions for multivariate linear regression. See Example 12.3. The nonparamet-
ric prediction region for multivariate linear regression of Theorem 12.5 uses
(T, C) = (y f , S r ) in (12.4), and has simple geometry. Let Rr be the non-
parametric prediction region (12.6) applied to the residuals i with y f = 0.
Then Rr is a hyperellipsoid with center 0, and the nonparametric prediction
region is the hyperellipsoid Rr translated to have center y f . Hence in a DD
plot, all points to the left of the line M D = D(Un ) correspond to y i that are
in their prediction region, while points to the right of the line are not in their
prediction region.
The nonparametric prediction region has some interesting properties. This
prediction region is asymptotically optimal if the i are iid for a large class
of elliptically contoured ECm (0, , g) distributions. Also, if there are 100
dierent values (xjf , y jf ) to be predicted, we only need to update y jf for
j = 1, . . . , 100, we do not need to update the covariance matrix S r .
It is common practice to examine how well the prediction regions work on
the training data. That is, for i = 1, . . . , n, set xf = xi and see if y i is in
the region with probability near to 1 with a simulation study. Note that
y f = y i if xf = xi . Simulation is not needed for the nonparametric prediction
region (12.6) for the data since the prediction region (12.6) centered at y i
contains y i i Rr , the prediction region centered at 0, contains i since i =
y i y i . Thus 100qn % of prediction regions corresponding to the data (y i , xi )
contain y i , and 100qn % 100(1 )%. Hence the prediction regions work
well on the training data and should work well on (xf , y f ) similar to the
training data. Of course simulation should be done for (xf , y f ) that are not
equal to training data cases. See Section 12.5.
This training data result holds provided that the multivariate linear regres-
sion using least squares is such that the sample covariance matrix S r of the
residual vectors is nonsingular, the multivariate regression model need
not be correct. Hence the coverage at the n training data cases (xi , y i )
is robust to model misspecication. Of course, the prediction regions may
be very large if the model is severely misspecied, but severity of misspec-
ication can be checked with the response and residual plots. Coverage for
a future value y f can also be arbitrarily bad if there is extrapolation or if
(xf , y f ) comes from a dierent population than that of the data.
356 12 Multivariate Linear Regression
Note that T /(n 1) is the usual sample covariance matrix y if all n of the
y i are iid, e.g. if B = 0. The regression sum of squares and cross products
matrix is
1 1 T 1
R = Z X(X X) X 11 Z = Z T X B Z T 11T Z.
T T T
n n
T
Let H = B LT [L(X T X)1 LT ]1 LB. The error or residual sum of squares
and cross products matrix is
Source matrix df
Regression or Treatment R p 1
Error or Residual We n p
Total (corrected) T n1
m
The Hotelling-Lawley trace statistic is U (L) = tr[W 1
e H] = i .
i=1
D
Some notation is useful to show (12.8) and to show that (np)U (L) 2rm
under mild conditions if H0 is true. Following Henderson and Searle (1979),
let matrix A = [a1 a2 . . . ap ]. Then the vec operator stacks the columns
of A on top of one another so
a1
a2
vec(A) = . .
..
ap
placed by W = n(X T X)1 and . Hence under H0 and using the proof of
Theorem 12.6,
1
T = (np)U (L) = [vec(LB)]T [ (L(X T X)1 LT )1 ][vec(LB)] 2rm .
D
Some more details on the above results may be useful. Consider testing a
linear hypothesis H0 : LB = 0 versus H1 : LB = 0 where L is a full rank
r p matrix. For now assume the error distribution is multivariate normal
Nm (0, ). Then
1 1
2 2
vec(B B) = .. Npm (0, (X T X)1 )
.
m m
where
12.4 Testing Hypotheses 359
11 (X T X)1 12 (X T X)1 1m (X T X)1
21 (X T X)1 22 (X T X)1 2m (X T X)1
C = (X T X)1 = .. .. .. .
. . .
m1 (X T X)1 m2 (X T X)1 mm (X T X)1
Hence under H0 ,
[vec(LB)]T [ 1 1 T 1
(L(X X) L ) ][vec(LB)] rm ,
T 2
and
1
T = [vec(LB)]T [ (L(X T X)1 LT )1 ][vec(LB)] 2rm .
D
(12.9)
Since least squares estimators are asymptotically normal, if the i are iid
for a large class of distributions,
360 12 Multivariate Linear Regression
1 1
2 2
D
n vec(B B) = n .. Npm (0, W )
.
m m
where
XT X P
W 1 .
n
Then under H0 ,
L 1
L 2
D
n vec(LB) = n .. Nrm (0, LW LT ),
.
L m
and
n [vec(LB)]T [ 1 T 1 D
(LW L ) ][vec(LB)] rm .
2
Hence (12.9) holds, and (12.10) gives a large sample level test if the least
squares estimators are asymptotically normal.
Kakizawa (2009) shows, under stronger assumptions than Theorem 12.8,
that for a large class of iid error distributions, the following test statistics
have the same 2rm limiting distribution when H0 is true, and the same non-
central 2rm ( 2 ) limiting distribution with noncentrality parameter 2 when
H0 is false under a local alternative. Hence the three tests are robust to the
assumption of normality. The limiting null distribution is well known when
the zero mean errors are iid from a multivariate normal distribution. See
D D
Khattree and Naik (1999, p. 68): (n p)U (L) 2rm , (n p)V (L) 2rm ,
D
and [n p 0.5(m r + 3)] log((L)) 2rm . Results from Kshirsagar
(1972, p. 301) suggest that the third chi-square approximation is very good
if n 3(m + p)2 for multivariate normal errors.
Theorems 12.6 and 12.8 are useful for relating multivariate tests with
the partial F test for multiple linear regression that tests whether a reduced
model that omits some of the predictors can be used instead of the full model
that uses all p predictors. The partial F test statistic is
SSE(R) SSE(F )
FR = /M SE(F )
dfR dfF
where the residual sums of squares SSE(F ) and SSE(R) and degrees
of freedom dfF and dfr are for the full and reduced model while the
mean square error M SE(F ) is for the full model. Let the null hypothe-
sis for the partial F test be H0 : L = 0 where L sets the coecients
of the predictors in the full model but not in the reduced model to 0.
Seber and Lee (2003, p. 100) show that
12.4 Testing Hypotheses 361
Hence the Hotelling Lawley test will have the most power and Pillais test
will have the least power.
Following Khattree and Naik (1999, pp. 6768), there are several ap-
proximations used by the SAS software. For the Roys largest root test, if
h = max(r, m), use
nph+r
max (L) F (h, n p h + r).
h
The simulations in Section 12.5 suggest that this approximation is good for
r = 1 but poor for r > 1. Anderson (1984, p. 333) states that Roys largest
root test has the greatest power if r = 1 but is
an inferior test
for r > 1. Let
g = np(mr+1)/2, u = (rm2)/4 and t = r2 m2 4/ m2 + r2 5 for
P P
m2 +r2 5 > 0 and t = 1, otherwise. Assume H0 is true. Thus U 0, V 0,
P
and 1 as n . Then
gt 2u 1 1/t 1 1/t
1/t
F (rm, gt 2u) or (n p)t 2rm .
rm 1/t
362 12 Multivariate Linear Regression
then it is possible that the approximate 2rm distribution may be the limiting
distribution for only a small class of iid error distributions. When the i are
iid Nm (0, ), there are some exact results. For r = 1,
npm+1 1
F (m, n p m + 1).
m
For r = 2,
2(n p m + 1) 1 1/2
F (2m, 2(n p m + 1)).
2m 1/2
For m = 2,
2(n p) 1 1/2
F (2r, 2(n p)).
2r 1/2
Let s = min(r, m), m1 = (|r m| 1)/2 and m2 = (n p m 1)/2. Note
that s(|r m| + s) = min(r, m) max(r, m) = rm. Then
np V np V 2m2 + s + 1 V
=
rm 1 V /s s(|r m| + s) 1 V /s 2m1 + s + 1 s V
and conclude that LB = 0. If pval > , fail to reject H0 and conclude that
LB = 0 or that there is not enough evidence to conclude that LB = 0.
The MANOVA test of H0 : B = 0 versus H1 : B = 0 is the special case
T T
corresponding to L = I and H = B X T X B = Z Z, but is usually not a
test of interest.
The analog of the ANOVA F test for multiple linear regression is the
MANOVA F test that uses L = [0 I p1 ] to test whether the nontrivial
predictors are needed in the model. This test should reject H0 if the response
and residual plots look good, n is large enough, and at least one response
plot does not look like the corresponding residual plot. A response plot for
Yj will look like a residual plot if the identity line appears almost horizontal,
hence the range of Yj is small. Response and residual plots are often useful
for n 10p.
The 4 step MANOVA F test of hypotheses uses L = [0 I p1 ].
i) State the hypotheses H0 : the nontrivial predictors are not needed in the
mreg model H1 : at least one of the nontrivial predictors is needed.
ii) Find the test statistic F0 from output.
iii) Find the pval from output.
iv) If pval , reject H0 . If pval > , fail to reject H0 . If H0 is rejected,
conclude that there is a mreg relationship between the response variables
Y1 , . . . , Ym and the predictors x2 , . . . , xp . If you fail to reject H0 , conclude
that there is a not a mreg relationship between Y1 , . . . , Ym and the predictors
x2 , . . . , xp . (Or there is not enough evidence to conclude that there is a
mreg relationship between the response variables and the predictors. Get the
variable names from the story problem.)
conclude that xj is needed in the model. Get the variable names from the
story problem.)
The Hotelling Lawley statistic
j1
1 j2
1 T 1 1
Fj = B j B j = (j1 , j2 , . . . , jm ) ..
dj dj .
jm
T
where B j is the jth row of B and dj = (X T X)1
jj , the jth diagonal entry of
T 1
(X X) . The statistic Fj could be used for forward selection and backward
elimination in variable selection.
The 4 step MANOVA partial F test of hypotheses has a full model using
all of the variables and a reduced model where r of the variables are deleted.
The ith row of L has a 1 in the position corresponding to the ith variable
to be deleted. Omitting the jth variable corresponds to the Fj test while
omitting variables x2 , . . . , xp corresponds to the MANOVA F test. Using
L = [0 I k ] tests whether the last k predictors are needed in the multivariate
linear regression model given that the remaining predictors are in the model.
i) State the hypotheses H0 : the reduced model is good H1 : use the full
model.
ii) Find the test statistic FR from output.
iii) Find the pval from output.
iv) If pval , reject H0 and conclude that the full model should be used.
If pval > , fail to reject H0 and conclude that the reduced model is good.
The lregpack function mltreg produces the m response and residual plots,
gives B, , the MANOVA partial F test statistic and pval corresponding
to the reduced model that leaves out the variables given by indices (so x2
and x4 in the output below with F = 0.77 and pval = 0.614), Fj and the
pval for the Fj test for variables 1, 2, . . . , p (where p = 4 in the output below
so F2 = 1.51 with pval = 0.284), and F0 and pval for the MANOVA F test
(in the output below F0 = 3.15 and pval= 0.06). Right click Stop on the plots
m times to advance the plots and to get the cursor back on the command
line in R.
The command out <- mltreg(x,y,indices=c(2)) would produce a
MANOVA partial F test corresponding to the F2 test while the command
out <- mltreg(x,y,indices=c(2,3,4)) would produce a MANOVA par-
tial F test corresponding to the MANOVA F test for a data set with p = 4
predictor variables. The Hotelling Lawley trace statistic is used in the tests.
$Ftable
Fj pvals
[1,] 6.30355375 0.01677169
[2,] 1.51013090 0.28449166
[3,] 5.61329324 0.02279833
[4,] 0.06482555 0.97701447
$MANOVA
MANOVAF pval
[1,] 3.150118 0.06038742
$Ftable
Fj pvals
[1,] 4.35326807 0.02870083
[2,] 600.57002201 0.00000000
[3,] 0.08819810 0.91597268
[4,] 0.06531531 0.93699302
$MANOVA
MANOVAF pval
[1,] 295.071 1.110223e-16
366 12 Multivariate Linear Regression
Example 12.2. The above output is for the Hebbler (1847) data from
the 1843 Prussia census. Sometimes if the wife or husband was not at the
household, then s/he would not be counted. Y1 = number of married civilian
men in the district, Y2 = number of women married to civilians in the district,
x2 = population of the district in 1843, x3 = number of married military men
in the district, and x4 = number of women married to military men in the
district. The reduced model deletes x3 and x4 . The constant uses x1 = 1.
Solution:
a) i) H0 : the nontrivial predictors are not needed in the mreg model
H1 : at least one of the nontrivial predictors is needed
ii) F0 = 295.071
iii) pval = 0
iv) Reject H0 , the nontrivial predictors are needed in the mreg model.
b) i) H0 : x2 is not needed in the model H1 : x2 is needed
ii) F2 = 600.57
iii) pval = 0
iv) Reject H0 , population of the district is needed in the model.
c) i) H0 : x4 is not needed in the model H1 : x4 is needed
ii) F4 = 0.065
iii) pval = 0.937
iv) Fail to reject H0 , number of women married to military men is not
needed in the model given that the other predictors are in the model.
d) i) H0 : the reduced model is good H1 : use the full model.
ii) FR = 0.200
iii) pval = 0.935
iv) Fail to reject H0 , so the reduced model is good.
e) i) H0 : the reduced model is good H1 : use the full model.
ii) FR = 569.6
iii) pval = 0.00
iv) Reject H0 , so use the full model.
12.5 An Example and Simulations 367
300
L
150
3.0 3.6 4.2
log(W)
140
H
80
4.5
log(S)
2.5
log(M)
2
0
Example 12.3. Cook and Weisberg (1999a, pp. 351, 433, 447) give a data
set on 82 mussels sampled o the coast of New Zealand. Let Y1 = log(S)
and Y2 = log(M ) where S is the shell mass and M is the muscle mass.
The predictors are X2 = L, X3 = log(W ), and X4 = H: the shell length,
log(width), and height.
a) First use the multivariate location and dispersion model for this data.
Figure 12.1 shows a scatterplot matrix of the data and Figure 12.2 shows a
368 12 Multivariate Linear Regression
48
10
8
8
78
16
37
6
RD
79
4
2
1 2 3 4 5
MD
Fig. 12.2 DD Plot of the Mussels Data, MLD Model.
Response Plot
5.5
37
4.5
Y
79
3.5
2.5
Residual Plot
0.4
37
0.2
0.0
RES
48 11
0.4
79
FIT
Response Plot
4
3
2
Y
25
8
1
48
0
FIT
Residual Plot
0.5
0.0
RES
0.5
25
8
1.0
48
FIT
48
6
8
5
79
4
RD
3
2
1
0
0 1 2 3 4 5
MD
Fig. 12.5 DD Plot of the Residual Vectors for the Mussel Data.
370 12 Multivariate Linear Regression
DD plot of the data with multivariate prediction regions added. These plots
suggest that the data may come from an elliptically contoured distribution
that is not multivariate normal. The semiparametric and nonparametric 90%
prediction regions consist of the cases below the RD = 5.86 line and to the
left of the M D = 4.41 line. These two lines intersect on a line through the
origin that is followed by the plotted points. The parametric MVN prediction
region is given by the points below the RD = 3.33 line and does not contain
enough cases.
b) Now consider the multivariate linear regression model. Figures 12.3
and 12.4 give the response and residual plots for Y1 and Y2 . The response
plots show strong linear relationships. For Y1 , case 79 sticks out while for Y2 ,
cases 8, 25, and 48 are not t well. Highlighted cases had Cooks distance
> min(0.5, 2p/n). See Cook (1977). A residual vector = ( ) + is a
combination of and a discrepancy that tends to have an approximate
multivariate normal distribution. The term can dominate for small to
moderate n when is not multivariate normal, incorrectly suggesting that
the distribution of the error is closer to a multivariate normal distribution
than is actually the case. Figure 12.5 shows the DD plot of the residual vec-
tors. The plotted points are highly correlated but do not cover the identity
line, suggesting an elliptically contoured error distribution that is not mul-
tivariate normal. The nonparametric 90% prediction region for the residuals
consists of the points to the left of the vertical line M D = 2.60. Comparing
Figures 12.2 and 12.5, the residual distribution is closer to a multivariate
normal distribution. Cases 8, 48, and 79 have especially large distances. The
four Hotelling Lawley Fj statistics were greater than 5.77 with pvalues less
than 0.005, and the MANOVA F statistic was 337.8 with pvalue 0.
The response, residual, and DD plots are eective for nding inuential
cases, for checking linearity, for checking whether the error distribution is
multivariate normal or some other elliptically contoured distribution, and
for displaying the nonparametric prediction region. Note that cases to the
right of the vertical line correspond to cases with y i that are not in their
prediction region. These are the cases corresponding to residual vectors with
large Mahalanobis distances. Adding a constant does not change the distance,
so the DD plot for the residuals is the same as the DD plot for the z i .
c) Now suppose the same model is used except Y2 = M . Then the response
and residual plots for Y1 remain the same, but the plots shown in Figure 12.6
show curvature about the identity and r = 0 lines. Hence the linearity con-
dition is violated. Figure 12.7 shows that the plotted points in the DD plot
have correlation well less than one, suggesting that the error distribution is
no longer elliptically contoured. The nonparametric 90% prediction region
for the residual vectors consists of the points to the left of the vertical line
M D = 2.52, and contains 95% of the training data. Note that the plots can
be used to quickly assess whether power transformations have resulted in a
linear model, and whether inuential cases are present. R code for producing
the seven gures is shown below.
12.5 An Example and Simulations 371
Response Plot
50
40
30
Y
20
10
0
0 10 20 30 40
FIT
Residual Plot
15
10
5
RES
0
10 5
0 10 20 30 40
FIT
3
2
1
0
MD
y <- log(mussels)[,4:5]
x <- mussels[,1:3]
x[,2] <- log(x[,2])
z<-cbind(x,y)
pairs(z, labels=c("L","log(W)","H","log(S)","log(M)"))
ddplot4(z) #right click Stop
out <- mltreg(x,y) #right click Stop 4 times
ddplot4(out$res) #right click Stop
y[,2] <- mussels[,5]
tem <- mltreg(x,y) #right click Stop 4 times
ddplot4(tem$res) #right click Stop
A small simulation was used to study the Wilks test, the Pillais trace
test, the Hotelling Lawley trace test, and the Roys largest root test for the
Fj tests and the MANOVA F test for multivariate linear regression. The rst
row of B was always 1T and the last row of B was always 0T . When the null
hypothesis for the MANOVA F test is true, all but the rst row corresponding
to the constant are equal to 0T . When p 3 and the null hypothesis for the
MANOVA F test is false, then the second to last row of B is (1, 0, . . . , 0),
the third to last row is (1, 1, 0, . . . , 0) et cetera as long as the rst row is
not changed from 1T . First m 1 error vectors wi were generated such that
the m errors are iid with variance 2 . Let the m m matrix A = (aij ) with
aii = 1 and aij = where 0 < 1 for i = j. Then i = Awi so that
= 2 AAT = (ij ) where the diagonal entries ii = 2 [1 + (m 1) 2 ] and
the o diagonal entries ij = 2 [2 + (m 2) 2 ] where = 0.10. Hence the
correlations are (2+(m2) 2 )/(1+(m1) 2 ). As gets close to 1, the error
vectors cluster about the line in the direction of (1, . . . , 1)T . See Maronna and
Zamar (2002). We used wi Nm (0, I), wi (1 )Nm (0, I) + Nm (0, 25I)
with 0 < < 1 and = 0.25 in the simulation, wi multivariate td with
d = 7 degrees of freedom, or wi lognormal - E(lognormal): where the m
components of wi were iid with distribution ez E(ez ) where z N (0, 1).
Only the lognormal distribution is not elliptically contoured.
The simulation used 5000 runs, and H0 was rejected if the F statistic
was greater than Fd1 ,d2 (0.95) where P (Fd1 ,d2 < Fd1 ,d2 (0.95)) = 0.95 with
d1 = rm and d2 = n mp for the test statistics
[n p 0.5(m r + 3)] np np
log((L)), V (L), and U(L),
rm rm rm
while d1 = h = max(r, m) and d2 = n p h + r for the test statistic
12.5 An Example and Simulations 373
nph+r
max (L).
h
Denote these statistics by W , P , HL, and R. Let the coverage be the propor-
tion of times that H0 is rejected. We want coverage near 0.05 when H0 is true
and coverage close to 1 for good power when H0 is false. With 5000 runs,
coverage outside of (0.04,0.06) suggests that the true coverage is not 0.05.
Coverages are tabled for the F1 , F2 , Fp1 , and Fp test and for the MANOVA
F test denoted by FM . The null hypothesis H0 was always true for the Fp
test and always false for the F1 test. When the MANOVA F test was true,
H0 was true for the Fj tests with j = 1. When the MANOVA F test was
false, H0 was false for the Fj tests with j = p, but the Fp1 test should be
hardest to reject for j = p by construction of B and the error vectors.
When the null hypothesis H0 was true, simulated values started to get close
to nominal levels for n 0.8(m+p)2 , and were fairly good for n 1.5(m+p)2 .
The exception was Roys test which rejects H0 far too often if r > 1. See
Table 12.1 where we want values for the F1 test to be close to 1 since H0 is
false for the F1 test, and we want values close to 0.05, otherwise. Roys test
was very good for the Fj tests but very poor for the MANOVA F test. Results
are shown for m = p = 10. As expected from Berndt and Savin (1977), Pillais
test rejected H0 less often than Wilks test which rejected H0 less often than
the Hotelling Lawley test. Based on a much larger simulation study Pelawa
Watagoda (2013, pp. 111112), using the four types of error distributions
and m = p, the tests had approximately correct level if n 0.83(m + p)2 for
the Hotelling Lawley test, if n 2.80(m + p)2 for the Wilks test (agreeing
with Kshirsagar (1972) n 3(m + p)2 for multivariate normal data), and if
n 4.2(m + p)2 for Pillais test.
In Table 12.2, H0 is only true for the Fp test where p = m, and we want
values in the Fp column near 0.05. We want values near 1 for high power
otherwise. If H0 is false, often H0 will be rejected for small n. For example,
if n 10p, then the m residual plots should start to look good, and the
MANOVA F test should be rejected. For the simulated data, the test had
fair power for n not much larger than mp. Results are shown for the lognormal
distribution.
Some R output for reproducing the simulation is shown below. The lregpack
function is mregsim and etype = 1 uses data from an MVN distribution. The
fcov line computed the Hotelling Lawley statistic using equation (12.8) while
the hotlawcov line used Denition 12.10. The mnull=T part of the command
means we want the rst value near 1 for high power and the next three
numbers near the nominal level 0.05 except for mancv where we want all
of the MANOVA F test statistics to be near the nominal level of 0.05. The
mnull=F part of the command means we want all values near 1 for high power
except for the last column (for the terms other than mancv) corresponding to
the Fp test where H0 is true so we want values near the nominal level of 0.05.
The coverage is the proportion of times that H0 is rejected, so coverage
is short for power and level: we want the coverage near 1 for high power
12.5 An Example and Simulations 375
when H0 is false, and we want the coverage near the nominal level 0.05 when
H0 is true. Also see Problem 12.10.
mregsim(nruns=5000,etype=1,mnull=T)
$wilkcov
[1] 1.0000 0.0450 0.0462 0.0430
$pilcov
[1] 1.0000 0.0414 0.0432 0.0400
$hotlawcov
[1] 1.0000 0.0522 0.0516 0.0490
$roycov
[1] 1.0000 0.0512 0.0500 0.0480
$fcov
[1] 1.0000 0.0522 0.0516 0.0490
$mancv
wcv pcv hlcv rcv fcv
[1,] 0.0406 0.0332 0.049 0.1526 0.049
mregsim(nruns=5000,etype=2,mnull=F)
$wilkcov
[1] 0.9834 0.9814 0.9104 0.0408
$pilcov
[1] 0.9824 0.9804 0.9064 0.0372
$hotlawcov
[1] 0.9856 0.9838 0.9162 0.0480
$roycov
[1] 0.9848 0.9834 0.9156 0.0462
$fcov
[1] 0.9856 0.9838 0.9162 0.0480
$mancv
wcv pcv hlcv rcv fcv
[1,] 0.993 0.9918 0.9942 0.9978 0.9942
The same type of data and 5000 runs were used to simulate the prediction
regions for y f given xf for multivariate regression. With n=100, m=2, and
p=4, the nominal coverage of the prediction region is 90%, and 92% of the
training data is covered. Following Olive (2013a), consider the prediction
region {z : (z T )T C 1 (z T ) h2 } = {z : Dz
2
h2 } = {z : Dz h}.
Then the ratio of the prediction region volumes
376 12 Multivariate Linear Regression
hm det(C i )
i
m
h2 det(C 2 )
mpredsim(nruns=5000,etype=1)
$ncvr
[1] 0.9162
$scvr
[1] 0.916
$mcvr
[1] 0.9138
$voln
[1] 0.9892485
$vols
[1] 1
$volm
[1] 1.004964
$up
[1] 0.94
12.6 Summary
$Bhat
[,1] [,2] [,3]
[1,] 47.96841291 623.2817463 179.8867890
[2,] 0.07884384 0.7276600 -0.5378649
[3,] -1.45584256 -17.3872206 0.2337900
[4,] -0.01895002 0.1393189 -0.3885967
$Covhat
[,1] [,2] [,3]
[1,] 21.91591 123.2557 132.339
[2,] 123.25566 2619.4996 2145.780
[3,] 132.33902 2145.7797 2954.082
$partial
partialF Pval
[1,] 0.7703294 0.6141573
$Ftable
Fj pvals
[1,] 6.30355375 0.01677169
[2,] 1.51013090 0.28449166
[3,] 5.61329324 0.02279833
[4,] 0.06482555 0.97701447
$MANOVA
MANOVAF pval
[1,] 3.150118 0.06038742
1 T
T n
E E
19) = = i i while the sample covariance matrix of
np n p i=1
T
np E E
the residuals is S r = = . Both and S r are n consistent
n1 n1
estimators of for a large class of error distributions for i .
20) The 100(1 )% nonparametric prediction region could be applied to
T
the residual vectors or to z i = y f + i = B xf + i for i = 1, . . . , n. This
takes the data cloud of the n residual vectors i and centers the cloud at y f .
12.7 Complements 381
Let
Di2 (y f , S r ) = (z i y f )T S 1
r (z i y f )
{y : (y y f )T S 1
r (y y f ) D(Un ) } = {y : Dy (y f , S r ) D(Un ) }.
2
a) Consider the n prediction regions for the data where (y f,i , xf,i ) =
(y i , xi ) for i = 1, . . . , n. If the order statistic D(Un ) is unique, then Un of the
n prediction regions contain y i where Un /n 1 as n .
b) If (y f , S r ) is a consistent estimator of (E(y f ), ) then the nonpara-
metric prediction region is a large sample 100(1 )% prediction region for
yf .
c) If (y f , S r ) is a consistent estimator of (E(y f ), ), and the i come
from an elliptically contoured distribution such that the highest density re-
gion is {y : Dy (0, ) D1 }, then the nonparametric prediction region
is asymptotically optimal.
21) On the DD plot for the residuals, the cases to the left of the vertical
line correspond to cases that would have y f = y i in the nonparametric
prediction region if xf = xi , while the cases to the right of the line would
not have y f = y i in the nonparametric prediction region.
12.7 Complements
12.8 Problems
Let
XT X 1
= W .
n
1
Show T (W ) = [vec(LB)]T [ (L(X T X)1 LT )1 ][vec(LB)].
T
Let L = Lj = [0, . . . , 0, 1, 0, . . . , 0] have a 1 in the jth position. Let bj = LB
be the jth row of B. Let dj = Lj (X T X)1 LTj = (X T X)1 jj , the jth diagonal
1 T 1
entry of (X T X)1 . Then Tj = dj bj bj . The Hotelling Lawley statistic
T
U = tr([(n p) ]1 B LT [L(X T X)1 LT ]1 LB]).
1 1 T
Hence if L = Lj , then Uj = dj (np) tr( bj bj ).
Using tr(ABC) = tr(CAB) and tr(a) = a for scalar a, show that
(n p)Uj = Tj .
12.3. Consider the Hotelling Lawley test statistic. Using the Searle (1982,
p. 333) identity
$Ftable
Fj pvals
[1,] 82.147221 0.000000e+00
[2,] 58.448961 0.000000e+00
[3,] 15.700326 4.258563e-09
[4,] 9.072358 1.281220e-05
[5,] 45.364862 0.000000e+00
$MANOVA
MANOVAF pval
[1,] 67.80145 0
12.4. The above output is for the R Seatbelts data set where Y1 =
drivers = number of drivers killed or seriously injured, Y2 = f ront = number
of front seat passengers killed or seriously injured, and Y3 = back = num-
ber of back seat passengers killed or seriously injured. The predictors were
x2 = kms = distance driven, x3 = price = petrol price, x4 = van = number
of van drivers killed, and x5 = law = 0 if the law was in eect that month
and 1 otherwise. The data consists of 192 monthly totals in Great Britain
from January 1969 to December 1984, and the compulsory wearing of seat
belts law was introduced in February 1983.
a) Do the MANOVA F test.
b) Do the F4 test.
12.8 Problems 385
y<-USJudgeRatings[,c(9,10,12)]
x<-USJudgeRatings[,-c(9,10,12)]
mltreg(x,y,indices=c(2,5,6,7,8))
$partial
partialF Pval
[1,] 1.649415 0.1855314
$MANOVA
MANOVAF pval
[1,] 340.1018 1.121325e-14
12.6. The above output is for the R judge ratings data set consisting of
lawyer ratings for n = 43 judges. Y1 = oral = sound oral rulings, Y2 = writ =
sound written rulings, and Y3 = rten = worthy of retention. The predictors
were x2 = cont = number of contacts of lawyer with judge, x3 = intg =
judicial integrity, x4 = dmnr = demeanor, x5 = dilg = diligence, x6 =
cf mg = case ow managing, x7 = deci = prompt decisions, x8 = prep =
preparation for trial, x9 = f ami = familiarity with law, and x10 = phys =
physical ability.
a) Do the MANOVA F test.
b) Do the MANOVA partial F test for the reduced model that deletes
x2 , x5 , x6 , x7 , and x8 .
12.10. This problem uses the lregpack function mregsim to simulate the
Wilks test, Pillais trace test, Hotelling Lawley trace test, and Roys largest
root test for the Fj tests and the MANOVA F test for multivariate linear
regression. When mnull = T the rst row of B is 1T while the remaining
rows are equal to 0. Hence the null hypothesis for the MANOVA F test is
true. When mnull = F the null hypothesis is true for p = 2, but false for
p > 2. Now the rst row of B is 1T and the last row of B is 0. If p > 2,
then the second to last row of B is (1, 0, . . . , 0), the third to last row is
(1, 1, 0, . . . , 0) et cetera as long as the rst row is not changed from 1T . First
m iid errors z i are generated such that the m errors are iid with variance
12.8 Problems 387
12.11. This problem uses the lregpack function mpredsim to simulate the
prediction regions for y f given xf for multivariate regression. With 5000 runs
this simulation may take several minutes. The R command for this problem
generates iid lognormal errors then subtracts the mean, producing z i . Then
the i = Az i are generated as in Problem 12.10 with n=100, m=2, and p=4.
The nominal coverage of the prediction region is 90%, and 92% of the training
data is covered. The ncvr output gives the coverage of the nonparametric
region. What was ncvr?
Chapter 13
GLMs and GAMs
This chapter contains some extensions of the multiple linear regression model.
See Denition 1.1 for the 1D regression model, sucient predictor (SP =
h(x)), estimated sucient predictor (ESP = h(x)), generalized linear model
(GLM), and the generalized additive model (GAM). When using a GAM
to check a GLM, the notation ESP may be used for the GLM, and EAP
(estimated additive predictor) may be used for the ESP of the GAM. De-
nition 1.2 denes the response plot of ESP versus Y .
Suppose the sucient predictor SP = h(x). Often SP = T x if 1 cor-
responds to the constant and x1 1. If x only contains the nontrivial pre-
dictors, then SP = + T x is often used. Much of Chapter 1 examines this
special case, including response plots, variable selection, interactions, factors,
and the interpretation of the parameters i .
13.1 Introduction
Denition 13.2. The BBR model states that Y1 , ..., Yn are independent
random variables where Yi |SPi BB(mi , (SPi ), ). Hence E(Yi |SPi ) =
mi (SPi ) and
The BBR model has the same mean function as the binomial regression
model, but allows for overdispersion. As 0, it can be shown that the
BBR model converges to the binomial regression model.
for y = 0, 1, 2, ... where > 0 and > 0. Then E(Y ) = and V(Y ) =
+ 2 /. (This distribution is a generalization of the negative binomial (, )
distribution where = /( + ) and > 0 is an unknown real parameter
rather than a known integer.)
The NBR model has the same mean function as the PR model but allows
for overdispersion. Following Agresti (2002, p. 560), as 1/ 0, it can
be shown that the NBR model converges to the PR model.
where k() 0 and h(y) 0. The functions h, k, t, and w are real valued
functions.
where S(y) = log(g(y)), d() = log(k()), and the support Y does not depend
on . Here the indicator function IY (y) = 1 if y Y and IY (y) = 0, otherwise.
Denition 13.5. Assume that the data is (Yi , xi ) for i = 1, ..., n. An im-
portant type of generalized linear model (GLM) for the data states that
the Y1 , ..., Yn are independent random variables from a 1-parameter exponen-
tial family with pdf or pmf
c((xi ))
f (yi |(xi )) = k((xi ))h(yi ) exp yi . (13.3)
a()
(xi ) = g 1 ( + T xi ). (13.4)
13.2 Additive Error Regression 393
c()
t(yi ) = yi and w() = ,
a()
The end of Section 1.1 discusses several things to check and consider after
selecting a 1D regression model. The following three sections illustrate three
of the most important generalized linear models. Inference and variable selec-
tion for these GLMs are discussed in Sections 13.5 and 13.6. Their generalized
additive model analogs are discussed in Section 13.7.
Note that the conditional mean function E(Yi |SPi ) = mi (SPi ) and the
conditional variance function V (Yi |SPi ) = mi (SPi )(1 (SPi )).
Thus the binary logistic regression model says that
where
exp(SP )
(SP ) =
1 + exp(SP )
for the LR model. Note that the conditional mean function E(Y |SP ) =
(SP ) and the conditional variance function V (Y |SP ) = (SP )(1 (SP )).
For the LR model, the Y are independent and
exp(ESP)
Y |x binomial 1, ,
1 + exp(ESP)
exp( + T x)
g 1 ( + T x) = = (x) = (x).
1 + exp( + T x)
= 1 (1 0 ) (13.7)
1
and = log 0.5(1 0 )T 1 (1 + 0 ).
0
The logistic regression (maximum likelihood) estimator also tends to per-
form well for this type of data. An exception is when the Y = 0 cases and
Y = 1 cases can be perfectly or nearly perfectly classied by the ESP. Let the
T
logistic regression ESP = + x. Consider the response plot of the ESP
versus Y . If the Y = 0 values can be separated from the Y = 1 values by
the vertical line ESP = 0, then there is perfect classication. See Figure 13.1
b). In this case the maximum likelihood estimator for the logistic regression
parameters (, ) does not exist because the logistic curve cannot approxi-
mate a step function perfectly. See Atkinson and Riani (2000, pp. 251254).
If only a few cases need to be deleted in order for the data set to have per-
fect classication, then the amount of overlap is small and there is nearly
perfect classication.
Ordinary least squares (OLS) can also be useful for logistic regression. The
ANOVA F test, partial F test, and OLS t tests are often asymptotically valid
when the conditions in Denition 13.7 are met, and the OLS ESP and LR
ESP are often highly correlated. See Haggstrom (1983) and Theorem 13.1
below. Assume that Cov(x) x and that Cov(x, Y ) = x,Y . Let j =
E(x|Y = j) for j = 0, 1. Let Ni be the number of Ys that are equal to i for
i = 0, 1. Then
1
i = xj
Ni
j:Yj =i
on a constant and x (using software originally meant for multiple linear re-
gression). Then
1 1
OLS = x xY = 0 1 x (1 0 )
OLS = 0 1 1
P
x (1 0 ) as n .
Proof. From Theorem 11.19,
1 P
OLS = x xY OLS as n
1
n
and xY = xi Yi x Y .
n i=1
1
Thus xY = xj (1) + xj (0) x 1 =
n
j:Yj =1 j:Yj =0
1 1
(N1 1 ) (N1 1 + N0 0 )1 = 1 1 12 1 1 0 0 =
n n
1 (1 1 )1 1 0 0 = 1 0 (1 0 )
and the result follows.
n2 1
D = x OLS .
N0 N1
Now when the conditions of Denition 13.7 are met and if 1 0 is small
enough so that there is not perfect classication, then LR = 1 (1 0 ).
Empirically, the OLS ESP and LR ESP are highly correlated for many LR
data sets where the conditions are not met, e.g. when some of the predictors
are factors. This suggests that LR d 1 x (1 0 ) for many LR data
sets where d is some constant depending on the data.
Denition 13.8. For binary logistic regression, the response plot or esti-
T
mated sucient summary plot is the plot of the ESP = + xi versus Yi
with the estimated mean function
398 13 GLMs and GAMs
exp(ESP )
(ESP ) =
1 + exp(ESP )
exp(ESP )
(ESP ) =
1 + exp(ESP )
Both the lowess curve and step function are simple nonparametric estima-
tors of the mean function (SP ). If the lowess curve or step function tracks
the logistic curve (the estimated mean) closely, then the LR mean function
is a reasonable approximation to the data.
Checking the LR model in the nonbinary case is more dicult because
the binomial distribution is not the only distribution appropriate for data
that takes on values 0, 1, ..., m if m 2. Hence both the mean and variance
functions need to be checked. Often the LR mean function is a good approx-
imation to the data, the LR MLE is a consistent estimator of , but the
LR model is not appropriate. The problem is that for many data sets where
E(Yi |xi ) = mi (SPi ), it turns out that V (Yi |xi ) > mi (SPi )(1 (SPi )).
This phenomenon is called overdispersion. The BBR model of Denition 13.2
is a useful alternative to LR.
400 13 GLMs and GAMs
For both the LR and BBR models, the conditional distribution of Y |x can
still be visualized with a response plot of the ESP versus Zi = Yi /mi with the
estimated mean function E(Zi |xi ) = (SP ) = (ESP ) and a step function
or lowess curve added as visual aids.
Since the binomial regression model is simpler than the BBR model, graph-
ical diagnostics for the goodness of t of the LR model would be useful. The
following plot was suggested by Olive (2013b) to check for overdispersion.
than 10 times that of the horizontal, or if the percentage of points above the
slope 4 line through the origin is much larger than 5%.
If the binomial LR OD plot is used but the data follows a betabinomial re-
gression model, then Vmod = V (Yi |SP ) mi (ESP )(1(ESP )) while V =
[Yi mi (ESP )]2 (Yi E(Yi ))2 . Hence E(V ) V (Yi ) mi (ESP )(1
(ESP ))[1 + (mi 1)/(1 + )], so the plotted points with mi = m should
1 + m
scatter about a line with slope 1 + (m 1) = .
1+ 1+
a) b)
0.6
0.6
Y
Y
0.0
c) ESSP d) OD Plot
0.0 0.4 0.8
0.6
Vhat
Y
0.0
The rst example is for binary data. For binary data, G2 is not approxi-
mately 2 and some plots of residuals have a pattern whether the model is
correct or not. For binary data the OD plot is not needed, and the plotted
points follow a curve rather than falling in a wedge. The response plot is
very useful if the logistic curve and step function of observed proportions are
added as visual aids. The logistic curve gives the estimated LR probability
of success. For example, when ESP = 0, the estimated probability is 0.5.
a) ESSP b) OD Plot
1.2
0.8
Vhat
0.6
Z
0.4
0.0
0.0
4 0 4 0.5 2.0
ESP Vmodhat
Fig. 13.2 Visualizing the Death Penalty Data
predictor head height and perfectly classies the data since the ape skulls
can be separated from the human skulls with a vertical line at ESP = 0.
Christmann and Rousseeuw (2001) also used the response plot to visualize
overlap. The response plot in Figure 13.1c uses predictors lower jaw length,
face length, and upper jaw length. None of the predictors is good individually,
but together provide a good LR model since the observed proportions (the
step function) track the model proportions (logistic curve) closely. The OD
plot in Figure 13.1d) is curved and is not needed for a binary response.
Example 13.3. Collett (1999, pp. 216219) describes a data set where
the response variable is the number of rotifers that remain in suspension in
a tube. A rotifer is a microscopic invertebrate. The two predictors were the
density of a stock solution of Ficolli and the species of rotifer coded as 1
13.4 Poisson Regression 403
a) ESSP b) OD Plot
0.8
1000
Vhat
Z
0.4
0
0.0
3 0 2 0 20 40
ESP Vmodhat
Fig. 13.3 Plots for Rotifer Data
for polyarthra major and 0 for keratella cochlearis. Figure 13.3a shows the
response plot (ESSP). Both the observed proportions and the step function
track the logistic curve well, suggesting that the LR mean function is a good
approximation to the data. The OD plot suggests that there is overdispersion
since the vertical scale is about 30 times the horizontal scale. The OLS line
has slope much larger than 4 and two outliers seem to be present.
e y 1
f (y) = P (Y = y) = e
= ./01 exp[log() y]
y! y! . /0 1
k()0 ./01 c()
h(y)0
g 1 ( + T x) = exp( + T x) = (x).
In the response plot for Poisson regression, the shape of the estimated
mean function (ESP ) = exp(ESP ) depends strongly on the range of the
ESP. The variety of shapes occurs because the plotting software attempts
to ll the vertical axis. Hence if the range of the ESP is narrow, then the
exponential function will be rather at. If the range of the ESP is wide, then
the exponential curve will look at in the left of the plot but will increase
sharply in the right of the plot.
If the exponential curve clearly ts the lowess curve better than the line
Y = Y , then H0 should be rejected, but if the line Y = Y ts the lowess
curve about as well as the exponential curve (which should only happen if
the exponential curve is approximately linear with a small slope), then Y
may be independent of the predictors. See Figure 13.6a).
Warning: For many count data sets where the PR mean function is good,
the PR model is not appropriate but the PR MLE is still a consistent esti-
13.4 Poisson Regression 405
mator of . The problem is that for many data sets where E(Y |x) = (x) =
exp(SP ), it turns out that V (Y |x) > exp(SP ). This phenomenon is called
overdispersion. Adding parametric and nonparametric estimators of the
standard deviation function to the response plot can be useful. See and Cook
and Weisberg (1999a, pp. 401403). The NBR model of Denition 13.3 is a
useful alternative to PR.
Since the Poisson regression model is simpler than the NBR model, graph-
ical diagnostics for the goodness of t of the PR model would be useful. The
following plot was suggested by Winkelmann (2000, p. 110).
more than 10 times that of the horizontal, or if the percentage of points above
the slope 4 line through the origin is much larger than 5%. Hence the identity
line and slope 4 line are added to the OD plot as visual aids, and one should
check whether the scale of the vertical axis is more than 10 times that of the
horizontal.
Combining the response plot with the OD plot is a powerful method for
assessing the adequacy of the Poisson regression model. It is easier to use the
OD plot to check the variance function than the response plot since judging
the variance function with the straight lines of the OD plot is simpler than
judging two curves. Also outliers are often easier to spot with the OD plot.
For Poisson regression, judging the mean function from the response plot
may be rather dicult for large counts since the mean function is curved
and lowess does not track the exponential function very well for large counts.
Simple diagnostic plots for the Poisson regression model can be made using
weighted least squares (WLS). To see this, assume that all n of the counts
T
Yi are large. Then log((xi )) = log((x i ))+ log(Y
i ) log(Yi )=+ xi , or
Yi
log(Yi ) = + T xi + ei where ei = log . The error ei does not have
(xi )
Yi (xi )
zero mean or constant variance, but if (xi ) is large N (0, 1)
(xi )
by the central limit theorem. Recall that log(1 + x) x for |x| < 0.1. Then,
heuristically,
(xi ) + Yi (xi ) Yi (xi )
ei = log =
(xi ) (xi )
1 Yi (xi ) 1
N 0, .
(xi ) (xi ) (xi )
This suggests that for large (xi ), the errors ei are approximately 0 mean
with variance 1/(xi ). If the (xi ) were known, and all of the Yi were large,
then a weighted least squares of log(Yi ) on xi with weights wi = (xi ) should
produce good estimates of (, ). Since the (xi ) are unknown, the estimated
weights wi = Yi could be used. Since P (Yi = 0) > 0, the estimators given in
the following denition are used. Let Zi = Yi if Yi > 0, and let Zi = 0.5 if
Yi = 0.
See Agresti (2002, pp. 611612). However, the two estimators are often close
for many data sets.
The basic idea of the following two plots for Poisson regression is to trans-
form the data towards a linear model, then make the response plot of W
versus W and residual plot of the residuals W W for the transformed
response variable W . The mean function is the identity line and the verti-
cal deviations from the identity line are the WLS residuals. The plots are
based on weighted least squares (WLS) regression. Use the
equivalent OLS
regression (without intercept) of W = Zi log(Zi ) on Zi (1, xTi )T . Then
T
the
plot of the tted values W = Zi (M + M xi ) versus the response
Zi log(Zi ) should have points that scatter about the identity line. These
results and the equivalence of the minimum chisquare estimator to an OLS
estimator suggest the following diagnostic plots.
Example 13.4. For the Ceriodaphnia data of Myers et al. (2002, pp.
136139), the response variable Y is the number of Ceriodaphnia organisms
counted in a container. The sample size was n = 70 and seven concentra-
tions of jet fuel (x1 ) and an indicator for two strains of organism (x2 ) were
used as predictors. The jet fuel was believed to impair reproduction so high
concentrations should have smaller counts. Figure 13.4 shows 4 plots for this
data. In the response plot of Figure 13.4a, the lowess curve is represented as
a jagged curve to distinguish it from the estimated PR mean function (the
exponential curve). The horizontal line corresponds to the sample mean Y .
The OD plot in Figure 13.4b suggests that there is little evidence of overdis-
persion. These two plots as well as Figures 13.4c and 13.4d suggest that the
Poisson regression model is a useful approximation to the data.
408 13 GLMs and GAMs
a) ESSP b) OD Plot
100
300
Vhat
Y
40
0
0 1.5 2.5 3.5 4.5 20 40 60 80
ESP Ehat
2
20 40
MWRES
0
2
0
0 10 30 0 10 30
MWFIT MWFIT
Example 13.5. For the crab data, the response Y is the number of satel-
lites (male crabs) near a female crab. The sample size n = 173 and the pre-
dictor variables were the color, spine condition, caparice width, and weight
of the female crab. Agresti (2002, pp. 126131) rst uses Poisson regression,
and then uses the NBR model with = 0.98 1. Figure 13.5a suggests that
there is one case with an unusually large value of the ESP. The lowess curve
does not track the exponential curve all that well. Figure 13.5b suggests that
overdispersion is present since the vertical scale is about 10 times that of
the horizontal scale and too many of the plotted points are large and greater
than the slope 4 line. Figure 13.5c also suggests that the Poisson regression
mean function is a rather poor t since the plotted points fail to cover the
identity line. Although the exponential mean function ts the lowess curve
better than the line Y = Y , an alternative model to the NBR model may t
the data better. In later chapters, Agresti uses binomial regression models
for this data.
Example 13.6. For the popcorn data of Myers et al. (2002, p. 154), the
response variable Y is the number of inedible popcorn kernels. The sample
size was n = 15 and the predictor variables were temperature (coded as 5,
6, or 7), amount of oil (coded as 2, 3, or 4), and popping time (75, 90, or
105). One batch of popcorn had more than twice as many inedible kernels
as any other batch and is an outlier. Ignoring the outlier in Figure 13.6a
suggests that the line Y = Y will t the data and lowess curve better than
the exponential curve. Hence Y seems to be independent of the predictors.
Notice that the outlier sticks out in Figure 13.6b and that the vertical scale is
well over 10 times that of the horizontal scale. If the outlier was not detected,
13.4 Poisson Regression 409
a) ESSP b) OD Plot
15
60 120
Vhat
Y
5
0
0
0.5 1.5 2.5 2 4 6 8 12
ESP Ehat
MWRES
8
4
4
0
0
0 2 4 6 0 2 4 6
MWFIT MWFIT
a) ESSP b) OD Plot
80
Vhat
0 1500
Y
20
2 2 6
MWRES
10 20 30 40 10 20 30 40
MWFIT MWFIT
then the Poisson regression model would suggest that temperature and time
are important predictors, and overdispersion diagnostics such as the deviance
would be greatly inated. However, we probably need to delete the high
temperature, low oil, and long popping time combination, to conclude that
the response is independent of the predictors.
410 13 GLMs and GAMs
13.5 Inference
This section gives a very brief discussion of inference for the logistic regression
(LR) and Poisson regression (PR) models. Inference for these two models is
very similar to inference for the multiple linear regression (MLR) model. For
all three of these models, Y is independent of the k 1 vector of predictors
x = (x1 , ..., xk )T given the sucient predictor + T x: Y x|( + T x).
To perform inference for LR and PR, computer output is needed. Shown
below is output using symbols and Arc output from a real data set with
k = 2 nontrivial predictors. This data set is the banknote data set described
in Cook and Weisberg (1999a, p. 524). There were 200 Swiss bank notes of
which 100 were genuine (Y = 0) and 100 counterfeit (Y = 1). The goal of the
analysis was to determine whether a selected bill was genuine or counterfeit
from physical measurements of the bill.
Binomial Regression
Kernel mean function = Logistic
Response = Status
Terms = (Bottom Left)
Trials = Ones
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -389.806 104.224 -3.740 0.0002
Bottom 2.26423 0.333233 6.795 0.0000
Left 2.83356 0.795601 3.562 0.0004
Scale factor: 1.
Number of cases: 200
Degrees of freedom: 197
Pearson X2: 179.809
Deviance: 99.169
13.5 Inference 411
Point estimators for the mean function are important. Given values of
x = (x1 , ..., xk )T , a major goal of binary logistic regression is to estimate the
success probability P (Y = 1|x) = (x) with the estimator
T
exp( + x)
(x) = T
. (13.8)
1 + exp( + x)
The Wald condence interval (CI) for j can also be obtained using the
output: the large sample 100 (1 ) % CI for j is j z1/2 se(j ).
412 13 GLMs and GAMs
The Wald test and CI tend to give good results if the sample size n is large.
Here 1 refers to the coverage of the CI. A 90% CI uses z1/2 = 1.645, a
95% CI uses z1/2 = 1.96, and a 99% CI uses z1/2 = 2.576.
For a GLM, often 3 models are of interest: the full model that uses all k of
the predictors xT = (xTR , xTO ), the reduced model that uses the r predictors
xR , and the saturated model that uses n parameters 1 , ..., n where n is
the sample size. For the full model the k + 1 parameters , 1 , ..., k are
estimated while the reduced model has r + 1 parameters. Let lSAT (1 , ..., n )
be the likelihood function for the saturated model and let lF U LL (, ) be the
likelihood function for the full model. Let LSAT = log lSAT (1 , ..., n ) be the
log likelihood function for the saturated model evaluated at the maximum
likelihood estimator (MLE) (1 , ..., n ) and let LF U LL = log lF U LL (, ) be
the log likelihood function for the full model evaluated at the MLE (, ).
Then the deviance D = G2 = 2(LF U LL LSAT ). The degrees of freedom
for the deviance = dfF U LL = n k 1 where n is the number of parameters
for the saturated model and k + 1 is the number of parameters for the full
model.
The saturated model for logistic regression states that for i = 1, ..., n, the
Yi |xi are independent binomial(mi , i ) random variables where i = Yi /mi .
The saturated model is usually not very good for binary data (all mi = 1)
or if the mi are small. The saturated model can be good if all of the mi are
large or if i is very close to 0 or 1 whenever mi is not large.
The saturated model for Poisson regression states that for i = 1, ..., n,
the Yi |xi are independent Poisson(i ) random variables where i = Yi . The
saturated model is usually not very good for Poisson data, but the saturated
model may be good if n is xed and all of the counts Yi are large.
If X 2d , then E(X) = d and VAR(X) = 2d. An observed value of
X > d + 3 d is unusually large and an observed value of X < d 3 d is
unusually small.
When the saturated model is good, a rule of thumb is that the logistic
or Poisson
regression model is ok if G2 n k 1 (or if G2 n k
1 + 3 n k 1). For binary LR, the 2nk+1 approximation for G2 is rarely
good even for large sample sizes n. For LR, the response plot is often a much
better diagnostic for goodness of t, especially when ESP = + T xi takes
on many values andwhen k + 1 << n. For PR, both the response plot and
G2 n k 1 + 3 n k 1 should be checked.
Response = Y
Terms = (X1 , ..., Xk )
13.5 Inference 413
Total Change
Predictor df Deviance df Deviance
Ones n 1 = dfo G2o
X1 n2 1
X2 n3 1
.. .. .. ..
. . . .
Xk n k 1 = dfF U LL G2F U LL 1
-----------------------------------------
Data set = cbrain, Name of Fit = B1
Response = sex
Terms = (cephalic size log[size])
Sequential Analysis of Deviance
Total Change
Predictor df Deviance | df Deviance
Ones 266 363.820 |
cephalic 265 363.605 | 1 0.214643
size 264 315.793 | 1 47.8121
log[size] 263 305.045 | 1 10.7484
The above Arc output, shown in symbols and for a real data set, is used
for the deviance test described below. Assume that the response plot has
been made and that the logistic or Poisson regression model ts the data
well in that the nonparametric step or lowess estimated mean function fol-
lows the estimated model mean function closely and there is no evidence of
overdispersion. The deviance test is used to test whether = 0. If this is the
case, then the predictors are not needed in the GLM model. If Ho : = 0
is not rejected, then for Poisson regression the estimator = Y should be
n
n
used while for logistic regression = Yi / mi should be used. Note that
i=1 i=1
= Y for binary logistic regression since mi 1 for i = 1, ..., n.
This test can be performed in R by obtaining output from the full and
null model.
outf <- glm(Y~x1 + x2 + ... + xk, family = binomial)
outn <- glm(Y~1,family = binomial)
anova(outn,outf,test="Chi")
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 *** ****
2 *** **** k G^2(0|F) pvalue
The output below, shown both in symbols and for a real data set, can be
used to perform the change in deviance test. If the reduced model leaves out
a single variable Xi , then the change in deviance test becomes H0 : i = 0
versus HA : i = 0. This test is a competitor of the Wald test. This change in
deviance test is usually better than the Wald test if the sample size n is not
large, but the Wald test is often easier for software to produce. For large n
the test statistics from the two tests tend to be very similar (asymptotically
equivalent tests).
T
If the reduced model is good, then the EE plot of ESP (R) = R + R xRi
T
versus ESP = + xi should be highly correlated with the identity line
with unit slope and zero intercept.
SP = + 1 x1 + + k xk = + T x = + TR xR + TO xO
where the reduced model uses r of the predictors used by the full model and
xO denotes the vector of k r predictors that are in the full model but not
the reduced model. For logistic regression, the reduced model is Yi |xRi
independent Binomial(mi , (xRi )) while for Poisson regression the reduced
model is Yi |xRi independent Poisson((xRi )) for i = 1, ..., n.
Assume that the response plot looks good. Then we want to test H0 : the
reduced model is good (can be used instead of the full model) versus HA :
use the full model (the full model is signicantly better than the reduced
model). Fit the full model and the reduced model to get the deviances G2F U LL
and G2RED .
This test can be performed in R by obtaining output from the full and
reduced model.
outf <- glm(Y~x1 + x2 + ... + xk, family = binomial)
outr <- glm(Y~ x3 + x5 + x7,family = binomial)
anova(outr,outf,test="Chi")
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 *** ****
2 *** **** k-r G^2(R|F) pvalue
Interpretation of coecients: if x1 , ..., xi1 , xi+1 , ..., xk can be held xed,
then increasing xi by 1 unit increases the sucient predictor SP by i units.
As a special case, consider logistic regression. Let (x) = P (success|x) = 1
P(failure|x) where a success is what is counted and a failure is what is not
counted (so if the Yi are binary, (x) = P (Yi = 1|x)). Then the estimated
(x) T
odds of success is (x) = = exp( + x). In logistic regression,
1 (x)
increasing a predictor xi by 1 unit (while holding all other predictors xed)
multiplies the estimated odds of success by a factor of exp(i ).
eESP 1.1384
(x) = = = 0.5324.
1 + eESP 1 + 1.1384
b) i) H0 : the reduced model is good HA : use the full model
ii) G2 (R|F ) = 313.457 234.792 = 78.665
iii) Now df = 264 257 = 7, and comparing 78.665 with 27,0.999 = 24.32
shows that the pval = 0 < 1 0.999 = 0.001.
iv) Reject H0 , use the full model.
Response = y
Sequential Analysis of Deviance
All fits include an intercept.
Total Change
Predictor df Deviance | df Deviance
Ones 999 1221.73 |
x1 998 1177.11 | 1 44.6148
x2 997 1176.55 | 1 0.561629
x3 996 1168.33 | 1 8.21723
x4 995 1168.20 | 1 0.137583
x5 994 1163.44 | 1 4.75625
x6 993 1158.22 | 1 5.21846
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -5.84211 1.74259 -3.353 0.0008
jaw ht 0.103606 0.0383650 ? ??
Example 13.9. A museum has 60 skulls, some of which are human and
some of which are from apes. Consider trying to estimate whether the skull
type is human or ape from the height of the lower jaw. Use the above logistic
regression output to answer the following problems. The museum data is
available from the texts website as le museum.lsp, and is from Schaahausen
(1878).
418 13 GLMs and GAMs
eESP 0.1830731
(x) = = = 0.1547.
1 + eESP 1 + 0.1830731
Example 13.10. Use the above output to perform inference on the num-
ber of locations where aircraft was damaged. The output is from a Poisson
regression. The variable exper = total months of aircrew experience while
type of aircraft was coded as 0 or 1. There were n = 30 cases. Data is from
Montgomery et al. (2001).
a) Predict (x) if bombload = x1 = 7.0, exper = x2 = 80.2, and type
= x3 = 1.0.
b) Perform the 4 step Wald test for Ho : 2 = 0.
c) Find a 95% condence interval for 3 .
Solution: a) ESP = + 1 x1 + 2 x2 + 3 x3 = 0.406023 + 0.165426(7)
0.0135223(80.2) + 0.568773(1) = 0.2362. So (x) = exp(ESP ) =
exp(0.2360) = 1.2665.
b) i) Ho: 2 = 0 Ha: 2 = 0
ii) t02 = 1.633.
iii) pval = 0.1024
iv) Fail to reject Ho, exper in not needed in the PR model for number of
locations given that bombload and type are in the model.
c) 3 1.96SE(3 ) = 0.568773 1.96(0.504297) = 0.568773 0.9884 =
[0.4196, 1.5572].
13.6 Variable Selection 419
This section gives some rules of thumb for variable selection for logistic and
Poisson regression when SP = + T x. Before performing variable selection,
a useful full model needs to be found. The process of nding a useful full
model is an iterative process. Given a predictor x, sometimes x is not used
by itself in the full model. Suppose that Y is binary. Then to decide what
functions of x should be in the model, look at the conditional distribution of
x|Y = i for i = 0, 1. The rules shown in Table 13.1 are used if x is an indicator
variable or if x is a continuous variable. Replace normality by symmetric
with similar spreads and symmetric with dierent spreads in the second
and third lines of the table. See and Cook and Weisberg (1999a, p. 501) and
and Kay and Little (1987).
The full model will often contain factors and interactions. If w is a nominal
variable with J levels, make w into a factor by using J 1 (indicator or)
dummy variables x1,w , ..., xJ1,w in the full model. For example, let xi,w = 1
if w is at its ith level, and let xi,w = 0, otherwise. An interaction is a product
of two or more predictor variables. Interactions are dicult to interpret.
Often interactions are included in the full model, and then the reduced model
without any interactions is tested. The investigator is often hoping that the
interactions are not needed.
model. Suppose that all values of the variable x are positive. The log rule
says add log(x) to the full model if max(xi )/ min(xi ) > 10. For the binary
logistic regression model, it is often useful to mark the plotted points by a 0
if Y = 0 and by a + if Y = 1.
To make a full model, use the above discussion and then make a response
plot to check that the full model is good. The number of predictors in the
full model should be much smaller than the number of data cases n. Suppose
420 13 GLMs and GAMs
that the Yi are binary for i = 1, ..., n. Let N1 = Yi = the number of 1s and
N0 = nN1 = the number of 0s. A rough rule of thumb is that the full model
should use no more than min(N0 , N1 )/5 predictors and the nal submodel
should have r predictor variables where r is small with r min(N0 , N1 )/10.
For Poisson regression, a rough rule of thumb is that the full model should
use no more than n/5 predictors and the nal submodel should use no more
than n/10 predictors.
Variable selection, also called subset or model selection, is the search for
a subset of predictor variables that can be deleted without important loss of
information. A model for variable selection for a GLM can be described by
SP = + T x = + TS xS + TE xE = + TS xS (13.10)
SP = + TI xI + TO xO . (13.11)
where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, O = 0 if the set of predictors S is
a subset of I. Let (, ) and (I , I ) be the estimates of (, ) and (, I )
obtained from tting the full model and the submodel, respectively. Denote
T
the ESP from the full model by ESP = + xi and denote the ESP from
the submodel by ESP (I) = I + I xIi .
13.6 Variable Selection 421
Backward elimination starts with the full model with k nontrivial vari-
ables, and the predictor that optimizes some criterion is deleted. Then there
are k 1 variables left, and the predictor that optimizes some criterion is
deleted. This process continues for models with k 2, k 3, ..., 2, and 1 pre-
dictors.
Forward selection starts with the model with 0 variables, and the pre-
dictor that optimizes some criterion is added. Then there is 1 variable in the
model, and the predictor that optimizes some criterion is added. This process
continues for models with 2, 3, ..., k 2, and k 1 predictors. Both forward
selection and backward elimination result in a sequence, often dierent, of k
models {x1 }, {x1 , x2 }, ..., {x1 , x2 , ..., xk1 }, {x1 , x2 , ..., xk } = full model.
All subsets variable selection can be performed with the following pro-
cedure. Compute the ESP of the GLM and compute the OLS ESP found by
the OLS regression of Y on x. Check that |corr(ESP, OLS ESP)| 0.95.This
high correlation will exist for many data sets. Then perform multiple linear
regression and the corresponding all subsets OLS variable selection with the
Cp (I) criterion. If the sample size n is large and Cp (I) 2(r + 1) where the
subset I has r + 1 variables including a constant, then corr(OLS ESP, OLS
ESP(I)) will be high by the proof of Proposition 3.1c, and hence corr(ESP,
ESP(I)) will be high. In other words, if the OLS ESP and GLM ESP are
highly correlated, then performing multiple linear regression and the corre-
sponding MLR variable selection (e.g., forward selection, backward elimina-
tion, or all subsets selection) based on the Cp (I) criterion may provide many
interesting submodels.
422 13 GLMs and GAMs
Know how to nd good models from output. The following rules of thumb
(roughly in order of decreasing importance) may be useful. It is often not
possible to have all 12 rules of thumb to hold simultaneously. Let submodel I
have rI + 1 predictors, including a constant. Do not use more predictors than
submodel II , which has no more predictors than the minimum AIC model.
It is possible that II = Imin = If ull . Assume the response plot for the full
model is good. Then the submodel I is good if
i) the response plot for the submodel looks like the response plot for the full
model.
ii) corr(ESP,ESP(I)) 0.95.
iii) The plotted points in the EE plot cluster tightly about the identity line.
iv) Want the pval 0.01 for the change in deviance test that uses I as the
reduced model.
v) For binary LR want rI +1 min(N1 , N0 )/10. For PR, want rI +1 n/10.
vi) Fit OLS to the full and reduced models. The plotted points in the plot of
the OLS residuals from the submodel versus the OLS residuals from the full
model should cluster tightly about the identity line.
vii) Want the deviance G2 (I) G2 (f ull) but close. (G2 (I) G2 (f ull) since
adding predictors to I does not increase the deviance.)
viii) Want AIC(I) AIC(Imin ) + 7 where Imin is the minimum AIC model
found by the variable selection procedure.
ix) Want hardly any predictors with pvals > 0.05.
x) Want few predictors with pvals between 0.01 and 0.05.
xi) Want G2 (I) n rI 1 + 3 n rI 1.
xii) The OD plot should look good.
Heuristically, forward selection tries to add the variable that will decrease
the deviance the most. A decrease in deviance less than 4 (if the predictor
has 1 degree of freedom) may be troubling in that a bad predictor may have
been added. In practice, the forward selection program may add the variable
13.6 Variable Selection 423
such that the submodel I with j nontrivial predictors has a) the smallest
AIC(I), b) the smallest deviance G2 (I), or c) the smallest pval (preferably
from a change in deviance test but possibly from a Wald test) in the test
Ho: i = 0 versus Ha: i = 0 where the current model with j terms plus the
predictor xi is treated as the full model (for all variables xi not yet in the
model).
Suppose that the full model is good and is stored in M1. Let M2, M3,
M4, and M5 be candidate submodels found after forward selection, backward
elimination, etc. Make a scatterplot matrix of the ESPs for M2, M3, M4,
M5, and M1. Good candidates should have estimated sucient predictors
that are highly correlated with the full model estimated sucient predictor
(the correlation should be at least 0.9 and preferably greater than 0.95). For
binary logistic regression, mark the symbols (0 and +) using the response
variable Y .
The nal submodel should have few predictors, few variables with large
Wald pvals (0.01 to 0.05 is borderline), a good response plot, and an EE plot
that clusters tightly about the identity line. If a factor has I 1 dummy
variables, either keep all I 1 dummy variables or delete all I 1 dummy
variables, do not delete some of the dummy variables.
Example 13.11. The following output is for forward selection and back-
ward elimination. All models use a constant. For forward selection, the min
AIC model uses {F}LOC, TYP, AGE, CAN, SYS, PCO, and PH. Model II
uses {F}LOC, TYP, AGE, CAN, and SYS. Let model I use {F}LOC, TYP,
AGE, and CAN. This model may be good, so for forward selection, models
II and I are the rst models to examine.
424 13 GLMs and GAMs
Example 13.13. The ICU data is available from the texts website and
from STATLIB (http://lib.stat.cmu.edu/DASL/Datafiles/ICU.html).
Also see Hosmer and Lemeshow (2000, pp. 2325). The survival of 200 pa-
tients following admission to an intensive care unit was studied with logistic
regression. The response variable was STA (0 = Lived, 1 = Died). Predictors
were AGE, SEX (0 = Male, 1 = Female), RACE (1 = White, 2 = Black, 3 =
Other), SER= Service at ICU admission (0 = Medical, 1 = Surgical), CAN=
Is cancer part of the present problem? (0 = No, 1 = Yes), CRN= History
of chronic renal failure (0 = No, 1 = Yes), INF= Infection probable at ICU
admission (0 = No, 1 = Yes), CPR= CPR prior to ICU admission (0 = No, 1
= Yes), SYS= Systolic blood pressure at ICU admission (in mm Hg), HRA=
Heart rate at ICU admission (beats/min), PRE= Previous admission to an
426 13 GLMs and GAMs
Response Plot
1.0
0.8
0.6
Y
0.4
0.2
0.0
20 10 0 10 20 30 40
ESP
Fig. 13.7 Visualizing the ICU Data
5 0 5 10 15 20
ESPS
Fig. 13.8 EE Plot Suggests Race is an Important Predictor
13.6 Variable Selection 427
20 10 0 10 20 30
ESPS
Fig. 13.9 EE Plot Suggests Race is an Important Predictor
Factors LOC and RACE had two indicator variables to model the three
levels. The response plot in Figure 13.7 shows that the logistic regression
model using the 19 predictors is useful for predicting survival, although the
output has (x) = 1 or (x) = 0 exactly for some cases. Note that the
step function of slice proportions tracks the model logistic curve fairly well.
Variable selection, using forward selection and backward elimination with
the AIC criterion, suggested the submodel using AGE, CAN, SYS, TYP, and
LOC. The EE plot of ESP(sub) versus ESP(full) is shown in Figure 13.8.
The plotted points in the EE plot should cluster tightly about the identity
line if the full model and the submodel are good. Since this clustering did
not occur, the submodel seems to be poor. The lowest cluster of points and
the case on the right nearest to the identity line correspond to black patients.
The main cluster and upper right cluster correspond to patients who are not
black.
428 13 GLMs and GAMs
Figure 13.9 shows the EE plot when RACE is added to the submodel.
Then all of the points cluster about the identity line. Although numerical
variable selection did not suggest that RACE is important, perhaps since
output had (x) = 1 or (x) = 0 exactly for some cases, the two EE plots
suggest that RACE is important. Also the RACE variable could be replaced
by an indicator for black. This example illustrates how the plots can be
used to quickly improve and check the models obtained by following logistic
regression with variable selection even if the MLE LR does not exist.
P1 P2 P3 P4
df 144 147 148 149
# of predictors 6 3 2 1
# with 0.01 Wald p-value 0.05 1 0 0 0
# with Wald p-value > 0.05 3 0 1 0
G2 127.506 131.644 147.151 149.861
AIC 141.506 139.604 153.151 153.861
corr(P1:ETAU,Pi:ETAU) 1.0 0.954 0.810 0.792
p-value for change in deviance test 1.0 0.247 0.0006 0.0
Example 13.14. The above table gives summary statistics for 4 models
considered as nal submodels after performing variable selection. Poisson
regression was used. The response plot for the full model P1 was good. Model
P2 was the minimum AIC model found.
Which model is the best candidate for the nal submodel? Explain briey
why each of the other 3 submodels should not be used.
Solution: P2 is best. P1 has too many predictors with large pvalues and
more predictors than the minimum AIC model. P3 and P4 have corr and
pvalue too low and AIC too high.
Variable selection for GLMs is very similar to that for multiple linear
regression. Finding a model II from variable selection, and using GLM output
for model II does not give valid tests and condence intervals. If there is a
good full model that was found before examining the response, and if II is
the minimum AIC model, then the Olive (2016a,b,c) bootstrap tests may
be useful. These tests are similar to those for the minimum Cp model for
multiple linear regression described in Section 3.4.1.
There are many alternatives to the binomial and Poisson regression GLMs.
Alternatives to the binomial GLM of Denition 13.6 include the discriminant
function model of Denition 13.7, the quasi-binomial model, the binomial
generalized additive model (GAM), and the beta-binomial model of Deni-
tion 13.2.
13.7 Generalized Additive Models 429
Note that a GLM is a special case of the GAM using Sj (xj ) = j xj for
j = 1, ..., p. A GLM with SP = + 1 x1 + 2 x2 + 3 x1 x2 is a special case of a
GAM with x3 x1 x2 . A GLM with SP = + 1 x1 + 2 x21 + 3 x2 is a special
case of a GAM with S1 (x1 ) = 1 x1 + 2 x21 and S2 (x2 ) = 3 x2 . A GLM with
p terms may be equivalent to a GAM with k terms w1 , ..., wk where k < p.
The plotted points in the EE plot dened below should scatter tightly
about the identity line if the GLM is appropriate and if the sample size is
large enough so that the ESP is a good estimator of the SP and the EAP is a
good estimator of the AP. If the clustering is not tight but the GAM gives a
reasonable approximation to the data, as judged by the EAPresponse plot,
then examine the Sj of the GAM to see if some simple terms such as x2i can
be added to the GLM so that the modied GLM has a good ESPresponse
plot. (This technique is easiest if the GLM and GAM have the same p terms
x1 , ..., xp . The technique is more dicult, for example, if the GLM has terms
x1 , x21 , and x2 while the GAM has terms x1 and x2 .)
For the quasi-binomial model, the conditional mean and variance functions
are similar to those of the binomial distribution, but it is not assumed that
Y |SP has a binomial distribution. Similarly, it is not assumed that Y |SP
has a Poisson distribution for the quasi-Poisson model.
Next, some notation is needed to derive the zero truncated Poisson re-
gression model. Y has a zero truncated Poisson distribution, Y ZT P (),
e y
if the probability mass function (pmf) of Y is f (y) = for
(1 e ) y!
y = 1, 2, 3, ... where > 0. The ZTP pmf is obtained from a Poisson distri-
bution where y = 0 values are truncated, so not allowed.
If W P oisson()
with pmf fW (y), then P (W = 0) = e , so f (y) = 1 e =
y=1 W
y=0 fW (y) y=1 fW (y). So the ZTP pmf f (y) = fW (y)/(1 e ) for
y = 0.
Now E(Y ) = y=1 yf (y) = y=0 yf (y) = y=0 yfW (y)/(1 e ) =
E(W )/(1 e ) = /(1 e ).
Similarly, E(Y 2 ) = y=1 y 2 f (y) = y=0 y 2 f (y) = y=0 y 2 fW (y)/(1
e ) = E(W 2 )/(1 e ) = [2 + ]/(1 e ). So
2
2 +
V (Y ) = E(Y ) (E(Y )) =
2 2
.
1 e 1 e
exp(SP )
E(Y |x) = and
1 exp( exp(SP ))
2
[exp(SP )]2 + exp(SP ) exp(SP )
V (Y |SP ) = .
1 exp( exp(SP )) 1 exp( exp(SP ))
It is well known that the residual plot of ESP or EAP versus the residuals
(on the vertical axis) is useful for checking the model, but there are several
other plots using the ESP that can be generalized to a GAM by replacing the
ESP by the EAP . The response plots are used to visualize the 1D regression
model or GAM in the background of the data. For 1D regression, a response
plot is the plot of the ESP versus the response Y with the estimated model
conditional mean function and a scatterplot smoother often added as visual
aids. Note that the response plot is used to visualize Y |SP while for the
additive error regression model, a residual plot of the ESP versus the residual
is used to visualize e|SP . For a GAM, these two plots replace the ESP by
the EAP . Assume that the ESP or EAP takes on many values.
Suppose the zero mean constant variance errors e1 , ..., en are iid from a
unimodal distribution that is not highly skewed. For additive error regression,
see Denition 13.1i), the estimated mean function is the identity line with
unit slope and zero intercept. If the sample size n is large, then the plotted
points should scatter about the identity line and the residual = 0 line in
an evenly populated band for the response and residual plots, with no other
pattern. To avoid overtting, assume n 10d where d is the model degrees
of freedom. Hence d = p for multiple linear regression with OLS.
If Zi = Yi /mi , then the conditional distribution Zi |xi of the binomial
GAM can be visualized with a response plot of the EAP versus Zi with
exp(EAP )
the estimated mean function of the Zi , E(Z|AP ) = , and a
1 + exp(EAP )
scatterplot smoother added to the plot as a visual aids. Instead of adding a
lowess curve to the plot, consider the following alternative. Divide the EAP
into J slices with
approximately
the same number of cases in each slice. Then
compute s = s Yi / s mi where the sum is over the cases in slice s. Then
plot the resulting step function. For binary data the step function is simply
432 13 GLMs and GAMs
the sample proportion in each slice. The response plot for the beta-binomial
GAM is similar.
The lowess curve and step function are simple nonparametric estimators
of the mean function (AP ) or (SP ). If the lowess curve or step function
tracks the logistic curve (the estimated conditional mean function) closely,
then the logistic conditional mean function is a reasonable approximation to
the data.
The Poisson GAM response plot is a plot of EAP versus Y with
E(Y |AP ) = exp(EAP ) and lowess added as visual aids. For both the
GAM and the GLM response plots, the lowess curve should be close to the
exponential curve, except possibly for the largest values of the ESP or EAP
in the upper right corner of the plot. Here, lowess often underestimates the
exponential curve because lowess downweights the largest Y values too much.
Similar plots can be made for a negative binomial regression or GAM.
Following the discussion
above Denition 13.15, the weighted forward re-
sponse plot is a plot of Zi EAP versus Zi log(Zi ). The weighted residual
plot
is a plot of Z i EAP versus the WLS residuals rW i = Z i log(Z i)
Zi EAP . These plots can also be used for the negative binomial GAM. If the
counts Yi are large and E(Y |AP ) = exp(EAP ) is a good approximation to
the conditional mean function E(Y |AP ) = exp(AP ), then the plotted points
in the weighted forward response plot and weighted residual plot should scat-
ter about the identity line and r = 0 lines in roughly evenly populated bands.
See Examples 13.4, 13.5, and 13.6.
Variable selection is the search for a subset of variables that can be deleted
without important loss of information. Olive and Hawkins (2005) make an
EE plot of ESP (I) versus ESP where ESP (I) is for a submodel I and ESP
is for the full model. This plot can also be used to complement the hypothesis
test that the reduced model I (which is selected before gathering data) can
be used instead of the full model. The obvious extension to GAMs is to make
the EE plot of EAP (I) versus EAP . If the tted full model and submodel
I are good, then the plotted points should follow the identity line with high
correlation (use correlation 0.95 as a benchmark).
To justify this claim, assume that there exists a subset S of predictor
variables such that if xS is in the model, then none of the other predictors
is needed in the model. Write E for these (extraneous) variables not in S,
partitioning x = (xTS , xTE )T . Then
p
AP = + Sj (xj ) = + Sj (xj )+ Sk (xk ) = + Sj (xj ). (13.13)
j=1 jS kE jS
13.7 Generalized Additive Models 433
The extraneous terms that can be eliminated given that the subset S is in
the model have Sk (xk ) = 0 for k E.
Now suppose that I is a candidate subset of predictors and that S I.
Then
p
AP = + Sj (xj ) = + Sj (xj ) = + Sk (xk ) = AP (I),
j=1 jS kI
(if I includes predictors from E, these will have Sk (xk ) = 0). For any subset
I that includes all relevant predictors, the correlation corr(AP, AP(I)) = 1.
Hence if the full model and submodel are reasonable and if EAP and EAP(I)
are good estimators of AP and AP(I), then the plotted points in the EE plot
of EAP(I) versus EAP will follow the identity line with high correlation.
13.7.4 Examples
For the binary logistic GAM, the EAP will not be a consistent estimator
of the AP if the estimated probability (AP ) = (EAP ) is exactly zero or
one. The following example will show that GAM output and plots can still
be used for exploratory data analysis. The example also illustrates that EE
plots are useful for detecting cases with high leverage and clusters of cases.
Numerical diagnostics, such as analogs of Cooks distances (Cook 1977), tend
to fail if there is a cluster of two or more inuential cases.
Example 13.15. For the ICU data of Example 13.13, a binary general-
ized additive model was t with unspecied functions for AGE, SYS, and
HRA, and linear functions for the remaining 16 variables. Output suggested
that functions for SYS and HRA are linear but the function for AGE may
434 13 GLMs and GAMs
1.0
0.8
0.6
Y
0.4
0.2
0.0
20 0 20 40 60
EAP
Fig. 13.10 Visualizing the ICU GAM
40
30
20
ESP
10
0
20 10
20 0 20 40 60
EAP
Fig. 13.11 GAM and GLM give Similar Success Probabilities
13.7 Generalized Additive Models 435
be slightly curved. Several cases had (AP ) equal to zero or one, but the
response plot in Figure 13.10 suggests that the full model is useful for pre-
dicting survival. Note that the ten slice step function closely tracks the logistic
curve. To visualize the model with the response plot, use Y |x binomial[1,
(EAP ) = eEAP /(1+eEAP )]. When x is such that EAP < 5, (EAP ) 0.
If EAP > 5, (EAP ) 1, and if EAP = 0, then (EAP ) = 0.5. The logistic
curve gives (EAP ) P (Y = 1|x) = (AP ). The dierent estimated bi-
nomial distributions have (AP ) = (EAP ) that increases according to the
logistic curve as EAP increases. If the step function tracks the logistic curve
closely, the binary GAM gives useful smoothed estimates of (AP ) provided
that the number of 0s and 1s are both much larger than the model degrees
of freedom so that the GAM is not overtting.
A binary logistic regression was also t, and Figure 13.11 shows the plot of
EAP versus ESP. The plot shows that the near zero and near one probabilities
are handled dierently by the GAM and GLM, but the estimated success
probabilities for the two models are similar: (ESP ) (EAP ). Hence we
used the GLM and perform variable selection as in Example 13.13.
Example 13.16. For binary data, Kay and Little (1987) suggest exam-
ining the two distributions x|Y = 0 and x|Y = 1. Use predictor x if the two
distributions are roughly symmetric with similar spread. Use x and x2 if the
distributions are roughly symmetric with dierent spread. Use x and log(x)
if one or both of the distributions are skewed. The log rule says add log(x)
to the model if min(x) > 0 and max(x)/ min(x) > 10. The Gladstone (1905)
data is useful for illustrating these suggestions. The response was gender with
Y = 1 for male and Y = 0 for female. The predictors were age, height, and
the head measurements circumference, length, and size. When the GAM was
t without log(age) or log(size), the Sj for age, height, and circumf erence
were nonlinear. The log rule suggested adding log(age), and log(size) was
added because size is skewed. The GAM for this model had plots of Sj (xj )
that were fairly linear. The response plot is not shown but was similar to
Figure 13.10, and the step function tracked the logistic curve closely. When
EAP = 0, the estimated probability of Y = 1 (male) is 0.5. When EAP > 5
the estimated probability is near 1, but near 0 for EAP < 5. The response
plot for the binomial GLM, not shown, is similar. See Problem 13.14 for
another analysis of this data set.
Example 13.17. Wood (2006, pp. 8286) describes heart attack data
where the response Y is the number of heart attacks for mi patients suspected
of suering a heart attack. The enzyme ck (creatine kinase) was measured for
the patients and it was determined whether the patient had a heart attack
or not. A binomial GLM with predictors x1 = ck, x2 = [ck]2 , and x3 = [ck]3
was t and had AIC = 33.66. The binomial GAM with predictor x1 was t
in R, and Figure 13.12 shows that the EE plot for the GLM was not too
good. The log rule suggests using ck and log(ck), but ck was not signicant.
436 13 GLMs and GAMs
8
6
4
ESPp
2
0
2
4
2 0 2 4
EAP
Fig. 13.12 EE plot for cubic GLM for Heart Attack Data
4
2
ESPl
0
2
2 0 2 4
EAP
Fig. 13.13 EE plot with log(ck) in the GLM
13.8 Overdispersion 437
1.0
0.8
0.6
Z
0.4
0.2
0.0
2 0 2 4
ESPl
Fig. 13.14 Response Plot for Heart Attack Data
Hence a GLM with the single predictor log(ck) was t. Figure 13.13 shows
the EE plot, and Figure 13.14 shows the response plot where the Zi = Yi /mi
track the logistic curve closely. There was no evidence of overdispersion and
the model had AIC = 33.45. The GAM using log(ck) had a linear S, and
the correlation of the plotted points in the EE plot, not shown, was one. See
Problem 13.22.
13.8 Overdispersion
The OD plot has been used by Winkelmann (2000, p. 110) for the Poisson
regression model where VM (Y |SP ) = EM (Y |SP ) = exp(ESP ). For binomial
and Poisson regression, the OD plot can be used to complement tests and
diagnostics for overdispersion such as those given in Cameron and Trivedi
(2013), Collett (1999, ch. 6), and Winkelmann (2000). See discussion below
Denitions 13.10 and 13.13 for how to interpret the OD plot with the identity
line, OLS line, and slope 4 line added as visual aids, and for discussion of the
numerical summaries G2 and X 2 for GLMs.
Denition 13.1, with SP = AP, gives EM (Y |AP ) = m(AP ) and VM (Y |AP )
= v(AP ) for several models. Often m(AP ) = m(EAP ) and v(AP ) =
v(EAP ), but additional parameters sometimes need to be estimated.
Hence v(AP ) = mi (EAPi )(1 (EAPi ))[1 + (mi 1)/(1 + )],
v(AP ) = exp(EAP ) + exp(2 EAP ), and v(AP ) = [m(EAP )]2 / for
the beta-binomial, negative binomial, and gamma GAMs, respectively. The
beta-binomial regression model is often used if the binomial regression is
inadequate because of overdispersion, and the negative binomial GAM is
often used if the Poisson GAM is inadequate.
Since the Poisson regression (PR) model is simpler than the negative bi-
nomial regression (NBR) model, and the binomial logistic regression (LR)
model is simpler beta-binomial regression (BBR) model, the graphical di-
agnostics for the goodness of t of the PR and LR models are very useful.
Combining the response plot with the OD plot is a powerful method for
assessing the adequacy of the Poisson and logistic regression models. NBR
and BBR models should also be checked with response and OD plots. See
Examples 13.213.6.
Example 13.18. The species data is from Cook and Weisberg (1999a,
pp. 285286) and Johnson and Raven (1973). The response variable is the
total number of species recorded on each of 29 islands in the Galapagos
13.8 Overdispersion 439
200
100
0
0 1 2 3 4 5 6
EAP
Fig. 13.15 Response Plot for Negative Binomial GAM
The response plot with the exponential and lowess curves added as visual
aids is shown in Figure 13.15. The interpretation is that Y |x negative
binomial with E(Y |x) exp(EAP ). Hence if EAP = 0, E(Y |x) 1. The
negative binomial and Poisson GAM have the same conditional mean func-
tion. If the plot was for a Poisson GAM, the interpretation would be that
Y |x Poisson(exp(EAP )). Hence if EAP = 0, Y |x Poisson(1).
440 13 GLMs and GAMs
5000
3000
Vhat
1000
0
Figure 13.16 shows the OD plot for the negative binomial GAM with
the identity line and slope 4 line through the origin added as visual aids.
The plotted points fall within the slope 4 wedge, suggesting that the neg-
ative binomial regression model has successfully dealt with overdispersion.
Here E(Y |AP ) = exp(EAP ) and V (Y |AP ) = exp(EAP ) + exp(2EAP )
where = 1/37.
13.9 Complements
GLMs were introduced by Nelder and Wedderburn (1972) . Also see McCul-
lagh and Nelder (1989), Myers et al. (2002), Olive (2010), Andersen and
Skovgaard (2010), Agresti (2013, 2015), and Cook and Weisberg (1999a, ch.
2123). Collett (2003) and Hosmer and Lemeshow (2000) are excellent texts
on logistic regression while Cameron and Trivedi (2013) and Winkelmann
(2008) cover Poisson regression. Alternatives to Poisson regression mentioned
in Section 13.7 are covered by Zuur et al. (2009), Simono (2003), and Hilbe
(2011). See Hillis and Davis (1994) Davis for a widely used algorithm to com-
pute the GLM. Cook and Zhang (2015) show that envelope methods have the
potential to signicantly improve GLMs.
13.9 Complements 441
tion. Olive (2010, ch. 16) discusses plots useful for visualizing the Cox pro-
portional hazards regression model, Weibull proportional hazards regression
model, and accelerated failure time models. It is also shown that inference
and variable selection for these models is very similar to that of GLMs. Again
the Olive (2016a,b,c) bootstrap tests may be useful after variable selection
with AIC.
Plots were made in R and Splus, see R Core Team (2016). The Wood
(2006) library mgcv was used for tting a GAM, and the Venables and Rip-
ley (2010) library MASS was used for the negative binomial family. The Lesno
and Lancelot (2010) R package aod has function betabin for beta binomial
regression and is also useful for tting negative binomial regression. SAS
has proc genmod, proc gam, and proc countreg which are useful for t-
ting GLMs such as Poisson regression, GAMs such as the Poisson GAM,
and overdispersed count regression models. The lregpack R functions include
lrplot which makes response and OD plots for binomial regression; lrplot2
which makes the response plot for binary regression; prplot which makes
the response, weighted forward response, weighted residual, and OD plots
for Poisson regression; and prsim which makes the last 4 plots for simulated
Poisson or negative binomial regression models.
13.10 Problems
13.2 . Now the data is as in Problem 13.1, but try to estimate the pro-
portion of males by measuring the circumference and the length of the head.
Use the above logistic regression output to answer the following problems.
13.3 . A museum has 60 skulls of apes and humans. Lengths of the lower
jaw, upper jaw, and face are the explanatory variables. The response variable
is ape (= 1 if ape, 0 if human). Using the output above, perform the four step
deviance test for whether there is an LR relationship between the response
variable and the predictors.
Number of cases: 60
Degrees of freedom: 56
Pearson X2: 16.782
Deviance: 13.532
Reduced Model
Response = ape
444 13 GLMs and GAMs
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant 8.71977 4.09466 2.130 0.0332
lower jaw -0.376256 0.115757 -3.250 0.0012
upper jaw 0.295507 0.0950855 3.108 0.0019
Number of cases: 60
Degrees of freedom: 57
Pearson X2: 28.049
Deviance: 17.185
13.4 . Suppose the full model is as in Problem 13.3, but the reduced
model omits the predictor face length. Perform the 4 step change in deviance
test to examine whether the reduced model can be used.
The following three problems use the possums data from Cook and Weis-
berg (1999a).
13.5 . Use the above output to perform inference on the number of pos-
sums in a given tract of land. The output is from a Poisson regression.
a) Predict (x) if habitat = x1 = 5.8 and stags = x2 = 8.2.
13.6 . Perform the 4 step deviance test for the same model as in Prob-
lem 13.5 using the output above.
Output for Problem 13.7
Terms = (Acacia Bark Habitat Shrubs Stags Stumps)
Label Estimate Std. Error Est/SE p-value
Constant -1.04276 0.247944 -4.206 0.0000
Acacia 0.0165563 0.0102718 1.612 0.1070
Bark 0.0361153 0.0140043 2.579 0.0099
Habitat 0.0761735 0.0374931 2.032 0.0422
Shrubs 0.0145090 0.0205302 0.707 0.4797
Stags 0.0325441 0.0102957 3.161 0.0016
Stumps -0.390753 0.286565 -1.364 0.1727
Number of cases: 151
Degrees of freedom: 144
Deviance: 127.506
13.7 . Let the reduced model be as in Problem 13.5 and use the output
for the full model be shown above. Perform a 4 step change in deviance test.
B1 B2 B3 B4
df 945 956 968 974
# of predictors 54 43 31 25
# with 0.01 Wald p-value 0.05 5 3 2 1
# with Wald p-value > 0.05 8 4 1 0
G2 892.96 902.14 929.81 956.92
AIC 1002.96 990.14 993.81 1008.912
corr(B1:ETAU,Bi:ETAU) 1.0 0.99 0.95 0.90
p-value for change in deviance test 1.0 0.605 0.034 0.0002
13.8 . The above table gives summary statistics for 4 models considered
as nal submodels after performing variable selection. (Several of the predic-
tors were factors, and a factor was considered to have a bad Wald p-value >
0.05 if all of the dummy variables corresponding to the factor had p-values
> 0.05. Similarly the factor was considered to have a borderline p-value with
0.01 p-value 0.05 if none of the dummy variables corresponding to the
factor had a p-value < 0.01 but at least one dummy variable had a p-value
between 0.01 and 0.05.) The response was binary and logistic regression was
used. The response plot for the full model B1 was good. Model B2 was the
minimum AIC model found. There were 1000 cases: for the response, 300
were 0s and 700 were 1s.
a) For the change in deviance test, if the p-value 0.07, there is little
evidence that Ho should be rejected. If 0.01 < p-value < 0.07, then there is
moderate evidence that Ho should be rejected. If p-value 0.01, then there
is strong evidence that Ho should be rejected. For which models, if any, is
there strong evidence that Ho: reduced model is good should be rejected.
446 13 GLMs and GAMs
c) Which model should be used as the nal submodel? Explain briey why
each of the other 3 submodels should not be used.
Arc Problems
The following four problems use data sets from Cook and Weisberg (1999a)
Weisberg.
From Graph&Fit select Fit binomial response. Select Top as the predictor,
Status as the response, and ones as the number of trials.
e) From Graph&Fit select Fit binomial response. Select Top and Diagonal
as predictors, Status as the response, and ones as the number of trials. Include
the output in Word.
b) From Graph&Fit select Fit linear LS. Select Diagonal and Top for pre-
dictors, and Status for the response. From Graph&Fit select Plot of and select
L2:Fit-Values for H, B1:EtaU for V, and Status for Mark by. Include the plot
T T
in Word. Is the plot linear? How are OLS + OLS x and logistic + logistic x
related (approximately)?
b) Response plot: From Graph&Fit select Plot of. Select P1:EtaU for the
H box and y for the V box. From the OLS popup menu select Poisson and
move the slider bar to 1. Move the lowess slider bar until the lowess curve
tracks the exponential curve well. Include the response plot in Word.
c) From Graph&Fit select Fit Poisson response. Select y as the response
and select bark, habitat, stags, and stumps as the predictors. Include the
output in Word.
d) Response plot: From Graph&Fit select Plot of. Select P2:EtaU for the
H box and y for the V box. From the OLS popup menu select Poisson and
move the slider bar to 1. Move the lowess slider bar until the lowess curve
tracks the exponential curve well. Include the response plot in Word.
e) Deviance test. From the P2 menu, select Examine submodels and click
on OK. Include the output in Word and perform the 4 step deviance test.
g) EE plot. From Graph&Fit select Plot of. Select P2:EtaU for the H box
and P1:EtaU for the V box. Move the OLS slider bar to 1. Click on the
Options popup menu and type y=x. Include the plot in Word. Is the plot
linear?
13.12 . In this problem you will nd a good submodel for the possums
data.
From Graph&Fit select Fit Poisson response. Select y as the response and
select Acacia, bark, habitat, shrubs, stags, and stumps as the predictors.
In Problem 13.11, you showed that this was a good full model.
448 13 GLMs and GAMs
a) Using what you have learned in class nd a good submodel and include
the relevant output in Word.
(Hints: Use forward selection and backward elimination and nd the model
Imin with the smallest AIC. Let (I) = AIC(I) AIC(Imin ). Then nd the
model II with the fewest number of predictors such that (II ) 2. Then
submodel II is the initial submodel to examine. Fit model II and look at the
Wald test pvalues. Try to eliminate predictors with large pvalues but make
sure that the deviance does not increase too much. Also examine submodels
I with fewer predictors than II with (I) 7. You may have several models,
say P2, P3, P4, and P5 to look at. Make a scatterplot matrix of the Pi:ETAU
from these models and from the full model P1. Make the EE and response
plots for each model. The correlation in the EE plot should be at least 0.9 and
preferably greater than 0.95. As a very rough guide for Poisson regression,
the number of predictors in the full model should be less than n/5 and the
number of predictors in the nal submodel should be less than n/10.)
b) Make a response plot for your nal submodel, say P2. From Graph&Fit
select Plot of. Select P2:EtaU for the H box and y for the V box. From
the OLS popup menu select Poisson and move the slider bar to 1. Move the
lowess slider bar until the lowess curve tracks the exponential curve well.
Include the response plot in Word.
c) Suppose that P1 contains your full model and P2 contains your nal
submodel. Make an EE plot for your nal submodel: from Graph&Fit select
Plot of. Select P1:EtaU for the V box and P2:EtaU, for the H box. After
the plot appears, click on the options popup menu. A window will appear.
Type y = x and click on OK. This action adds the identity line to the plot.
Also move the OLS slider bar to 1. Include the plot in Word.
d) Using a), b), c), and any additional output that you desire (e.g.,
AIC(full), AIC(Imin ), AIC(II ), and AIC(nal submodel), explain why your
nal submodel is good.
13.13 . (Response Plot): Activate cbrain.lsp in Arc with the menu com-
mands File > Load > Removable Disk (G:) > cbrain.lsp. Scroll up the
screen to read the data description. From Graph&Fit select Fit binomial
response. Select brnweight, cephalic, breadth, cause, size, and headht as pre-
dictors, sex as the response, and ones as the number of trials. Perform the
logistic regression and from Graph&Fit select Plot of. Place sex on V and
B1:EtaU on H. From the OLS popup menu, select Logistic and move the
13.10 Problems 449
slider bar to 1. From the lowess popup menu select SliceSmooth and move
the slider bar until the t is good. Include your plot in Word. Are the slice
means (observed proportions) tracking the logistic curve (tted proportions)
very well? Use lowess if SliceSmooth does not work.
13.14 . Suppose that you are given a data set, told the response, and
asked to build a logistic regression model with no further help. In this prob-
lem, we use the cbrain data to illustrate the process.
b) Use the menu commands cbrain>Make factors and select cause. This
makes cause into a factor with 2 degrees of freedom. Use the menu commands
cbrain>Transform and select age and the log transformation.
Why was the log transformation chosen?
c) From Graph&Fit select Plot of and select size in H. Also place sex in
the Mark by box. A plot will come up. From the GaussKerDen menu (the
triangle to the left) select Fit by marks, move the sliderbar to 0.9, and include
the plot in Word.
d) Use the menu commands cbrain>Transform and select size and the
log transformation. From Graph&Fit select Fit binomial response. Select age,
log(age), breadth, {F}cause, cephalic, circum, headht, height, length, size, and
log(size) as predictors, sex as the response, and ones as the number of trials.
This is the full model B1. Perform the logistic regression and include the
relevant output for testing in Word.
f) From B1 select Examine submodels and select Add to base model (For-
ward Selection). Include the output with the header Base terms: ... and
from Add: length 259 to Add: {F}cause 258 in Word.
g) From B1 select Examine submodels and select Delete from full model
(Backward Elimination). Include the output with df corresponding to the
minimum AIC model in Word. What predictors does this model use?
450 13 GLMs and GAMs
h) As a nal submodel B2, use the model from f): from Graph&Fit select
Fit binomial response. Select age, log(age), circum, height, length, size, and
log(size) as predictors, sex as the response, and ones as the number of trials.
Perform the logistic regression and include the relevant output for testing in
Word.
k) Perform the 4 step change in deviance test using the full model in d)
and the reduced submodel in h).
l) From B2 select Examine submodels, click OK, and include the output in
Word. Then use the output to perform a 4 step deviance test on the submodel.
13.15 . In this problem you will nd a good submodel for the ICU data
obtained from STATLIB or the texts website. This data set will violate some
of the rules of thumb: the model II does not have enough predictors to make
a good EE plot. See Example 13.13.
b) Use the menu commands ICU>Make factors and select loc and race.
c) From Graph&Fit select Fit binomial response. Select STA as the re-
sponse and ones as the number of trials. The full model will use every predic-
tor except ID, LOC, and RACE (the latter 2 are replaced by their factors):
select AGE, Bic, CAN, CPR, CRE, CRN, FRA, HRA, INF, {F}LOC , PCO,
PH, PO2 , PRE , {F}RACE , SER, SEX, SYS, and TYP as predictors.
13.10 Problems 451
Perform the logistic regression and include the relevant output for testing in
Word.
d) Make the response plot for the full model: from Graph&Fit select Plot
of. Place STA on V and B1:EtaU on H. From the OLS popup menu, select
Logistic and move the slider bar to 1. From the lowess popup menu select
SliceSmooth and move the slider bar until the t is good. Use lowess if Slice-
Smooth does not work. Include your plot in Word. Is the full model good?
e) Using what you have learned in class, nd a good submodel and include
the relevant output in Word.
[Hints: Use forward selection and backward elimination and nd the model
with the minimum AIC. Let (I) = AIC(I) AIC(Imin ). Then nd the
model II with the fewest number of predictors such that (II ) 2. Then
submodel II is the initial submodel to examine. Fit model II and look at the
Wald test pvalues. Try to eliminate predictors with large pvalues but make
sure that the deviance does not increase too much. Also examine submodels
I with fewer predictors than II with (I) 7. WARNING: do not delete
part of a factor. Either keep all 2 factor dummy variables or delete all I-1=2
factor dummy variables. You may have several models, say B2, B3, B4, and
B5 to look at. Make the EE and response plots for each model. WARNING:
if a useful factor is in the full model but not the reduced model, then the EE
plot may have I = 3 lines if the factor should be in the model. See part h)
below.]
g) Suppose that B1 contains your full model and B5 contains your nal
submodel. Make an EE plot for your nal submodel: from Graph&Fit select
Plot of. Select B1:EtaU for the V box and B5:EtaU, for the H box. After
the plot appears, click on the options popup menu. A window will appear.
Type y = x and click on OK. This action adds the identity line to the plot.
Include the plot in Word.
If the EE plot is good, then the plotted points will cluster about the
identity line. For model II , some points are far away from the identity line.
At least one variable needs to be added to model II to get a good submodel
and EE plot, violating the rule of thumb that submodels with more predictors
than II should not be examined. Variable selection may be suggesting poor
submodels because of clusters of cases that are given exact probabilities of 0
or 1. Try adding {F}RACE to the predictors in II .
h) Using e), f), g), and any additional output that you desire [e.g.
AIC(full), AIC(Imin ), AIC(II ), and AIC(nal submodel)], explain why your
nal submodel is good.
13.16. In this problem you will examine the museum skull data.
452 13 GLMs and GAMs
a) From Graph&Fit select Fit binomial response. Select ape as the response
and ones as the number of trials. Select x5 as the predictor. Perform the
logistic regression and include the relevant output for testing in Word.
b) Make the response plot and place it in Word (the response variable is
ape not y). Is the LR model good?
Now you will examine logistic regression when there is perfect classication
of the sample response variables. Assume that the model used in c)g) is in
menu B2.
c) From Graph&Fit select Fit binomial response. Select ape as the response
and ones as the number of trials. Select x3 as the predictor. Perform the
logistic regression and include the relevant output for testing in Word.
d) Make the response plot and place it in Word (the response variable is
ape not y). Is the LR model good?
13.17. In this problem you will nd a good submodel for the credit data
from Fahrmeir and Tutz (2001).
b) Make the response plot for the full model: from Graph&Fit select Plot
of. Place y on V and B1:EtaU on H. From the OLS popup menu, select
Logistic and move the slider bar to 1. From the lowess popup menu select
SliceSmooth and move the slider bar until the t is good. Include your plot
in Word. Is the full model good? Use lowess if SliceSmooth does not work.
c) Using what you have learned in class, nd a good submodel and include
the relevant output in Word. (See hints below Problem 13.15e.)
e) Suppose that B1 contains your full model and B5 contains your nal
submodel. Make an EE plot for your nal submodel: from Graph&Fit select
Plot of. Select B1:EtaU for the V box and B5:EtaU, for the H box. Place y
in the Mark by box. After the plot appears, click on the options popup menu.
A window will appear. Type y = x and click on OK. This action adds the
identity line to the plot. Also move the OLS slider bar to 1. Include the plot
in Word.
f) Using c), d), e), and any additional output that you desire (e.g.,
AIC(full), AIC(min), and AIC(nal submodel), explain why your nal sub-
model is good.
13.18 . a) This problem uses a data set from Myers et al. (2002). Activate
popcorn.lsp in Arc with the menu commands File > Load > Removable
Disk (G:) > popcorn.lsp. Scroll up the screen to read the data description.
From Graph&Fit select Fit Poisson response. Use oil, temp, and time as
the predictors and y as the response. From Graph&Fit select Plot of. Select
P1:EtaU for the H box and y for the V box. From the OLS popup menu
select Poisson and move the slider bar to 1. Move the lowess slider bar until
the lowess curve tracks the exponential curve. Include the response plot in
Word.
c) Test whether 1 = 2 = 3 = 0.
d) From the popcorn menu, select Transform and select y. Put 1/2 in the p
box and click on OK. From the popcorn menu, select Add a variate and type
yt = sqrt(y)*log(y) in the resulting window. Repeat three times adding the
variates oilt = sqrt(y)*oil, tempt = sqrt(y)*temp, and timet = sqrt(y)*time.
From Graph&Fit select Fit linear LS and choose y 1/2 , oilt, tempt, and timet
as the predictors, yt as the response and click on the Fit intercept box to
remove the check. Then click on OK. From Graph&Fit select Plot of. Select
L2:Fit-Values for the H box and yt for the V box. A plot should appear.
454 13 GLMs and GAMs
Click on the Options menu and type y = x to add the identity line. Include
the weighted t response plot in Word.
e) From Graph&Fit select Plot of. Select L2:Fit-Values for the H box and
L2:Residuals for the V box. Include the weighted residual response plot in
Word.
f) For the plot in e), highlight the case in the upper right corner of the plot
by using the mouse to move the arrow just above and to the left the case.
Then hold the rightmost mouse button down and move the mouse to the
right and down. From the Case deletions menu select Delete selection from
data set, then from Graph&Fit select Fit Poisson response. Use oil, temp, and
time as the predictors, and y as the response. From Graph&Fit select Plot
of. Select P3:EtaU for the H box and y for the V box. From the OLS popup
menu select Poisson and move the slider bar to 1. Move the lowess slider bar
until the lowess curve tracks the exponential curve. Include the response plot
in Word.
h) Test whether 1 = 2 = 3 = 0.
i) From Graph&Fit select Fit linear LS. Make sure that y 1/2 , oilt, tempt,
and timet are the predictors, yt is the response, and that the Fit intercept
box does not have a check. Then click on OK. From Graph&Fit select Plot
of. Select L4:Fit-Values for the H box and yt for the V box. A plot should
appear. Click on the Options menu and type y = x to add the identity line.
Include the weighted t response plot in Word.
j) From Graph&Fit select Plot of. Select L4:Fit-Values for the H box and
L4:Residuals for the V box. Include the weighted residual plot in Word.
l) From Graph&Fit select Plot of. Select P3:EtaU for the H box and
P3:Dev-Residuals for the V box. Include the deviance residual plot in Word.
m) Is the weighted residual plot from part j) a better lack of t plot than
the deviance residual plot from part l)? Explain briey.
R problems
Use the command source(G:/lregpack.txt) to download the func-
tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the lregpack function,
e.g. lrdata, will display the code for the function. Use the args command,
e.g. args(lrdata), to display the needed arguments for the function. For some
13.10 Problems 455
of the following problems, the R commands can be copied and pasted from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.
13.19. Obtain the function lrdata from lregpack.txt. Enter the com-
mands
out <- lrdata()
x <- out$x
y <- out$y
Obtain the function lressp from lregpack.txt. Enter the commands
lressp(x,y) and include the resulting plot in Word.
13.20. Obtain the function prdata from lregpack.txt. Enter the com-
mands
out <- prdata()
x <- out$x
y <- out$y
a) Obtain the function pressp from lregpack.txt. Enter the commands
pressp(x,y) and include the resulting plot in Word.
13.21. The and Rousseeuw and Leroy (1987, p. 26) Belgian telephone
data has response Y = number of international phone calls (in tens of mil-
lions) made per year in Belgium. The predictor variable x = year (19501973).
From 1964 to 1969 total number of minutes of calls was recorded instead, and
years 1963 and 1970 were also partially eected. Hence there are 6 large out-
liers and 2 additional cases that have been corrupted.
a) The simple linear regression model is Y = + x + e = SP + e. The
R commands from the URL above Problem 13.19 make a response plot of
ESP = Y = + x versus Y for this model. Include the plot in Word.
b) The additive model is Y = + S(x) + e = AP + e where S is some
unknown function of x. The R commands make a response plot of EAP =
+ S(x) versus Y for this model. Include the plot in Word.
c) The simple linear regression model is a special case of the additive model
with S(x) = x. The additive model is a special case of the additive error
regression model Y = m(x) + e where m(x) = + S(x). The response plots
for these three models are used in the same way as the response plot for the
multiple linear regression model: if the model is good, then the plotted points
should cluster about the identity line with no other pattern. Which response
plot is better for showing that something is wrong with the model? Explain
briey.
456 13 GLMs and GAMs
ii) SAS needs an end of le marker to determine when the data ends.
SAS uses a period as the end of le marker. Add a period on the line after
the last line of data in Word and save the le as cbrain.dat.txt on your ash
13.10 Problems 457
drive (in the save as type box menu, select plain text and in the File name
box, type cbrain.dat). If these commands fail, get the le cbrain.dat.txt from
the URL and save it on your ash drive. Assume that the ash drive is
Removable Disk (J:)). Warning: make sure that the le has been saved
as cbrain.dat.txt.
a) To copy and paste relevant output into Word, click on the output win-
dow and use the top menu commands Edit>Select All and then the menu
commands Edit>Copy.
Interesting models have C(p) 2k where k = number in model.
The only SAS output for this problem that should be included
in Word are two header lines (Number in model, R-square, C(p), Variables
in Model) and the rst line with Number in Model = 6 and C(p) = 7.0947.
You may want to copy all of the SAS output into Word, and then cut and
paste the relevant two lines of output into another window of Word.
b) Activate cbrain.lsp in Arc with the menu commands
File > Load >. Then use the upper box to navigate to where cbrain.lsp is
stored, for example Removable Disk (G:). From Graph&Fit select Fit binomial
response. Select age = X2, breadth = X6, cephalic = X10, circum = X9, headht
= X4, height = X3, length = X5, and size = X7 as predictors, sex as the
response, and ones as the number of trials. This is the full logistic regression
model. Include the relevant output in Word. (A better full model was used
in Problem 13.14 .)
c) Response plot. From Graph&Fit select Plot of. Place sex on V and
B1:EtaU on H. From the OLS popup menu, select Logistic and move the
slider bar to 1. From the lowess popup menu select SliceSmooth and move
the slider bar until the t is good. Include your plot in Word. Are the slice
means (observed proportions) tracking the logistic curve (tted proportions)
fairly well? Use lowess if SliceSmooth does not work.
d) From Graph&Fit select Fit binomial response. Select breadth = X6,
cephalic = X10, circum = X9, headht = X4, height = X3, and size = X7 as
predictors, sex as the response, and ones as the number of trials. This is the
best submodel. Include the relevant output in Word.
e) Put the EE plot H B2:EtaU versus V B1:EtaU in Word. Is the plot
linear?
458 13 GLMs and GAMs
This chapter gives some information about R and Arc, and some hints for
selected homework problems. As of August 2016, the authors personal com-
puter has Version 3.3.1 (June 21, 2016) of R, and Version 1.06 (July 2004)
of Arc.
can be used to download the R functions and data sets into R. Type ls().
Nearly 70 R functions from lregpack.txt should appear. In R, enter the com-
mand q(). A window asking Save workspace image? will appear. Click on
No to remove the functions from the computer (clicking on Yes saves the func-
tions in R, but the functions and data are easily obtained with the source
commands).
This section gives tips on using R, but is no replacement for books such
as Becker et al. (1988), Chambers (2008), Crawley (2005, 2013), Fox and
Weisberg (2010), or Venables and Ripley (2010). Also see MathSoft (1999a,b)
and use the website (www.google.com) to search for useful websites. For
example, enter the search words R documentation.
To put a graph in Word, hold down the Ctrl and c buttons simulta-
neously. Then select Paste from the Word menu, or hit Ctrl and v at the
same time.
To enter data, open a data set in Notepad or Word. You need to know
the number of rows and the number of columns. Assume that each case is
entered in a row. For example, assuming that the le cyp.lsp has been saved
on your ash drive from the webpage for this book, open cyp.lsp in Word. It
has 76 rows and 8 columns. In R, write the following command.
14.1 R and Arc 461
To save data or a function in R, when you exit, click on Yes when the
Save worksheet image? window appears. When you reenter R, type ls().
This will show you what is saved. You should rarely need to save anything
for this book. To remove unwanted items from the worksheet, e.g. x, type
rm(x),
pairs(x) makes a scatterplot matrix of the columns of x,
hist(y) makes a histogram of y,
boxplot(y) makes a boxplot of y,
462 14 Stu for Students
The following commands are useful for a scatterplot created by the com-
mand plot(x,y).
lines(x,y), lines(lowess(x,y,f=.2))
identify(x,y)
abline(out$coef ), abline(0,1)
12 61 1.49
13 62 1.61
14 63 2.12
15 64 11.9
16 65 12.4
17 66 14.2
18 67 15.9
19 68 18.2
20 69 21.2
21 70 4.3
22 71 2.4
23 72 2.7073
24 73 2.9
To enter a data set into Arc, use the following template new.lsp.
dataset=new
begin description
Artificial data.
Contributed by David J. Olive.
end description
begin variables
col 0 = x1
col 1 = x2
col 2 = x3
col 3 = y
end variables
begin data
Next open new.lsp in Notepad. (Or use the vi editor in Unix. Sophisticated
editors like Word will often work, but they sometimes add things like page
breaks that do not allow the statistics software to use the le.) Then copy
the data lines from R and paste them below new.lsp. Then modify the le
new.lsp and save it on a ash drive as the le belg.lsp. (Or save it in mdata
where mdata is a data folder added within the Arc data folder.) The header
of the new le belg.lsp is shown below.
dataset=belgium
begin description
Belgium telephone data from
Rousseeuw and Leroy (1987, p. 26)
end description
begin variables
col 0 = case
col 1 = x = year
col 2 = y = number of calls in tens of millions
464 14 Stu for Students
end variables
begin data
1 50 0.44
. . .
. . .
. . .
24 73 2.9
The le above also shows the rst and last lines of data. The header le needs
a data set name, description, variable list, and a begin data command. Often
the description can be copied and pasted from the source of the data, e.g.
from the STATLIB website. Note that the rst variable starts with Col 0.
To transfer a data set from Arc to R, select the item Display data
from the datasets menu. Select the variables you want to save, and then
push the button for Save in R/Splus format. You will be prompted to give
a le name. If you select bodfat, then two les bodfat.txt and bodfat.Rd will
be created. The le bodfat.txt can be read into either R using the read.table
command. The le bodfat.Rd saves the documentation about the data set in
a standard format for R.
Warning: R is free but not fool proof. If you have an old version of R
and want to download a library, you may need to update your version of
R. The libraries for robust statistics may be useful for outlier detection, but
the methods have not been shown to be consistent or high breakdown. All
software has some bugs. For example, Version 1.1.1 (August 15, 2000) of R
had a random generator for the Poisson distribution that produced variates
14.2 Hints for Selected Problems 465
with too small of a mean for 10. Hence simulated 95% condence
intervals might contain 0% of the time. This bug seems to have been xed
in Versions 2.4.1 and later. Also, some functions in lregpack may no longer
work in new versions of R.
1.1 T x = xT
2.1 Fo = 0.904, pvalue > 0.1, fail to reject Ho, so the reduced model is
good.
2.2 a) 25.970
b) Fo = 0.600, pvalue > 0.5, fail to reject Ho, so the reduced model is
good.
2.7 No, since 0 is in the CI. X could be a very useful predictor for Y , e.g.
if Y = X 2 .
2.11 a) 7 + Xi
b) = (Yi 7)Xi / Xi2
2
2.14 a) 3 =
X3i (Yi 10 2X2i )/ X3i . The second partial derivative
2
= X3i > 0.
3.1 The model uses constant, nger to ground, and sternal height. (You
can tell what the variable are by looking at which variables are deleted.)
3.2 Use L3. L1 and L2 have more predictors and higher Cp than L3 while
L4 does not satisfy the Cp 2k screen.
3.3 a) L2.
b) Examine L3 since L1 has too many predictors while L4 does not satisfy
the Cp 2k screen.
466 14 Stu for Students
3.4 Use a constant, A, B, and C since this is the only model that satises
the Cp 2k screen.
b) Use the model with a constant and B since it has the smallest Cp and
the smallest k such that the Cp 2k screen is satised.
3.8 Several of the marginal relationships are nonlinear, including E(M |H).
3.9 This problem has the student reproduce Example 3.3. Hence log(Y )
is the appropriate response transformation.
3.10 Plots b), c), and e) suggest that log(ht) is needed while plots d),
f), and g) suggest that log(ht) is not needed. Plots c) and d) show that the
residuals from both models are quite small compared to the tted values.
Plot d) suggests that the two models produce approximately the same tted
values. Hence if the goal is prediction, the expensive log(ht) measurement
does not seem to be needed.
3.11 h) The submodel is ok, but the response and residual plots found in
f) for the submodel do not look as good as those for the full model found in
d). Since the submodel residuals do not look good, more terms are probably
needed in the model.
3.12 b) Forward selection gives constant, (size)1/3 , age, sex, breadth, and
cause.
c) Backward elimination gives constant, age, cause, cephalic, headht,
length, and sex.
d) Forward selection is better because it has fewer terms and a smaller Cp .
e) The variables are highly correlated. Hence backward elimination quickly
eliminates the single best predictor (size)1/3 and cannot get a good model
that only has a few terms.
f) Although the model in b) could be used, a better model uses constant,
age, sex, and (size)1/3 .
j) The FF and RR plots are good and so are the response and residual
plots if you ignore the good leverage points corresponding to the 5 babies.
b)
X1 49 3 1
N2 , .
X3 17 1 4
c) X1 X4 and X3 X4 .
d)
Cov(X1 , X3 ) 1
(X1 , X2 ) = = = 0.2887.
VAR(X1 )VAR(X3 ) 3 4
10.2 a) Y |X N (49, 16) since Y X. (Or use E(Y |X) = Y +
1
12 22 (X x ) = 49 + 0(1/25)(X 100) = 49 and VAR(Y |X) =
1
11 12 22 21 = 16 0(1/25)0 = 16.)
1
b) E(Y |X) = Y +12 22 (X x ) = 49+10(1/25)(X 100) = 9+0.4X.
1
c) VAR(Y |X) = 11 12 22 21 = 16 10(1/25)10 = 16 4 = 12.
10.4 The proof is identical to that given in Example 10.2. (In addition, it
is fairly simple to show that M1 = M2 M . That is, M depends on but
not on c or g.)
10.6 a) Sort each column, then nd the median of each column. Then
MED(W ) = (1430, 180, 120)T .
b) The sample mean of (X1 , X2 , X3 )T is found by nding the sample mean
of each column. Hence x = (1232.8571, 168.00, 112.00)T .
10.13 a) The 4 plots should look nearly identical with the ve cases 6165
appearing as outliers.
13.4 G2 (R|F ) = 17.1855 13.5325 = 3.653, df = 1, 0.05 < pvalue < 0.1,
fail to reject Ho, the reduced model is good.
13.8 a) B4
b) EE plot
c) B3 is best. B3 has 12 fewer predictors than B2 but the AIC increased
by less than 3. B1 has too many predictors with large Wald pvalues, B2 =
II still has too many predictors (want 300/10 = 30 predictors), while B4
has too small of a pvalue for the change in deviance test.
13.14 b) Use the log rule: (max age)/(min age) = 1400 > 10.
e) The slice means track the logistic curve very well if 8 slices are used.
i) The EE plot is linear.
j) The slice means track the logistic curve very well if 8 slices are used.
13.16 b) The response plot (e.g., with 4 slices) is bad, so the LR model is
bad.
d) Now the response plot (e.g., with 12 slices) is good in that slice smooth
and the logistic curve are close where there is data (also the LR model is
good at classifying 0s and 1s).
f) For this problem, G2 (O|F ) = 62.7188 0.00419862 = 62.7146, df = 1,
pvalue = 0.00, so reject Ho and conclude that there is an LR relationship
between ape and the predictor x3 .
g) The MLE does not exist since there is perfect classication (and the
logistic curve can get close to but never equal a discontinuous step function).
Hence Wald pvalues tend to have little meaning; however, the change in
deviance test tends to correctly suggest that there is an LR relationship
when there is perfect classication.
14.2 Hints for Selected Problems 469
13.19 The response plot should look ok, but the function uses a default
number of slices rather than allowing the user to select the number of slices
using a slider bar (a useful feature of Arc).
13.20 a) Since this is simulated PR data, the response plot should look
ok, but the function uses a default lowess smoothing parameter rather than
allowing the user to select the smoothing parameter using a slider bar (a
useful feature of Arc).
b) The data should follow the identity line in the weighted t response
plots. In about 1 in 20 plots there will be a very large count that looks like
an outlier. The weighted residual plot based on the MLE usually looks better
than the plot based on the minimum chi-square estimator (the MLE plot
tends to have less of a left opening megaphone shape).
13.22 b) Model I1 is better since it has fewer predictors and lower AIC
than model I2 .
13.23 a)
Number in Model Rsquare C(p) Variables in model
6 0.2316 7.0947 X3 X4 X6 X7 X9 X10
c) The slice means follow the logistic curve fairly well with 8 slices.
e) The EE plot is linear.
f) The slice means follow the logistic curve fairly well with 8 slices.
470 14 Stu for Students
14.3 Tables
Tabled values are F(k,d, 0.95) where P (F < F (k, d, 0.95)) = 0.95.
00 stands for . Entries were produced with the qf(0.95,k,d) command in
R. The numerator degrees of freedom are k while the denominator degrees of
freedom are d.
k 1 2 3 4 5 6 7 8 9 00
d
1 161 200 216 225 230 234 237 239 241 254
2 18.5 19.0 19.2 19.3 19.3 19.3 19.4 19.4 19.4 19.5
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.37
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.41
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 1.84
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 1.71
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 1.62
00 3.84 3.00 2.61 2.37 2.21 2.10 2.01 1.94 1.88 1.00
14.3 Tables 471
Tabled values are t,d where P (t < t,d ) = where t has a t distribution
with d degrees of freedom. If d > 29, use the N (0, 1) cutos d = Z = .
alpha pvalue
d 0.005 0.01 0.025 0.05 0.5 0.95 0.975 0.99 0.995 left tail
1 -63.66 -31.82 -12.71 -6.314 0 6.314 12.71 31.82 63.66
2 -9.925 -6.965 -4.303 -2.920 0 2.920 4.303 6.965 9.925
3 -5.841 -4.541 -3.182 -2.353 0 2.353 3.182 4.541 5.841
4 -4.604 -3.747 -2.776 -2.132 0 2.132 2.776 3.747 4.604
5 -4.032 -3.365 -2.571 -2.015 0 2.015 2.571 3.365 4.032
6 -3.707 -3.143 -2.447 -1.943 0 1.943 2.447 3.143 3.707
7 -3.499 -2.998 -2.365 -1.895 0 1.895 2.365 2.998 3.499
8 -3.355 -2.896 -2.306 -1.860 0 1.860 2.306 2.896 3.355
9 -3.250 -2.821 -2.262 -1.833 0 1.833 2.262 2.821 3.250
10 -3.169 -2.764 -2.228 -1.812 0 1.812 2.228 2.764 3.169
11 -3.106 -2.718 -2.201 -1.796 0 1.796 2.201 2.718 3.106
12 -3.055 -2.681 -2.179 -1.782 0 1.782 2.179 2.681 3.055
13 -3.012 -2.650 -2.160 -1.771 0 1.771 2.160 2.650 3.012
14 -2.977 -2.624 -2.145 -1.761 0 1.761 2.145 2.624 2.977
15 -2.947 -2.602 -2.131 -1.753 0 1.753 2.131 2.602 2.947
16 -2.921 -2.583 -2.120 -1.746 0 1.746 2.120 2.583 2.921
17 -2.898 -2.567 -2.110 -1.740 0 1.740 2.110 2.567 2.898
18 -2.878 -2.552 -2.101 -1.734 0 1.734 2.101 2.552 2.878
19 -2.861 -2.539 -2.093 -1.729 0 1.729 2.093 2.539 2.861
20 -2.845 -2.528 -2.086 -1.725 0 1.725 2.086 2.528 2.845
21 -2.831 -2.518 -2.080 -1.721 0 1.721 2.080 2.518 2.831
22 -2.819 -2.508 -2.074 -1.717 0 1.717 2.074 2.508 2.819
23 -2.807 -2.500 -2.069 -1.714 0 1.714 2.069 2.500 2.807
24 -2.797 -2.492 -2.064 -1.711 0 1.711 2.064 2.492 2.797
25 -2.787 -2.485 -2.060 -1.708 0 1.708 2.060 2.485 2.787
26 -2.779 -2.479 -2.056 -1.706 0 1.706 2.056 2.479 2.779
27 -2.771 -2.473 -2.052 -1.703 0 1.703 2.052 2.473 2.771
28 -2.763 -2.467 -2.048 -1.701 0 1.701 2.048 2.467 2.763
29 -2.756 -2.462 -2.045 -1.699 0 1.699 2.045 2.462 2.756
Z -2.576 -2.326 -1.960 -1.645 0 1.645 1.960 2.326 2.576
CI 90% 95% 99%
0.995 0.99 0.975 0.95 0.5 0.05 0.025 0.01 0.005 right tail
0.01 0.02 0.05 0.10 1 0.10 0.05 0.02 0.01 two tail
References
Chen, A., Bengtsson, T., & Ho, T. K. (2009). A regression paradox for lin-
ear models: Sucient conditions and relation to Simpsons paradox. The
American Statistician, 63, 218225.
Chen, C. H., & Li, K. C. (1998). Can SIR be as popular as multiple linear
regression? Statistica Sinica, 8, 289316.
Chihara, L., & Hesterberg, T. (2011). Mathematical statistics with resampling
and R. Hoboken, NJ: Wiley.
Chmielewski, M. A. (1981). Elliptically symmetric distributions: A review
and bibliography. International Statistical Review, 49, 6774.
Christensen, R. (2013). Plane answers to complex questions: The theory of
linear models (4th ed.). New York, NY: Springer.
Christmann, A., & Rousseeuw, P. J. (2001). Measuring overlap in binary
regression. Computational Statistics & Data Analysis, 37, 6575.
Claeskins, G., & Hjort, N. L. (2003). The focused information criterion (with
discussion). Journal of the American Statistical Association, 98, 900916.
Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging.
New York, NY: Cambridge University Press.
Cleveland, W. (1979). Robust locally weighted regression and smoothing scat-
terplots. Journal of the American Statistical Association, 74, 829836.
Cobb, G. W. (1998). Introduction to design and analysis of experiments.
Emeryville, CA: Key College Publishing.
Cody, R. P., & Smith, J. K. (2006). Applied statistics and the SAS pro-
gramming language (5th ed.). Upper Saddle River, NJ: Pearson Prentice
Hall.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple
regression/correlation analysis for the behavioral sciences (3rd ed.). Mah-
wah, NJ: Lea, Inc.
Collett, D. (1999). Modelling binary data (1st ed.). Boca Raton, FL: Chap-
man & Hall/CRC.
Collett, D. (2003). Modelling binary data (2nd ed.). Boca Raton, FL: Chap-
man & Hall/CRC.
Cook, R. D. (1977). Deletion of inuential observations in linear regression.
Technometrics,19, 1518.
Cook, R. D. (1993). Exploring partial residual plots. Technometrics, 35, 351
362.
Cook, R. D. (1998). Regression graphics: Ideas for studying regression through
graphics. New York, NY: Wiley.
Cook, R. D., Helland, I. S., & Su, Z. (2013). Envelopes and partial least
squares regression. Journal of the Royal Statistical Society, B, 75, 851877.
Cook, R. D., & Nachtsheim, C. J. (1994). Reweighting to achieve elliptically
contoured covariates in regression. Journal of the American Statistical As-
sociation, 89, 592599.
Cook, R. D., & Olive, D. J. (2001). A note on visualizing response transfor-
mations in regression. Technometrics, 43, 443449.
References 477
Cook, R. D., & Setodji, C. M. (2003). A model-free test for reduced rank
in multivariate regression. Journal of the American Statistical Association,
98, 340351.
Cook, R. D., & Su, Z. (2013). Scaled envelopes: Scale-invariant and ecient
estimation in multivariate linear regression. Biometrika, 100, 929954.
Cook, R. D., & Weisberg, S. (1982). Residuals and inuence in regression.
London: Chapman & Hall.
Cook, R. D., & Weisberg, S. (1994). Transforming a response variable for
linearity. Biometrika, 81, 731737.
Cook, R. D., & Weisberg, S. (1997). Graphics for assessing the adequacy
of regression models. Journal of the American Statistical Association, 92,
490499.
Cook, R. D., & Weisberg, S. (1999a). Applied regression including computing
and graphics. New York, NY: Wiley.
Cook, R. D., & Weisberg, S. (1999b). Graphs in statistical analysis: Is the
medium the message? The American Statistician, 53, 2937.
Cook, R. D., & Zhang, X. (2015). Foundations of envelope models and meth-
ods. Journal of the American Statistical Association, 110, 599611.
Copas, J. B. (1983). Regression prediction and shrinkage (with discussion).
Journal of the Royal Statistical Society, B, 45, 311354.
Cramer, H. (1946). Mathematical Methods of Statistics. Princeton, NJ:
Princeton University Press.
Crawley, M. J. (2005). Statistics: An introduction using R. Hoboken, NJ:
Wiley.
Crawley, M. J. (2013). The R book (2nd ed.). Hoboken, NJ: Wiley.
Croux, C., Dehon, C., Rousseeuw, P. J., & Van Aelst, S. (2001). Robust
estimation of the conditional median function at elliptical models. Statistics
& Probability Letters, 51, 361368.
Daniel, C., & Wood, F. S. (1980). Fitting equations to data (2nd ed.). New
York, NY: Wiley.
Darlington, R. B. (1969). Deriving least-squares weights without calculus.
The American Statistician, 23, 4142.
Datta, B. N. (1995). Numerical linear algebra and applications. Pacic Grove,
CA: Brooks/Cole Publishing Company.
David, H. A. (1995). First (?) occurrences of common terms in mathematical
statistics. The American Statistician, 49, 121133.
David, H. A. (20062007). First (?) occurrences of common terms in statistics
and probability. Publications and Preprint Series, Iowa State University,
www.stat.iastate.edu/preprint/hadavid.html
Dean, A. M., & Voss, D. (2000). Design and analysis of experiments. New
York, NY: Springer.
Dongarra, J. J., Moler, C. B., Bunch, J. R., & Stewart, G. W. (1979). Lin-
packs users guide. Philadelphia, PA: SIAM.
Draper, N. R. (2002). Applied regression analysis bibliography update 2000
2001. Communications in Statistics: Theory and Methods, 31, 20512075.
478 References
Draper, N. R., & Smith, H. (1966). Applied regression analysis (1st ed.).
New York, NY: Wiley.
Draper, N. R., & Smith, H. (1981). Applied regression analysis (2nd ed.).
New York, NY: Wiley.
Draper, N. R., & Smith, H. (1998). Applied regression analysis (3rd ed.). New
York, NY: Wiley.
Driscoll, M. F., & Krasnicka, B. (1995). An accessible proof of Craigs theorem
in the general case. The American Statistician, 49, 5962.
Dunn, O. J., & Clark, V. A. (1974). Applied statistics: Analysis of variance
and regression. New York, NY: Wiley.
Eaton, M. L. (1986). A characterization of spherical distributions. Journal of
Multivariate Analysis, 20, 272276.
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans.
Philadelphia, PA: SIAM.
Efron, B. (2014), Estimation and accuracy after model selection (with dis-
cussion). Journal of the American Statistical Association, 109, 9911007.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle
regression (with discussion). The Annals of Statistics, 32, 407451.
Ernst, M. D. (2009). Teaching inference for randomized experiments. Journal
of Statistical Education, 17 (online).
Ezekiel, M. (1930). Methods of correlation analysis. New York, NY: Wiley.
Ezekiel, M., & Fox, K. A. (1959). Methods of correlation and regression anal-
ysis. New York, NY: Wiley.
Fabian, V. (1991). On the problem of interactions in the analysis of variance
(with discussion). Journal of the American Statistical Association, 86, 362
375.
Fahrmeir, L., & Tutz, G. (2001). Multivariate statistical modelling based on
generalized linear models (2nd ed.). New York, NY: Springer.
Ferrari, D., & Yang, Y. (2015). Condence sets for model selection by
F testing. Statistica Sinica, 25, 16371658.
Fox, J. (1991). Regression diagnostics. Newbury Park, CA: Sage Publications.
Fox, J. (2015). Applied regression analysis and generalized linear models (3rd
ed.). Thousand Oaks, CA: Sage Publications.
Fox, J., & Weisberg, S. (2010). An R companion to applied regression. Thou-
sand Oaks, CA: Sage Publications.
Freedman, D. A. (1983). A note on screening regression equations. The Amer-
ican Statistician, 37, 152155.
Freedman, D. A. (2005). Statistical models theory and practice. New York,
NY: Cambridge University Press.
Frey, J. (2013). Data-driven nonparametric prediction intervals. Journal of
Statistical Planning and Inference, 143, 10391048.
Fujikoshi, Y., Sakurai, T., & Yanagihara, H. (2014). Consistency of high-
dimensional AIC-type and Cp -type criteria in multivariate linear regres-
sion. Journal of Multivariate Analysis, 123, 184200.
References 479
Furnival, G., & Wilson, R. (1974). Regression by leaps and bounds. Techno-
metrics, 16, 499511.
Gail, M. H. (1996). Statistics in action. Journal of the American Statistical
Association, 91, 113.
Gelman, A. (2005). Analysis of varianceWhy it is more important than
ever (with discussion). The Annals of Statistics, 33, 153.
Ghosh, S. (1987), Note on a common error in regression diagnostics using
residual plots. The American Statistician, 41, 338.
Gilmour, S. G. (1996). The interpretation of Mallowss Cp -statistic. The
Statistician, 45, 4956.
Gladstone, R. J. (1905). A study of the relations of the brain to the size of
the head. Biometrika, 4, 105123.
Golub, G. H., & Van Loan, C. F. (1989). Matrix computations (2nd ed.).
Baltimore, MD: John Hopkins University Press.
Graybill, F. A. (1976). Theory and application of the linear model. North
Scituate, MA: Duxbury Press.
Gunst, R. F., & Mason, R. L. (1980). Regression analysis and its application:
A data oriented approach. New York, NY: Marcel Dekker.
Guttman, I. (1982). Linear models: An introduction. New York, NY: Wiley.
Haenggi, J. C. (2009). Plots for the Design and Analysis of Experiments. Mas-
ters Research Paper, Southern Illinois University, at http://lagrange.
math.siu.edu/Olive/sjenna.pdf
Haggstrom, G. W. (1983). Logistic regression and discriminant analysis by
ordinary least squares. Journal of Business & Economic Statistics, 1, 229
238.
Hahn, G. J. (1982). Design of experiments: An annotated bibliography. In S.
Kotz & N. L. Johnson (Eds.), Encyclopedia of statistical sciences (Vol. 2,
pp. 359366). New York, NY: Wiley.
Hamilton, L. C. (1992). Regression with graphics: A second course in applied
statistics. Belmont, CA: Wadsworth.
Harrell, F. E. (2015). Regression modeling strategies with applications to lin-
ear models, logistic and ordinal regression, and survival analysis (2nd ed.).
New York, NY: Springer.
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic prices and the demand
for clean air. Journal of Environmental Economics and Management, 5,
81102.
Harter, H. L. (1974a). The method of least squares and some alternatives.
Part I. International Statistical Review, 42, 147174.
Harter, H. L. (1974b). The method of least squares and some alternatives.
Part II. International Statistical Review, 42, 235165.
Harter, H. L. (1975a). The method of least squares and some alternatives.
Part III. International Statistical Review, 43, 144.
Harter, H. L. (1975b). The method of least squares and some alternatives.
Part IV. International Statistical Review, 43, 125190, 273278.
480 References
Marden, J. I., & Muyot, E. T. (1995). Rank tests for main and interaction
eects in analysis of variance. Journal of the American Statistical Associ-
ation, 90, 13881398.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis.
London, UK: Academic Press.
Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and
dispersion for high-dimensional datasets. Technometrics, 44, 307317.
MathSoft (1999a). S-Plus 2000 users guide. Seattle, WA: Data Analysis
Products Division, MathSoft.
MathSoft (1999b). S-Plus 2000 guide to statistics (Vol. 2). Seattle, WA: Data
Analysis Products Division, MathSoft.
Maxwell, S. E., & Delaney, H. D. (2003). Designing experiments and analyzing
data (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.).
London, UK: Chapman & Hall.
McDonald, G. C., & Schwing, R. C. (1973). Instabilities of regression esti-
mates relating air pollution to mortality. Technometrics, 15, 463482.
McKenzie, J. D., & Goldman, R. (1999). The student edition of MINITAB.
Reading, MA: Addison Wesley Longman.
Mendenhall, W., & Sincich, T. L. (2011). A second course in statistics:
Regression analysis (7th ed.). Boston, MA: Pearson.
Merriman, M. (1907). A text book on the method of least squares (8th ed.).
New York, NY: Wiley.
Mickey, R. M., Dunn, O. J., & Clark, V. A. (2004). Applied statistics: Analysis
of variance and regression (3rd ed.). Hoboken, NJ: Wiley.
Miller, D. M. (1984). Reducing transformation bias in curve tting. The
American Statistician, 38, 124126.
Montgomery, D. C. (1984). Design and analysis of experiments (2nd ed.).
New York, NY: Wiley.
Montgomery, D. C. (2012). Design and analysis of experiments (8th ed.).
New York, NY: Wiley.
Montgomery, D. C., Peck, E. A., & Vining, G. (2001). Introduction to linear
regression analysis (3rd ed.). Hoboken, NJ: Wiley.
Montgomery, D. C., Peck, E. A., & Vining, G. (2012). Introduction to linear
regression analysis (5th ed.). Hoboken, NJ: Wiley.
Moore, D. S. (2000). The basic practice of statistics (2nd ed.). New York,
NY: W.H. Freeman.
Moore, D. S. (2007). The basic practice of statistics (4th ed.). New York, NY:
W.H. Freeman.
Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression. Reading,
MA: Addison-Wesley.
Myers, R. H., & Milton, J. S. (1990). A rst course in the theory of linear
statistical models. Belmont, CA: Duxbury.
484 References
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier de-
tection. New York, NY: Wiley.
Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the mini-
mum covariance determinant estimator. Technometrics, 41, 212223.
Ryan, T. (2009), Modern regression methods (2nd ed.). Hoboken, NJ: Wiley.
Sadooghi-Alvandi, S. M. (1990). Simultaneous prediction intervals for re-
gression models with intercept. Communications in Statistics: Theory and
Methods, 19, 14331441.
Sall, J. (1990). Leverage plots for general linear hypotheses. The American
Statistician, 44, 308315.
Santer, T. J., & Duy, D. E. (1986). A note on A. Alberts and J. A. An-
dersons conditions for the existence of maximum likelihood estimates in
logistic regression models. Biometrika, 73, 755758.
SAS Institute (1985). SAS users guide: Statistics. Version 5. Cary, NC: SAS
Institute.
Schaahausen, H. (1878). Die Anthropologische Sammlung Des Anatom
ischen Der Universitat Bonn. Archiv fur Anthropologie, 10, 165. Appendix.
Schee, H. (1959). The analysis of variance. New York, NY: Wiley.
Schoemoyer, R. L. (1992). Asymptotically valid prediction intervals for linear
models. Technometrics, 34, 399408.
Searle, S. R. (1971). Linear models. New York, NY: Wiley.
Searle, S. R. (1982). Matrix algebra useful for statistics. New York, NY: Wiley.
Searle, S. R. (1988). Parallel lines in residual plots. The American Statisti-
cian, 42, 211.
Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis (2nd ed.).
New York, NY: Wiley.
Selvin, H. C., & Stuart, A. (1966). Data-dredging procedures in survey anal-
ysis. The American Statistician, 20 (3), 2023.
Sen, P. K., & Singer, J. M. (1993). Large sample methods in statistics: An
introduction with applications. New York, NY: Chapman & Hall.
Severini, T. A. (1998). Some properties of inferences in misspecied linear
models. Statistics & Probability Letters, 40, 149153.
Sheather, S. J. (2009). A modern approach to regression with R. New York,
NY: Springer.
Shi, L., & Chen, G. (2009). Inuence measures for general linear models with
correlated errors. The American Statistician, 63, 4042.
Simono, J. S. (2003). Analyzing categorical data. New York, NY: Springer.
Snedecor, G. W., & Cochran, W. G. (1967). Statistical methods (6th ed.).
Ames, IA: Iowa State College Press.
Steinberg, D. M., & Hunter, W. G. (1984). Experimental design: Review and
comment. Technometrics, 26, 7197.
Stigler, S. M. (1986). The history of statistics: The measurement of uncer-
tainty before 1900. Cambridge, MA: Harvard University Press.
Su, Z., & Cook, R. D. (2012). Inner envelopes: Ecient estimation in multi-
variate linear regression. Biometrika, 99, 687702.
References 487
Su, Z., & Yang, S.-S. (2006). A note on lack-of-t tests for linear models
without replication. Journal of the American Statistical Association, 101,
205210.
Tremearne, A. J. N. (1911). Notes on some Nigerian tribal marks. Journal
of the Royal Anthropological Institute of Great Britain and Ireland, 41,
162178.
Tukey, J. W. (1957). Comparative anatomy of transformations. Annals of
Mathematical Statistics, 28, 602632.
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-
Wesley Publishing Company.
Velilla, S. (1993). A note on the multivariate Box-Cox transformation to
normality. Statistics & Probability Letters, 17, 259263.
Velleman, P. F., & Welsch, R. E. (1981). Ecient computing of regression
diagnostics. The American Statistician, 35, 234242.
Venables, W. N., & Ripley, B. D. (2010). Modern applied statistics with S
(4th ed.). New York, NY: Springer.
Vittingho, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2012).
Regression methods in biostatistics: Linear, logistic, survival, and repeated
measures models (2nd ed.). New York, NY: Springer.
Wackerly, D. D., Mendenhall, W., & Scheaer, R. L. (2008). Mathematical
statistics with applications (7th ed.). Belmont, CA: Thomson Brooks/Cole.
Walls, R. C., & Weeks, D. L. (1969). A note on the variance of a predicted
response in regression. The American Statistician, 23, 2426.
Wang, L., Liu, X., Liang, H., & Carroll, R. J. (2011). Estimation and vari-
able selection for generalized additive partial linear models. The Annals of
Statistics, 39, 18271851.
Weisberg, S. (2002). Dimension reduction regression in R. Journal of Statis-
tical Software, 7, webpage www.jstatsoft.org
Weisberg, S. (2014). Applied linear regression (4th ed.). Hoboken, NJ: Wiley.
Welch, B. L. (1947). The generalization of Students problem when several
dierent population variances are involved. Biometrika, 34, 2835.
Welch, B. L. (1951). On the comparison of several mean values: An alternative
approach. Biometrika, 38, 330336.
Welch, W. J. (1990). Construction of permutation tests. Journal of the Amer-
ican Statistical Association, 85, 693698.
Weld, L. D. (1916). Theory of errors and least squares. New York, NY:
Macmillan.
Wilcox, R. R. (2012). Introduction to robust estimation and hypothesis testing
(3rd ed.). New York, NY: Academic Press, Elsevier.
Winkelmann, R. (2000). Econometric analysis of count data (3rd ed.).
New York, NY: Springer.
Winkelmann, R. (2008). Econometric analysis of count data (5th ed.).
New York, NY: Springer.
Wood, S. N. (2006). Generalized additive models: An introduction with R.
Boca Rotan, FL: Chapman & Hall/CRC.
488 References